Piu M
Big Data Engineer
*************@*****.***
Summary of Professional Experience
Have 7 years of IT experience as a software engineer which includes 4 years of experience in design and development using Hadoop big data eco system tools.
Experience in Big data technologies like Hadoop frameworks, Hive, Sqoop, Spark, Kafka, Flume, Zookeeper and Oozie.
Expertise on creating Data Pipelines in SPARK using SCALA and Python.
Expertise in writing Spark RDD transformations, actions, Data Frame's, case classes for the required input data and performed the data transformations using Spark-Core.
Experience in using Spark-SQL with various data sources like Hive and MySQL, Perform transformations, perform read/write operations, save the results to output directory into HDFS.
Experience in loading data from Enterprise data lake (EDL) to HDFS using Sqoop scripts.
Experience in Data migration from relational (Oracle) databases or external data to HDFS.
Experience in deploying industries scale Data lake on cloud platforms.
Extensively used Scala and Java, Created frameworks for processing data pipelines through Spark.
Experience in developing Python code to gather the data from HBase, Cassandra and designs the solutions to implement using PySpark.
Extending HIVE and PIG core functionality by using custom User Defined Function’s (UDF).
Extensive experience in Designing, Installation, Configuration and Management of Apache Hadoop Clusters, Hadoop Eco systems and Spark with Scala as well Python.
Extensive experience in Data Ingestion, Transformation, Analytics using Apache Spark framework, and Hadoop ecosystem components.
Expert in working with Hive data warehouse tools – creating tables, data distribution by implementing partitioning and bucketing, writing and optimizing the HiveQL queries.
Experience using Kafka cluster for Data Integration and secured cloud service platform like AWS and doing data summarization, Querying and Analysis of large Datasets stored on HDFS and Amazon S3 filesystem using Hive Query Language (HiveQL).
Experience in using ETL tools such as SSIS, Informatica.
Worked on PySpark APIs for data transformations and used pandas for data transformation.
Experienced in Azure cloud services such as HD Insight (HDI), Azure Data lake Gen1 and Gen2, Azure event hubs, Azure Databricks and Azure SQL Datawarehouse.
Experience in Amazon AWS services such as Elastic Map Reduce (EMR), Storage S3, EC2 instances and Data Warehousing.
Experience in working with different Reporting tools like Tableau and Power BI.
Experience in streaming data using Apache Strom from source to Hadoop.
Strong knowledge on implementing of data processing on Spark-Core using Spark SQL and Spark streaming.
Hands on experience in working on spark -SQL queries, import data from Data sources, perform transformations, and perform read/write operations, save the results to output directory into HDFS.
Migrated existing data from SQL server, Teradata to HADOOP and performed ETL operations.
Technical skills:
Big Data Ecosystem
: HDFS, YARN, Pig, Hive, Kafka, Flume, Sqoop, Spark core, Spark SQL, Spark Streaming, HBase, Oozie, Zookeeper.
Languages
Scala, Python, SQL, Java, bash
Database
Oracle, MySQL, MS SQL Server, Teradata, Sydata
No SQL Database
HBase, Cassandra
IDE/Testing Tools
IntelliJ, PyCharm, Eclipse
ETL Tools
SSIS, Talend, Informatica
Professional Experience
Client: PNC Bank, Pittsburgh, PA July 2019 – Till Date
Data Engineer
Roles and Responsibilities:
Involved in requirement analysis, design and implementation of automated data ingestion and processing of regulatory reporting solution.
Responsible for building scalable distributed data solutions using Hadoop cluster environment.
Responsible for ingesting data from different sources into Hadoop using Spark.
Experienced in loading Incremental daily loads & history data.
Involved in creating, loading Hive tables, and analyzing data using hive queries.
Developed Hive queries to process the data and generate the data cubes for visualizing.
Performed transformations using spark to create intermediate tables in hive.
Using Scala programming language in the spark transformation.
Extensively worked on spark sql to perform transformations.
Worked on bash script to schedule jobs on both Sqoop and spark jobs.
Worked on creating the oozie jobs such as workflows, coordinators, and job properties to reflect made changes to run on the edge node.
Performed manual Sqoop loading for one time loads for the missing data in the already existing hive tables.
Validating the hive tables to see the schema and column data types to match with the source data.
Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark.
Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
Involved in scheduling Oozie workflow engine to run multiple Hive jobs.
Involved unit testing, interface testing, system testing and user acceptance testing of the workflow tool.
Environment: Spark, Spark SQL, Scala, Hive, Sqoop, oozie, bash scripting, Oracle, SQL Server, Tera data, sydata, HDFS, Jenkins, Groovy, YARN, IntelliJ Idea, Maven
Client: CITI Bank, Irving, TX Jan 2018 – July 2019
Data Engineer
Roles and Responsibilities:
Extensively used Spark core i.e. RDDs, Data Frames, and Spark Sql as part of developing multiple applications using both Python and Scala.
Built multiple data pipe lines using Pig scripts for processing data for specific applications.
Used different file formats such as Parquet, Avro, and ORC for storing and retrieving data in Hadoop.
Used Spark-streaming for consuming event based data from Kafka and joined this data set with existing Hive table data to generate performance indicators for an application
Developed analytical queries on different tables using Spark sql for finding insights and building data pipelines for data scientists to consume this data for applying ML models.
Spark performance tuning by applying different techniques: choosing optimum parallelism, Serialization format while shuffling the data, using broadcast variables, joins, aggregations, and memory management.
Written multiple custom Sqoop import scripts to load data from oracle into HDFS directories and Hive tables.
Used Nifi for automating and managing data flows between multiple systems.
Used different compression techniques while storing data into Hive tables for performance improvement : snappy and Gzip
Have used Impala for faster querying for a time critical application to generate reports.
Also used Hbase for OLTP purpose for an application requiring high scalability using hadoop.
Have written sqoop export scripts to write the date from HDFS into Oracle database.
Used Control M component to simplify and automate different batch workload Applications.
Worked closely with multiple data science and machine learning teams in building a data eco system to support AI.
Also developed a Java based application to automate most of the manual work in on boarding a tenant to a multi-tenant environment. This is saving around 4 to 5 hours of manual work per tenant per person every day.
Applied different job tuning techniques while processing data using Hive and spark frameworks to improve the performance of jobs.
Environment: Spark Core, Spark Streaming, Core Java, Python, Hive, Impala, HBase, Sqoop, Kerberos (security), LDAP, and Control M.
Client: AIG, Houston, TX Oct 2016 – Jan 2018
Data Engineer.
Roles and Responsibilities:
Develop Logical Data Warehouse on Hortonworks Data Platform using Hive, Spark SQL, Hbase, Phoenix.
Develop ETL jobs using Apache Spark high level Scala API's like Data Frames and Spark SQL to Extract, Transform and Load data from various source/target systems like HDFS,AWS S3, Relational Databases like Oracle, SQL Server.
Develop Hive scripts to process and reformat data raw data and load into optimized Hive tables using Parquet and ORC format.
Design Hive tables using Partitioning and Bucketing techniques to enable efficient joins and filtering of data for faster data retrieval.
Performance tune long running Hive queries using various query optimization techniques such partition predicate pushdown, parallel execution, vectorization, Map joins.
Implement security roles in Ambari and Ranger for fine grained access control to Hadoop resources.
Design and develop Spark jobs using Scala and Spark SQL to aggregate data stored in Hive datastore.
Performance tune Spark jobs by planning resources, adjusting degree of parallelism and Spark SQL query tuning.
Experience in performance tuning of Apache Spark and Hive jobs. In-depth knowledge of Spark and Hive processing frame works.
Develop GraphQL API to expose data stored in Sql Server using Typescript, GraphQL, Sequelize, Apollo Server, Openshift Container Platform.
Create and maintain code repositories in bitbucket, use Git branching strategy to promote code to different environments.
Client: EvaTech, India May 2013 – Sep 2015
Software Developer
Roles and Responsibilities:
Responsible for data identification, collection, exploration, cleaning for modeling.
Involved in creating database solutions, evaluating requirements, preparing design reports and also migrating data from legacy systems to new solutions
Worked with various database administrators/operations and analysts to secure easy access to data
Build and maintain data in HDFS by identifying structural and installation solutions
Analyzing structural requirements for new applications which will be sourced.
Developed Spark batch job to automate creation/metadata update of external Hive table on top of datasets residing in HDFS.
Developed Data Serialization spark common Immuta module for converting Complex objects into sequence bits by using AVRO, PARQUET, JSON, CSV formats.
Worked on ERModeling, Dimensional Modeling (StarSchema, SnowflakeSchema), Data warehousing and OLAP tools.
Worked in writing SPARK SQL query scripts for optimizing the query performance.
Implemented Spark Scripts using Spark Session, Python, Spark SQL to access hive tables data flow into spark for faster processing of data.
Worked on various Spark optimizations techniques pocs for memory management, garbage collection, Serialization, and custom partitioning.
Developed Spark programs to parse the raw data, populate staging tables and store the refined data in partitioned tables in the EDW.
Creating new workflows and maintaining Data access existing ETL workflows, data management and data query components.
Design, develop and orchestrate data pipelines for real time and batch data processing using AWS Redshift
Performed Exploratory Data Analysis and Data visualizations using Python and Tableau
Strong communication, analytical and interpersonal skills working within cross-functional teams.
Used JDBC calls to update the database at the application server.
Written UNIX shell scripts to copy the data files from one server to another.
Wrote SQL queries and PL/SQL procedures for JDBC
Education:
Bachelor of Technology, India-2013.