Data Engineer

Location:

United, PA

Posted:

October 26, 2022

Contact this candidate

Resume:

LALITH ADITYA

Email: ****************@*****.*** PH: 916-***-****

Sr. Big Data Engineer

Professional Summary

* ***** ** ********** ** Data Engineering, Data Pipeline Design, Development and Implementation as a Sr. Data Engineer/Data Developer and Data Modeler.

Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.

Worked on ETL Migration services by creating and deploying AWS Lambda functions to provide a serverless data pipeline that can be written to Glue Catalog and queried from Athena.

Developed ETL pipelines in and out of the data warehouse using a mix of Python and Snowflakes Snow SQL Writing SQL queries against Snowflake.

Good knowledge on AWS IAM services – IAM policies, Roles, Users, Groups, AWS access keys and Multi Factor Authentication.

Analytics and cloud migration from on-premises to AWS Cloud with AWS EMR, S3, and DynamoDB.

Experience in creating and managing reporting and analytics infrastructure for internal business clients using AWS services including Athena, Redshift, Spectrum, EMR, and Quick Sight.

Created an Azure SQL database, monitored it, and restored it. Migrated Microsoft SQL server to Azure SQL database.

Experience with Azure Cloud, Azure Data Factory, Azure Data Lake Storage, Azure Synapse Analytics, Azure Analytical services, Big Data Technologies (Apache Spark), and Data Bricks is preferred.

Extensive experience developing and implementing cloud architecture on Microsoft Azure.

Strong experience in writing scripts using Python API, PySpark API and Spark API for analyzing the data.

Sustaining the Big Query, PySpark and Hive code by fixing the bugs and providing the enhancements required by the Business User.

Knowledge of Cloudera platform & Apache Hadoop 0.20. version, OLAP and OLTP.

Well experience in Normalization and De-Normalization techniques for optimum performance in relational and dimensional database environments.

Expertise in designing complex Mappings and have expertise in performance tuning and slowly changing Dimension Tables and Fact tables

Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB using Python.

Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.

Experience in working with Flume and NiFi for loading log files into Hadoop.

Experience in working with NoSQL databases like HBase and Cassandra.

Good Experience in implementing and orchestrating data pipelines using Oozie and Airflow.

Strong understanding in Dimensional modelling - Star Schema, Snow Flake Schema and BI Stack like SSRS, Tableau, Power BI.

Good knowledge of Data Marts, OLAP, Dimensional Data Modeling with Ralph Kimball Methodology (Star Schema Modeling, Snow-Flake Modeling for FACT and Dimensions Tables) using Analysis Services.

Knowledge of working with Proof of Concepts (POC's) and gap analysis and gathered necessary data for analysis from different sources, prepared data for data exploration using data munging and Teradata.

Technical Skills

Big Data Ecosystem

HDFS, MapReduce, HBase, Pig, Hive, Sqoop, Kafka,

Flume, Impala, Oozie, Zookeeper, MapR, Amazon Web Services (AWS), EMR

Cloud Technologies

AWS, Azure, Snowflake (Data warehouse)

IDE’s

IntelliJ, Eclipse, Spyder, Jupyter

Databases

Oracle 11g/10g/9i, MySQL, DB2, MS-SQL Server, HBASE

Programming / Query Languages

SQL, Python Programming (Pandas, NumPy, SciPy, Scikit-Learn, Seaborn, Matplotlib, NLTK), NoSQL, PySpark, PySpark SQL, PL/SQL, Linux shell scripts, Scala.

Data Engineer/Big Data Tools / Cloud / Visualization / Other Tools

Databricks, Terraform, Kafka, Airflow, Hadoop Distributed File System (HDFS), Hive, Pig, Sqoop, MapReduce, Data integration, YARN, Hortonworks, Cloudera, MLlib, Oozie, Zookeeper, etc. AWS, Azure Databricks, Azure Data Explorer, Azure HDInsight, Linux, PuTTY, Bash Shell, Unix, etc., Tableau, Power BI,

Professional Experience

Sr. Data Engineer

Cummins Columbus, Indiana April 2021 - Present

Responsibilities:

Exported Data into Snowflake by creating Staging Tables to load Data of different files from Amazon S3.

Development of test automation using SQL, Hive Query Language (HQL), Python to validate data accuracy and quality.

Performed end- to-end Architecture & implementation assessment of various AWS services like Amazon EMR, Redshift, S3

As a part of Data Migration, wrote many SQL Scripts for Mismatch of data and worked on loading the history data from Teradata SQL to snowflake.

Prepared scripts to automate the ingestion process using PySpark and Python as needed through various so such as API, AWS S3, Teradata and Redshift.

Using Bash & Python included Boto3 to supplement automation provided by Ansible and Terraform for tasks such as encrypting EBS volumes backing AMI’s.

Managed AWS infrastructure as code using Terraform.

Extensively involved in infrastructure as code, execution plans, resource graph and change automation using Terraform.

Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.

Worked on to retrieve the data from various sources to S3 using spark commands

Developed and implemented ETL pipelines on S3 parquet files in a data lake using AWS Glue.

Developed a cloud formation template in JSON format to utilize content delivery with cross-region replication using Amazon Virtual Private Cloud.

Built S3 buckets and managed policies for S3 buckets and used S3 bucket and Glacier for storage and backup on AWS

Implemented AWS solutions using EC2, S3, RDS, EBS. Used IAM to create new accounts, roles and groups.

Analyzed the SQL scripts and designed the solution to implement using both Scala and Python.

Converted Hive/SQL queries into transformations using Python

Created performance dashboards in Tableau/ Excel / Power point for the key stakeholders

Created various notebooks using Databricks as a platform to perform various transformations

Implemented Defect Tracking process using JIRA tool by assigning bugs to Development Team

Created Metric tables, End user views in Snowflake to feed data for Tableau refresh.

Generated Custom SQL to verify the dependency for the daily, Weekly, Monthly jobs.

worked with spark ecosystem using Spark SQL and Scala queries on different formats like text file, CSV file.

Worked with Airflow to schedule various jobs and automate the process.

Worked with Kafka in processing the data and connecting Kafka with various development tools

Developed spark code and spark-SQL/streaming for faster testing and processing of data.

Environment: Snowflake, ETL, Databricks, Kafka, Airflow, PySpark, SQL, AWS glue, Athena, Redshift, Apache airflow Spark SQL, Python, GitHub, EMR, Nebula Metadata, Teradata, Terraform, SQL Server, Apache Spark, Sqoop

Big Data Engineer

Travelport Englewood, CO August 2019 to March 2021

Responsibilities:

Installing, Configuring and maintaining Data Pipelines

Ingestion of data into one or more Azure Services (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing of data in Azure Databricks.

Monitored the SQL scripts and modified them for improved performance using PySpark SQL.

Managed relational database service in which Azure SQL manages scalability, stability, and maintenance.

Integrated data storage options with Spark, notably with Azure Data Lake Storage and Blob storage.

Configured stream analytics, Event Hubs, and worked with Azure to manage IoT solutions.

Successfully executed a proof of concept for Azure implementation, with the wider objective of transferring on-premises servers and data to the cloud.

Analyzed data quality issues using Snow SQL by creating analytical warehouses on Snowflake.

Created UDFs in Scala and PySpark to satisfy specific business requirements.

Used Hive queries to analyze huge data sets of structured, unstructured, and semi-structured data.

Used structured data in Hive to enhance performance using sophisticated techniques including bucketing, Partitioning and optimizing self joins.

Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap.

Designing the business requirement collection approach based on the project scope and SDLC methodology.

Authoring Python (PySpark) Scripts for custom UDF’s for Row/ Column manipulations, merges, aggregations, stacking, data labeling and for all Cleaning and conforming tasks.

Develop solutions to leverage ETL tools and identify opportunities for process improvements using Informatica and Python

Conduct root cause analysis and resolve production problems and data issues. Performance tuning, code promotion and testing of application changes

Loading data from different sources to Snowflake to perform some data aggregations for business Intelligence using SQL.

Used Sqoop to channel data from different sources of HDFS and RDBMS.

Developed Spark applications using PySpark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats.

Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra.

Validated the test data in DB2 tables on Mainframes and on Teradata using SQL queries.

Environment: Microsoft Azure, Databricks, Snowflake, Python, SQL, PySpark, HDFS, NiFi, Hive, Kafka, Scrum, Git, Informatica, Tableau, OLTP, OLAP, HBase, Informatica, SQL Server.

Azure Data Engineer

Experian, Costa Mesa, CA January 2017 to July 2019

Responsibilities:

Migrated SQL database to Azure data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks, Azure SQL Data warehouse and controlling and granting database access and Migration On-premise databases to Azure Data Lake store using Azure Data factory.

Developed Spark applications using Spark/PySpark - SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analysing & transforming the data to uncover insights into the customer usage, consumption patterns, and behaviour.

Skilled dimensional modelling, forecasting using large-scale datasets (Star schema, Snowflake schema), transactional modelling, and SCD (Slowly Changing Dimension).

Developed scripts to transfer data from FTP server to the ingestion layer using Azure CLI commands.

Developed an automated process in Azure cloud which can ingest data daily from web service and load into Azure SQL DB

Created Azure HD Insights cluster using PowerShell scripts to automate the process.

Used stored procedure, lookup, execute the pipeline, data flow, copy data, Azure function features in ADF.

Used Azure Data Lake storage gen2 to store excel files, parquet files and retrieve user data using Blob API.

Worked on Azure data bricks, PySpark, Spark SQL, Azure ADW, and Hive used to load and transform data.

Used Azure Data Lake as Source and pulled data using Azure Polybase.

Azure data lake, Azure Blob used for storage and performed analytics in Azure Synapse Analytics.

Worked on Azure Cloud, Azure Data Factory, Azure Data Lake Storage, Azure Synapse Analytics, Azure Analytical services, Azure Cosmos NO SQL DB, Azure HDInsight Big Data Technologies (Hadoop and Apache Spark) and Data bricks.

We have developed automated processes for flattening the upstream data from Cassandra, which in JSON format. Used Hive UDFs to flatten the JSON Data.

Worked on Data loading into Hive for Data Ingestion history and Data content summary.

Created Impala tables and Secure File Transfer Protocol (SFTP) scripts, and Shell scripts to import data into Hadoop.

Created Hive tables and involved in data loading and writing Hive User defined functions (UDFs). Developed Hive UDFs for rating aggregation.

Provided ad-hoc queries and data metrics to the Business Users using Hive, Pig.

Environment: PL/SQL, Python, Azure-Data factory, Azure Blob storage, Azure table storage, Azure SQL server, Apache Hive, Apache Spark, MDM, Teradata, Oracle 12c, SQL Server, Teradata SQL Assistant, Microsoft Word/Excel, Flask, Snowflake, DynamoDB, MongoDB, Pig, Sqoop, Tableau, Power BI and UNIX, Docker, Kubernetes.

Dhruvsoft Services Private Limited, Hyderabad, India June 2015 - Nov 2016

ETL Developer

Responsibilities:

Assisting in designing the overall ETL solutions including analyzing data, preparation of high level and detailed design documents, test and data validation plans and deployment strategy.

Prepared the technical mapping specifications, process flow and error handling document.

Developed both simple and complex mappings implementing complex business logic using variety of transformation logic like Unconnected and connected Lookups, Router, Filter, Expression, Aggregator, Joiner, Update Strategy, Unconnected and Connected Stored procedures, normalizer and more.

Used informatica cloud data integration for global, distributed data warehouse and analytical projects.

Performed data integration using Informatica for identifying duplicates.

Experienced on data integration, data management/data warehousing.

Extensive experience in Extraction, Transformation, and Loading (ETL) data from various data sources into Data Warehouse and Data Marts using Informatica PowerCenter tools (Repository Manager, Designer, Workflow Manager, Workflow Monitor, and Informatica Administration Console).

Worked on dashboard reports, data integration, ETL components.

Created various tasks like Pre\Post Session Commands, Timer, Event Wait, Event Raise, Email and Command task.

Experienced in writing live Real-time Processing using Spark Streaming with Kafka.

Experience in integration of various data sources like Oracle, DB2, SQL Server, Flat Files, Mainframes, XML files into Data Warehouse and also experienced in Data Cleansing and Data Analysis.

Used HiveQL to analyze the partitioned and bucketed data and compute various metrics for reporting

Involved in managing and monitoring Hadoop cluster using Cloudera Manager.

Used Python and Shell scripting to build pipelines.

Developed data pipeline using Sqoop, HQL, Spark and Kafka to ingest Enterprise message delivery data into HDFS.

Developed workflow in Oozie also in Airflow to automate the tasks of loading data into HDFS and pre-processing with Pig and Hive.

Assisted in creating and maintaining technical documentation to launching HADOOP Clusters and even for executing Hive queries and Pig Scripts

Extensively worked in database components like SQL, PL/SQL, Stored Procedures, Stored Functions, Packages and Triggers.

Performs Code review and troubleshooting of existing Informatica mappings and deployment of code from Development to test to production environment.

Environment: Hadoop, HDFS, Spark, HiveQL, Kafka, Oozie, Pig, Airflow, Informatica, Oracle, PL/SQL, Sql Server, Unix, Linux, Shell Scripting.

IBing Software Solutions Sep 2014 to May 2015

Jr. Data Engineer

Responsibilities:

Developed the Conceptual Data Models, Logical Data models and transformed them to creating schema-using ERWIN.

Developing the Data Mart for the base data in Star Schema, Snow-Flake Schema involved in developing the data warehouse for the database.

Forward Engineering the Data models, Reverse Engineering on the existing Data Models and Updates the Data models.

Database cloning to synchronize test and Developing Database to production database.

Writing Procedure and Package using Dynamic PL/SQL.

Created Database Tables, Views, Indexes, Triggers and Sequences and Developing the Database Structure.

Data cleansing process on staging tables.

Wrote a complex SQL, PL/SQL, Procedures, Functions, and Packages to validate data and testing process.

Normalizing the Data and developing Star Schema.

Reducing the CPU cost for the query by various hints and Tuning SQL queries.

Extensively performed manual Test process.

Extensively generated both logical and physical reports from Erwin.

Involved in developing the Data warehouse for the complete Actuarial Information System Application.

Worked with DBA's after generating DDL to create tables in database.

Environment: Erwin data modeler 7, Oracle 8i, SQL Server 2008, ER/STUDIO, Informatica, MS Office packages Word, Excel, Power point, Project, TOAD

Contact this candidate