Senior Data Engineer

Location:

Marietta, GA

Posted:

August 18, 2023

Contact this candidate

Resume:

Dharma raj T

Email: **************@*****.*** PH: +1-470-***-****

Sr. Data Engineer.

PROFESSIONAL SUMMARY

Around 9+ years of overall IT experience in a variety of industries and hands-on experience in Big Data Analytics Design and development.

Hands-on experience with Amazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, Cloud Front, CloudWatch, SNS, SES, SQS and other services of the AWS family.

Good work experience with cutting-edge technologies like Kafka, Spark, Spark streaming.

Experience setting up instances behind Elastic Load Balancer in AWS for high availability and cloud integration with AWS using ELASTIC MapReduce (EMR).

Experience in working in Hadoop eco-system integrated to the Cloud platform provided by AWS with several services like Amazon EC2 instances, S3 bucket and RedShift.

Good experience working with Azure Cloud Platform services like Azure Data Factory (ADF), Azure Data Lake, Azure Blob Storage, Azure SQL Analytics, HDInsight/Databricks.

Experience in collecting, processing, and aggregating large amounts of streaming data using Kafka, Spark Streaming.

Hand on experience working with Databricks environment and worked with Lake House Architecture.

Hands on experience working with Snowflake Database.

Setting Up AWS and Microsoft Azure with Databricks, Databricks Workspace for Business Analytics, Manage Clusters in Databricks.

Partnered with cross functional teams across the organization to gather requirements, architect, and develop proof of concept for the enterprise Data Lake environments like CLOUDERA, HORTONWORKS, AWS, and AZURE.

Experience in designing Data Marts by following Star Schema and Snowflake Schema Methodology.

Experience in Spark-Scala programming with good knowledge on Spark Architecture and its In-memory Processing.

Experience in designing and developing applications in Spark using Python to compare the performance of Spark with Hive.

Hands on experience on Data Analytics Services such as Athena, Glue Data Catalog & Quick Sight.

Excellent understanding and knowledge of NOSQL databases like HBase, and Cassandra.

Experienced in converting Hive/SQL queries into Spark transformations using Spark Data Frames and Python.

Extensive experience working with spark distributed Framework involving Resilient Distributed Datasets (RDD) and Data Frames using Python, Scala and Java8.

Strong skills in visualization tools Power BI, Microsoft Excel - formulas, Pivot Tables, Charts and DAX Commands.

Proficient in Data Analysis, Cleansing, Transformation, Data Migration, Data Integration, Data Import, and Data Export through use of ETL tools such as Informatica, SSIS.

Good knowledge of Data Marts, OLAP, Dimensional Data Modeling with Ralph Kimball Methodology (Star Schema Modeling, Snow-Flake Modeling for FACT and Dimensions Tables) using Analysis Services.

Implemented continuous integration and delivery (CI/CD) pipelines using AWS Cloud Build and GitHub.

TECHNICAL SKILLS:

Databases: Oracle, MySQL, MS-SQL, NO SQL, RDBMS, SQL Server 2014, HBase 1.2, Teradata, Cassandra.

Bigdata Framework: HDFS, MapReduce, Hive, Sqoop, HBase, Amazon EC2, EMR, S3 and Red Shift), Spark, Storm, Impala, Kafka.

Programming Languages: Python, Java, R, SQL, Scala, UNIX

Machine Learning: Regression, clustering, SVM, Decision trees, Classification, Recommendation systems, Association Rules, Survival Analysis etc.,

Data Visualization: Tableau9.4/9.2, Power BI.

Modeling tools: Erwin, ER Studio.

Professional Experience:

Senior Data Engineer Nike - Atlanta, GA

Sept 2021 to Present Responsibilities:

Collaborated with Business Analysts, SMEs across departments to gather business requirements, and identify workable items for further development.

Worked with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, Data Frame and RDD’s.

Used EMR and Pyspark for data processing and for compute and stored output data on S3 data lake.

Consumed the data from multiple different file formats and stored the resultant data in Parquet file format.

Hands on experience ingesting the data from AWS s3 to Snowflake vice versa.

Performed data preprocessing and feature engineering for further predictive analytics using Python Pandas.

Generated report on predictive analytics using Python and Tableau including visualizing model performance and prediction results.

Worked on developing ETL to create Aggregated Tables in Snowflake consuming various data sources.

Used Snowflake as cloud Datawarehouse database for BI & Reporting.

Implemented Lake House Architecture powered by Databricks to migrate the pipelines from AWS to Lakehouse using capabilities such as Unity Catalog, DB-SQL, DLT, DBR-Workflows.

Used Airflow for scheduling and orchestration of the data pipelines. Have good experience in troubleshooting airflow ingestion jobs.

Designed and implemented Model and Pipeline validation procedures alongside teams of Data Scientists, Data Engineers, and other Client Engineers.

Worked with delta tables on Databricks. Hands on experience working with Databricks runtime and performance tuning of spark jobs.

Worked on increasing the airflow cluster resources when ingestion jobs are taking long time.

Experience in integrating Jenkins with various tools like Maven (Build tool) & Git (Repository).

Created functions and assigned roles in AWS Lambda to run python scripts, and AWS Lambda using java to perform event driven processing. Created Lambda jobs and configured Roles using AWS CLI.

Utilized Agile and Scrum methodology for team and project management.

Environment: Spark (PySpark, SparkSQL, SparkMLIib), Python 3.x (Scikit-learn, Numpy, Pandas), Snowflake, Databricks, Tableau 10.1, GitHub, AWS EMR/EC2/S3/Athena/Lambda.

Senior Data Engineer

Alliance Bernstein - Atlanta, GA May 2018 to Sept 2021 Responsibilities:

Create and maintain reporting infrastructure to facilitate visual representation of manufacturing data for purposes of operations planning and execution.

Extract, Transform and Load data from source systems to Azure Data Storage services using a combination of Azure Data Factory, Azure Databricks, Blob Storage, Azure Synapse, T-SQL, Spark SQL and Azure Analytics.

Extensively worked on Spark Streaming and Apache Kafka to fetch live stream data.

Constructed product-usage SDK data and data aggregations by using PySpark, Scala.

Worked with Azure HD Insight clusters and migrated Hive scripts to Spark.

Installed Kafka manager for consumer lags and for monitoring Kafka Metrics also this has been used for adding topics, Partitions etc.

Experience in creating configuration files to deploy the SSIS packages across all environments. Experience in writing queries in SQL and R to extract, transform and load (ETL) data from large datasets using Data Staging.

Implemented CI/CD pipelines using Jenkins and built and deployed the applications.

Worked on developing Restful endpoints to cache application specific data in in-memory data clusters like Redis and exposed them with Restful endpoints.

Worked on creating notebooks for moving data from raw to stage and then to curated zones using Azure data bricks. Hands on experience working with Delta tables.

Worked on data pre-processing and cleaning the data to perform feature engineering and performed data imputation techniques for the missing values in the dataset using Python.

Created Data Quality Scripts using SQL and Hive to validate successful das ta load and quality of the data. Created various types of data visualizations using Python and Tableau.

Worked with Machine Learning engineering teams to help them with EDA and feature extraction and also to provide them with the infrastructure to run the ML models.

Responsible for performing Machine-learning techniques regression/classification to predict the outcomes.

Managed CI/CD pipelines as part of DEVOps in this project and worked to automate the workflow.

Familiar with data architecture including data ingestion pipeline design, Hadoop information architecture, data modelling and data mining, machine learning and advanced data processing.

Environment: Hadoop, ETL operations, Data Warehousing, Data Modelling, Azure Data Factory, Azure Databricks, Advanced SQL methods, Python, Linux, Apache Spark, Scala, Spark-SQL, HBase.

Data Engineer Monsanto/Bayer -St Louis, MO Aug 2016 to May 2018 Responsibilities:

Transformed batch data from several tables containing tens of thousands of records from SQL Server, MySQL, PostgreSQL, and CSV file datasets into data frames using PySpark.

Developed a PySpark program that writes data frames to HDFS as Avro files.

Migrated pipelines to run on AWS EMR clusters and store the data on S3.

Utilized Spark's parallel processing capabilities to ingest data.

Created and executed HQL scripts that create external tables in a raw layer database in Hive.

Developed a Script that copies avro formatted data from HDFS to External tables in raw layer.

Created PySpark code that uses Spark SQL to generate dataframes from avro formatted raw layer and writes them to data service layer internal tables as orc format.

In charge of PySpark code, creating data frames from tables in data service layer and writing them to a Hive data warehouse.

Hosted client calls for project planning and business analysis to provide oversight of long term and short-term technical project initiatives.

Developed Airflow DAGs in python by importing the Airflow libraries.

Utilized Airflow to schedule automatically trigger and execute data ingestion pipeline.

Partnered with ETL developers to ensure that data is well cleaned, and the data warehouse is up to date for reporting purposes by Pig.

Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, PostgreSQL, Data Frame, OpenShift, Talend, pair RDD's.

Used Informatica Power Center for (ETL) extraction, transformation and loading data from heterogeneous source systems into target database.

Created Mappings using Designer and extracted data from various sources, transformed data according to the requirement.

Parsed high-level design specification to simple ETL coding and mapping standards.

Designed and customized data models for Data Warehouse supporting data from multiple sources on real time.

Environment: AWS Services, Hadoop, Map Reduce, Hive, Sqoop, Spark, Teradata, Parquet, Oracle 12c, SQL Plus, TOAD, SQL Loader, SQL Developer, PL/SQL, Informatica Power Center 10.2.0, Designer, Workflow Manager, Workflow Monitor, Kafka, GitHub, Shell Scripts, Splunk, Redshift.

Python/SQL Developer

Coke - India

June 2014 to May 2016 Responsibilities:

Perform T-SQL tuning and optimizing queries for and SSIS packages.

Designed Distributed algorithms for identifying trends in data and processing them effectively.

Creating an SSIS package to import data from SQL tables to different sheets in Excel.

Used Spark and Scala for developing machine learning algorithms that analyze clickstream data.

Used Spark SQL for data pre-processing, cleaning, and joining very large data sets.

Tested Hadoop Map Reduce developed in python, pig, Hive.

Co-developed the SQL server database system to maximize performance benefits for clients.

Assisted senior-level data scientists in the design of ETL processes, including SSIS packages.

Database migrations from traditional data warehouses to spark clusters.

Devised simple and complex SQL scripts to check and validate Dataflow in various applications.

Performed Data Preparation by using Pig Latin to get the right data format needed.

Experience in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension).

Devised PL/SQL Stored Procedures, Functions, Triggers, Views, and packages. Made use of Indexing, Aggregation and Materialized views to optimize query performance.

Experience in Object Oriented Design and Programming concepts in Python.

Ensure the data warehouse was populated only with quality entries by performing regular cleaning and integrity checks.

Used Oracle relational tables and used them in process design.

Developed Databricks Python notebooks to Join, filter, pre-aggregate, and process the files stored in Azure data lake storage.

Developed SQL queries to perform data extraction from existing sources to check format accuracy.

Developed automated tools and dashboards to capture and display dynamic data.

Developed Python scripts to extract the data from the web server output files to load into HDFS.

Coordinated data security issues and instructed other departments about secure data transmission and encryption.

Built various graphs for business decision making using Python Matplotlib library.

Environment: SSIS, T-SQL, Hadoop, Spark SQL, Relational Databases, SQL, Linux, Erwin, Cloudera, Teradata, Data Validation, MS Excel.

Contact this candidate