Post Job Free

Resume

Sign in

Data Engineer Azure

Location:
Charlotte, NC, 28202
Posted:
July 17, 2023

Contact this candidate

Resume:

Ajay Shanmuham

Charlotte, NC YOE: *+ Years

Email: adyckm@r.postjobfree.com PHN: 408-***-****

Sr. Data Engineer www.linkedin.com/in/ajay-shanmuham

Summary of Qualifications

Strong expertise in telecom, e-commerce, and financial, industry experience. Keen awareness of cutting-edge technologies and their practical enterprise applications. Highly motivated, proactive, skilled & Dedicated data engineer well-equipped to contribute to complex data projects and drive valuable insights from datasets.

Experience Synopsis

8+ years of Python/Java-related experience.

7+ years of AWS/Azure Cloud platform Worked On.

8+ years of SQL-related experience.

7+ years of Data integration/Warehousing experience.

6+Years of Airflow/Snowflake, and Databricks-related experience.

8+ years of Hadoop and other related Big-Data Ecosystem-Prior experience.

5+years of Feature engineering experience.

Provide mentorship to junior engineers.

Professional Summary

Proficient in Python and Julia, experienced in leveraging these languages for data analysis, designing applications in Spark, and using PySpark for Spark library scripting.

Experience in Spark-Scala programming with good knowledge of Spark Architecture and its In-memory Processing.

Experience in writing MapReduce programs in Java for data cleansing and preprocessing.

Extensive experience working with spark distributed Framework involving Resilient Distributed Datasets (RDD) and Data Frames using Python, Scala, and PySpark.

Experience in collecting and aggregating large amounts of streaming data using Kafka and Spark Streaming.

Good Knowledge of Apache NiFi for automating and managing the data flow between systems.

Experience in designing Data Marts by following Star Schema and Snowflake Schema Methodology.

Deep Experience in Data Pipelines, phases of ETL, ELT data process, and converting Bigdata/unstructured data sets (JSON, log data) to structured data sets for Product analysts, Data Scientists.

Experienced in converting Hive/SQL queries into Spark transformations using Spark Data Frames and Python.

Highly skilled in Business Intelligence tools like Tableau and PowerBI.

Excellent understanding and knowledge of NoSQL databases like HBase, and Cassandra.

Expertise in data modeling, migration, design, and ETL pipeline preparation for cloud and On-prem platforms.

Hands-on experience with Amazon EC2, Amazon S3, Amazon RDS, VPC, IAM, and other AWS services.

Experience in working in a Hadoop eco-system integrated into the Cloud platform provided by AWS with several services like Amazon EC2 instances, S3 bucket, and RedShift.

Good experience working with Azure Cloud Platform services like Azure Data Factory (ADF), Azure Data Lake, Azure Blob Storage, Azure SQL Analytics, and Azure Databricks.

Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that processes the data using the SQL Activity.

Hands-on Experience in Service Oriented Architecture (SOA), Event Driven Architecture, Distributed Application Architecture, and Software as Service (SAS).

Excellent understanding/knowledge of Hadoop architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Resource Manager, and Node Manager.

Expose to various software development methodologies like Agile and Waterfall.

Proficient with version control tools such as GIT, Subversion (SVN), and Bitbucket, wherein experienced with Tagging, and Branching on platforms like Linux and Windows.

Extensive experience in integrating various data sources like SQL Server, DB2, PostgreSQL, Oracle, and Excel.

Build an ETL that utilizes a spark jar inside, which executes the business analytical model.

Education:

Anna University Tamil Nadu, India

B.E in Computer Science & Engineering Graduated In 2013

Professional Experience

Sr Data Engineer Charter Communications, Negaunee, MI March 2022 to Present

Responsibilities

Ingested and processed data through cleansing and transformations using AWS Lambda, AWS Glue, and Step Functions.

Leveraged AWS Kinesis Firehose for reliable data delivery to AWS services like S3, Redshift, and Elasticsearch, ensuring seamless data flow and storage.

Improved join performance by 60% after optimizing Delta lakes to Z-Order fields and bin-compact files, further reducing storage blow-up during Upsert operations to <2TB from over 150TB for a single delta table.

Reduced downtime by 30% after designing and assisting in implementing a monitoring system using ELK stack that performs timely analysis of failure points in the data pipeline.

Created and maintained the tables/views in the Snowflake data warehouse for downstream consumers.

Increased feature generation performance by 25% in Knowledge Graphs by optimizing queries to persist 15% less number of edges in the database.

Worked on Docker container snapshots, attaching to a running container, removing images, managing Directory structures, and managing containers.

Improved performance by 15% after optimizing queries for Knowledge Graphs handling ~2 billion nodes and 10 billion edges.

Worked on aggregation functions such as COUNT, SUM, AVG, MAX, and MIN in MySQL to calculate summary statistics on data, essential for data analysis and reporting.

Improved feature generation throughput by 50% after custom-building an ELT data pipeline using Julia and bash.

Collected near-real-time data using Spark Streaming from AWS S3 bucket, performed transformations and aggregations on the fly, and persisted data in HDFS using Spark, Python, and PySpark.

Achieved 20% performance gain and 40% reduced utilization of storage by automating ELT for Knowledge Graph feature generation using Amazon Neptune for on-premises systems.

Leverage AWS X-Ray for debugging and tracing microservices-based architecture, improving troubleshooting time by 30%.

Reduced round time of backfills by 40%, from 18 hours down to 10 hours, for missing features by employing an automated pool of Amazon EC2 instances.

Developed and implemented over 1000 model features for Data Scientists, effectively facilitating enhanced data interpretation and enabling the derivation of valuable business insights.

Worked with Data Engineers, Data Architects, to define back-end requirements for data products (aggregations, materialized views, tables – visualization)

Scheduled all the data pipelines by creating DAGS in Airflow

Created architecture stack blueprint for data access with NoSQL Database Cassandra.

Brought data from various sources into Hadoop and Cassandra using Kafka.

Created multiple dashboards in Tableau for multiple business needs.

Installed and configured Hive and wrote Hive UDFs and used the piggy bank, a repository of UDFs for Pig Latin

Successfully implemented POC (Proof of Concept) in a Development Databases to validate the requirements and benchmark the ETL loads.

Supporting Continuous storage in AWS using Elastic Block Storage, S3, and Glacier. Created Volumes and configured Snapshots for EC2 instances.

Worked on ETL Migration services by developing and deploying AWS Lambda functions to generate a serverless data pipeline that can be written to Glue Catalog and queried from Athena.

Managed security groups on AWS, focusing on high availability, fault tolerance, and auto-scaling using Terraform templates. Along with Continuous Integration and Continuous Deployment with AWS Lambda and AWS code pipeline. Used Pandas in Python for Data Cleansing and validating the source data.

Monitored and troubleshot data streaming issues using AWS CloudWatch and Kinesis Firehose, maintaining high performance and availability.

Environment: SQL, Python (3.8+), PySpark, Snowflake, Airflow, AWS Glue, AWS Athena, Julia, EMR, HDFS, Pig, Sqoop, Hive, NoSQL, Spark, Spark SQL, AWS, SQL Server, Tableau, Kafka, Terraform, EC2, S3

Data Engineer Macy's, New York, NY December 2019 to February 2022

Responsibilities

Executed a migration of on-premises systems from a proof-of-concept stage to a live production environment, leveraging Azure Cloud services like AKS, ADLS, Pipelines, Rest APIs, Key Vault, and Repos

Aided in the design and development of logical and physical data models, business rules, and data mapping for the Enterprise Data Warehouse system.

Designed, developed, tested, and maintained Tableau functional reports based on user requirements.

Implemented large Lambda architectures using Azure Data platform capabilities like Azure Data Lake, Azure Data Factory, HDInsight, Azure SQL Server, Azure ML, and Power BI.

Designed end-to-end scalable architecture using various Azure components like HDInsight, Data Factory, Data Lake, Storage, and Machine Learning Studio to solve business problems.

Created Pipelines in ADF using Linked Services/Datasets/Pipeline to extract, transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, and write-back tool.

Used Azure Data Factory, SQL API, and MongoDB API to integrate data from MongoDB, MS SQL, and cloud storage (Blob, Azure SQL DB, Cosmos DB).

Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) to process the data using SQL Activity.

Involved in Big data requirement analysis, developing and designing solutions for ETL and Business Intelligence platforms.

Utilized Docker for container management, including snapshots, attaching to running containers, removing images, and managing directory structures.

Developed Spark applications using Scala and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats to uncover insights into customer usage patterns.

Designed complex SSIS Packages for ETL with data from different sources.

Experienced in performance tuning of Spark Applications for optimal batch interval time, parallelism, and memory usage.

Designed and developed real-time stream processing applications using Spark, Kafka, Scala, and Hive to perform Streaming ETL and apply Machine Learning.

Developed PL/SQL triggers and master tables for the automatic creation of primary keys.

Used Apache Spark DataFrames, Spark-SQL, and Spark MLlib extensively, developing and designing POCs using Scala, Spark SQL, and MLlib libraries.

Environment: Hadoop, Spark, Databricks, Azure, ADF, Blob, cosmos DB, Python, PySpark, Scala, SQL, Sqoop, Kafka, Airflow, Oozie, HBase, Oracle, Teradata, Cassandra, MLlib, Tableau, Maven, Git, Jira.

Big Data Engineer Ace Hardware, Oak Brook, IL April 2017 to November 2019

Responsibilities

Develop Spark applications using PySpark and Spark SQL for data extraction, transformation, and aggregation from multiple file formats.

Responsible for analyzing and data cleaning using Spark SQL Queries.

Developed pre-processing job using Spark Data frames to flatten JSON documents to a flat file.

Involved in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipeline system.

Developed Spark applications using SparkSQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into customer usage patterns.

Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage; involved in Maintaining the Hadoop cluster on AWS EMR.

Imported data from AWS S3 into Spark RDD and performed transformations and actions on RDDs.

Migrated data from the DR region to S3 buckets and developed a connection that is linked to AWS, and Snowflake by using Python.

Loaded data into Amazon Redshift and used AWS Cloud Watch to collect and monitor AWS RDS instances within Confidential.

Wrote Python scripts in PySpark to process semi-structured data in formats like JSON.

Worked on converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala, and Python.

Used Spark Streaming APIs to perform transformations and actions on the fly for building a common learner data model, which gets the data from Kafka in near real-time and persists it to Cassandra.

Used Scala to write the code for all the use cases in Spark and extensive experience with Scala for data analytics on Spark cluster and Performed map-side joins on RDD.

Using Avro, Parquet, RCFile, and JSON file formats, developed UDFs in Hive and Pig.

Performed transformations like event joins, filter bot traffic, and some pre-aggregations.

Developed Sqoop and Kafka Jobs to load data from RDBMS, and External Systems into HDFS and HIVE.

Handled large Datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, effective & efficient Joins, Transformations, and others during the ingestion process itself.

Environment: Python, Spark, Spark-Streaming, Spark SQL, AWS tools, Apache Kafka, Scala, Shell scripting, Linux, PySpark, MySQL, Jenkins, Git.

Cybage Software Private Limited, Hyderabad, India October 2015 to January 2017

Role: Cloud Data Engineer

Responsibilities

Worked with Spark to improve performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Spark MLlib, Data Frame, Pair RDDs, and Spark YARN.

Author Python (PySpark) scripts for custom UDFs for row/column manipulations, mergers, aggregations, stacking, data labeling, and for all cleaning and conforming tasks.

Develop Spark applications using PySpark and Spark SQL for data extraction, transformation, and aggregation from multiple file formats.

Responsible for analyzing and data cleaning using Spark SQL Queries.

Developed pre-processing job using Spark Data frames to flatten JSON documents to a flat file.

Involved in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipeline system.

Implemented SQL queries and used stored procedures, and built-in functions to retrieve and update data from the databases. I also wrote complex queries in SQL involving joins to obtain data from the persistent layer.

Developed Spark applications using SparkSQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into customer usage patterns.

Imported data from AWS S3 into Spark RDD and performed transformations and actions on RDDs.

Loaded data into Amazon Redshift and used AWS Cloud Watch to collect and monitor AWS RDS instances within Confidential.

Wrote Python scripts in PySpark to process semi-structured data in formats like JSON.

Worked on converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala, and Python.

Used Spark Streaming APIs to perform transformations and actions on the fly for building a common learner data model which gets the data from Kafka in near real-time and persists it to Cassandra.

Used Scala to write the code for all the use cases in Spark and extensive experience with Scala for data analytics on Spark cluster and Performed map-side joins on RDD.

Author Python (PySpark) scripts for custom UDFs for row/column manipulations, mergers, aggregations, stacking, data labeling, and all cleaning and conforming tasks.

Created map-reduce jobs using Python scripts that can perform ETL jobs.

After running ETL queries, performed validation checks to report to the client at every stage of the project.

Create external tables with partitions using Hive, AWS Athena, and Redshift.

Initially migrated existing MapReduce programs to spark model using Python.

Developed Kafka consumer's API in Scala for consuming data from Kafka topics.

Consumed XML messages using Kafka and processed the XML file using Spark Streaming to capture UI updates.

Environment: Python, Spark, Spark-Streaming, Spark SQL, AWS tools, Apache Kafka, Scala, Shell scripting, Linux, PySpark, MySQL, Jenkins, Git, MySQL, Cassandra QSpiders & JSpiders

Yana Software Private Limited, Hyderabad, India May 2014 to September 2015

Role: Big Data Engineer

Responsibilities

Applied SQL in querying, data extraction, and data transformations.

Developed high-level design documents, use case documents, detailed design documents, and Unit Test Plan documents and created Use Cases, Class Diagrams, and Sequence Diagrams using UML.

Using Dual controllers on various Business Projects for Dual Data Validation and Data consistency.

Interacted with users, analyzed client processes, and documented the project’s business requirements.

Good experience in identifying the root causes, Troubleshooting, and submitting change controls.

Wrote various Hibernate Queries using Hibernate Query Language (HQL) and hibernate criterion queries to execute queries against the database.

Develop Spark applications using PySpark and Spark SQL for data extraction, transformation, and aggregation from multiple file formats.

Responsible for analyzing and data cleaning using Spark SQL Queries.

Developed pre-processing job using Spark Data frames to flatten JSON documents to a flat file.

Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage. Involved in Maintaining the Hadoop cluster on AWS EMR.

Imported data from AWS S3 into Spark RDD and performed transformations and actions on RDDs.

Environment: Python, Spark, Spark-Streaming, AWS, Spark SQL, AWS tools, Apache Kafka, Scala, Shell scripting, Linux, PySpark, MySQL.

Certification:

AWS Certified Solution Architecture-Associate

https://www.credly.com/badges/e8fe2d29-2fe8-4819-9c1a-a97f96a6df77/linked_in_profile

Java (Oracle Certified Associate, Java SE 8 Programmer)

https://catalog-education.oracle.com/pls/certview/sharebadge?id=515CA6AE2CCF3E41648290EC6BFA0BF8ABD27C7E195BE820F3D3EEF33F5628C8

Technical Skills:

Big Data Ecosystem

HDFS (Adept), MapReduce (Skilled), Pig (Well-versed), Hive (Proficient), Spark, YARN, Kafka, Flume, Sqoop, Impala, Oozie, Zookeeper, Spark, Ambari, HBase, Beeline, NiFi, Streamsets, ELK

Cloud Environment

Amazon Web Service (Business Savvy), Microsoft Azure (Prior Experience)

Hadoop Distributions

Cloudera Distribution, Hortonworks

Reporting and ETL Tools

Tableau, Power BI, Talend, AWS GLUE, Informatica

Languages

Python/Julia (Expert), Java (Proficient), Unix Shell Scripting (Master), SQL (Skilled), Spark (Prior Experience), SCALA (Well-versed), JDBC

Databases

Oracle, MySQL, Cassandra, PostgreSQL, MongoDB, HBase, SQL Server

Testing

MR Unit Testing, Quality Center (QC), HP- ALM, Pytest Framework

Virtualization

VMWare, AWS/EC2

Version Controls and Tools

Maven, Ant, SBT, Git, Kubernetes, Jenkins, Azure DevOps

References:

It will be provided upon request.



Contact this candidate