Data Engineer Azure

Location:

Jersey City, NJ

Posted:

May 12, 2025

Contact this candidate

Resume:

WORK EXPERIENCE

Kenvue Inc – Azure Data Engineer

Skillman, New Jersey, USA Mar 2024 - Present

Kenvue Inc. is an American consumer health company. I design, develop, and maintain scalable data pipelines to ingest, process, and transform data from various sources. Used Azure Data Factory, Azure Databricks, or Azure Synapse Analytics to orchestrate and automate data workflows.

Responsibilities:

Developed reusable framework to be leveraged for future migrations that automates ETL from RDBMS systems to the Data Lake utilizing Spark Data Sources and Hive data objects. Import the data from different sources like HDFS/HBase into Spark RDD and perform computations using PySpark to generate the output response.

Enabling monitoring and Azure log analytics to alert support team on usage and stats of the daily runs.

Worked with Docker containers in developing the images and hosting them in antifactory.

Created and provisioned different Databricks clusters needed for batch and continuous streaming data processing and installed the required libraries for the clusters. Managed relational database services in which the Azure SQL handles reliability, scaling, and maintenance. Integrated data storage solutions.

Created Data tables utilizing PyQt to display customer and policy information and add, delete, update customer records.

Worked with Python ORM classes with SQL Alchemy to get update the data from database.

Designed and implemented Infrastructure as code using Terraform, enabling automated provisioning and scaling of cloud resources on Azure. Built and configured Jenkins slaves for parallel job execution. Installed and configured Jenkins for continuous integration and performed continuous deployments.

Involved in the entire lifecycle of the projects including Design, Development, and Deployment, Testing and Implementation, and support. Extracted, converted, and loaded data from various sources to Azure Data Storage Services utilizing Azure data factory and T-SQL for data lake analytics.

Developed Spark programs to parse the raw data, populate staging tables, and store the refined data in partitioned tables in the Enterprise Data warehouse. Developed Streaming applications using PySpark to read from the Kafka and persist the data NoSQL databases such as HBase and Cassandra.

Experience in building PowerBI reports on AzureAnalysis services for better performance.

Implemented PySparkScripts using SparkSQL to access hive tables into a spark for faster processing of data.

Worked on Big Data Hadoopcluster implementation and data integration in developing large-scale system software.

Migrating an entire oracle database to BigQuery and using of PowerBI for reporting.

Demonstrated skill in parameterizing dynamic SQL to prevent SQL injection vulnerabilities and ensure data security.

Ensured data integrity and consistency during migration, resolving compatibility issues with T-SQL scripting.

Developed a fully automated continuous integration system using Git, Jenkins, MySQL and custom tools developed in Python and Bash. Developed Kibana Dashboards based on the Log stash data and Integrated different source and target systems into Elasticsearch for near real time log analysis of monitoring End to End transactions.

Implemented Synapse Integration with Azure Databricks notebooks which reduce about half of development work.

Developed and maintained data models and schemas within Snowflake, including the creation of tables, views, and materialized views to support business reporting and analytics requirements.

Worked on CI/CD tools like Jenkins, Docker in Devops Team for setting up application process from end-to-end using Deployment for lower environments and Delivery for higher environments by using approvals in between.

Technologies Used: API, Azure, Cassandra, CI/CD, Docker, Elasticsearch, ETL, Git, HBase, HDFS, Hive, Java, Jenkins, MySQL, PySpark, Python, RDBMS, Services, Snowflake, Spark, SQL

PBF Energy – AWS Data Engineer

Parsippany, New Jersey, USA Nov 2022 - Feb 2024

PBF Energy Inc. is a petroleum refining and logistics company that produces and sells transportation fuels, heating oils, lubricants, petrochemical feedstocks, and other petroleum products. Monitored the data pipelines and storage solutions for performance and reliability. Documented the data engineering processes, architectures, and workflows.

Responsibilities:

Developed PySpark applications for various ETL operations across various data pipelines.

Used the AWS -CLI to suspend an AWS Lambda function processing an Amazon Kinesis stream.

Written multiple MapReduce Jobs using Java API, Pig and Hive for data extraction, transformation and aggregation from multiple file formats including Parquet, Avro, XML, JSON, CSV, ORCFILE and other compressed file formats Codecs like Gzip, Snappy, LZO. Managed large datasets using Panda data frames and SQL.

Creating Lambda functions with Boto3 to deregister unused AMIs in all application regions to reduce the cost for EC2 resources. Used Kafka functionalities like distribution, partition, replicated commit log service for messaging systems by maintaining feeds. Involved in loading data from rest endpoints to Kafka.

Implemented Responsible AWS solutions using EC2, S3, RDS, EBS, Elastic Load Balancer, and Auto scaling groups, Optimized volumes and EC2 instances.

Wrote Terraform templates for AWS Infrastructure as a code to build staging, production environments & set up build & automations for Jenkins. Configured Elastic Load Balancers (ELB) with EC2Auto scaling groups.

Created Amazon VPC to create public-facing subnet for web servers with internet access, and backend databases & application servers in a private-facing subnet with no Internet access. Created AWS Launch configurations based on customized AMI and use this launch configuration to configure auto scaling groups.

Utilized Puppet for configuration management of hosted Instances within AWS Configuring and Networking of VirtualPrivate Cloud (VPC). Utilized S3 bucket and Glacier for storage and backup on AWS.

Developed Spark applications using PySpark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.

Worked with AWS Terraform templates in maintaining the infrastructure as code.

Actively Participated in all phases of the Software Development Life Cycle (SDLC) from implementation to deployment.

Used Jira for ticketing and tracking issues and Jenkins for continuous integration and continuous deployment.

Successfully managed data migration projects, including importing and exporting data to and from MongoDB, ensuring data integrity and consistency throughout the process.

Automated and monitored AWS infrastructure with Terraform for high availability and reliability, reducing infrastructure management time by 90% and improving system uptime.

Performed end-to-end Architecture & implementation assessment of various AWS services like Amazon EMR, Redshift, S3. Worked on container orchestration tools such as Docker swarm, Mesos, and Kubernetes.

Skilled in monitoring servers using Nagios, Cloud watch and using ELK Stack- Elastic search and Kibana.

Implemented a proof of concept deploying this product in AWS S3 bucket and Snowflake.

Used Bitbucket as source control to push the code and Bamboo as deployment tool to build CI/CD pipeline.

A security framework was created to provide for fine-grained access control for items in AWS S3.

Technologies Used: API, AWS, Bitbucket, CI/CD, Docker, ETL, Hive, Java, Jenkins, Jira, JS, Kafka, Kubernetes, lake, Lambda, MapR, Pig, PySpark, S3, Snowflake, Spark, SQL

Apollo Global Management Inc., - GCP Data Engineer

Mumbai, India Jul 2020 - Jul 2022

Apollo Global Management, Inc. is an American asset management firm that primarily invests in alternative assets. Designed, built, and maintained the scalable and efficient data pipelines to ingest, process, and transform data from various sources (e.g., financial systems, market data feeds, internal databases).

Responsibilities:

Created Hive, HBase tables and HBase integrated Hive tables as per the design using ORC file format and Snappy compression. Worked extensively with Python in optimization of the code for better performance.

Provided high availability for IaaS VMs and PaaS role instances for access from other services in the VNet with Azure Internal Load Balancer. Created BigQuery authorized views for row level security or exposing the data to other teams.

Creating Data Studio report to review billing and usage of services to optimize the queries and contribute in cost saving measures. Experienced in GCP Dataproc, GCS, Cloud functions, BigQuery.

Experienced in GCP features which include Google Compute engine, Google Storage, VPC, Cloud Load balancing, IAM.

Creating pipelines, data flows and complex data transformations and manipulations using ADF and PySpark with Databricks. Good knowledge in using Cloud Shell for various tasks and deploying services.

Worked on building end-to-end Scala based Spark applications for cleansing, auditing, and transforming raw data feeds from multiple report suites. Designing and implementing data integration solutions using Azure Data Factory to move data between various data sources, including on-premises and cloud-based systems.

Experienced in Google Cloud components, Google container builders and GCP client libraries and Cloud SDK'S.

Storing different configs in No SQL database MongoDB and manipulating the configs using PyMongo.

Used Sqoop import/export to ingest raw data into Google Cloud Storage by spinning up Cloud Dataproc cluster.

Process and load bound and unbound Data from Google pub/sub topic to BigQuery using Cloud Dataflow with Python.

Used Apache Airflow in GCP Composer environment to build data pipelines and used various airflow operators like bash operators, Hadoop operators and Python callable and branching operators. Build data pipelines in airflow in GCP for ETI related jobs using different airflow operators.

Developed streaming and batch processing applications using PySpark to ingest data from the various sources into HDFS Data Lake. Developed DDLs and DMLSscripts in SQL and HQL for analytics applications in RDBMS and Hive.

Developed and implemented HQLscripts to create Partitioned and Bucketed tables in Hive for optimized data access.

Used cloudshellSDK in GCP to configure the services Data Proc, Storage, BigQuery.

Written HiveUDFS to implement custom functions in the Hive for aggregations.

Worked extensively with Sqoop for importing and exporting the data from HDFS to Relational Database systems/mainframe and vice-versa loading data into HDFS.

Technologies Used: Airflow, Apache, Azure, BigQuery, Data Factory, Factory, GCP, HBase, Hive, IaaS, PaaS, PySpark, Python, Scala, SDK, Spark, SQL, Sqoop, VPC

Hindustan Construction Company – Data Engineer

Mumbai, India May 2019 - Jun 2020

Hindustan Construction Company Limited (HCC) is an Indian multinational engineering and construction company. I performed the data transformation and processing using BigQuery SQL, Dataproc (for Apache Spark and Hadoop), or Dataflow (for Apache Beam). Optimized the data processing jobs for performance and cost-efficiency.

Responsibilities:

Been part of developing ETL jobs for extracting data from multiple tables and loading it into data mart in Redshift.

Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics).

Worked on Big Data Integration & Analytics based on Hadoop, SOLR, PySpark, Kafka, Storm and Web Methods.

Wrote Spark-Streaming applications to consume the data from Kafka topics and write the processed streams to HBase.

Processed the data efficiently with in Azure Databricks and visualizing insights through Tableau dashboards

Developed Spark applications with Azure Data Factory and Spark-SOL for data extraction, transformation, and aggregation from different file formats to analyze and transform the data to uncover insights into customer usage patterns.

Involved in various phases of Software Development Lifecycle (SDLC) of the application, like gathering requirements, design, development, deployment, and analysis of the application.

Developed Triggers, stored procedures, functions, and packagers using cursors associated with the project using PL/SQL.

Pipelines were created in Azure Data Factory utilizing Linked Services to extract, transform, and load data from many sources such as Azure SQL Data warehouse, write-back tool, and backwards.

Used PySpark, Apache Flink, Kafka, and Hive on a distributed Hadoop Cluster to contribute to the creation of real- time streaming applications. Implemented Azure data lake, Azure Data Factory and Azure data bricks to move and conform the data from on - premises to cloud to serve the analytical needs of the company.

Building/Maintaining Docker container clusters managed by Kubernetes Linux, Bash, Git, Docker.

Proficient in using Snowflake utilities, Snow SQL, Snow Pipe, and applying Big Data modelling techniques using Python.

Instantiated, created, and maintained CI/CD (continuous integration & deployment) pipelines and apply automation to environments and applications.

Technologies Used: Azure, Azure Data Lake, CI/CD, Cluster, Data Factory, Docker, EC2, EMR, ETL, Factory, Git, HBase, Hive, JS, Kafka, Kubernetes, Linux, PL/SQL, PySpark, Python, Redshift, S3, Services, Snowflake, Spark, SQL, Storm, Tableau

EDUCATION

Wilmington University, Delaware, USA

Masters / Information Technology Sep 2023 to Aug 2024

Results-driven Data Engineer with 5+ years of extensive experience in designing, building, and maintaining scalable data infrastructure on cloud platforms, including Azure, AWS, and Google Cloud (GCP). Adept at leveraging cloud-native services to optimize ETL processes, data pipelines, and storage solutions, ensuring high performance and reliability.

*****************@*****.***

862-***-****

Kumar Akash Chowdary Kommina

Data Engineer

PROFILE SUMMARY

5+ years of expertise designing, developing, and executing data pipelines and data lake requirements in numerous companies using the Big Data Technology stack, Python, PL/SQL, SQL, REST APIs, and the Azure cloud platform.

Hands-on experience with Amazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, CloudWatch, SNS, SES, SQS, Lambda, EMR and other services of the AWS family.

Expertise in building CI/CD on AWS environment using AWS Code Commit, Code Build, Code Deploy and Code Pipeline. Worked on production support looking into logs, hot fixes and used Splunk for log monitoring along with AWS CloudWatch.

Experience in working with RIVERY ELT platform which performs data integration, data orchestration, data cleansing, and other vital data functionalities.

Strong Experience using AWS and Azure Data services and creating the most Efficient Data Pipelines using optimized Solutions.

Implemented Big Data solutions using Hadoop technology stack, including PySpark, Hive, Sqoop, Avro and Thrift. Proficiency with Scala, Apache HBase,Hive, Pig, Sqoop, Zookeeper, Spark, Spark SQL, Spark Streaming, Kinesis, Airflow, Yarn, and Hadoop (HDFS, MapReduce).

Working knowledge of HDFS, Kafka, Map Reduce, Spark, PIG, HIVE, Sqoop, HBase, Flume, and ApacheZooKeeper as tools for designing and deploying end-to-end big data ecosystems. Good understanding of Spark Architecture with Databricks, Structured Streaming. Setting Up AWS and Microsoft Azure with Databricks, Databricks Workspace for Business Analytics, Managing Clusters in Databricks, Managing the Machine Learning Lifecycle.

Extensively worked with Avro and Parquet files and converted the data from either format Parsed Semi Structured JSON data or converted to Parquet using Data Frames in PySpark. Experience in Windows Azure Services like PaaS, IaaS and worked on storages like Blob (Page and Block), SQL Azure. Well experienced in deployment & configuration management and Virtualization.

Extensive work experience in all phases of Software Development Life Cycle (SDLC) including Requirement Analysis, Design, Coding, Testing and Implementation in Agile (Scrum), TDD Environment. Experience in implementing Azure data solutions, provisioning storage account, Azure Data Factory, SQL Server, SQL Databases, SQL Data warehouse, Azure Data Bricks and Azure Cosmos DB.

Experience in migrating on premise to Windows Azure using Azure Site Recovery and Azure backups. Worked with creating Docker files, Docker containers in developing the images and hosting them in antifactory.

Experience in Cisco Cloud Center to more securely deploy and manage applications in multiple data center, private cloud, and public cloud environments.

KEY SKILLS

Big Data Technologies: HDFS, YARN, MapReduce, Hive, Pig, Impala, Sqoop, Storm, Flume, Spark, Apache Kafka, Zookeeper, Ambari, Oozie, MongoDB, Cassandra, Mahout, Puppet, Avro, Parquet, Snappy

NO SQL Databases: Postgres, HBase, Cassandra, MongoDB, Amazon DynamoDB, Redis

Languages: Scala, Python, R, XML, XHTML, HTML, AJAX, CSS, HiveQL, Unix, Shell Scripting

Source Code Control: GitHub, CVS, SVN, ClearCase

Cloud Computing Tools: Amazon AWS, (S3, EMR, EC2, Lambda, VPC, Route 53, Cloud Watch, CloudFront), Microsoft Azure, GCP

Databases: Teradata Snowflake, Microsoft SQL Server, MySQL, DB2

DB languages: MySQL, PL/SQL, PostgreSQL & Oracle

Build Tools: Jenkins, Maven, Ant, Log4j

Business Intelligence Tools: Tableau, Power BI

Development Tools: Eclipse, Intellij, Microsoft SQL Studio, Toad, NetBeans

ETL Tools: Talend, Pentaho, Informatica, Ab Initio, SSIS

Development Methodologies: Agile, Scrum, Waterfall, V model, Spiral, UML

Contact this candidate