Data Engineer Big

Location:

Chicago, IL, 60602

Salary:

120,000

Posted:

May 31, 2025

Contact this candidate

Resume:

Navyasree Shetty

Data Engineer

309-***-**** *****************@*****.*** Illinois, USA

https://www.linkedin.com/in/navyasree-s-1a4221191/ Objective

Seasoned Data Engineer with 4+ years of extensive experience in designing, implementing, and optimizing data solutions using AWS, Azure, and GCP tools. Proven track record of leveraging cloud platforms to manage large-scale data pipelines, ensure data integrity, and enhance performance.

Profile Summary

• 4+ years of expertise designing, developing, and executing data pipelines and data lake requirements in numerous com- panies using the Big Data Technology stack, Python, PL/SQL, SQL, REST APls, and the Azure cloud platform.

• Expertise in building batch and real-time data processing systems leveraging AWS services (S3, Redshift, EMR, Lambda, DynamoDB) and Azure services (ADLS, ADF, Databricks, Synapse Analytics).

• Extensive experience in performing ETL on structured, semi-structured data using PigLatin Scripts. Expertise in Az- ure infrastructure management (Azure Web Roles, Worker Roles, SQL Azure, Azure Storage, Azure AD Licenses, Of- fice365).

• Extensive experience in Hadoop led development of enterprise level solutions utilizing Hadoop components such as Apache Spark, Map Reduce, HDFS, Sqoop, PIG, Hive, HBase, Oozie, Flume, NiFi, Kafka, Zookeeper, and YARN.

• Extensive experience in developing and deploying applications using WebLogic, Apache Tomcat and JBOSS.

• Hands-on experience with Spark, Databricks, and Delta Lake. Good experience working with PyTest, PyMock, Sele- nium web driver frameworks for testing different front end and backend components.

• Expertise in developing production ready Spark applications utilizing Spark-Core, DataFrames, Spark-SQL, Spark-ML and Spark-Streaming API's. Experience in automating day-to-day activities by using Windows Power Shell.

• Strong understanding of Software Development Lifecycle (SDLC) and various methodologies (Waterfall, Agile).

• Experienced with Dimensional modelling, Data migration, Data cleansing, Data profiling, and ETL Processes features for data warehouses. Implemented production scheduling jobs using Control-M, and Airflow.

• Hands-on experience in implementing, Building, and Deployment of CI/CD pipelines, managing projects often including tracking multiple deployments across multiple pipeline stages (Dev, Test/QA staging, and production).

• Experience with Snowflake cloud data warehouse and AWS S3 bucket for integrating data from multiple source system which include loading nested JSON formatted data into snowflake table. Experience in Cisco Cloud Center to more se- curely deploy and manage applications in multiple data center, private cloud, and public cloud environments.

• Worked with creating Docker files, Docker containers in developing the images and hosting them in antifactory. Developed web-based applications using Python, DJANGO, QT, C++, XML, CSS3, HTML5, DHTML, JavaScript and jQuery.

• Experienced in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension). Hands-on experience in handling database issues and connections with SQL and NoSQL databases such as MongoDB, HBase, Cassandra, SQL server, and PostgreSQL.

• Developed, deployed, and managed SaaS applications using cloud platforms such as AWS, Azure. Relevant Skills

Programming Languages:

• Python: Widely used for scripting, data manipulation, and developing data pipelines.

• SQL: Essential for querying and managing relational databases.

• Java/Scala: Useful for working with big data tools like Apache Spark and Hadoop. Data Warehousing & Databases:

• Relational Databases: MySQL, PostgreSQL, SQL Server.

• NoSQL Databases: MongoDB, Cassandra, HBase.

• Data Warehousing: Amazon Redshift, Google BigQuery, Snowflake. Big Data Technologies: HDFS, Hue, MapReduce, PIG, Hive, HCatalog, HBase, Sqoop, Impala, Zookeeper, Flume, Kafka, Yarn, Cloudera Manager, Kerberos, PySpark Airflow, Kafka, Snowflake Spark Components Cloud Platforms:

• AWS: Services like S3 (storage), Redshift (data warehousing), Lambda (serverless computing), and EMR (Ha- doop/Spark).

• Azure: Services like Azure Data Lake, Azure Synapse Analytics, and Azure Databricks.

• Google Cloud Platform (GCP): BigQuery, Cloud Dataflow, and Cloud Storage. Visualization& ETL tools: Tableau, PowerBI, Informatica, Talend Web/Application server: Apache Tomcat, Web Logic, Web Sphere Version controls and Tools: GIT, Maven, SBT, CBT

Data Processing Frameworks: Apache Hadoop, Apache Spark, Apache Flink, Apache Beam Education Details

Western Illinois University, Illinois, USA

• Masters / Computer Science (Aug 2023 - May 2024) Relevant Experience

Azure Data Engineer JLL, Chicago, Illinois, USA Feb 2024 - Current Jones Lang LaSalle Incorporated (JLL) is a global real estate services company. I design, develop, and maintain robust data pipelines to support data ingestion, processing, and storage. Implement and manage ETL (Extract, Transform, Load) pro- cesses to ensure data integrity and availability. Optimize and tune data systems for performance and scalability. Key Responsibilities:

• Been part of developing ETL jobs for extracting data from multiple tables and loading it into data mart in Redshift.

• Combining various datasets in Hive to generate Business reports. Leveraged Control-M to automate and manage end-to- end data pipelines on Azure, facilitating seamless data integration, transformation, and movement.

• Actively involved in designing and developing data ingestion, aggregation, and integration in the Hadoop environment.

• Integrated Kafka with Spark Streaming for real time data processing. Design, develop, and maintain scalable data archi- tectures using Azure Data Lake, Azure SQL Database, and Azure Synapse Analytics. Build and manage robust ETL/ELT pipelines using Azure Data Factory to ingest data from various on-premises and cloud-based sources.

• Automate data transformation, cleansing, and loading into Azure storage and databases for use in analytics and report- ing. Developed Databricks ETL pipelines using notebooks, Spark Data frames, Spark SQL and Python scripting.

• Implement and optimize data storage solutions using Azure Data Lake Storage, Azure Blob Storage, and Azure SQL Database. Implement security best practices in data management by using Azure Active Directory (AAD), Azure Key Vault, and Azure Security Center.

• Use Azure Databricks and Apache Spark to process and analyze large datasets for real-time insights and analytics.

• Design and deploy APIs for data integration using Azure API Management and Azure Logic Apps to integrate data across systems and applications. Monitor data pipelines and infrastructure performance using Azure Monitor, Log Ana- lytics, and Azure Application Insights.

• Created Data tables utilizing PyQt to display customer and policy information and add, delete, update customer records.

• Expertise in Business intelligence and Data Visualization tools like Tableau Used Tableau to connect to various sources and build graphs. Good knowledge in creating data frames using Spark SQL.

• Designing the business requirement collection approach based on the project scope and SDLC methodology.

• Have used T-SQL for MS SQL Server and ANSI SQL extensively on disparate databases.

• Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Az- ure Data Factory, T-SQL, Spark SQL and Azure Data Lake Analytics. Data Ingestion to one or more Azure and pro- cessing the data in In Azure Databricks.

• Involved in the automation process through Jenkins for CI/CD pipelines. Successfully completed a POC for Azure imple- mentation, with the larger goal of migrating on-premises servers and data to the cloud.

• Dockerized applications by creating Docker images from Docker file, collaborated with development support team to setup a continuous deployment environment using Docker. Monitored and maintained Elasticsearch clusters using Met- ricbeat, Kibana, and custom alerts for ingestion lag, disk utilization, and search latency.

• Worked on Azure Data Factory to integrate data of both on-prem (MY SQL, Cassandra) and cloud (Blob storage, Az- ure SQL DB) and applied transformations to load back to Azure Synapse. Conducted Performance tuning and optimiza- tion of Snowflake data warehouse, resulting in improved query execution times and reduced operational costs. Environment: Analytics, Azure, Azure Data Lake, Blob, Cassandra, CI/CD, Data Factory, Docker, EC2, Elasticsearch, ETL, Factory, Hive, Jenkins, Kafka, lake, Lake, Python, Redshift, Snowflake, Spark, Spark SQL, Spark Streaming, SQL, Tab- leau

AWS Data Engineer Rush University Medical Center, Chicago, Illinois, USA Oct 2023 - Jan 2024

RUSH is a leading academic health system dedicated to improving the health of the people and diverse communities. Inte- grated the data from various sources, including internal databases, third-party services, and APIs. Ensured seamless data flow across systems and platforms, enabling effective data utilization. Key Responsibilities:

• Provisioned high availability of AWS EC2 instances, migrated legacy systems to AWS, and developed Terraform plugins, modules, and templates for automating AWS infrastructure.

• Designed and implemented ETL (Extract, Transform, Load) processes using C# to cleanse, transform, and enrich raw data, ensuring its quality and compatibility with downstream analytics and reporting systems.

• Implemented AWS Lambda functions to preprocess data before loading into Redshift or Snowflake, ensuring data qual- ity and integrity. Analyzed data using Hadoop components Hive and Pig.

• Consult leadership/stakeholders to share design recommendations and thoughts to identify product and technical require- ments, resolve technical problems and suggest Big Data based analytical solutions.

• Wrote Spark-Streaming applications to consume the data from Kafka topics and write the processed streams to HBase.

• Wrote Spark applications for Data validation, cleansing, transformations and custom aggregations.

• Worked with AWS Terraform templates in maintaining the infrastructure as code.

• Involved in the entire lifecycle of the projects including Design, Development, and Deployment, Testing and Implementa- tion, and support. Managed large datasets using Panda data frames and SQL.

• Experience working with Docker to improve our (CD) Continuous Delivery framework to streamline releases

• Working on data management disciplines including data integration, modeling and other areas directly relevant to business intelligence/business analytics development. Designed and implemented a fault-tolerant data processing framework lever- aging Kubernetes and Docker, reducing downtime and increasing system reliability by 25%.

• Skilled in monitoring servers using Nagios, Cloud watch and using ELK Stack- Elastic search and Kibana.

• Built PySpark code in validate the data from raw source to Snowflake tables.

• Worked in building ETL pipeline for data ingestion, data transformation, data validation on cloud service AWS, working along with data steward under data compliance.

Environment: AWS, Docker, ETL, HBase, Hive, Kafka, Kubernetes, lake, Lambda, Pig, PySpark, Redshift, Snowflake, Spark, SQL

GCP Data Engineer HSBC, Hyderabad, India Nov 2021 - Jun 2023 HSBC Holdings plc is a British universal bank and financial services. Implemented data quality measures and perform regular audits to ensure data accuracy, completeness, and consistency. Developed and enforced data governance policies and pro- cedures to maintain data integrity and compliance with industry regulations. Key Responsibilities:

• Involved in scheduling Oozie workflow engine to run multiple Hives and pig jobs and used Oozie workflows for batch pro- cessing and scheduling workflows dynamically.

• Assess the infrastructure needs for each application and deploy it on Azure platform. Ensured data quality and report accuracy by implementing validation scripts and schema checks in the pipeline feeding Data Studio.

• Automated batch data workflows on GCP using Dataproc and Cloud Composer, ensuring timely data availability for down- stream analytics and BI dashboards. Achieved 70% faster EMR cluster launch and configuration, optimized Hadoop job processing by 60%, improved system stability, and utilized Boto3 for seamless file writing to S3 bucket.

• Used Cloud Shell for troubleshooting production data pipelines in GCP

• Integrated Databricks notebooks with Azure Data Factory for orchestrating batch and streaming workflows.

• Developed multiple notebooks using PySpark and Spark SQL in Databricks for data extraction, analyzing and transform- ing the data according to the business requirements. Used Python for data validation and analysis purposes.

• Automated deployment and management of GCP resources using Google Cloud SDK, streamlining infrastructure provi- sioning for data pipelines across Dataproc, BigQuery, and GCS.

• Experience in report writing using SQL Server Reporting Services (SSRS) and creating various types of reports like drill down, Parameterized, Cascading, Conditional, Table, Matrix, Chart and Sub Reports.

• Used Azure Data Factory to ingest data from log files and business custom applications, processed data on Data bricks per day-to-day requirements, and loaded them to Azure Data Lakes.

• Extensive use of cloud shell SDK in GCP to configure/deploy the services using GCP BigQuery.

• Designed Cassandra schemas for time-series IoT data (500K writes/sec). Working with GCP cloud using in GCP Cloud storage, Data-proc, Data Flow, Big-Query, EMR, S3, Glacier and EC2 with EMR Cluster

• Integrated Dataflow with BigQuery, Cloud Storage, and Firestore to support analytics, machine learning, and operational use cases. Written Python DAGS in airflow which orchestrate end to end data pipelines for multiple applications. Environment: Azure, Azure Data Lake, BigQuery, Cassandra, Cluster, Data Factory, EC2, EMR, Factory, GCP, Hive, Lake, Oozie, PySpark, Python, S3, SDK, Services, Spark, Spark SQL, SQL, SSRS, VPC Data Engineer ZF Group, Hyderabad, India Mar 2020 - Oct 2021 ZF Group is a German technology manufacturing company that supplies systems for passenger cars, commercial vehicles and industrial technology. Utilized the big data technologies (e.g., Hadoop, Spark) to handle large-scale data processing and analytics. Implemented solutions for real-time data streaming and processing (e.g., Kafka, Flink). Key Responsibilities:

• Designed and developed ETL pipelines for real-time data integration and transformation using Kubernetes and Docker.

• Integrated AWS DynamoDB using AWS Lambda to store the values of items and backup the DynamoDB streams.

• Responsible for building scalable distributed data solutions using Hadoop. Developed Spark Streaming programs to pro- cess near real time data from Kafka, and process data with both stateless and state full transformations.

• Created several Databricks Spark jobs with PySpark to perform several tables to table operations.

• Experienced in adjusting the performance of Spark applications for the proper batch interval time, parallelism level, and memory tuning. Utilized Elasticsearch and Kibana for indexing and visualizing the real-time analytics results, enabling stakeholders to gain actionable insights quickly.

• Actively Participated in all phases of the Software Development Life Cycle (SDLC) from implementation to deployment.

• Importing & exporting database using SQL Server Integrations Services (SSIS) and Data Transformation Services (DTS Packages). Worked on CI/CD tools like Jenkins, Docker in Devops Team for setting up application process from end-to- end using Deployment for lower environments and Delivery for higher environments by using approvals in between.

• Migrated legacy cron-based workflows to Airflow, improving monitoring, failure alerts, and logging with minimal downtime.

• Working on migrating Data to the cloud (Snowflake and AWS) from the legacy data warehouses and developing the infrastructure. Develop metrics based on SAS scripts on legacy system, migrating metrics to Snowflake (AWS S3).

• Developed a fully automated continuous integration system using Git, Jenkins, MySQL and custom tools developed in Python and Bash. Hands on experience in working with AWS Cloud Services like EMR, S3 and Redshift.

• Designed and implemented Elasticsearch index schemas to support scalable, high-performance search and analytics over structured and unstructured data.

Environment: Airflow, Analysis, AWS, CI/CD, Docker, DynamoDB, Elasticsearch, EMR, ETL, Factory, GCP, Git, Jenkins, Kafka, Kubernetes, lake, Lambda, MySQL, Power BI, PySpark, Python, Redshift, S3, SAS, Services, Snowflake, Spark, Spark Streaming, SQL, SSIS

Contact this candidate