Data Engineer Azure

Location:

Bellevue, WA

Posted:

March 27, 2025

Contact this candidate

Resume:

SHINY THOTA

Sr. Data Engineer

E: ***********@*****.***

P: +1-980-***-****

SUMMARY:

●Over 5 years of experience in Data Engineer, including profound expertise and experience in traditional data engineering background with expertise in Apache Spark, PySpark, Kafka, Spark Streaming, Spark SQL, Hadoop, HDFS, Hive, Sqoop, Pig, MapReduce, Flume, Beam.

●Extensive experience in relational databases including Microsoft SQL Server, Teradata, Oracle, Postgress and No SQL Databases including MongoDB, HBase, Azure Cosmos DB, AWS DynamoDB, Cassandra.

●Hands on experience with Data modeling, Physical Datawarehouse designing & cloud data warehousing technologies including Snowflake, Redshift, BigQuery, Synapse.

●Experience with major cloud providers & cloud data engineering services including AWS, Azure, GCP & Databricks.

●Created and optimized Talend jobs for data extraction, data cleansing, and data transformation.

●Designed & orchestrated data processing layer & ETL pipelines using Airflow, Azure Data Factory, Oozie, Autosys, Cron & Control-M.

●Hands on experience with AWS services including EMR, EC2, Redshift, Glue, Lambda, SNS, SQS, CloudWatch, Kinesis, Step functions, Managed Airflow instances, Storage & Compute.

●Hands on experience with Azure services including Synapse, Azure Data Factory, Azure functions, EventHub, Stream Analytics, Key Vault, Storage & Compute.

●Hands on experience with GCP services including DataProc, VM, Big Query, Dataflow, Cloud functions, Pub/Sub, Composer, Secrets, Storage & Compute.

●Hands on experience with Databricks services including Notebooks, Delta Tables, SQL Endpoints, Unity Catalog, Secrets, Clusters.

●Experience in Data Governance & Master Data Management through Collibra & Informatica. Standardization to improve Master Data Management (MDM) and other common data management issues.

●Designed and maintained data integration solutions to extract, transform, and load (ETL) data from various sources into target systems.

●Experienced in working with ETL tools such as Informatica, DataStage, or SSIS (SQL Server Integration Services).

●Good knowledge in Database Creation and maintenance of physical data models with Oracle, Teradata, Netezza, DB2, MongoDB, HBase and SQL Server databases.

●Experienced in writing complex SQL Quires like Stored Procedures, triggers, joints, and Sub queries.

●Expert in Migrating SQL database to Azure data Lake storage, Azure Data Factory (ADF), Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and controlling and granting database access and migrating on premise databases to Azure Data Lake store using Azure Data factory.

●Integrated Kafka with Spark Streaming for real time data processing.

●Hands on experience with ETL, Hadoop and Data Governance tools such as Tableau, Informatica Enterprise Data Catalog

●Experience in designing & developing applications using Big Data technologies HDFS, Map Reduce, Sqoop, Hive, PySpark & Spark SQL, HBase, Python, Snowflake, S3 storage, Airflow.

●Understanding of structured data sets, data pipelines, ETL tools, data reduction, transformation and aggregation technique, Knowledge of tools such as DBT, DataStage.

●Have good knowledge in Job Orchestration tools like Oozie, Zookeeper & Airflow.

●Knowledge and experience with Continuous Integration and Continuous Deployment using containerization technologies like Docker and Jenkins.

●Excellent working experience in Agile/Scrum development and Waterfall project execution methodologies. Experienced in using Agile methodologies including extreme programming, SCRUM and Test-Driven Development (TDD).

TECHNICAL SKILLS:

Big Data Technologies

Hadoop, MapReduce, Spark, HDFS, Sqoop, YARN, Oozie, Hive, Impala, Zookeeper, Apache Flume, Apache Airflow, Cloudera, HBase

Programming Languages

Python, PL/SQL, SQL, Scala, C, C#, C++, T-SQL, Power Shell Scripting, JavaScript

Cloud Services

Azure Data Lake Storage Gen 2, Azure Data Factory, Blob storage, Azure SQL DB, Databricks, Azure Event Hubs, AWS RDS, Amazon SQS, Amazon S3, AWS EMR, Lambda, AWS SNS, GCP Big Query, Dataflow, Pub/Sub, and Cloud Storage.

Databases

MySQL, SQL Server, Oracle, MS Access, Teradata, and Snowflake

NoSQL Data Bases

MongoDB, Cassandra DB, HBase

Monitoring tool

Apache Airflow

Visualization & ETL tools

Tableau, Informatica, Talend, SSIS, Power BI and SSRS

Version Control & Containerization tools

Jenkins, Git, and SVN

Operating Systems

Unix, Linux, Windows, Mac OS

WORK EXPERIENCE:

Client: Amazon, Bellevue

Duration: Dec 2024 - Present

Role: Data Engineer II

Responsibilities:

•Proven ability to work in cross-functional Agile teams, ensuring timely and high-quality delivery of scalable data engineering solutions.

•Active participation in Agile ceremonies, including daily stand-ups, sprint planning, and retrospectives, to drive seamless collaboration.

•Designed and implemented scalable ETL pipelines using AWS Glue, PySpark, and SQL, processing high-volume datasets from Amazon S3 to structured data warehouses such as Amazon Redshift.

•Developed robust data cleansing, transformation, and validation frameworks using SQL, Python, and schema enforcement techniques to standardize formats, eliminate duplicates, and maintain high data integrity.

•Built end-to-end orchestration workflows using AWS Step Functions, ensuring seamless execution, monitoring, and error handling of multi-step ETL processes.

•Automated cloud resource provisioning using AWS CDK, ensuring repeatable and scalable deployments aligned with Amazon’s infrastructure standards.

•Leveraged Git Farm for source control and implemented CDK Pipelines for CI/CD automation, ensuring efficient and reliable deployments across environments.

•Integrated AWS CloudWatch for real-time monitoring, logging, and alerting, enabling proactive issue resolution and performance optimization of ETL workloads.

•Used Amazon Athena for querying and analyzing raw and processed datasets in S3, enabling efficient data exploration without impacting production workloads.

•Streamlined deployment workflows by leveraging Brazil CLI for managing build, packaging, and deployment processes within Amazon’s internal ecosystem.

•Optimized Redshift table structures (distribution keys, sort keys, compression techniques) and implemented incremental data loads, reducing redundant processing and improving query performance.

•Achieved 99.9% data accuracy through rigorous validation processes and reduced ETL job execution time by 60% via optimized Spark and SQL query tuning.

Environment: Sparksql, Data validation, Data quality, Aws step functions, SQL, AWS cloud formation, Amazon Athena, Amazon redshift, Aws Glue, JSON, Typescript, Scala, Aws S3, Amazon cloud Watch, Data pipelines, PostgreSQL.

Client: Centene Corporation, ST Louis

Duration: Jan 2023 to Dec 2024

Role: Senior Data Engineer

Responsibilities:

●Involved in full Software Development Life Cycle (SDLC) - Business Requirements Analysis, preparation of Technical Design documents, Data Analysis, Logical and Physical database design, Coding, Testing, Implementing, and deploying to business users.

●Designed and implemented end-to-end data pipelines on AWS using services such as AWS Glue, AWS Lambda, and AWS EMR.

●Written complex SQLs using joins, sub queries and correlated sub queries. Expertise in SQL Queries for cross verification of data.

●Created ingestion framework for creating Data Lake from heterogeneous sources like Flat files, Oracle Db, mainframe, and SQL server Databases.

●Design and Develop ETL Processes in AWS Glue to load data from external sources like S3, glue catalog, and AWS Redshift.

●Used DynamoDB to log the errors of the ETL process while validating input files with target table structure data type mismatches and all.

●Developed complex ETL mappings for Stage, Dimensions, Facts, and Data marts load. Involved in Data Extraction for various Databases & Files using Talend.

●Optimized Spark jobs in Databricks to enhance performance and reduce processing time, utilizing AWS EMR integration where necessary.

●Collaborated with data scientists to deploy machine learning models using Databricks on AWS for real-time scoring and inference.

●Integrated Databricks with AWS services such as Amazon S3, Amazon Redshift, Amazon RDS, and AWS Glue.

●Implemented data security and access controls within Databricks on AWS to ensure compliance with regulatory requirements.

●Ingested large-size files around 600 GB files to S3 in an efficient way.

●Using Glue job read the data from S3 and load it into redshift tables by reading metadata from the data catalog in JSON format.

●Developed and maintained data dictionaries, data catalogs, and data lineage documentation for improved data understanding and traceability.

●Developed ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS snowflake.

●Created a data pipeline involving various AWS services including S3, Kinesis firehose, kinesis data stream, SNS, SQS, Athena, Snowflake etc.

●Worked on end-to-end deployment of the project that involved Data Analysis, Data Pipelining, Data Modelling, Data Reporting and Data documentations as per the business needs.

●Gathering the data stored in AWS S3 from various third-party vendors, optimizing it and joining with internal datasets to gather meaningful information.

●By creating a customized read/write Snowflake utility function in python, data was transferred from an AWS S3 bucket to Snowflake.

●Created S3 buckets and managed S3 bucket policies, as well as using S3 buckets for storage and backup on AWS.

●Developed ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS snowflake.

●Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift

●Developed a python script to transfer data from on-premises to AWS S3. Developed a python script to hit REST API’s and extract data to AWS S3.

●Worked on Ingesting data by going through cleansing and transformations and leveraging AWS Lambda, AWS Glue and Step Functions.

●Developed Spark applications using PySpark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats.

●Developed PySpark and SparkSQL code to process the data in Apache Spark on Amazon EMR to perform the necessary transformations based on the source to target mappings developed.

●Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.

●Performed benchmark tests to read data from database, object store using pandas and PySpark API s to compare results, identify potential improvement areas and provide recommendations.

●Read and write Parquet, JSON files from S3 buckets using Spark, Pandas data frame with various configuration.

●Designed, developed, and maintained complex data pipelines using Apache Airflow for data extraction, transformation, and loading (ETL) processes.

●Orchestrated and scheduled data workflows in Apache Airflow to ensure timely and automated execution of data tasks.

●Designed and implemented data visualizations and charts in Tableau to effectively communicate complex data insights and trends to non-technical users.

●Skilled in visualizing and presenting data using Tableau, creating interactive dashboards, and generating meaningful insights for stakeholders.

●Work closely with the application customers to resolve JIRA tickets related to API issues, data issues, consumption latencies, onboarding, and publishing data.

Environment: Python, Spark, AWS EC2, AWS S3, AWS EMR, AWS Redshift, AWS Glue, AWS RDS, AWS Kinesis firehose, kinesis data stream, AWS SNS, AWS SQS, AWS Athena, snowflake, SQL, Tableau, Git, REST, Bitbucket, Jira.

Client: Northern Trust Bank, Chicago

Duration: Jan 2022 to Dec 2022

Role: Azure Data Engineer

Responsibilities:

●Demonstrated proficiency in Agile methodologies, working within cross-functional Agile teams to deliver data engineering solutions on schedule and within scope.

●Actively participated in Agile ceremonies such as daily stand-ups, sprint planning, sprint reviews, and retrospectives to ensure effective collaboration and communication within the team.

●Designed and implemented end-to-end data solutions on the Azure cloud platform, including Azure Databricks, Azure Synapse Pipeline, and Azure Blob Storage.

●Developed and managed Azure Data Lake and Azure Blob Storage accounts to store, manage, and secure data assets.

●Created and maintained ETL pipelines using Azure Data Factory to orchestrate data workflows.

●Collaborated with cross-functional teams to understand business requirements and translated them into scalable data solutions using Azure services.

●Worked with Azure Postgres SQL and Azure SQL databases to store and retrieve data for various applications and analytics.

●Designed and developed complex data models and database schemas in Azure SQL Database, optimizing data storage, retrieval, and organization.

●Engineered end-to-end data pipelines using SQL for data extraction, transformation, and loading (ETL) processes, ensuring high-quality data for analytics and reporting.

●Conducted data cleansing, transformation, and enrichment using SQL scripts to ensure data accuracy and consistency.

●Created and optimized SQL queries, stored procedures, and data processing logic for data analysis and reporting within Azure SQL Database.

●Managed and optimized data storage, indexing, and partitioning strategies within Azure databases and data lakes for efficient data access.

●Utilized Spark Streaming and Azure Databricks for real-time data processing and streaming analytics.

●Implemented data versioning, change tracking, and data lineage for enhanced data governance and auditing in Azure environments.

●Developed end-to-end data pipelines in Azure Databricks, encompassing the bronze, silver, and gold stages for comprehensive data processing.

●Implemented the Bronze stage in data pipelines, focusing on raw data ingestion, storage, and initial data quality checks.

●Enhanced data quality and usability by transitioning data through the Silver stage, performing data transformations, normalization, and schema changes.

●Orchestrated data cleaning and transformation processes within Azure Databricks, ensuring the silver data was structured and ready for analysis.

●Proficient in designing, developing, and maintaining data pipelines and ETL processes using Azure Databricks.

●Skilled in data ingestion from various sources such as Azure Blob Storage, Azure Data Lake Storage, and Azure SQL Database into Azure Databricks.

●Leveraged Databricks for advanced data transformations, including aggregations, joins, and feature engineering, to prepare data for analytical purposes in the gold stage.

●Stored and managed gold data in Azure data warehousing solutions, optimizing data structures for high-performance querying and reporting.

●Implemented Spark Streaming for real-time data processing and analytics, enabling immediate insights from streaming data sources.

●Ensured data security and access control on Azure through identity and access management, encryption, and compliance measures.

●Developed data archiving and retention strategies in Snowflake to store historical data while optimizing storage costs.

●Utilized Snowflake's data-sharing capabilities to securely share historical data with external partners, enabling collaborative analysis and reporting.

●Implemented CI/CD pipelines using GitHub DevOps to automate testing and deployment of data solutions.

●Managed and optimized Snowflake data warehousing solutions for data storage and retrieval.

●Developed and maintained PySpark and Pandas-based data processing scripts and notebooks for data transformation and analysis.

●Developed PySpark-based data processing scripts and notebooks to perform data transformations and analysis on large datasets.

●Developed a python script to transfer data from on-premises and developed a python script to hit REST API’s and extract data.

●Hands-on experience with Spark SQL, DataFrame, and Dataset APIs for data manipulation and analysis.

●Strong understanding of data warehousing concepts and experience in building data warehouses on Azure using Databricks.

●Proficient in working with structured, semi-structured, and unstructured data formats in Azure Databricks.

●Implemented data security and governance practices in Azure Databricks environments.

●Utilized PySpark to build scalable data processing applications, taking advantage of parallel processing and in-memory computing.

●Assisted in data migration projects, including the movement of on-premises data to Azure cloud environments.

●Worked on continuous integration and continuous deployment (CI/CD) pipelines for automated testing and deployment of data solutions.

●Collaborated with Azure DevOps teams to ensure high availability, scalability, and resource optimization for data systems.

●To read data from database, object store using pandas and PySpark API s to compare results, identify potential improvement areas and provide requirements.

●Used Structured APIs for manipulating all sorts of data from unstructured log files to semi structured csv files and highly structured parquet files.

●Successfully completed data migration projects to Azure, ensuring data consistency and integrity during the transition.

Environment: Python, SQL, Azure Databricks, Azure synapse pipeline, Azure blob storage, Azure data lake, Teraform, Azure postgres SQL, Azure SQL, Spark streaming, GitHub, Pycharm, Snowflake, Pyspark, Pandas.

Client: Wells Fargo, Hyderabad, India

Duration:June 2020 to July 2021

Role: Data Analyst/ SQL Server Developer

Responsibilities:

●Worked with Technical Architects and Database analysts for the Design of tables for efficient Report Design.

●Developed the SQL Server Integration Services (SSIS) packages to transform data as well as created interface stored procedures used in SSIS to load/transform data to the Database.

●Extracted the data from the different sources (CSV files, oracle, and MS SQL 2000) and stored the data in the intermediate Staging Database using the SSIS ETL tool.

●Created Dashboards using Tableau Dashboard and prepared user stories to create compelling dashboards to deliver actionable insights.

●Managed datasets using Panda data frames and MySQL, queried MYSQL database queries from python using Python-MySQL connector and MySQL dB package to retrieve information.

●Implemented business logic in Python to prevent, detect and claim duplicate payments.

●Wrote scripts to integrate APIs with third party applications. Python based web application, PostgreSQL, and integrations with third party email, messaging, storage services.

●Improved the performance of an ASP.NET MVC application by taking advantage of the output cache.

●Transform the data from the staging database and perform the transformation using the Data Flow Tasks.

●Designed and created data extracts, supporting SSRS, Power BI, Tableau, or other visualization tools reporting applications.

●Generated reports from the cubes by connecting to Analysis server from SSRS.

●Developed various reports using SSRS, which included writing complex stored procedures for datasets.

Involved in Report Authoring, Report Delivery (using both Pull and Push strategies), and Report Management for various types of reports.

●Prepared the reports for the day-to-day as well as weekly/monthly purposes in multiple formats like MS Excel, PDF, HTML, and XML etc.

●Wrote scripts to integrate APIs with third party applications. Python based web application, PostgreSQL, and integrations with third party email, messaging, storage services.

●Migrate a classic asp project to asp.net application, also migrate windows applications to web applications.

●Created business rules in Informatica Developer and imported them to Informatica power center to load the standardized and good format of data to staging tables.

●Built cubes for a production application, Partitioned Cubes to decrease the processing time and increase the performance of the queries running on front-end applications using SSAS 2012/2016.

●Extensively involved in the SSAS storage and partitions, Aggregations, calculation of queries with MDX, Data Mining Models, and developing reports using MDX and SQL.

●Responsible for logical and physical data modeling, database design, star schema, data analysis, programming, Documentation, and implementation.

●Managed datasets using Panda data frames and MySQL, queried MYSQL database queries from python using Python-MySQL connector and MySQL dB package to retrieve information.

●Fetch the data from Oracle 10g by creating a linked server and using SQL Server to query the Oracle database.

●Experience developing and Extending OLAP Cubes, Dimensions, and data source view.

●Worked on T-SQL programming, stored procedures, user-defined functions, cursors, views setup, and management of linked servers to Non-SQL-Server databases.

●Developed complex stored procedure using T-SQL to generate Ad hoc reports using SSRS.

Environment: SQL Server 2016/2014/2012, ASP.NET, SQL Server Management Studio (SSMS), Tableau 9.7, .Net, Business Intelligence Studio (SSAS, SSIS, SSRS), OBIEE, T-SQL, Power BI, Erwin, MDX, Oracle 10g, MS Excel, MS Office.

Client: Micron Technology, Inc., Hyderabad, India - Internship

Duration: 2019 Dec to 2020 May

Role: Python Developer

Responsibilities:

●Created a Python/Django based web application using Python scripting and MySQL Workbench for the database, and HTML/CSS/jQuery and High Charts for data visualization of the served pages

●Successfully migrated the Django database from MySQL Workbench to PostgreSQL with complete data integrity.

●Integrated Bootstrap for responsive UI components, utilized React.js for dynamic interfaces, and employed Sass/SCSS for efficient styling in frontend development.

●Used TypeScript for type-safe JavaScript development in frontend components, enhancing code quality and maintainability alongside Python in web applications.

●Deployed Python-based web applications on Apache Tomcat server, ensuring seamless integration and optimal performance in production environments

●Experience with Core java with strong understanding and working knowledge of Object-Oriented Concepts (OOPS) like Collections, Multi-threading, Exception Handling, Reflection.

●Designed and developed ETL processes that were used to integrate data between various sources using Informatica.

●Involved in writing complex SQL queries to filter the data and fetch data to Tableau desktop to generate reports.

●Extensive experience in deploying, managing and developing Oracle SQL developer clusters.

●Created Automated Test Scripts in Python for Reports and Data extraction jobs.

●Developed scripts in python for Financial Data coming from SQL Developer based on the requirements specified.

●Refactored existing batch jobs and migrated existing legacy extracts from Informatica to Python based micro services and deployed in AWS with minimal downtime.

●Worked on Python Open stack API's, used Python scripts to update content in the database and manipulate files.

●Worked on Jira for bug tracking and communicating for better results.

●Have worked with production support team resolving the real time issues based on the requirement specified.

●Experience in handling production-based tickets by working closely with core database team.

Environment: Python, Django, JSON, REST, HTML/CSS, jQuery, MySQL Workbench, PostgreSQL, Bootstrap, TypeScript, Apache Tomcat, Informatica, Oracle SQL Developer, Jira.

EDUCATION:

Kakatiya Institute of Technology and Science

Bachelor of science in Electrical and Electronical Engineering. 8.2GPA

Masters:

Trine University

Masters in Information Studies 3.7GPA

Contact this candidate