Azure Data Engineer

Location:

Financial District, MA, 02109

Posted:

February 28, 2025

Contact this candidate

Resume:

Sumanth Reddy K

Ph no: 214-***-****

Email:************.*.********@*****.***

Sr. Data Engineer

PROFESSIONAL SUMMARY:

Around 9 years of Experience in Spark, Python, SQL, MPP Databases, and Data Warehousing. Experience in Big data Technologies, Hadoop ecosystem, Data Warehousing, SQL related technologies. Experience in Big Data Analytics using Various Hadoop eco-systems tools and Spark Framework and designed pipelines using Spark and Spark SQL.

Experience in Migrating SQL database to Azure Data Lake, Azure Data Lake Analytics, Azure SQL Database, Data Bricks, and Azure SQL Data warehouse and controlling and granting database access and Migrating On-premises databases to Azure Data Lake store using Azure Data Factory.

Over years of experience in data modeling, warehousing, and building ETL pipelines using SQL and Python.

Experience installing/configuring/maintaining Apache Hadoop clusters for application development and Hadoop tools like Sqoop, Hive, HBase, Kafka, Hue, Oozie, Spark, Scala, and Python.

Developed data pipelines using Spark using EMR clusters and scheduled jobs using Airflow

Developed AWS Cloud formation templates to create custom-sized EC2 instances, ELB, Lambda, S3, Glue crawlers, Glue ETL jobs and security groups.

Extensive experience in developing applications that perform Data Processing tasks using Teradata, Oracle, SQL Server, and MySQL databases.

Worked on migrating an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, experienced in Maintaining the Hadoop cluster on AWS EMR.

Experience working with NoSQL databases like Cassandra and HBase and developed real-time read/write access to very large datasets via HBase.

Developed Spark Applications that can handle data from various RDBMS (MySQL, Oracle Database) and Streaming sources. Implemented ETL processes to extract, transform, and load data into PostgreSQL databases from various sources.

Experience in handling large datasets using Partitions, Spark in-memory capabilities, Broadcasts in Spark with Python, Effective and efficient Joins, Transformations, and others during the ingestion process itself

Experience in developing data pipelines using Sqoop, and Flume to extract the data from weblogs and store them in HDFS and accomplished development using HiveQL for data analytics.

Designed and optimized NoSQL and MySQL database schemas to ensure efficient data storage and retrieval. Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL.

Administered and maintained CouchDB clusters, ensuring high availability and data replication. Worked on Dimensional Data modeling in Star and Snowflake schemas and Slowly Changing Dimensions (SCD).

Extensively dealt with Spark Streaming and Apache Kafka to fetch live stream data.

Strong expertise in troubleshooting and performance fine-tuning Spark, Map Reduce, and Hive applications. Developed Map Reduce jobs for Data Cleanup in Python. Written PySpark job in

AWS Glue to merge data from multiple tables and in Utilizing Crawler to populate AWS Glue data Catalog with metadata table definitions.

Developed and deployed a self-service data platform using native AWS services including S3, Redshift, Glue, and Athena.

Implemented a scalable and automated data pipeline system that enabled efficient data processing and real-time analytics for business users.

Collaborated with DevOps teams to integrate PostgreSQL database deployments into CI/CD pipelines. Developed interactive reports and dashboards in Power BI for data analysis and decision support. Created and maintained interactive dashboards and reports in Tableau for data visualization. Developed RESTful APIs for data access and integrated CouchDB with web applications.

Designed, built, and maintained data pipelines using modern Big Data technologies such as AWS

Redshift, S3, Glue, EMR, Spark, and Hive.

Strong expertise in AWS Cloud services, including Redshift, S3, EMR, and Lambda, integrating them into scalable architectures.

Enhanced data processing efficiency by implementing Spark-based ETL pipelines, reducing processing time by 40%.

Proficient in using Informatica PowerCenter to seamlessly integrate data from diverse sources, ensuring data consistency and accuracy. Collaborated with business users to define KPIs and metrics and implemented them in Power BI. Implement One-time Data Migration of Multistate level data from SQL server to Snowflake by using Python and SnowSQL.

Experience in designing stunning visualizations using Tableau software and publishing and presenting dashboards, and stories on web and desktop platforms. Experience in data stream processing using Kafka (Zookeeper for developing data pipelines with PySpark.

Proficient in establishing robust backup and recovery strategies to safeguard data integrity and maintain high availability in RDBMS environments.

Developed and implemented end-to-end data migration strategies from on-premise Hadoop clusters to AWS Redshift, ensuring data integrity and optimized performance

Expertise in all aspects of Agile SDLC from requirement analysis, Design, Development Coding, Testing, Implementation, and maintenance. Worked on data warehousing and ETL tools like Informatica, and Tableau. Acquaintance with Agile and Waterfall methodologies. Responsible for handling several client-facing meetings with great communication skills.

TECHNICALSKILLS:

Programming Languages

Python, Scala, Java, R, SQL

Database Technologies

SQL, Teradata, Oracle, MySQL, Cassandra, HBase, CouchDB, MongoDB

Big Data Technologies

Sqoop, Hive, HBase, Kafka, Oozie, Spark, EMR, Apache Kafka, Spark, Scala, PySpark, Apache Pig

Clouds

Azure, AWS, GCP

ETL Tools

Informatica PowerCenter, Apache NiFi, Apache Flume

Data Warehousing

Azure SQL Data Warehouse, Redshift, Star and Snowflake schemas, SCD

Data Visualization

Tableau, Power BI, Data visualization in Python (Matplotlib, Plotly)

Web Development

RESTful APIs, Flask, Django

Data Processing Frameworks

Apache Spark, MapReduce, PySpark, Spark SQL

NoSQL Databases

Cassandra, HBase, CouchDB, MongoDB

Version Control

Git, Bitbucket

Methodologies

Agile, Waterfall

Data Analysis

Statistical analysis, Data visualization (Pandas, Matplotlib, Plotly), Gap analysis

WORK EXPERIENCE:

Client: Wells Fargo, San Francisco, CA Dec 2022 - Present

Role: Senior Data Engineer

Responsibilities:

Collaborated with multiple teams within my department to facilitate quick migration of data. Developed multiple scripts in Pyspark, Sqoop, and Shell to perform different types of ingestions.

Built ETL data pipelines using Python/MySQL/Spark/Hadoop/Hive/UDFs. Experience in analyzing data using Python, R, SQL, Microsoft Excel, Hive, PySpark, and Spark SQL for Data Mining, Data Cleansing, and Data Munging.

Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks.

Day-to-day responsibility includes developing ETL Pipelines in and out of the data warehouse and developing major regulatory and financial reports using advanced SQL queries in Snowflake.

Experience developing Scala applications for loading/streaming data into NoSQL databases (MongoDB) and HDFS. Implemented data pipelines using Python code.

Use the RUP and Agile methodology to conduct new development and maintain software.

Integrated diverse data sources into NoSQL and MySQL databases, ensuring data consistency and accuracy. Proficient SQL experience in querying, data extraction/transformations, and developing queries for a wide range of applications.

Well-versed in implementing rigorous security measures and access controls to protect sensitive data within the RDBMS, ensuring compliance with industry standards.

Stage the API or Kafka Data (in JSON file format) into Snowflake DB by FLATTENING the same for different functional services. Development-level experience in Microsoft Azure providing data movement and scheduling functionality to cloud-based technologies such as Azure Blob Storage and Azure SQL Database.

Communicate with Business Analysts and convert their requirements into dimensional models. Worked with different file formats like parquet, CSV, XML, JSON, DAT, etc. during Data migration. Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries.

Experience in analyzing data using Hive QL, Pig Latin, HBase, Spark, R Studio and custom Map Reduce programs in Python. Extending Hive and Pig core functionality by writing custom UDFs. Ingested data from Teradata, HDFS, SQL server, and Teradata into Big Query which simplified DataAnalaysis easier. Implemented data quality checks and transformations in Informatica to enhance data reliability and maintain high data quality standards.

Integrated Tableau with various data sources, including databases and APIs, to enable real-time reporting. Created and maintained data models and relationships within Power BI for accurate reporting.

proficient in planning and executing data migration projects, ensuring a smooth transition between different RDBMS platforms or versions.

Set up and managed Kafka clusters for real-time data streaming and event processing. Implemented data synchronization between CouchDB and other data stores for real-time data updates. Used Spark and Scala for developing machine learning algorithms that analyze click stream data.

Analyzed the SQL scripts and designed the solution to implement using Python.

Implemented query optimization techniques to enhance the performance of SQL queries on MySQL databases. Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that processes the data using the SQL Activity.

Created and managed CouchDB views for efficient querying and reporting. Optimized Informatica workflows and mappings to enhance data processing efficiency and reduce ETL execution times.

Implemented row-level security and role-based access control to protect sensitive data in Power BI reports. Implemented role-based access control and data-level security in Tableau for data protection.

Written data pipelines in Python to extract data from Hive, MYSQL, and Presto. Used packages like NumPy, Pandas, Matplotlib, and Plotly in Python for exploratory data analysis.

Environment: ETL, Pyspark, Sqoop, Shell scripting, Azure, T-SQL, RDBMS, Spark SQL, U-SQL, Informatica, Python, MySQL, Hadoop, Hive, Snowflake, MongoDB, CouchDB, Teradata, Big Query, HDFS, Kafka, R, SQL, Microsoft Excel, Hive QL, Pig Latin, HBase, Spark, R, Map Reduce, Scala, JSON, RUP, Agile, NumPy, Pandas, Matplotlib, Plotly.

Client: Cohesity, Reston, VA June 2020 – Nov 2022

Role: Data Engineer

Responsibilities:

Developed a customized monitoring tool for the EMR cluster Utilization. Majorly used AWS tools like Glue, lambda, ECS, and EMR to perform ETLs with the Pyspark framework.

Amazon ECS and Data sync are also used depending on the size of the data. Hive Data Warehouse is used to create tables for the end users. Developed Spark code using Python for Spark/Spark-SQL for faster testing and processing of data.

Extensive experience in configuring and managing AWS Redshift Spectrum for querying data across Redshift and S3 with minimal latency.

Expertise in designing and maintaining ETL pipelines to move, clean, and transform data using AWS Glue and custom Python scripts.

Experience in monitoring and troubleshooting Redshift clusters, using AWS CloudWatch, Redshift Insights, and performance tuning best practices.

Successfully migrated complex datasets across different database environments (Teradata, Oracle, etc.) to AWS Redshift, improving overall efficiency.

Strong understanding of AWS IAM roles and security policies, ensuring data governance, privacy, and compliance with industry standards.

Experience in Big Data ecosystems using Hadoop, Pig, Hive, HDFS, MapReduce, Sqoop, Storm, Spark, Airflow, Snowflake. Develop ETL pipeline to extract data from MongoDB perform ingestion into AWS S3, and load transformed data into Redshift.

Introduced Redshift to the team as a better approach than Hive with performance analysis. Involved in the complete Big Data flow of the application starting from data ingestion from upstream to HDFS, processing and analyzing the data in HDFS.

Implemented robust backup and recovery strategies for NoSQL and MySQL databases to ensure data availability and disaster recovery. Implemented a proof of concept deploying this product in AWS S3 bucket and Snowflake.

Extensively worked on developing Spark jobs in Python (Spark SQL) using Spark APIs. Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and MongoDB using Python. Developed Spark API to import data into HDFS from Teradata and created Hive tables.

Proficient in designing and managing data warehousing solutions within RDBMS to support business intelligence and reporting requirements. Involved in performance tuning of Hive from design, storage, and query perspectives. Developed Flume ETL job for handling data from HTTP Source and Sink as HDFS.

Collected the JSON data from HTTP Source and developed Spark APIs that help to do inserts and updates in Hive tables. Generated a script in AWS Glue to transfer the data and utilized AWS Glue to run ETL jobs and run aggregation on PySpark code.

Good understanding of data modeling (Dimensional & Relational) concepts like Star-Schema Modeling, Snowflake Schema Modeling, and Fact and Dimension tables. Created and maintained metadata repositories in Informatica to enable better data lineage and impact analysis.

Worked on reading multiple data formats on HDFS using Python. Ensured data security and access control through CouchDB's authentication and authorization mechanisms. Utilized Tableau Prep to clean, shape, and prepare data for analysis, improving data quality and consistency. Designed and implemented data pipelines to ingest, process, and route streaming data using Kafka.

Implemented horizontal and vertical scaling solutions to accommodate growing data volumes in NoSQL and MySQL databases. Implemented various UDFS in Python as per the requirement. Developed Spark scripts to import large files from Amazon S3 buckets. Used Amazon Web Services Elastic Compute Cloud (AWS EC2) to launch cloud instances.

Hands-on experience working with Amazon Web Services (AWS) using Elastic Map Reduce (EMR), Redshift, and EC2 for data processing.

Experienced in implementing data compliance policies and governance frameworks to ensure data quality, privacy, and regulatory compliance within the RDBMS environment.

Developed AWS cloud formation templates and set up Auto scaling for EC2 instances and was involved in the automated provisioning of AWS cloud environment using Jenkins.

Developed robust error-handling mechanisms in Informatica workflows to ensure data integrity and minimize data loss during ETL processes. Implemented usage monitoring and auditing in Power BI to track report usage and user interactions for performance optimization and compliance.

Designed and deployed Tableau Server environments, managing user access, permissions, and content promotion. Integrated Power BI with custom data connectors and APIs to access data from diverse sources. Designed custom conflict resolution strategies to handle concurrent updates in distributed CouchDB databases.

Utilized AWS S3 as a data lake for storing and versioning large datasets, ensuring data integrity and easy retrieval for analysis.

Developed Spark core and Spark SQL scripts using Scala for faster data processing. Developed Kafka consumer's API in Scala for consuming data from Kafka topics. Used Jira for bug tracking and Bit Bucket to check-in and checkout code changes. Continuous monitoring and managing of the Hadoop cluster through Cloudera Manager.

Environment: PySpark, Spark-SQL, Hadoop, Pig, Hive, HDFS, MapReduce, Sqoop, Storm, Spark, Airflow, Snowflake, ETL, MongoDB, HTTP, Hive, Redshift, RDBMS, NoSQL MySQL, Spark, Teradata, Hive, JSON, Informatica, Python, CouchDB, Tableau, Kafka, AWS, Jenkins, Power BI, Scala, Jira, Bitbucket.

Client: CVS Health, Woonsocket, RI Jan 2019 - May 2020

Role: Data Engineer

Responsibilities:

Utilized Apache Spark with Python to design, develop, and deploy optimized Big Data pipelines. Provide guidance to the development team during production to enhance the performance of spark jobs.

Use both Python/Spark SQL to perform transformations and aggregation. Develop Scala scripts, and UDFs using both data frames/datasets and RDD in Spark for data aggregation, queries, and writing back into the S3 bucket.

Experience in Migrating SQL database to Azure Data Lake, Azure Data Lake Analytics, Azure SQL Database, Data Bricks, and Azure SQL Data Warehouse and Control

Filling and granting database access and Migrating On-premises databases to Azure Data Lake store using Azure Data Factory.

Experience in developing ETL applications on large volumes of data using different tools: MapReduce, Spark-Scala, PySpark, Spark-SQL, and Pig. Hands-on experience in SQL and NoSQL databases such as Snowflake, HBase, Cassandra, and MongoDB. Used Python along with Big Data Technologies and developed the Ingestion framework with DynamoDB, and Cassandra data stores.

Developed rest APIs using Python with Flask and Django framework and the integration of various data sources including JDBC, RDBMS, Shell Scripting, Spreadsheets, and Text files. Enhance the performance of ETL jobs that run in the EMR cluster, by applying optimization techniques (code-level and architecture level). Hands-on experience in complete Software Development Life Cycle SDLC for projects using methodologies like Agile and hybrid methods.

Experience in loading the data into Spark RDD and performing in-memory data computation to generate the output responses. Migrated complex Map reduces programs into In-memory Spark processing using Transformations and actions. Wrote UDFS in Python depending on the scenario.

Developed full-text search platform using NoSQL and Logstash Elastic Search engine, allowing for much faster, more scalable, and more intuitive user searches. Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python, and Scala. Well-versed with various Hadoop distributions which include Cloudera (CDH), Hortonworks (HDP), and Azure HD Insight.

Monitored and maintained Kafka topics, partitions, and offsets to ensure data reliability. Created scripts to read CSV, JSON, and parquet files from S3 buckets in Python and loaded them into DynamoDB and Snowflake.

Designed scalable ETL (Extract, Transform, Load) solutions in Informatica to handle large volumes of data, ensuring smooth data processing even as data grows. Conducted regular performance tuning of Power BI reports and dashboards for optimal rendering and responsiveness.

Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool, and backward. Developed the Sqoop scripts to make the interaction between Hive and MySQL Database.

Worked on Performance Enhancement in Hive and HBase on multiple nodes. Collaborated with data architects to define data sources and establish best practices for data governance and data lineage in awe. Created calculated fields, parameters, and sets in Tableau to enable advanced analytics and user-driven data exploration.

Set up continuous replication between CouchDB clusters in different geographical locations to ensure data redundancy and disaster recovery. Developed MapReduce application using Hadoop, MapReduce programming, and HBase.

Evaluated usage of Oozie for Workflow Orchestration and experienced in cluster coordination using Zookeeper. Developing ETL jobs with organization and project-defined standards and processes. Experienced in enabling Kerberos authentication in the ETL process.

Environment: Apache Spark, Python, SQL, Scala, MapReduce, Pig, SQL, NoSQL, Snowflake, HBase, Cassandra, MongoDB, DynamoDB, Cassandra, REST API, SDLC, Spark, UDFS, Elasticsearch, Hive, Hadoop, Kafka, Informatica, Power BI, Azure, Sqoop, H base, Data governance, Tableau, CouchDB, MapReduce,

Client: Comprobase Inc (Chapman University/MSKCC), Hyderabad, India Jul 2015 - Nov 2018

Role: Data Analyst

Responsibilities:

Prepared reports by using and utilizing MS Excel (VLOOKUP, HLOOKUPS, pivot tables, Macros, data points). Created Process flow charts, presentations, and defects mgmt. using JIRA, Visio, PowerPoint, and MS Excel.

Design, build, and operationalize large-scale enterprise data solutions and applications using one or more AWS data and analytics services in combination with 3rd parties - Spark, EMR, DynamoDB, RedShift, Kinesis, Lambda, Snowflake.

Implemented end-to-end systems for Data Analytics and Data Automation and integrated with custom visualization tools using R, Mahout, Hadoop, and MongoDB.

Prepared project and program reports on phased development, UAT testing, Traceability Matrix, and planned release work items,

Used HP ALM to document Test Plans and Test Cases, used Analysis module to create customized reports like Test Plan reports Test Execution Summary Reports, Defect reports, etc.

Utilized Python to extract and analyze large datasets, applying statistical methods and data visualization for actionable insights. Wrote optimized SQL queries to retrieve, manipulate, and transform data from databases, ensuring efficient data processing.

Working knowledge of Amazon Web Services (AWS) and Cloud Data Management.

Designed and developed Tableau dashboards to visualize complex data sets, facilitating data-driven decision-making for stakeholders.

Created interactive Power BI reports and dashboards for real-time data insights, enabling informed business decisions. Extensively performed Data analysis using Python Pandas.

Design and implement data engineering, ingestion, and curation functions on AWS cloud using AWS native or custom programming.

Designed and maintained QlikView applications, delivering user-friendly interfaces for effective data exploration.

Performed GAP analysis of current state to desired state and document requirements to control the gaps identified. Designed, configured, and deployed Amazon Web Services (AWS) for a multitude of applications Conducted in-depth data analysis using Excel, including complex formula development, pivot tables, and trend analysis

Environment: VLOOKUP, HLOOKUP, pivot tables, Macros, JIRA, Visio, PowerPoint, AWS, Spark, Python, SQL, Tableau, Power BI, QlikView, GAP, Excel.

Contact this candidate