Azure Data Engineer

Location:

Charlotte, NC

Posted:

March 21, 2025

Contact this candidate

Resume:

Sai Srinivas

Senior GCP Data Engineer

Email Id: **********@*****.***

Phone: 980-***-****

Professional Summary:

Over 10+ years of expertise in Data Engineering, encompassing Data Pipeline Design, Development, and Implementation across the Full Software Development Life Cycle.

Proficient in Software and System analysis, design, development, testing, deployment, maintenance, enhancements, re-engineering, migration, troubleshooting, and support of multi-tiered web applications in high-performing environments.

Extensive hands-on work with Pyspark/Spark SQL for data cleansing, generating data frames, and RDDS.

Experienced in Cloud migrations and building analytical platforms using AWS Services.

Proficient in creating shared VPCs with different tags in a single GCP project, with knowledge of Kubernetes service deployments in GCP.

Skilled in migrating SQL databases to Azure Data Lake, Azure Data Lake Analytics, Azure SQL Database, Data Bricks, and Azure SQL Data Warehouse.

Expert in controlling and granting database access, as well as migrating On-premises databases to Azure Data Lake store using Azure Data Factory.

Excellent understanding and knowledge of NoSQL databases like MongoDB and Cassandra.

Hands-on experience in designing and developing ETL transformations using PySpark, Spark SQL, S3, Lambdas, and Java.

Expertise in using major components of the Hadoop ecosystem, including HDFS, YARN, MapReduce, Hive, Impala, Pig, Sqoop, HBase, Spark, Spark SQL, Kafka, Spark Streaming, Flume, Oozie, Zookeeper, and Hue.

Responsible for loading processed data to AWS Redshift tables, enabling the Business reporting team to build dashboards.

Created views in AWS Athena to facilitate secure and streamlined data analysis access for downstream business teams.

Collaborated extensively with the Data Science team to productionalize machine learning models and build various feature datasets for data analysis and modeling.

Hands-on experience with Python programming and PySpark implementations in AWS EMR, constructing data pipeline infrastructure to support deployments for Machine Learning models, Data Analysis, and cleansing.

Developed and optimized Snowflake data models to support scalable and efficient data warehousing solutions.

Managed golden records and data deduplication using MDM best practices, improving data quality and reducing redundancy.

Integrated MDM platforms (e.g., Informatica MDM, Talend MDM, IBM InfoSphere, SAP MDG) to centralize and standardize business-critical data.

Utilized Snowflake’s native features (such as clustering, micro-partitioning, and semi-structured data support) to optimize data storage and query performance.

Integrated Snowflake with ETL tools (e.g., DBT) to streamline data transformation processes and ensure real-time data updates.

Developed and optimized PySpark-based ETL workflows on Databricks to process large-scale datasets efficiently in a distributed cloud environment.

Optimized Tableau workbooks to handle large datasets, improving performance and user experience.

Trained team members on Tableau best practices, enhancing the overall data visualization capabilities within the organization.

Used DBT’s testing and documentation features to ensure data integrity and consistency across multiple data sources.

Led the migration to DBT Cloud for version control, collaboration, and better job scheduling, enhancing team productivity.

Proficient in using Statistical models with Python, Pandas, NumPy, Visualization using Matplotlib, Seaborn, and Scikit packages for predictions, xgboost.

Experience in working with SDLC, Agile, and Waterfall Methodologies.

Experience in Cloudera Data Platform over six years.

Monitored and fine-tuned FiveTran ETL pipelines, ensuring low-latency, high-reliability data replication while optimizing cost and resource utilization.

Designed and implemented Workato automation workflows to streamline data integration, reducing manual processes and improving operational efficiency.

Designed and deployed cloud-native data solutions on GCP (BigQuery, Dataflow, Cloud Composer) and AWS (Glue, Redshift, Lambda) to enable scalable data processing.

Collaborated with cross-functional teams to design and deploy advanced data models and data lakes in Azure Synapse, enabling real-time analytics and improving data-driven decision-making across the organization.

Developed automated reporting systems using Synapse SQL and integrated machine learning models, improving forecasting accuracy by 15% for business operations.

Integrated structured and unstructured data from multiple sources into GCP (BigQuery, Data Fusion, Pub/Sub), Snowflake, and other cloud platforms for analytics and reporting.

Developed real-time and batch data processing solutions using Kafka, Pub/Sub, and Dataflow to support high-velocity data ingestion.

Developed Kafka producers and Kafka Consumers for streaming millions of events per second on streaming data. Proficient in Extraction, Transformation, and Loading (ETL), API Gateway between homogenous and heterogeneous sources using SSIS packages and DTS (Data Transformation Services).

Solid experience with Azure Data Lake Analytics, Azure Databricks, Azure Data Lake Storage, Azure Data Factory, Azure SQL databases, and Azure SQL Data Warehouse, providing analytics and reports to enhance marketing strategies.

Extensive experience in IT data analytics projects, with hands-on expertise in migrating on-premises ETLs to Google Cloud Platform (GCP) using cloud-native tools such as Big Query, Cloud Data Proc, Google Cloud Storage, Composer, and APIs.

Experience in GCP with various technologies like Dataflow, Pub/Sub, Big Query, GCS Bucket, and related tools. Proficient in creating Tableau Visualizations using Tableau. Certified Power Center Developer. Certified UNIX Korn Shell.

Experience in creating and maintaining a centralized repository using Informatica PowerCenter.

Expertise in transferring streaming data and data from different sources into HDFS and NoSQL databases using Apache Flume, along with cluster coordination services through Zookeeper.

Technical Skills:

Big Data Tools

Hadoop, HDFS, Sqoop, HBase, Hive, Map Reduce, Cloudera, Spark, Kafka, MicroStrategy

Cloud Technologies

Snowflake, Snow SQL, PostgreSQL, Azure, Databricks, AWS (EMR, EC2, S3, Cloud Watch, Event Bridge, Lambda, SNS, AWS Glue, AWS Lambda, AWS Batch, AWS Aurora, /RDS)

ETL Tools

SSIS, Informatica Power Center

Modeling and Architect Tools

Erwin, ER Studio, Star-Schema, Snowflake-Schema Modeling, FACT and dimension tables, Pivot Tables

Database

Snowflake Cloud Database, Oracle, MS SQL Server, MySQL, Cassandra, Dynamo DB

Operating Systems

Microsoft Windows, Unix, Linux

Reporting Tools

MS Excel, Tableau, Matillion, Tableau server, Tableau Reader, Power BI, Qlik View

Methodologies

Agile, UML, System Development Life Cycle (SDLC), Ralph Kimball, Waterfall Model

Programming Languages

SQL, R (shiny, R-studio), Python (Jupyter Notebook, PyCharm IDE), Scala

Professional Experience:

Molina healthcare, Bothell, WA Nov 2021 to Present

GCP Data Engineer

Developed Spark applications to process raw data, populate staging tables, and store refined data in partitioned tables within the Enterprise Data Warehouse.

Proficient in creating Power BI reports optimized for performance on Azure Analysis Services.

Created streaming applications using PySpark to read data from Kafka and persist it in NoSQL databases like HBase and Cassandra. Orchestrated infrastructure management across multiple projects within the organization on the Google Cloud Platform using Terraform, adhering to the principles of Infrastructure as Code (IaC).

Improved the performance of existing BigQuery and Tableau reporting solutions through optimization techniques such as partitioning key columns and thorough testing under various scenarios.

Developed Extract, Load, transform (ELT) processes using diverse data sources, including Ab Initio, Google Sheets in GCP, and computing resources like Data prep, Dataproc (PySpark), and Big Query.

Implemented PySpark scripts with Spark SQL to access Hive tables within Spark for accelerated data processing. Contributed to the implementation of a Big Data Hadoop cluster and integrated data for large-scale system software development.

Led the migration of an entire Oracle database to Big Query and utilized Power BI for reporting purposes.

Built data pipelines in Google Cloud Platform (GCP) using Airflow for ETL jobs, incorporating various airflow operators.

Created and maintained real-time data dashboards using Azure Synapse Analytics and Power BI, providing key stakeholders with actionable insights on business performance.

Conducted in-depth data analysis and performance tuning for high-volume data workloads, improving query execution times by up to 50% in Azure Synapse environments.

Integrated Workato with cloud and on-premise applications (e.g., Salesforce, Snowflake, BigQuery) to enable seamless data synchronization and process automation.

Developed streaming and batch processing applications using PySpark to ingest data from diverse sources into the HDFS Data Lake.

Crafted DDL and DML scripts in SQL and Hive to support analytics applications in relational databases (RDBMS) and Hive.

Served as an integrator, fostering collaboration between data architects, data scientists, and other data consumers. Translated SAS code into Python and Spark-based jobs for execution in Cloud Dataproc and Big Query on the Google Cloud Platform.

Facilitated data transfer between Big Query and Azure Data Warehouse through Azure Data Factory (ADF) and devised complex DAX language expressions for memory optimization in reporting cubes within Azure Analysis Services (AAS).

Utilized Cloud Pub/Sub and Cloud Functions for specific use cases, including workflow triggers upon incoming messages.

Developed and executed HQL scripts for creating partitioned and bucketed tables in Hive to enhance data access. Utilized GCP cloud shell SDK to configure services like Data Proc, Storage, and Big Query.

Created custom Hive User-Defined Functions (UDFs) to implement custom aggregation functions in Hive.

Worked extensively with Sqoop to import and export data between HDFS and relational database systems/mainframes.

Monitored YARN applications and troubleshooted cluster-related system issues.

Developed shell scripts to parameterize Hive actions in Oozie workflows and schedule jobs.

Loaded massive volumes of data into HDFS and Cassandra using Apache Kafka.

Played a key role in developing an initial prototype of a NiFi big data pipeline demonstrating end-to-end data ingestion and processing.

Developed custom processors for NiFi to enhance data flow capabilities.

Integrated with PySpark for processing and storing real-time streaming data in NoSQL databases such as HBase. Leveraged GCP Dataproc, GCS, Cloud Functions, and Big Query for data processing and analysis.

Utilized Oozie Scheduler systems to automate pipeline workflows and orchestrate map-reduce jobs.

Created Hive queries that helped market analysts identify emerging trends by comparing fresh data with reference tables and historical metrics.

Designed secure and efficient Snowflake schemas, ensuring compliance with data security and privacy regulations.

Monitored and troubleshooted Snowflake performance issues, optimizing queries and storage costs, leading to a 15% reduction in data processing time.

Assessed technologies and methods to ensure that the EDW/BI architecture meets business needs and supports enterprise growth.

Contributed to Big Data integration and analytics projects involving Hadoop, SOLR, Spark, Kafka, Storm, and web Methods technologies.

Designed and developed data pipelines for integrated data analytics using Hive, Spark, Sqoop, and MySQL.

Environment: GCP, Pyspark, SAS, Hive, Sqoop, GCPs Data Proc Big Query, Hadoop, Hive, DataProcs, GCS, Cloud Functions, Big Query, PySpark, Oozie, Airflow, Hadoop, Power BI, Spark, Kafka, No SQL, Big Data, HBase.

Travelport Englewood, CO June 2019 to Oct 2021

Data Engineer

Demonstrated extensive proficiency in navigating the AWS cloud platform, showcasing hands-on expertise in AWS services like EC2, S3, EMR, Redshift, Lambda, and Glue.

Possess a strong command of Spark, including specialized knowledge in Spark RDD, Data Frame API, Data Set API, Data Source API, Spark SQL, and Spark Streaming.

Developed Spark applications using Python, implementing Apache Spark data processing projects adept at handling data from various sources, including RDBMS and streaming platforms.

Leveraged Spark to enhance performance and optimize existing algorithms within Hadoop.

Successfully migrated an Oracle SQL ETL process to run on the Google Cloud Platform, leveraging Cloud Dataproc, Big Query, and Cloud Pub/Sub for orchestrating Airflow jobs.

Proficiently utilized Presto, Hive, Spark SQL, and Big Query alongside Python client libraries to craft efficient and interoperable programs for analytics platforms.

Demonstrated extensive hands-on experience with Google Cloud Platform's big data-related services.

Competent in Spark technologies such as Spark Context, Spark SQL, Spark MLlib, Data Frame, Pair RDD, and Spark YARN.

Employed Spark Streaming APIs for real-time transformations and actions, establishing a common data model by ingesting data from Kafka and persisting it to Cassandra.

Implemented advanced Hive features such as Partitioning, Dynamic Partitions, and Buckets to optimize data storage and retrieval.

Enhanced data security and governance protocols within Azure Synapse environments, ensuring compliance with GDPR, HIPAA, and other regulatory standards.

Developed code for importing and exporting data to and from HDFS and Hive using Apache Sqoop.

Demonstrated expertise in Hive SQL, Presto SQL, and Spark SQL to perform ETL tasks, choosing the most suitable technology for each specific job.

Created a Python-based Kafka consumer API for efficient data consumption from Kafka topics.

Processed Extensible Markup Language (XML) messages with Kafka and utilized Spark Streaming to capture User Interface (UI) updates.

Implemented preprocessing jobs using Spark Data Frames to flatten JSON documents into flat files.

Loaded D-Stream data into Spark RDD and performed in-memory data computations to generate output responses. Extensive experience in developing live real-time processing and core jobs using Spark Streaming in conjunction with Kafka as a data pipeline system.

Successfully migrated an existing on-premises application to AWS, utilizing AWS services like EC2 and S3 for data processing and storage.

Proficient in maintaining Hadoop clusters on AWS Elastic Map Reduce (EMR).

Loaded data into S3 buckets using AWS Glue and PySpark, filtered stored data in S3 buckets using Elastic search, and loaded the data into Hive external tables.

Configured Snow pipe to extract data from S3 buckets and store it in Snowflake's staging area.

Created numerous ODI interfaces for loading into Snowflake DB.

Utilized Amazon Redshift to consolidate multiple data warehouses into a single data warehouse.

Designed columnar families in Cassandra, ingested data from RDBMS, performed data transformations, and exported transformed data to Cassandra as per business requirements.

Leveraged the Spark Data Cassandra Connector for loading data to and from Cassandra.

Configured Kafka from scratch, including settings for managers and brokers.

Developed data models for clients' transactional logs and analyzed data from Cassandra tables using the Cassandra Query Language.

Created interactive Tableau dashboards to visualize complex datasets, enabling stakeholders to make data-driven decisions.

Developed real-time visualizations using Tableau to track key performance indicators (KPIs) and business metrics.

Designed and maintained automated Tableau reports, providing actionable insights and reducing manual reporting time.

Conducted cluster performance testing using the Cassandra-stress tool to measure and enhance Read/Writes. Employed Hive QL to analyze partitioned and bucketed data, executing Hive queries on Parquet tables stored in Hive for data analysis to meet business specifications.

Used Apache Kafka to aggregate web log data from multiple servers and make it available in downstream systems for data analysis and engineering roles.

Configured and optimized FiveTran connectors to automate data ingestion from multiple sources into cloud data warehouses like BigQuery, Snowflake, and Redshift.

Implemented Kafka security measures and enhanced its performance.

Demonstrated expertise in working with various data formats such as Avro, Parquet, Refiles, and JSON, including the development of user-defined functions (UDFs) in Hive.

Developed custom UDFs in Python and utilized them for data sorting and preparation.

Worked on custom loaders and storage classes in Pig to process diverse data formats like JSON, XML, CSV, and generated bags for further processing using Pig.

Developed Sqoop and Kafka jobs to load data from RDBMS and external systems into HDFS and Hive.

Created Oozie coordinators to schedule Hive scripts, effectively establishing data pipelines.

Authored numerous MapReduce jobs using PySpark and NumPy and implemented Jenkins for continuous integration.

Conducted cluster testing and monitoring of HDFS, Hive, Pig, and MapReduce to facilitate access for new users.

Ensured the continuous monitoring and management of the Hadoop cluster through Cloudera Manager.

Environment: AWS EMR, map R, HDFS, GCS, Python, Snowflake, Dynamo DB, Oracle Database, Power Bi, SDK’S, Data Flow, Glacier, EC2, EMR Cluster, SQL Database, Data Bricks,Hive, Pig, Apache Kafka, Sqoop, Python, Pyspark, Shell scripting, Linux, MySQL Oracle Enterprise DB, SOLR, Jenkins, Eclipse, Oracle, Git, Oozie, Tableau, MySQL, Soap, Cassandra and Agile Methodologies.

Macy’s, New York,NY July 2017 to May 2019

Data Engineer II

Engaged in meetings with business/user groups to comprehend business processes, gather requirements, analyze, design, and implement solutions according to client specifications.

Creating and maintaining optimal data pipeline architecture in Microsoft Azure using Data Factory and Azure Databricks.

Executing Extract, Transform, and Load (ETL) processes from various source systems to Azure Data Storage services through a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics.

Ingesting data into one or more Azure Services (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing it in Azure Databricks.

Designing and developing Azure Data Factory (ADF) extensively for ingesting data from diverse source systems (relational and non-relational) to meet business functional requirements.

Creating event-driven architectures using blob triggers and Data Factory, along with building pipelines, data flows, and intricate data transformations and manipulations using ADF and PySpark with Databricks.

Automating jobs with various triggers like Events, Schedules, and Tumbling in ADF, including the creation and provisioning of different Databricks clusters, notebooks, jobs, and autoscaling.

Ingesting a substantial volume and variety of data from disparate source systems into Azure Data Lake Gen2 using Azure Data Factory.

Implemented DBT models to transform raw data into clean, actionable datasets, improving data quality and accuracy.

Designed and optimized DBT workflows to automate data pipelines, reducing manual processing time by X%.

Collaborated with data engineers to integrate DBT with other ETL tools and cloud platforms, streamlining data processing and reporting.

Deployed Apache Airflow within a GCP Composer environment to construct data pipelines, utilizing various Airflow operators like Bash Operator, Hadoop Operators, Python Callable, and Branching Operators.

Developed innovative techniques for orchestrating Airflow pipelines and employed environment variables for project-level definition and password encryption.

Exhibited competency in Kubernetes within GCP, focusing on devising monitoring solutions with Stack driver’s log router and designing reports in Data Studio.

Developed innovative techniques for orchestrating Airflow pipelines and employed environment variables for project-level definition and password encryption.

Exhibited competency in Kubernetes within GCP, focusing on devising monitoring solutions with Stack driver’s log router and designing reports in Data Studio.

Developing Spark applications using PySpark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats, aiming to analyze and transform data to reveal insights into customer usage patterns.

Querying and analyzing data from Cassandra for quick searching, sorting, and grouping through CQL. Involved in the implementation and integration of the Cassandra database.

Creating JSON scripts for deploying the Pipeline in Azure Data Factory (ADF) to process data using the SQL Activity and performing data flow transformations.

Implementing Azure self-hosted integration runtime in ADF and developing streaming pipelines using Apache Spark with Python.

Implemented Delta Lake for scalable, ACID-compliant data lakes, improving data reliability and performance for machine learning and analytics workloads.

Enhancing performance by optimizing computing time for processing streaming data and saving costs by optimizing cluster runtime.

Built scalable data pipelines and automation scripts using Python libraries such as Pandas, NumPy, PySpark, and SQLAlchemy for efficient data processing.

Implemented Terraform modules for deploying various applications across multiple cloud providers and managing structures.

Conducting ongoing monitoring, automation, and refinement of data engineering solutions.

Utilizing SQL Server Import and Export Data tool extensively, working with complex SQL views, Stored Procedures, Triggers, and packages in large databases from various servers.

Setting up Azure infrastructure, including storage accounts, integration runtime, service principal ID, and app registrations to enable scalable and optimized utilization of business user analytical requirements in Azure.

Creating builds and release pipelines in VSTS and deploying solutions using SPN (secure endpoint connection) for implementing Continuous Integration/Continuous Deployment (CI/CD).

Environment: Data Profiling, Mapping, Financial Tests (Python, R, SAS, Power BI), Agile (Jira, Trello), Azure, SQL, BI Tools (Power BI, Tableau, Looker, QlikView), MDM, ETL (Hadoop, Spark, Nifi, Kafka), SQL/Python/Java for Reporting, Data Protection, Incident Response, Epic EHR, Big Data (Hadoop, Spark, No SQL).

Jarus Technologies, Hyderabad, India July 2014 to March 2017

Hadoop Developer

Orchestrated data ingestion seamlessly from UNIX file systems to Hadoop Distributed File System (HDFS), employing contemporary techniques for efficient data loading.

Utilized Sqoop to facilitate the seamless import and export of data between HDFS, Hive, and external data sources, ensuring streamlined data workflows.

Translated business requirements into comprehensive specifications, aligning with modern project guidelines for program development. Designed and implemented procedures addressing complex business challenges, considering hardware and software capabilities, operating constraints, and desired outcomes.

Conducted extensive analysis of large datasets, identifying optimal methods for data aggregation and reporting. Responsively generated ad hoc reports for internal and external clients.

Led the construction of scalable distributed data solutions using Hadoop, incorporating the latest tools and technologies in the field.

Authored Hive SQL scripts for creating sophisticated tables with high-performance attributes like partitioning, clustering, and skewing.

Engaged in the transformation and analysis of extensive structured and semi-structured datasets through the execution of Hive queries.

Collaborated with the Data Science team to implement advanced analytical models within the Hadoop cluster, utilizing large datasets.

Utilized Cloud Pub/Sub and Cloud Functions for specific use cases, including workflow triggers upon incoming messages.

Developed API integrations and automated data workflows using Flask, FastAPI, and Workato/FiveTran connectors, enabling seamless data exchange between systems.

Took a hands-on role in Extract, Transform, Load (ETL) processes, ensuring the efficient handling of data across various stages.

Managed cluster maintenance, encompassing node management, monitoring, troubleshooting, and reviewing data backups and log files within the Hadoop ecosystem. Oversaw the extraction of data from diverse sources, executed transformations using Hive and MapReduce, and effectively loaded data into HDFS.

Leveraged Sqoop for extracting data from Teradata and seamlessly integrating it into HDFS, adhering to contemporary best practices.

Leveraged Power BI and SSRS to create dynamic reports, dashboards, and interactive functionalities for web clients and mobile apps.

Developed SAS scripts for Hadoop, providing data for downstream SAS teams, particularly for SAS Visual Analytics, an in-memory reporting engine.

Monitored data engines to define data requirements and access data from both relational and non-relational databases, including Cassandra and HDFS.

Created complex SQL queries and established JDBC connectivity to retrieve data for presales and secondary sales estimations.

Conducted in-depth data analysis through Hive queries and Pig scripts, revealing user behavior patterns like shopping enthusiasts, travelers, and music lovers.

Exported insights and patterns derived from the analysis back into Teradata using Sqoop for a comprehensive feedback loop. Ensured continuous monitoring and management of the Hadoop cluster through cutting-edge tools like Cloudera Manager.

Integrated the Oozie workflow engine to orchestrate and automate multiple Hive workflows and processes efficiently.

Developed advanced Hive queries to process data and generate data cubes, facilitating data visualization and advanced analytics.

Environment: Hive, pig, Apache Hadoop, Cassandra, Sqoop, Big Data, HBase, Zookeeper, Cloudera, CentOS, No SQL, sencha extjs, java script, ajax, Hibernate, JM’s, web logic Application server, Eclipse, Web services, azure, Project Server, Unix, Windows.

Education Details: Bachelor’s degree in computer science from Vaagdevi college of engineering in 2014

Contact this candidate