Data Engineer

Location:

New York City, NY

Posted:

December 11, 2024

Contact this candidate

Resume:

Chandrashekhar Polagoni

Senior Data Engineer

Email: ************@*****.***

Phone: 425-***-****

Professional Summary:

Data Engineer with 10+ years of IT experience, Designed and implementing Cloud ETL solutions using Azure Data Factory, Azure Data Lake Storage Gen2, Azure Databricks and Azure Synapse Analytics and Data Warehousing Solutions.

6 years of dedicated experience in Azure Cloud Services and Big Data technologies, designing and establishing optimal cloud solutions for efficient data migration and processing.

Hands - on experience in Azure Cloud Services, Azure SQL Server, Azure Logic Apps, Azure Functions, Azure Monitor, Key Vault, Azure Event hubs, Azure Cosmos DB, Azure Stream Analytics.

Developed Azure Data Factory (ADF) batch pipelines to ingest and transform data from relational sources into Azure Data Lake Storage (ADLS Gen2), implementing incremental data loads and ETL processes to clean and load data into Delta tables.

Expert in leveraging Azure Data Factory to design and implement scalable ETL pipelines that automate data ingestion and transformation, resulting in improved data accessibility and enhanced decision-making processes.

Proficient in utilizing Azure Databricks for collaborative data processing, employing Apache Spark to optimize large-scale data transformations and analytics, leading to significant performance improvements and actionable insights.

Skilled in architecting and managing data storage solutions using Azure Data Lake Storage Gen2, ensuring efficient storage of unstructured and structured data while maintaining security and compliance across data assets.

Experienced in developing end-to-end analytics solutions using Azure Synapse Analytics, integrating data from multiple sources and enabling advanced analytics capabilities to drive business intelligence and data-driven decision-making.

Involved in development of roadmaps and deliverables to advance the migration of existing solutions on-premises systems/applications to Azure Cloud.

Developed Data pipelines, Dataflows, Event Streams, Lakehouse, ML Models, Notebooks, Reports, Spark jobs and warehouse in Synapse Data engineering on Microsoft Fabric.

Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation and aggregation from multiple file formats for analysing & transforming the data to uncover insights into the customer usage patterns.

Developed real-time streaming solutions using Apache Kafka and Azure Event Hubs, integrated with Azure Databricks for stream processing using PySpark.

Extensive knowledge and experience in dealing with Relational Database Management Systems Normalization, Stored Procedures, Constraints, Joins, Indexes, Data Import/Export, Triggers.

Expert at data transformations like lookups, Derived Column, Conditional Splits, Sort, Data Conversation, Multicast and Derived columns, Union All, Merge Joins, Merge, Fuzzy Lookup, Fuzzy Grouping, Pivot, Un - pivot and SCD to load data in SQL SERVER destination.

Developed and optimized data processing applications across Teradata, Oracle, SQL Server, and MySQL databases. streamlined query execution time by 40% through efficient indexing strategies in high-volume environments.

Experience in Extraction, Transformation and Loading (ETL) data from various sources into Data Warehouses, as well as data processing like collecting, aggregating, and moving data from various sources using Apache Flume, Kafka.

Developed iterative algorithms using Spark Streaming in Scala and Python to builds near real-time dashboards.

Experienced in loading data to Hive partitions and created buckets in Hive and developed Map Reduce jobs to automate transfer the data from H-Base.

Extensive experience in processing and transforming diverse file formats, including CSV, JSON, XML, ORC, Avro, and Parquet, utilizing Spark with Python, PySpark, and Scala to enhance data analysis and development workflows.

Strong expertise across the technology stack, encompassing ETL, data analysis, cleansing, matching, quality, audit, and design, while also possessing extensive experience in designing and optimizing OLAP and OLTP systems for high-performance database environments.

Created interactive dashboards and reports in Power BI, leveraging data from various sources for insightful business intelligence and reporting.

Scheduled and monitored data workflows with Control-M and Apache Airflow for coordinated execution of complex tasks.

Leveraged SharePoint and Confluence for centralized documentation and project collaboration, enhancing team communication and improving accessibility to project plans and technical specifications.

Developed troubleshooting guides for Azure Data Factory, Databricks, and SQL-based ETL processes, reducing incident resolution time by 30%.

Ensured adherence to testing best practices, including unit, integration, and end-to-end testing, to validate data accuracy, reliability, and performance in ETL workflows, improving data integrity across systems.

Involved in all the phases of Software Development Life Cycle (Requirements Analysis, Design, Development, Testing, Deployment, and Support) and Agile methodologies.

Leveraged Jira for Agile project management, facilitating Scrum ceremonies such as daily stand-ups, sprint planning, and retrospectives, which enhanced team collaboration and transparency; conducted retrospectives to identify actionable insights for continuous improvement.

Technical Skills:

Azure Services: Azure Data Factory, Azure Data Bricks, Logic Apps, Functional App, Azure DevOps, Azure Synapse Analytics, Key Vault, Azure Event Hub, Azure Monitor, Azure Blob Storage, Azure Functions, MS Fabric.

Databases: Snowflake, MS SQL Server 2016/2014/2012, Azure Data Lake Storage Gen2, Azure SQL DB, MS Excel, MS Access, Oracle 11g/12c, Cosmos DB, Mongo DB PostgreSQL, MySQL

Big Data Technologies: MapReduce, Hive, Teg, Python, PySpark, Scala, Kafka, Spark streaming, Oozie, Sqoop, HDFS, Flume, Zookeeper

Hadoop Distribution: Cloudera, Horton Works, Apache

Languages: Python, SQL, PL/SQL, Java, HiveQL, Scala, Shell Scripting, Bash, PowerShell

Web Technologies: HTML, CSS, JavaScript, XML, JSP, Restful, SOAP

Operating Systems: Windows (XP/7/8/10), UNIX, LINUX, UBUNTU, CENTOS

Build Automation tools: Ant, Maven, Jenkins

Version Control: GIT, GitHub, Bitbucket

Methodology: Agile, Scrum

IDE &Build Tools, Design: Eclipse, Visual Studio, PyCharm, IntelliJ IDEA, VS Code

Education:

Masters in Information Systems at Trine University, Indiana, USA in 2013

Completed Bachelors in Computer science from JNTU, Hyderabad in 2011

Certifications:

Azure Data Engineer Associate (DP-203)

Azure Fundamentals (AZ-900)

Professional Experience:

Client: Casey's, Des Moines, IA July 2023 to Present

Role: Senior Data Engineer

Responsibilities:

Designed and implemented end-to-end ETL pipelines using Azure Data Factory, incorporating Copy Activities, Mapping Data Flows, and Control Flow Activities, Get Meta data Activity, Web activity which improved in data transferring efficiency.

Managed data orchestration with Azure Data Factory, scheduling workflows to integrate Databricks and Synapse Analytics activities, ensuring seamless data processing across platforms.

Optimized data transformation workflows with Azure Databricks, leveraging PySpark DataFrames, RDDs, and Databricks Runtime to enhance data processing performance by 60%.

Developed robust ETL pipelines using Azure Synapse Analytics and Databricks for efficient data ingestion, transformation, and loading from various data sources into Azure Data Lake Storage and SQL databases.

Created robust data pipelines in Azure Synapse Analytics, using Data Flows for code-free ETL and Databricks Notebooks for complex transformations.

Built and managed Delta Lake tables in Databricks, implementing bronze, silver, and gold data layers to ensure data accuracy, consistency, and performance in large-scale analytics.

Implemented Delta Lake architecture on ADLS Gen2, leveraging Hierarchical Namespace and RBAC for secure and efficient data management.

Implemented Delta Live Tables for seamless continuous data processing and leveraged Databricks SQL to perform interactive querying, enhancing data availability and facilitating real-time analytics for informed decision-making.

Implemented Azure Data Lake Storage Gen2 Lifecycle Management policies and Immutable Storage for Data governance, resulting in a scalable and cost-effective analytics platform.

Integrated Functional Apps within Azure Data Factory pipelines for automated data processing tasks, enhancing overall system flexibility and responsiveness.

Implemented data integration and transformations within Azure SQL Server, ensuring high availability and performance for enterprise-level applications.

Designed and implemented scalable data governance frameworks using Databricks Unity Catalog to manage permissions, audit data usage, and ensure compliance with industry standards across multiple environments, improving data security and operational efficiency by 30%.

Designed real-time data integration solution using Data Factory and Python, streaming from IoT devices to Data Lake through Databricks, implementing PySpark transformations for analytics processing.

Integrated Cosmos DB with Azure Data Factory and Databricks to create ETL pipelines, facilitating seamless data flow between various sources and Cosmos DB.

Conducted a Proof of Concept (POC) on Microsoft Fabric, demonstrating its capabilities in streamlining data integration and analytics workflows, resulting in improved data processing efficiency for client applications.

Designed and implemented a unified data architecture using Microsoft Fabric's OneLake and Lakehouse features, enabling data ingestion, storage, and analytics across diverse data sources, for analytical workloads.

Implemented Kafka Connectors for data ingestion from POS Terminals into Azure Data Lake Storage (ADLS), and integrated Kafka with Azure Databricks for processing of real time data, enhancing ETL workflows and optimizing customer experience analytics.

Designed and developed high-availability MS SQL databases while optimizing complex T-SQL queries for data manipulation and reporting in Azure SQL Server.

Designed modular, reusable DBT models to transform raw data into analytics-ready datasets within Azure Synapse, improving data consistency and optimizing processing time for efficient data modelling and analytical workflows.

Implemented Azure Monitor for comprehensive monitoring of data pipelines, enabling proactive identification and resolution of performance bottlenecks.

Implemented Azure Key Vault to securely manage secrets, keys, and certificates within Azure Data Factory pipelines, significantly strengthening security and ensuring compliance with industry standards.

Collaborated with data scientists to provide processed, high-quality datasets from Synapse and Databricks for machine learning models, supporting predictive analytics use cases.

Created efficient schema designs in MongoDB, optimizing collections with indexes for improved query performance in document-based data storage.

Experienced in configuring Linked Services, creating Datasets, and setting up Integration Runtime within Azure Data Factory for seamless data integration across diverse environments.

Implemented advanced scheduling and automation using Triggers like Scheduled Trigger, Tumbling Window Trigger and Event Based Trigger based on necessity in Azure Data Factory, ensuring timely execution of ETL processes and data synchronization.

Developed and maintained real-time data streaming pipelines using Apache Kafka, enabling efficient data flow and processing for real-time analytics.

Optimized data pipelines and Spark jobs in Azure Databricks through advanced techniques like Spark configuration tuning, data caching, and data partitioning, resulting in superior performance and efficiency.

Developed automated data quality checks and executed large-scale data processing tasks using PySpark in Azure Databricks, significantly enhancing data trustworthiness and improving processing performance by reducing execution times.

Developed and optimized SQL queries for data extraction, transformation, and loading (ETL) processes across various platforms, ensuring data accuracy and efficiency.

Leveraged SQL and Python for querying and managing data within data warehouses, enhancing data availability and supporting analytical requirements.

Automated infrastructure provisioning on Azure using Terraform, managing resources like Data Factories, Storage Accounts, and Databricks clusters, while developing reusable modules that standardized deployments and reduced provisioning time by 40%.

Designed and implemented CI/CD pipelines using Azure DevOps and GitHub Actions to automate the deployment of data pipelines and infrastructure, ensuring efficient and reliable operations.

Streamlined infrastructure provisioning with Azure Repos and automated end-to-end testing for data pipelines, enhancing version control, collaboration, and deployment quality.

Managed and optimized data storage and processing using file formats like Parquet, ORC, JSON, Avro, and CSV, ensuring compatibility and efficiency across systems and improving data retrieval speeds.

Automated Azure resource management using PowerShell scripts, reducing manual configuration time by 40% and improving deployment efficiency.

Developed Star Schema data models for OLAP systems, ensuring efficient query performance and simplified data navigation.

Developed interactive Power BI dashboards to visualize key business metrics, improving decision-making processes and enabling real-time insights for stakeholders.

Integrated Power BI with Azure Synapse Analytics for real-time data visualization, reducing the time to insight by 40%.

Worked in a fast-paced Agile environment, collaborating with cross-functional teams to deliver high-quality data solutions on time and within budget.

Collaborated with cross-functional teams to develop shared Confluence documentation, ensuring alignment and clarity across various data projects.

Encouraged open discussion during retrospectives, fostering a culture of transparency and continuous feedback among team members.

Integrated GitHub with CI/CD tools to streamline the deployment of data pipelines and applications, ensuring quick and reliable updates to production environments.

Environment: Azure Data Factory, Azure Databricks, Azure Data Lake Storage Gen2, Azure Blob, Snowflake Data Warehouse, Azure Event Hubs, Azure Functions, Azure Logic Apps, Function Apps, Azure Monitor, Cosmos DB, Mongo DB, Azure Synapse Analytics, Azure DevOps, Terraform, Azure Fabric, Azure Key Vault, MS SQL, Oracle, Python, PySpark, JIRA, Kafka, Power Bi.

Client: DXC Technology, Chicago, IL Jan 2020 to June 2023

Role: Senior Data Engineer

Responsibilities:

Designed and implemented complex data integration pipelines in Azure Data Factory, utilizing various activities including Copy Activity, Mapping Data Flow, and Control Flow Activities.

Configured Linked Services to connect to multiple data sources and implemented Triggers for automated execution.

Implemented Azure Data Factory's Integration Runtime to enable secure data movement between on-premises and cloud environments. Implemented Self-Hosted Integration Runtime for accessing data behind corporate firewalls.

Optimized Data storage and management in Azure Data Lake Storage Gen2, implementing Hierarchical Namespace for efficient data organization.

Configured Azure Databricks Workspace to create and manage clusters, developing notebooks for data processing and analysis while implementing ETL processes that leveraged Spark DataFrames and RDDs for efficient data transformation.

Implemented Azure Data Lake Storage Gen2 with Azure Active Directory (AAD) integration, ensuring seamless and secure authentication and authorization for robust data access control.

Utilized Python libraries such as Pandas and NumPy for data manipulation, transformation, and analysis, enabling efficient handling of large datasets in Azure Data Lake Storage.

Implemented data validation and cleansing processes in Python, ensuring high data quality before loading into Azure SQL Database and Azure Synapse Analytics.

Leveraged Azure Synapse Link to enable near real-time analytics on operational data, integrating with Azure Cosmos DB for seamless data synchronization.

Developed real-time data ingestion solutions using Azure Data Factory integrated with Azure Event Hubs, improving data synchronization accuracy and reducing latency.

Developed real-time data streaming architectures leveraging Azure Event Hubs, Databricks, and Synapse Analytics, ensuring instant data accessibility and analysis for enhanced business intelligence.

developed PySpark applications in Databricks to transform 200GB daily data, optimizing ETL workflows from Data Lake to Synapse Analytics, reducing processing costs by 40% through efficient Python.

Engineered Azure Data Factory pipelines integrating Data Lake with Databricks, implementing SQL-based validation checks and Python scripts for data quality monitoring, serving 100+ business users daily.

Architected enterprise data platform using Synapse Analytics and Databricks, processing Data Lake sources through PySpark transformations, achieving 99.9% SLA and 45% faster performance.

Developed production-grade Python and PySpark modules in Databricks for complex ETL operations, orchestrating Data Factory pipelines between Data Lake and SQL databases with 40% improved efficiency.

Built automated testing framework using Python in Azure Data Factory, validating data quality between Data Lake and Synapse, while implementing PySpark optimizations for 50% faster processing.

Implemented delta lake architecture using Databricks and PySpark, optimizing Data Lake storage patterns and SQL-based merge operations in Synapse, reducing data processing latency by 60%.

Designed and implemented end-to-end ETL pipelines using Azure Data Factory and Python, incorporating data extraction, transformation, and loading processes that improved data flow efficiency.

Spearheaded migration from legacy ETL to Azure Data Factory, leveraging Databricks notebooks with PySpark and SQL for Data Lake transformations, resulting in 75% reduction in development time.

Migrated and managed large-scale datasets in Azure Data Lake, and Apache Cassandra, optimizing storage strategies for performance and cost-efficiency.

Designed and developed data models in Snowflake, leveraging its MPP architecture for scalable and high-performance data storage and querying.

Implemented Snowflake's Time Travel and Zero-Copy Cloning features for efficient data backup and recovery, reducing downtime and data loss risks.

Developed and optimized complex Snow SQL queries for efficient data extraction, transformation, and loading (ETL) operations within Snowflake.

Designed and implemented Snowflake Schema for optimized data warehousing, enabling efficient data storage and retrieval.

Implemented Azure Key Vault integration to manage secrets, keys, and certificates within data pipelines; strengthened security protocols and ensured compliance with organizational policies while reducing potential vulnerabilities by 35%.

Managed and optimized data storage and processing using various file formats Like ORC, PARQUET, AVRO, JSON, CSV, ensuring compatibility and efficiency across multiple systems.

Enabled seamless Data Visualization and reporting through integration with Power BI, driving data-driven decision-making across the organization.

Conducted code reviews using GIT to maintain code quality and adhere to best practices, fostering a culture of continuous improvement.

Managed project tasks and user stories in Jira, facilitating effective tracking of progress and collaboration within the data engineering team.

Utilized SharePoint for document management and collaboration among data engineering team members, ensuring easy access to project documentation and resources.

Participated in Scrum ceremonies, including daily stand-ups, sprint reviews, and planning meetings, to foster collaboration and keep the data engineering team aligned.

Environment: Azure Data Factory, Azure Databricks, Azure Dara Lake Storage Gen 2, Azure Blob, Logic Apps, Functional App, Snowflake, MS SQL, Oracle, HDFS, Hive, Sqoop, PIG, PySpark, Oozie, Zookeeper, SQL, Python 3.8.2, Scala, Kafka, Power Bi.

Client: Intuitive, Sunnyvale, CA Jan 2017 to Nov 2019

Role: Data Engineer

Responsibilities:

Designed and implemented end-to-end data migration pipelines from the Hadoop ecosystem (Hadoop 2.7.3, Spark 2.4.0) to Azure Cloud, ensuring seamless data transfer and integrity throughout the process.

Architected and implemented end-to-end ETL pipelines using Azure Data Factory v1 and SSIS, integrating with Azure Data Lake Gen1 to process 500+ GB daily data while reducing processing time by 40%.

Leveraged Azure Databricks for complex data transformations and machine learning workflows, processing 2TB+ data weekly while achieving 45% improvement in computational efficiency.

Developed and optimized complex T-SQL stored procedures and SSIS packages for data integration, improving performance by 35% across Azure SQL databases and Azure Data Lake Gen1 storage.

Orchestrated migration of on-premises data warehouse to Azure SQL Data Warehouse, implementing PolyBase for efficient data loading from Azure Data Lake Gen1, resulting in 50% cost reduction.

Designed and deployed Azure Stream Analytics solutions integrated with Event Hubs and Azure Data Lake Gen1, processing 10,000+ events per second with sub-second latency.

Implemented row-level security and column-level encryption in Azure SQL Database using Azure Key Vault, ensuring GDPR compliance while managing access for 200+ users.

Developed scalable data processing applications using PySpark on Azure Databricks and Hadoop, leveraging Spark's distributed computing capabilities to analyse large datasets and optimize data transformation workflows.

Implemented data ingestion workflows using PySpark to extract, transform, and load (ETL) data from diverse sources into Azure Data Lake Storage and HDFS, enhancing data accessibility for analytics and reporting.

Developed and executed PySpark scripts in Azure Databricks for data transformation, leveraging Apache Spark 2.4.x for processing and aggregating data from various Hadoop sources, enhancing data accessibility in Azure.

Collaborated with cross-functional teams to build and deploy end-to-end data pipelines in Azure and Hadoop ecosystems, utilizing PySpark for data cleansing, aggregation, and analytics, resulting in improved data quality and insights.

Created automated Databricks notebooks and Azure Data Factory pipelines for incremental data loads, reducing processing time by 60% and ensuring data consistency across platforms.

Developed and optimized complex MapReduce programs to process large datasets, enhancing data handling capabilities and achieving a 30% performance improvement across critical big data tasks.

Built and maintained scalable HDFS structures, ensuring effective storage and fast retrieval of petabyte-scale datasets, which enhanced the efficiency of data-intensive applications and workflows.

Created and fine-tuned HiveQL scripts for data ETL processes, reducing query execution times by 25%, which significantly accelerated data transformation and extraction operations.

Managed and automated multi-step data workflows using Apache Oozie, orchestrating tasks across the Hadoop ecosystem to streamline big data pipeline execution and operational efficiency.

Integrated Kafka with Hadoop for real-time data streaming, improving data ingestion speed by 40%, which allowed for faster, reliable insights and facilitated real-time analytics applications.

Designed and implemented Sqoop jobs to enable seamless data transfers between Hadoop and relational databases, providing reliable connectivity with systems like MS SQL Server and MySQL.

Developed efficient PySpark and Scala scripts to process massive datasets, achieving a 20% improvement in data processing speeds, enhancing overall data pipeline throughput and analysis speed.

Collaborated with the QA team to develop automated test scripts, enhancing the validation process for data pipelines. This included scenarios for data validation, schema consistency, and performance metrics, resulting in a 30% reduction in data-related issues.

Conducted comprehensive design and code reviews to maintain high standards in ETL processes, identifying optimization opportunities and ensuring adherence to coding standards, leading to improved performance and maintainability.

Configured Jenkins pipelines to automate the build and deployment of data processing applications, significantly reducing deployment time and manual errors.

Utilized Git for version control, managing code repositories and enabling collaborative development across multiple teams, ensuring code integrity and history tracking.

Environment: Azure Data Factory V1, Azure Data Lake Storage Gen1, Azure Data Bricks, Azure SQL DataBase, Azure Stream Analytics, Azure Key Vault, Sqoop, Pig, HDFS, Apache Cassandra, Oozie, Zookeeper, Control-M, Flume, Kafka, Apache Spark, Scala, Hive, Hadoop, HBase, Cosmos DB, MySQL, YAML, GIT, JIRA

Client: Mercury Insurance, Austin, TX Oct 2013 to Dec 2016

Role: Data Warehouse Developer

Responsibilities:

Designed and managed SQL Server databases, ensuring efficient data storage, retrieval, and overall database performance.

Led data modeling efforts, performing both physical and logical database design to support complex business requirements.

Integrated front-end applications with SQL Server backend, enhancing data flow and application functionality.

Developed complex T-SQL scripts, including stored procedures, triggers, indexes, and user-defined functions, to automate tasks and optimize database performance.

Executed data migration projects using DTS, facilitating the seamless transfer of data between servers and different environments.

Optimized T-SQL queries for enhanced performance, reducing query execution times and improving system responsiveness.

Leveraged SSIS/DTS to transfer and transform data from diverse sources like MS Excel, MS Access, and flat files into SQL Server, utilizing advanced features like data conversion and derived columns.

Created and deployed sophisticated SSIS packages for ETL processes, with logging mechanisms at both package and task levels to track data processing activities.

Provided 24/7 production support, developing and testing SQL Server queries and Windows command files to monitor and ensure database health and availability.

Designed and generated complex reports using SQL Reporting Services, including cross-tab, drill-down, and OLAP reports, tailored to meet specific business needs.

Environment: SQL Server, Erwin, TOAD, Windows 2000, Oracle 9i, UNIX Shell Scripts, PL/SQL, Data Transformation Services (DTS), SQL Server Integration Services (SSIS), MS Excel, MS Access, SQL Server Reporting Services (SSRS).

Contact this candidate