Data Engineer Azure

Location:

Cartersville, GA

Posted:

March 01, 2024

Contact this candidate

Resume:

DINESH KUMAR PULLEPALLY

Senior Data Engineer

Phone: +1-470-***-****

Email: ****************@*****.***

LinkedIn: www.linkedin.com/in/dinesh-kumar-p05

PROFESSIONAL SUMMARY

** ***** ** ******ence in the IT industry. Skilled in leveraging tools, including Azure services, big data components such as Hadoop, Hive, MapReduce, and cloud data modeling schemas like Snowflake and Star. Extensive experience with Informatica.

Developed comprehensive data pipeline solutions using a wide range of technologies Azure Data Bricks, ADF, Azure Synapse, Azure functions, Logic Apps, ADLS Gen2, Hadoop, Stream Sets, PySpark, Map Reduce, Hive, HBase, Python, Scala, and Snowflake.

Designed and developed complex data transformations using mapping data flows in Azure Data Factory and Azure Data Bricks, optimizing data processing, and enhancing overall efficiency.

Demonstrated expertise in developing efficient data pipelines, utilizing technologies such as Delta Lake, Delta Tables, Delta Live Tables, Data Catalogs, and Delta Lake API.

Proficient in optimizing Spark jobs to enhance performance and efficiency, utilizing best practices and techniques.

Leveraged Azure Synapse Analytics for advanced analytics, demonstrating expertise in seamless integration and utilization within data processing workflows.

Proficiently design serverless solutions by leveraging Azure Function Apps, optimizing efficiency and resource utilization in cloud-based environments.

Skillfully devise workflows using Azure Logic Apps, ensuring seamless integration and automation of processes within the Azure ecosystem.

Skilled in creating real-time streaming data pipelines using Azure Event Hub and Databricks Auto Loader.

Implemented robust data encryption strategies using Azure Key Vaults to fortify privacy and security measures, ensuring sensitive information remains protected within the Azure environment.

Extensive experience in developing data pipelines with Pandas Data Frame, Spark Data Frame, and RDDs.

Proven track record in Spark performance tuning, Spark SQL, and Spark Streaming for both Big Data and Azure Databricks.

Ensured data quality and integrity through validation, cleansing, and transformation operations.

Experience in working with Various Big Data Components like HDFS, YARN, MapReduce, Spark, Sqoop, Oozie, Pig, Zookeeper, Hive, HBase, Kafka and Airflow.

Optimized query performance in Hive through bucketing and partitioning techniques and gained extensive experience in tuning Spark jobs.

Proven expertise with Apache Sqoop for synchronizing data across HDFS and Hive, as well as configuring and managing workflow with Apache Oozie, Control M, to efficiently schedule and manage Hadoop processes.

Implemented effective strategies for Hive performance tuning, including the use of User-Defined Functions (UDFs).

Proficiency In Apache Kafka-driven real-time streaming analytics in Spark Streaming, efficient processing and analysis of high-velocity streaming data is enabled, while Kafka is utilized effectively as a fault-tolerant data pipeline.

Built production data pipelines using Apache Airflow, YAML, and Terraform scripts.

Experience in NOSQL databases like HBase, MongoDB, Azure Cosmos DB.

Successfully designed and developed data pipelines utilizing Snow SQL, Snowflake Integrated services, and Snow pipe.

Leveraged advanced features like Snowflake Zero copy Clone and Time Travel to enhance and maintain Snowflake database applications. Proficient in working with Snowflake Multi-Cluster Warehouses.

Developed cloud-based data warehouse solutions using Snowflake, optimizing schemas, tables, and views for efficient data storage and retrieval. Implemented SQL Analytical Functions & Window Functions.

Skilled in data visualization using Power BI and Power BI DAX.

Well-versed in working with data formats such as ORC, Parquet, Avro, and delimited files.

Successfully designed and developed CI/CD data pipelines, collaborating with DevOps teams for automated pipeline deployment.

Experienced in utilizing version control tools such as GitHub, Azure DevOps, Bitbucket, Gitlab, and ARM templates.

Implemented a multi-node cloud cluster on AWS EC2, utilizing CloudWatch and CloudTrail for monitoring and logging with versioned S3 storage.

Developed Spark applications for extensive data processing and employed Matillion ETL for pipeline design and maintenance.

Enabled real-time data movement using Spark Structured Streaming, Kafka, and Elasticsearch, along with Tableau refresh via AWS Lambda.

Leveraged AWS services including EMR, Glue, and Lambda for transforming, moving data, and automating processes, showcasing a robust AWS-based environment.

TECHNICAL SKILLS

Azure Services

Azure Data Factory, Azure Data Bricks, Azure Synapse analytics, Azure blob storage, Logic Apps, Function Apps, Azure Data Lake Gen2, Azure SQL Database, Azure key vault, Azure DevOps.

AWS Services

EC2, S3, Glue, Lambda functions.

Big Data Technologies

MapReduce, Hive, Tez, HDFS, YARN, Pyspark, Hue, Kafka, Spark streaming, Oozie, Sqoop, Zookeeper, Airflow.

Hadoop Distribution

Cloudera, Horton Works

Languages

SQL, PL/SQL, Hive Query Language, Azure Machine learning, Python, Scala, Java.

Web Technologies

JavaScript, JSP, XML, Restful, SOAP, FTP, SFTP

Operating Systems

Windows (XP/7/8/10), UNIX, LINUX, UBUNTU, CENTOS.

Build Automation tools

Ant, Maven, Toad, AutoSys

Version Control

GIT, GitHub.

IDE & Build Tools, Design

Eclipse, IntelliJ IDEA, Visual Studio, SSIS, informatica, Erwin, Tableau, Business Objects, Power BI

Databases

MS SQL Server 2016/2014/2012, Azure SQL DB, Azure Synapse, MS Excel, MS Access, Oracle 11g/12c, Cosmos DB, Mongo DB, Cassandra, HBase.

EDUCATION

Bachelor’s degree in computer science from Andhra University.

CERTIFICATION

Microsoft Certified: Azure Data Engineer Associate

WORK EXPERIENCE

Senior Data Engineer Sep 2022 – Till Now

Client: Koantek, Chandler, Arizona

Responsibilities:

•Architected and implemented end-to-end scalable data ingestion pipelines leveraging Azure Data Factory, handling diverse data sources, including SQL databases, JSON files, CSV files, and REST APIs.

•Engineered efficient data processing workflows with Azure Databricks, utilizing Pyspark for distributed data processing and transformation tasks.

•Utilized Azure Synapse Analytics for advanced analytics, facilitating seamless data exploration, and generating valuable insights.

•Designed and implemented scalable data storage solutions using Azure Data Lake Storage Gen2, leveraging its hierarchical file system and advanced analytics capabilities.

•Orchestrated the design and implementation of the Medallion architecture on the Azure cloud platform, leveraging robust Azure services, including Azure Cosmos for NoSQL capabilities and Terraform for infrastructure as code, to enhance system scalability and performance.

•Implemented key architectural components within the Medallion system using Azure services, ensuring seamless integration, modularity, and adherence to best practices.

•Demonstrated expertise in utilizing Azure Synapse serverless SQL pools to implement comprehensive data warehousing and analytics solutions.

•Engineered Spark core and Spark SQL scripts with Pyspark, enhancing data processing speed and efficiency.

•Designed and implemented Slowly Changing Dimension (SCD) strategies in data warehousing solutions on Azure, ensuring efficient handling of changing data over time.

•Designed and fine-tuned Pyspark jobs for data transformations, executing aggregation tasks efficiently on large datasets.

•Implemented event-based triggers and scheduling mechanisms, automating the orchestration of data pipelines and workflows.

•Architected and implemented real-time data processing solutions utilizing Azure Event Hub, facilitating seamless ingestion, transformation, and analysis of high-volume streaming data.

•Engineered and fine-tuned Snowflake schemas, tables, and views, ensuring optimal data storage and retrieval for analytics and reporting tasks.

•Architected and deployed a cloud-based data warehouse solution on Azure using Snowflake and Azure Cosmos, harnessing their scalability and high-performance features.

•Collaborated with data scientists on data exploration and profiling tasks, fostering a deeper understanding of data characteristics, quality, and identifying potential features for machine learning applications.

•Partnered with data scientists to enhance scalability and performance in data processing and machine learning workflows, leveraging Azure services such as Azure Synapse Analytics, Azure Databricks, and Azure Cosmos.

•Deployed event-based triggers and scheduling mechanisms using Azure services and Terraform to automate Alteryx workflows and maintain timely data processing.

•Engaged closely with data analysts and business stakeholders to comprehend their requirements, translating insights into well-suited data models and structures within the Snowflake environment.

•Deployed data lineage and metadata management solutions, including Azure Cosmos, to effectively track and monitor the flow and transformations of data throughout the system.

•Implemented event-based triggers and scheduling mechanisms in Databricks, automating the orchestration of data pipelines and workflows powered by Pyspark.

•Engineered scalable and optimized Snowflake schemas, tables, and views within Azure Synapse Analytics, incorporating Alteryx workflows for advanced data processing tasks.

•Recognized and addressed performance bottlenecks within data processing and storage layers, optimizing query execution and minimizing data latency.

•Deployed data lineage and metadata management solutions, including Azure Cosmos, effectively tracking and monitoring the flow and transformations of data throughout the system, enhancing overall data governance practices.

•Proficient in understanding the varied roles and responsibilities within an Agile team, encompassing crucial functions such as the Product Owner, Scrum Master, and developers.

•Executed performance tuning and capacity planning initiatives to ensure optimal scalability and efficiency within the data infrastructure.

Environment: Azure Databricks, Azure Data Factory, Azure Blob storage, Azure Synapse Analytics, Azure Data Lake, Azure Event hub, Azure DevOps, Logic Apps, Function Apps, MS SQL, Python, Snowflake, Pyspark, Kafka, Power Bi.

Azure Data Engineer Sep 2020 – Aug 2022

Client: US Bank, Minneapolis, MN.

Responsibilities:

Designed and implemented comprehensive end-to-end data pipelines using Azure Data Factory, ensuring seamless extraction, transformation, and loading (ETL) of data from diverse sources into Snowflake.

Engineered and implemented robust data processing workflows with Azure Databricks, leveraging Pyspark for large-scale data transformations.

Administered Apache Spark clusters within Azure Synapse Analytics, ensuring efficient creation, setup, and maintenance.

Utilized Azure Data Lake Storage as a centralized data lake, implementing effective data partitioning and retention strategies for both raw and processed data.

Integrated Azure Synapse serverless SQL pools seamlessly with other Azure services, including Azure Data Factory and Azure Databricks.

Demonstrated proficiency in the Apache Spark API and ecosystem, along with Azure Synapse Studio, for effective management of Azure Synapse Analytics.

Implemented Azure Blob Storage to optimize storage and retrieval of data files, incorporating compression and encryption techniques for improved efficiency and enhanced security.

Enhanced performance in data pipelines and Spark jobs within Azure Databricks through optimization measures, including fine-tuning Spark configurations, implementing caching strategies, and leveraging data partitioning techniques.

Integrated Azure Data Factory seamlessly with Azure Logic Apps to orchestrate intricate data workflows and trigger actions in response to specific events.

Loaded and transformed extensive datasets into Snowflake, employing various methods such as bulk loading, data ingestion, and ETL tools.

Executed data replication and synchronization strategies between Snowflake and other platforms, leveraging Azure Data Factory and Change Data Capture techniques.

Engineered and deployed Azure Functions for handling data preprocessing, enrichment, and validation tasks within data pipelines.

Orchestrated the creation, configuration, and management of Apache Spark clusters within Databricks, ensuring optimal performance and scalability for Spark applications.

Developed a Business Objects dashboard to monitor and analyze the performance of marketing campaigns.

Engineered scalable and optimized Snowflake schemas, tables, and views to accommodate complex analytics queries and meet reporting requirements.

Formulated and implemented data archiving and retention strategies using Azure Blob Storage and Snowflake's Time Travel feature.

Designed and implemented custom monitoring and alerting solutions using Azure Monitor and Snowflake Query Performance Monitoring (QPM) to proactively identify and address performance issues.

Seamlessly integrated Snowflake with Power BI and Azure Analysis Services to develop interactive dashboards and reports, empowering business users with self-service analytics capabilities.

Collaborated with cross-functional teams, including data scientists, data analysts, and business stakeholders, to comprehend data requirements and deliver scalable and reliable data solutions.

Environment: Azure Databricks, Azure Data Factory, Azure Data Lake Storage Gen2, Azure Synapse Analytics, Logic Apps, Azure SQL Database, Oracle, Snowflake, Pyspark, Power bi.

Data Engineer Dec 2018 – Aug 2020

Client: Option Care, Brecksville, OH

Responsibilities:

Designed and implemented end-to-end data pipelines using AWS services for efficient data ingestion, transformation, and loading (ETL) into Snowflake data warehouse.

Utilized AWS EMR and Redshift for large-scale data processing, transforming, and moving data into and out of AWS S3.

Developed and maintained ETL processes with AWS Glue, migrating data from various sources into AWS Redshift.

Implemented serverless computing with AWS Lambda, executing real-time Tableau refreshes and other automated processes.

Utilized AWS SNS, SQS, and Kinesis for efficient messaging and data streaming, enabling event-driven communication and message queuing.

Designed and orchestrated workflows with AWS Step Functions, automating intricate multi-stage data workflows.

Implemented data movement with Kafka and Spark Streaming for efficient real-time data ingestion and transformation.

Integrated and monitored ML workflows with Apache Airflow, ensuring smooth task execution on Amazon SageMaker.

Leveraged Hadoop ecosystem tools, including Hadoop, MapReduce, Hive, Pig, and Spark for big data processing and analysis.

Managed workflows with Oozie, orchestrating effective coordination and scheduling in big data projects.

Utilized Sqoop for data import/export between Hadoop and RDBMS, importing normalized data from staging areas to HDFS and performing analysis using Hive Query Language (HQL).

Ensured version control with Git/GitHub, maintaining version control of the codebase and configurations.

Automated deployment with Jenkins and Terraform, facilitating the automated deployment of applications and data pipelines.

Worked with various databases, including SQL Server, Snowflake, and Teradata, for efficient data storage and retrieval.

Performed data modeling with Python, SQL, and Erwin, implementing Dimensional and Relational Data Modeling with a focus on Star and Snowflake Schemas.

Implemented and optimized Apache Spark applications, creating Spark applications extensively using Spark DataFrames, Spark SQL API, and Spark Scala API for batch processing of jobs.

Collaborated with business users for Tableau dashboards, facilitating actionable insights based on Hive tables.

Enhanced performance using optimization techniques, leading to the optimization of complex data models in PL/SQL, improving query performance by 30% in high-volume environments.

Developed predictive analytics reports with Python and Tableau, visualizing model performance and prediction results.

Environment: AWS S3, AWS Redshift, HDFS, Amazon RDS, Apache Airflow, Tableau, AWS Cloud Formation, AWS Glue, Apache Airflow Apache Cassandra, Terraform.

Big Data Developer Jul 2017 – Nov 2018

Client: Change healthcare, Nashville, TN

Responsibilities:

Developed an ETL framework utilizing Sqoop, Pig, and Hive to seamlessly extract, transform, and load data from diverse sources, making it readily available for consumption.

Processed HDFS data and established external tables using Hive, while also crafting scripts for table ingestion and repair to ensure reusability across the entire project.

Engineered ETL jobs utilizing Spark and Scala to efficiently migrate data from Oracle to new MySQL tables.

Employed Spark (RDDs, Data Frames, Spark SQL) and Spark-Cassandra Connector APIs for a range of tasks, including data migration and the generation of business reports.

Collaborated with Apache Hive, Apache Pig, HBase, Apache Spark, Zookeeper, Flume, Kafka, and Sqoop.

Played a significant role in crafting combiners, implementing partitioning, and leveraging distributed cache to boost the performance of MapReduce jobs.

Designed and implemented a Spark Streaming application for real-time sales analytics.

Analyzed source data, proficiently managed data type modifications, and employed Excel sheets, flat files, and CSV files to generate ad-hoc reports using Power BI.

Successfully addressed intricate technical challenges throughout the development process.

Analyzed SQL scripts and designed solutions using spark Scala.

Utilized Sqoop to extract data from diverse sources and load it into Hadoop Distributed File System (HDFS).

Managed data import from diverse sources, executed transformations using Hive and MapReduce, and loaded processed data into Hadoop Distributed File System (HDFS).

Developed detailed requirements specification for a new system using use cases and data flow diagrams.

Expertise in optimizing complex SQL queries for improved performance and reduced execution time, employing Snowflake query profiling and optimization techniques.

Utilized Sqoop to extract data from MySQL and efficiently load it into Hadoop Distributed File System (HDFS).

Implemented automation for deployments using YAML scripts for streamlined builds and releases.

Utilized Git and GitHub repositories to maintain the source code and enable version control.

Environment: Hadoop, MapReduce, Hive, Hue, spark, Scala, Sqoop, Spark SQL, Machine learning, Snowflake, Shell scripting, Cassandra, ETL.

Data Warehouse Developer Aug 2012 – May 2016

Client: Birla Soft, Hyderabad, India.

Responsibilities:

Developed source data profiling and analysis processes, thoroughly reviewing data content and metadata. This facilitated data mapping and validation of assumptions made in the business requirements.

Created and maintained databases for Server Inventory and Performance Inventory.

Operated within the Agile Scrum Methodology, engaging in daily stand-up meetings. Proficient in utilizing Visual SourceSafe for Visual Studio 2010 and project tracking through Trello.

Created Drill-Through and Drill-Down reports with dropdown menu options, implemented data sorting, and defined subtotals for enhanced data exploration.

Utilized the Data Warehouse to craft a Data Mart feeding downstream reports. Additionally, engineered a User Access Tool empowering users to generate ad-hoc reports and run queries, facilitating data analysis within the proposed Cube.

Developed logical and physical designs of databases and ER Diagrams for both Relational and Dimensional databases using Erwin.

Designed and implemented comprehensive data warehousing solutions, including the development of ETL processes and dimensional data models, resulting in improved data accessibility and analytical capabilities.

Engineered and optimized SQL queries and stored procedures for efficient data extraction, transformation, and loading (ETL) processes in support of data warehousing initiatives.

Proficient in Dimensional Data Modeling for Data Mart design, adept at identifying Facts and Dimensions, and skilled in developing fact tables and dimension tables using Slowly Changing Dimensions (SCD).

Developed a Business Objects dashboard that facilitated the tracking of marketing campaign performance for a company.

Deployed SSIS packages and established jobs to ensure the efficient execution of the packages.

Involved in System Integration Testing (SIT) and User Acceptance Testing (UAT).

Crafted intricate mappings utilizing Source Qualifier, Joiners, Lookups (Connected and Unconnected), Expression, Filters, Router, Aggregator, Sorter, Update Strategy, Stored procedure, and Normalizer transformations.

Proficient in crafting ETL packages using SSIS to extract data from heterogeneous databases, followed by transformation and loading into the data mart.

Implemented and maintained data integration workflows using ETL tools like Informatica, SSIS, or Talend, facilitating seamless data movement across the data warehouse.

Engaged in the creation of SSIS jobs to automate the generation of reports and refresh packages for cubes.

Experienced in utilizing SQL Server Reporting Services (SSRS) for authoring, managing, and delivering both paper-based and interactive web-based reports.

Designed and implemented stored procedures and triggers to ensure consistent and accurate data entry into the database.

Environment: Informatica 8.6.1, SQL Server 2005, RDBMS, Fast load, FTP, SFTP, Windows server, MS SQL Server 2014, SSIS, SSAS, SSRS, SQL Profiler, Dimensions, Performance Point Server, MS Office, SharePoint.

Contact this candidate