Data Engineer Azure

Location:

Atlanta, GA, 30303

Posted:

March 27, 2024

Contact this candidate

Resume:

Viswa A

Nashville, Tennessee 615-***-**** *********@*****.***

Professional Summary

●8+ years of expertise across various domains, proficient in cloud platforms such as AWS, Azure, GCP, Databricks and big data technologies including Hadoop and Spark, ETL tools like Informatica, NiFi, Airflow, scripting languages such as Python, Bash, PowerShell, non-relational databases like MongoDB, HBase, Cassandra, and programming languages like Python, Scala, Java, and SQL.

●Strong coding skills in Python, Scala, Java, and SQL, enabling efficient development and optimization of data pipelines and analytics solutions, as well as proficiency in Apache Spark for distributed data processing and analytics.

●Proficient in ETL and data warehousing concepts, including dimensional modeling, star schema, snowflake schema, slowly changing dimensions (SCDs), data mart design, and data warehouse architecture, demonstrating expertise in designing and implementing robust data pipelines, data integration, and data governance strategies to ensure accurate and reliable data processing and analysis.

●Proficient in leveraging Amazon Web Services (AWS) services such as Amazon S3 (Simple Storage Service), Amazon EC2 (Elastic Compute Cloud), Amazon Redshift, and Amazon EMR (Elastic MapReduce) for building scalable and reliable data infrastructure in the cloud.

●Experienced in setting up and managing AWS data lakes using services like Amazon S3 and AWS Glue, enabling efficient storage, processing, and analysis of large volumes of structured and unstructured data.

●Experience in orchestrating SQL database migrations to Azure cloud solutions, including Azure Data Lake, Azure Data Lake Analytics, Azure SQL Database, and Azure Data Warehouse. Skilled in access control, role-based access implementation, and data security enforcement. Expertise in on-premises database migration to Azure Data Lake Store using Azure Data Factory.

●Experienced in leveraging Google Cloud Platform (GCP) services such as Google Cloud Storage (GCS), Virtual Machines (VMs), Cloud SQL, and BigQuery. Skilled in architecting scalable and reliable data solutions using GCP services, including data storage, compute, and analytics.

●Proficient in leveraging Talend Data Integration to design and implement robust data models, including star schema structures, ensuring optimal performance and scalability for analytical querying and reporting needs.

●Proficient in leveraging Snowflake's features to design and implement scalable and performant data models, enabling efficient data storage, management, and analytics within cloud-based environments.

●Proficient in data modeling methodologies and techniques, including dimensional modeling, entity-relationship modeling, and schema design, to create structured and efficient data models that meet business requirements and support analytical insights.

●Proficient in designing and implementing CI/CD pipelines on AWS, leveraging services to automate software delivery processes and improve deployment efficiency.

●Proficient in Hadoop architecture, including HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Secondary Name Node, and MapReduce concepts, coupled with proficiency in Spark to optimize algorithms and enhance performance using SparkContext, PySpark, SparkSQL, DataFrames, Pair RDDs, Spark YARN, and memory tuning, ensuring improved data processing efficiency.

●Experienced in loading and analyzing large datasets using the Hadoop framework, encompassing MapReduce, HDFS, PIG, HIVE, Flume, Sqoop, SPARK, Impala, Scala, YARN, Kafka, Oozie, and HBase.

●Utilized a comprehensive suite of ETL tools including Informatica, Apache NiFi, Airflow, and Talend to streamline data pipelines, orchestrate workflows, and ensure seamless data integration and processing.

●Utilized Python and scripting languages like Bash and PowerShell, for automation in data engineering tasks, demonstrating fluency in scripting environments to streamline workflows and enhance productivity in data processing operations.

●Utilized a variety of non-relational databases essential for data engineering tasks, including MongoDB, HBase, Cassandra, and Elasticsearch, to efficiently handle unstructured and semi-structured data, ensuring seamless integration and processing within data pipelines.

●Strong experience in working with Teradata, and proficient in writing complex SQL and PL/SQL for tasks such as creating tables, views, indexes, and developing stored procedures and functions, and adept at utilizing Terraform for infrastructure as code, ensuring efficient deployment and management of cloud resources.

●Skilled Data Engineer adept at effectively communicating with business stakeholders to translate their requirements into technical solutions, fostering collaboration and driving data-driven decision-making for business growth.

SKILLS

●Big data technologies: Hadoop, MapReduce, Hive, Spark (PySpark), Sqoop, zookeeper, Airflow, Terraform Apache Kafka, Apache Cassandra, Apache HBase, Apache Impala, Data Modelling

●Programming languages: Python, Java, Scala, C, SAS, UNIX/LINUX Shell Scripting, JavaScript

●DataBases: SQL, PL/SQL, PSQL, SQL Server, MySQL, Oracle, Snowflake, Cassandra, MongoDB, HBase

●CI/CD Tools: Jenkins, GitHub, Jira, Confluence

●ETL Tools: Informatica, Apache NiFi, Airflow, Talend

●Data Visualization Tools: Tableau, Power BI, Data studio, AWS Athena

●Cloud Platform: AWS Services, Azure, Databricks, GCP, Cloudera

●Project Management: Agile, Waterfall, Asana, Kanban

WORK EXPERIENCE

Amazon, Nashville, Tennessee April 2022- Till Now

Data Engineer

Responsibilities:

●Designed and implemented robust data pipelines using AWS Glue and Apache Spark, facilitating efficient data ingestion, processing, and transformation for data analytics, and developing machine learning models.

●Leveraged Amazon EMR for distributed data processing and analytics, optimizing performance and scalability for processing large scale datasets in a cost-effective manner.

●Designed and implemented end-to-end ETL processes in AWS Glue utilizing PySpark for data transformation, seamlessly integrating with Amazon Athena for advanced analytics, enabling efficient migration of campaign data from diverse external sources like S3, JDBC connectors and various file formats into RDS databases, facilitating comprehensive data analysis and insights through AWS Athena's query insights.

●Developed and orchestrated complex ETL workflows using Apache Airflow in conjunction with AWS Glue, ensuring robust scheduling, monitoring, and execution of data pipelines for migrating campaign data from heterogeneous external sources to RDS databases, while leveraging Airflow's DAG-based architecture for streamlined workflow management and automation.

●Integrated Apache Kafka into data pipelines for real-time streaming of upstream data, enabling timely analysis and decision-making for adaptive learning experiences.

●Integrated Amazon Athena with Databricks to enable ad-hoc SQL queries and interactive analytics on data stored in Amazon S3, enhancing the data exploration and analysis capabilities of the platform.

●Developed automating data infrastructure provisioning, configuration, and management using AWS CloudFormation, AWS CLI, and AWS SDKs, enabling efficient deployment and maintenance of data solutions.

●Implemented Infrastructure as Code (IaC) using AWS CloudFormation or AWS CDK to provision and manage AWS resources required for CI/CD pipelines, ensuring consistency and repeatability across environments.

●Optimized resource utilization and cost-effectiveness by provisioning EC2 instances with appropriate specifications for data processing tasks.

●Spearheaded the planning, design, and execution of the migration of legacy ETL processes to Snowflake, resulting in a 50% improvement in query performance and a 30% reduction in maintenance overhead.

●Designed and implemented data warehouse solutions using Snowflake on AWS, including schema design, table structures, and data loading strategies.

●Developed ETL pipelines using AWS Glue and Snowflake's Snowpipe for efficient data ingestion, transformation, and loading processes.

●Utilized Snowflake's features such as data sharing, and clustering to enhance data management, security, and performance.

●Implemented Snowflake's semi-structured data support to handle unstructured data sources and complex data structures, ensuring flexibility and adaptability in the data models.

●Implemented robust monitoring solutions for ETL processes using AWS CloudWatch, creating alarms to detect failures and anomalies in AWS Glue jobs and developed custom metrics, dashboards in CloudWatch to track key performance indicators to monitor the health and performance of ETL workflows.

●Designed and implemented a serverless data ingestion pipeline using AWS Lambda, and SQS to ingest and process large volumes of real-time data streams and set up notification mechanisms using AWS SNS to notify stakeholders and teams of critical system failures and performance degradation.

●Designed and implemented a robust data exchange pipeline mechanism utilizing AWS architecture to transfer external data from vendor teams to internal stakeholders, ensuring compliance with security requirements.

●Developed and deployed Python scripts to automate the data ingestion process, streamlining extraction, transformation, and loading (ETL) operations from downstream vendor teams.

●Leveraged native AWS services including SQS, Lambda, SNS, EC2, and VPC to optimize data handling and ensure scalability and reliability.

●Provided technical expertise and support to internal stakeholders, troubleshooting issues and implementing solutions to address data exchange challenges effectively.

●Contributed to the development of best practices and standards for data management and exchange within the organization, ensuring compliance with industry regulations and data privacy requirements.

Environment: Spark, Python, SQL, Databricks, Snowflake, Airflow, Data Modelling, Agile, AWS – EMR, S3, SNS, EC2, AWS Batch, Athena, AWS CDK CI/CD, Kafka.

Nike, Oregon, Portland May 2020 – March 2022

Senior Data Engineer

Responsibilities:

●Developed data pipelines in Azure Data Factory (ADF) utilizing Linked Services, Datasets, and Pipelines to orchestrate the extraction, transformation, and load data from diverse sources including Azure SQL, Blob Storage, Azure SQL Data Warehouse.

●Implemented data integration solutions between on-premises and cloud environments using Azure Data Factory and SSIS.

●Designed and implemented data models and schemas for Azure SQL Database and Cosmos DB to support business intelligence and reporting requirements.

●Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that processes the data using the Sql Activity.

●Designed and implemented Spark notebooks to efficiently transform data, partition it, and organize files within Azure Data Lake Storage (ADLS), optimizing data processing and storage for enhanced performance and scalability.

●Implemented User-Defined Functions (UDFs) using Scala and PySpark to meet specific business requirements, enabling customized data transformations and analysis within Spark-based data processing workflows.

●Utilized Talend's graphical interface to create data integration jobs, transforming complex data structures into streamlined models suitable for reporting and analytics.

●Implemented comprehensive data extraction solutions by combining Talend Data Integration with Azure services, facilitating seamless integration of API data into Azure-based environments.

●Developed star schema structures that optimize query performance and support complex analytical requirements, ensuring scalability and ease of use for business users

●Leveraged Talend's tRESTClient component to connect to RESTful APIs and extract structured data in JSON or XML formats, ensuring compatibility with Azure services.

●Integrated Talend jobs with Azure Data Lake Storage using tAzureStorage components to store extracted data in scalable and secure storage environments.

●Developed custom routines using Talend's scripting capabilities, such as tJava or tJavaRow, to handle authentication, pagination, and other API-specific requirements in conjunction with Azure services.

●Utilized Azure Synapse to design and implement end-to-end analytics solutions, integrating data warehousing, big data analytics, and data integration capabilities to enable comprehensive data analysis and insights generation.

●Implemented Azure DevOps for continuous integration and continuous deployment (CI/CD), automating code deployment processes and ensuring seamless delivery of data engineering pipelines across Azure environments, enhancing development agility and efficiency.

●Utilized Apache Airflow in conjunction with Azure services to orchestrate and automate complex data workflows, providing robust scheduling, monitoring, and error handling capabilities for data engineering tasks within Azure environments.

●Implemented email notification functionality using Azure Logic Apps, leveraging its robust workflow automation capabilities to send alerts and notifications based on predefined triggers within data engineering pipelines, enhancing monitoring and communication for efficient workflow management.

●Implemented monitoring and alerting solutions for Azure data pipelines using Azure Monitor and Azure DevOps.

●Worked collaboratively with data science and business intelligence teams to ensure the delivery of clean and reliable data for analysis and reporting purposes.

●Utilized non-relational databases such as MongoDB and Cassandra for handling unstructured data and implementing data warehousing concepts.

●Designed and implemented star schema data modeling techniques to support analytical querying and reporting requirements, ensuring optimal performance and ease of use for business users.

●Utilized dimensional modeling data modeling techniques to create fact and dimension tables, establishing clear relationships and hierarchies for efficient data analysis.

●Developed custom Python scripts leveraging Azure SDKs (Software Development Kits) to interact with Azure services programmatically, enabling automation of various cloud management tasks.

●Contributed to the migration of Oracle database to Google BigQuery, employing Google Cloud Data Transfer Service and custom ETL scripts to seamlessly extract, transform, and load data into BigQuery, ensuring compatibility and maintaining data integrity throughout the migration process using Google Dataflow.

●Implemented data storage solutions on Google Cloud Storage, including bucket creation, object versioning, and data archival. Designed storage architectures to support data replication and disaster recovery requirements for mission-critical datasets.

●Developed and optimized complex SQL queries in BigQuery to perform aggregations, joins, and transformations on large datasets, enabling efficient analysis and visualization of data.

●Utilized Google Data Studio to create interactive dashboards and reports based on BigQuery datasets, providing stakeholders with intuitive visualizations to facilitate data-driven decision-making.

●Collaborated closely with business stakeholders to understand reporting needs and translate them into comprehensive data models, incorporating business logic, metrics, and KPIs.

●Managed project timelines, prioritized tasks, and adapted to changing requirements using agile methodologies, collaborating effectively with teams, and utilizing GitHub for version control and repository management.

Environment: Python, PySpark, Azure Devops, Azure Logic App, Azure Synapse, Azure Data Factory, Azure SQL Database, GitHub, Apache Airflow, Talend, DataModelling, GCP- Big Query, Google Data Studio, Google Cloud Storage, Google Dataflow.

Stigentech, Hyderabad March 2017 to December 2019

Big Data Engineer

Responsibilities:

●Responsible for developing this project from scratch using Python, Spark using python (Pyspark) in the Agile model.

● Migration of ETL processes from RDBMS to Hive to test the easy data manipulation.

● Designed and developed customized business rule framework to implement business logic for the existing process in mainframe environment using hive, pig UDF functions.

●Designed and implemented data processing workflows on YARN-managed Hadoop clusters, optimizing resource utilization and improving overall performance of data processing tasks.

●Developed and maintained batch job scheduling and automation using Autosys, ensuring timely execution and monitoring of data workflows across multiple environments.

●Utilized Sqoop to facilitate data transfer between Hadoop and PostgreSQL databases, enabling efficient data movement and synchronization for real-time and batch processing.

●Proficient in performing complex querying on Oracle databases, utilizing advanced SQL techniques and optimization strategies to extract, transform, and analyze data for various business requirements.

●Utilized Talend's tFileInputDelimited and tFileInputExcel components to extract data from delimited files and Excel spreadsheets, serving as the source for migration tasks.

●Designed and implemented transformation logic using components like tMap and tJoin to cleanse and enrich data during the migration process.

●Collaborated with cross-functional teams to ensure the accuracy and completeness of migrated data, providing documentation and training on Talend components for efficient data migration processes.

●Orchestrated the integration of non-relational databases, including Cassandra and MongoDB, into ETL workflows, facilitating the extraction, transformation, and loading of data from diverse sources into target systems.

●Configured and managed Hadoop cluster nodes using PuTTY, SSH, and command-line utilities, optimizing performance and resource utilization for distributed data processing tasks.

●Designed and implemented custom ETL pipelines tailored to the unique data models and query languages of Cassandra and MongoDB, ensuring seamless data extraction and transformation.

●Developed email setup notifications for monitoring and alerting purposes using monitoring tools, ensuring timely notifications of critical events and issues in data processing workflows.

●Collaborated closely with database administrators and stakeholders to understand data schema and requirements, ensuring accurate extraction and transformation of data from relational and non-relational databases.

●Managed and orchestrated daily continuous code deployments using Jenkins automation server, ensuring reliable and efficient delivery of software updates across development, staging, and production environments.

●Integrated CI/CD pipelines with version control systems such as, Bitbucket to automatically trigger builds and deployments upon code commits or pull requests

●Utilized Agile project management tools like Jira to efficiently manage sprint backlogs, meticulously track user stories, and closely monitor task progress, thereby ensuring the timely achievement of project milestones and adherence to project timelines

Environment: PySpark, YARN, Autosys, Oozie, Python & Shell scripting, Postgres, Talend, MS SQL server, Oracle 11g, Jenkins Cloudera, CI/CD.

Arrixon Technologies, India, Hyderabad January 2016 – February 2017

Hadoop Developer

Responsibilities:

●Designed and implemented MapReduce applications to process and analyze large volumes of data on Hadoop clusters, utilizing technologies such as Hadoop, MapReduce, Hive, Spark (PySpark), Sqoop, Zookeeper, Apache HBase, and Apache Impala.

●Utilized Apache HBase for real-time data storage and retrieval, designing schemas and optimizing performance for high-throughput applications.

●Deployed and managed Hadoop clusters using Cloudera Manager, ensuring high availability, scalability, and performance of cluster infrastructure.

●Developed data ingestion pipelines using Sqoop to efficiently transfer data between Hadoop and relational databases, ensuring seamless integration and data consistency.

●Utilized Apache HBase for real-time data storage and retrieval, designing schemas and optimizing performance for high-throughput applications.

●Implemented complex data processing workflows using Apache Impala for interactive querying and analysis of Hadoop data, enabling rapid insights and decision-making.

●Worked with Apache Zookeeper for distributed coordination and synchronization of Hadoop cluster nodes, ensuring reliability and fault tolerance in cluster operations.

●Collaborated with downstream teams to integrate machine learning algorithms into MapReduce applications using Spark (PySpark), enabling predictive analytics and recommendation systems.

●Developed complex SQL queries in Oracle to extract, transform, and load data into the Hadoop ecosystem, ensuring data accuracy and integrity for downstream analytics.

●Performed comprehensive testing of MapReduce applications and data pipelines, including unit testing, integration testing, and regression testing, to ensure robustness and reliability of software solutions.

●Conducted performance tuning of Hadoop cluster resources, optimizing memory, CPU, and disk utilization for efficient data processing and resource management.

●Optimized Hive queries by leveraging various file formats such as PARQUET, JSON, and AVRO, improving query performance and reducing storage costs.

●Utilized WinSCP for secure file transfer between local environments and remote Hadoop clusters, ensuring seamless data exchange for development and testing purposes.

●Designed and implemented data models and structures optimized for Tableau reporting, ensuring efficient data retrieval and visualization performance.

●Collaborated with business analysts and stakeholders to understand reporting requirements and translate them into actionable data models and visualizations in Tableau.

Environment: Oracle, MySQL, Teradata,Cloudera, Hive, Spark, HBase, Sqoop, PySpark, Tableau, Oozie, PARQUET, JSON.

Contact this candidate