Data Engineer Etl Developer

Location:

Reston, VA

Posted:

October 23, 2024

Contact this candidate

Resume:

Meghana Ankam

Sr ETL Developer/ Data Engineer

Email: *********.**@*****.*** Ph No: +1-571-***-****

LinkedIn: linkedin.com/in/methane-de-

PROFESSIONAL SUMMARY:

●Around 9+ Years of IT experience, with excellent knowledge in ETL Developer/ Data Engineer. Experience in Analysis, Architecture, and Development.

●Extensive experience in designing and deploying scalable data solutions on Azure, utilizing Azure Data Factory, Azure Databricks, and Azure Synapse Analytics.

●Proficient in creating and managing data pipelines in Azure Data Factory for ETL processes, ensuring efficient data flow and integration.

●Proficient in designing and implementing data architectures on AWS, utilizing services like AWS Glue, Amazon Redshift, and Amazon S3 for data warehousing and ETL.

●Experienced in developing and orchestrating complex data pipelines using AWS Glue and AWS Lambda, ensuring seamless data integration and transformation.

●Good experience with Google Cloud Platform. (Big Query, Data proc, Composer, Cloud run)

●Skilled in developing robust ETL processes using tools like Informatica, Talend, and SSIS for seamless data integration.

●Expertise in developing Spark code using Python and Spark-SQL, Streaming, and Pyspark for faster testing and processing of data.

●Skilled in using SparkSQL for querying structured data within Spark applications.

●Experience in writing complex Scala Spark transformations and actions for data processing workflows.

●Extensively worked on Hadoop, Hive, and Spark to build ETL and Data Processing systems having various data sources, data targets and data formats

●Proficient in monitoring Control-M jobs and promptly resolving job failures to ensure data pipeline reliability.

●Extracted the data from MySQL, and AWS RedShift into HDFS using Sqoop.

●Integrated Snowflake with various data sources and tools, automating data workflows for streamlined operations.

●Extensive experience developing and deploying machine learning models to optimize business processes and drive data-driven decision-making.

●Experienced in using Business intelligence& Visualization tools (POWER BI).

●Designed ETL workflows on Tableau and deployed data from various sources to HDFS.

●Strong ability to collaborate with business stakeholders to understand requirements, design intuitive user interfaces, and deliver actionable insights through QlikView solutions.

●Skilled in leveraging GitHub Actions for automating CI/CD pipelines, and enhancing deployment efficiency.

●Proficient in setting up and managing CI/CD pipelines using GitLab CI to automate testing and deployment processes.

●Experienced in integrating Azure DevOps with various data engineering tools and services for streamlined project workflows.

●Experience using JIRA, Maven, Jenkins and GIT for Version controlling and error reporting.

●Have good interpersonal, and communicational skills, strong problem-solving skills, Strong analytical and judgment techniques.

TECHNICAL SKILLS:

Programming Languages

Python, Scala, SQL, PL/SQL, R, DAX

ETL Tools

Azure Data Factory, AWS Glue, Informatica, Talend, SSIS, Apache Flume, Sqoop, Airflow

Big Data Frameworks

Hadoop, Spark, Hive, HDFS, MapReduce, YARN

Data Warehousing

Snowflake, Redshift, AWS S3, Azure SQL Database, Azure Cosmos DB

Data Processing

Spark, PySpark, SparkSQL, Scala, Spark Streaming, Hadoop

Cloud Platforms

Azure, AWS

Machine Learning

Azure Data bricks, Mallis, Python, R, SAS

Business Intelligence & Visualization

Power BI, Tableau, QlikView, Sisense,

Version Control & CI/CD

Git, GitHub, GitLab, Jenkins, GitHub Actions, Azure DevOps

Database Management

MySQL, PostgreSQL, SQL Server, Oracle, Aurora, Exadata

Data Analysis & Visualization Tools

Matplotlib, Seaborn, Jupyter Notebooks, DAX, Power Pivot

DevOps Tools

Docker, Kubernetes, Jenkins, GitLab CI, Azure DevOps

Data Modeling

ERD, Power Designer, Erwin

PROFESSIONAL EXPERIENCE:

Client: California Department of Public Health, Sacramento, CA. Jul 2021 – Present

Role: Sr ETL Developer/ Data Engineer

Responsibilities:

●Used various AWS services including S3, EC2, AWS Glue, Athena, RedShift, EMR, SNS, SQS, DMS, Kenesis

●Extracted data from multiple source systems S3, Redshift, RDS and Created multiple tables/databases in Glue Catalog by creating Glue Crawlers.

●Created AWS Glue crawlers for crawling the source data in S3 and RDS.

●Created multiple Glue ETL jobs in Glue Studio and then processed the data by using different transformations and then loaded into S3, Redshift and RDS

●Developed and implemented ETL pipelines on S3 parquet files in a data lake using AWS Glue.

●Design and Develop ETL Processes in AWS Glue to migrate data from external sources like S3, Parquet/Text Files into AWS Redshift.

●Used AWS glue catalog with crawler to get the data from S3 and perform SQL query operations using AWS Athena

●Developed and maintained DynamoDB tables with optimized key design, indexing, and capacity planning

●Designed and developed a Security Framework to provide fine-grained access to objects in AWS S3 using AWS Lambda, and DynamoDB.

●Configured CloudWatch Logs to collect, aggregate, and analyze application and system logs, and set up log retention policies to manage storage and compliance requirements.

●Implemented data validation and cleansing routines in Python to ensure data quality and integrity throughout the data pipeline.

●Engineered data pipelines with Scala and Apache Spark, optimizing data processing tasks and reducing job completion time by 40%.

●Conducted Exploratory Data Analysis using Python Matplotlib and Seaborn to identify underlying patterns and correlation between features.

●Used Python (NumPy, SciPy, Pandas, Scikit-Learn, Seaborn), and Spark 2.0 (Spark, Mallis) to develop a variety of models and algorithms for analytic purposes.

●Designed and optimized SparkSQL queries for aggregating and analyzing large datasets.

●Optimized Scala Spark jobs for data transformation, cleansing, and enrichment.

●Implemented data ingestion pipelines using tools like Apache Flume and Sqoop for Hadoop integration.

●Monitored and troubleshot Control M job schedules to maintain system performance and minimize downtime..

●Managed and optimized MS SQL Server databases, ensuring high availability and performance.

●Developed complex T-SQL queries, stored procedures, and functions.

●Managed relational databases using MySQL, PostgreSQL, and Aurora, ensuring high availability and performance.

●Utilized IICS for seamless data integration across multiple cloud and on-premises sources, enhancing data accessibility.

●Developed and enforced data governance policies in Snowflake, including encryption strategies and access monitoring, to safeguard sensitive information and maintain high standards of data privacy and protection across all organizational levels.

●Developed comprehensive documentation and training materials for Snowflake best practices and usage, enabling healthcare teams to efficiently utilize Snowflake's features and maintain data quality across the organization

●Implemented and optimized Snowflake data warehouse solutions for healthcare clients, leveraging Snowflake's scalable architecture and performance features to handle large volumes of clinical and administrative data, and improving data access and query performance by up to 40%.

●Created custom Snowflake data models for advanced healthcare analytics, including predictive modeling and cohort analysis.

●Implemented granular access control policies in Snowflake, leveraging role-based access management to safeguard sensitive healthcare data.

●Led the migration of legacy data warehouses to Snowflake, modernizing infrastructure to enhance data processing and scalability.

●Managed Snowflake cost optimization through resource monitoring and forecasting, ensuring efficient compute and storage utilization.

●Integrated Snowflake with PowerBI for advanced healthcare data visualizations and reporting.

●Provided training and mentorship on Snowflake best practices, including data modeling and performance optimization.

●Developed and maintained complex SQL Server Integration Services (SSIS) packages to streamline data extraction, transformation, and loading (ETL) processes, improving data pipeline efficiency and reducing ETL execution time by 30%.

●Optimized MySQL database performance through indexing and query optimization techniques.

●Used various sources to pull data into Power BI, such as Oracle and SQL Server..

●Using a query editor in Power BI performed certain operations like fetching data from different files.

●Experienced with GitHub, GitLab and CI/CD pipelines.

●Strong experience on DevOps essential tools like Docker, Kubernetes, GIT, and Jenkins.

●Created and maintained Jira dashboards for team visibility on project progress, backlog, and sprint goals.

Environment: Python, Spark, SparkSQL, Airflow, Scala, Snowflake, SnowPipe, SnowSQl, Apache Flink, Hadoop, Control M, MS SQL Server, PL/SQL, T-SQL, MySQL, PostgreSQL, Aurora, Power BI, GitHub, GitLab, CI/CD pipelines, Docker, Kubernetes, Jenkins, Jira, AWS, Databricks.

Client: Mastercard, Purchase, NY. Dec 2018 – Jun 2021

Role: ETL Developer/ Data engineer

Responsibilities:

●Implemented scalable data storage solutions using AWS S3 for efficient data ingestion and archival.

●Designed and deployed AWS EC2 instances for processing and analysing large datasets with tools such as Spark and Hadoop.

●Managed AWS Lambda functions to automate data processing pipelines and reduce operational overhead.

●Implemented AWS IAM roles and policies to ensure secure access and compliance with data governance standards.

●Created AWS Lambda, EC2 instances provisioning on AWS environment and implemented security groups, administered Amazon VPC's.

●Utilized CloudWatch metrics and logs to troubleshoot performance issues, analyze system behavior, and perform root cause analysis to improve application performance and stability.

●Configured state machines with AWS Step Functions to automate and manage tasks, ensuring reliable execution of workflows with defined error handling and retry mechanisms.

●Leveraged Airflow’s extensibility to create custom plugins and integrations with AWS services, including S3 and Lambda, for efficient data processing and management.

●Implemented end-to-end ETL processes to extract, transform, and load data from heterogeneous sources into data warehouses.

●Optimized Talend jobs for performance tuning and scalability, ensuring efficient processing of large volumes of data.

●Designed and implemented microservices architecture in Scala, resulting in a 30% increase in system scalability and a 20% reduction in deployment times.

●Utilized Databricks Delta Lake to enhance data reliability and manage large-scale datasets, leading to a 20% decrease in data retrieval time.

●Optimized Spark jobs and Databricks clusters to enhance query performance and reduce job execution time

●Designed and developed ETL jobs using AWS Glue to extract, transform, and load data into AWS data lakes and data warehouses.

●Automated data pipelines using Python frameworks such as Airflow to ensure timely data delivery.

●Conducted exploratory data analysis (EDA) and visualization using Python's matplotlib and seaborn.

●Modified selected machine learning models with real-time data in in Spark (PySpark).

●Utilized Spark SQL for querying structured data, improving query performance and enabling real-time analytics.

●Leveraged IICS data quality features to profile, cleanse, and enrich data, ensuring high-quality data for reporting and analysis.

●Worked with architect to improve cloud Hadoop architecture as needed for Research.

●Monitored and maintained Airflow DAGs to ensure reliable and accurate execution of scheduled workflows.

●Configured and managed Snowflake’s access control policies, defining user roles and privileges to restrict access to critical data, and performed regular audits to ensure adherence to security protocols and best practices.

●Engineered complex ETL/ELT pipelines with Snowpipe and Streams & Tasks, integrating diverse healthcare data sources for real-time analytics.

●Developed Snowflake SQL scripts for data extraction, transformation, and loading (ETL/ELT).

●Developed and published interactive dashboards and reports using Snowflake as the data source. Integrated Snowflake with BI tools such as Power BI and Tableau for comprehensive data visualization and reporting.

●Developed an SQL script for creating TABLEAU reports & Dashboards to identify the trends in data in the form of visualizations and reports to the teams.

●Connected Databricks with business intelligence (BI) tools such as Tableau, to provide actionable insights and visualizations from large datasets.

●Created high-level and interactive dashboards using Tableau, facilitating data-driven decision-making for senior stakeholders.

●Collaborated with cross-functional teams to track and prioritize data pipeline development tasks and issues in Jira.

Environment: AWS S3, AWS EC2, Spark, Hadoop, AWS Lambda, AWS IAM, ETL Processes, Talend, AWS Glue, Python, Spark, Spark SQL, Databricks, Amazon Redshift Spectrum, Snowflake SQL, Tableau, Git, GitHub, Jira.

Client: TATA AIG General Insurance Company Limited, Mumbai, India. Mar 2016 – Oct 2018

Role: Data Engineer

Responsibilities:

●Managed Azure SQL Database and Azure Cosmos DB configurations to optimize performance and scalability for big data solutions.

●Developed and maintained Azure Databricks clusters for data transformation and advanced analytics workflows.

●Designed and implemented ETL processes to extract, transform, and load data from various sources into data warehouses.

●Designed and implemented ETL processes using Informatica PowerCenter to extract, transform, and load data from various sources into the data warehouse.

●Experienced in Punt, the Python unit test framework, for all Python applications.

●Capable of writing functional and object-oriented Scala code to implement complex data processing logic and algorithms, ensuring code readability and maintainability.

●Implemented data partitioning and caching strategies in Spark, improving data processing efficiency for large-scale datasets.

●Managed Snowflake security and access controls, including configuring role-based access control (RBAC) and data masking policies to ensure compliance with healthcare regulations such as HIPAA, and enhancing data security and privacy.

●Designed and implemented large-scale data processing pipelines using Hadoop ecosystem tools such as HDFS, MapReduce, and YARN.

●Implemented logging and exception handling frameworks in PL/SQL, ensuring reliable data processing and easy troubleshooting.

●Designed and optimized complex MySQL queries to improve data retrieval efficiency.

●Developed and published reports and dashboards using POWER BI and written effective DAX formulas and expressions.

●Managed version control and code collaboration using Git for multiple data engineering projects.

Environment: Azure SQL Database Management, Azure Cosmos DB Configuration, Azure Databricks Cluster Management, ETL, Informatica, Scala, Spark, Hadoop, PL/SQL, MySQL, POWER BI, Git.

Client: Yashoda Hospitals, Hyderabad, India. Jul 2015 – Feb 2016

Role: Data Analyst

Responsibilities:

●Developed and maintained ETL pipelines on AWS, leveraging services such as Glue, and Lambda.

●Implemented data storage solutions using S3, ensuring secure and scalable data management.

●Designed and executed data migration strategies to AWS, ensuring minimal downtime and data integrity.

●Developed ETL workflows using Sqoop to import/export data between Hadoop and relational databases.

●Utilized R for statistical analysis and data visualization to support business decision-making.

●Created interactive data exploration and visualization using Jupyter Notebooks.

●Conducted data analysis and predictive modelling using SAS.

●Designed and developed business intelligence reports using Sisense.

●Utilized advanced Excel functions (VLOOKUP, pivot tables, macros) for data manipulation and analysis.

●Created insightful dashboards and reports using Tableau for data-driven decision-making.

●Utilized Cognos for reporting and performance management.

●Analyzed large datasets stored in Oracle and Exadata environments.

●Implemented data integration and transformation processes using Informatica.

●Developed BI reports and OLAP cubes to provide multidimensional data analysis.

●Created and managed data models and reports using Power Pivot.

●Built interactive visualizations and dashboards using QlikView.

●Automated data processing and analysis tasks using Python scripting.

●Conducted GAP analysis to identify discrepancies and improvements in data processes.

Environment: AWS, Hadoop, Sqoop, R, Jupyter Notebooks, SAS, Sisense, Excel, Tableau, Oracle, Exadata, Informatica, OLAP Cubes, Power Pivot, QlikView, Python, GAP.

EDUCATION: Jawaharlal Nehru Technology University, Hyderabad, TS, India

BTech in Computer Science and Engineering, June 2011 - May 2015

Contact this candidate