Machine Learning Data Engineer

Location:

Dallas, TX, 75201

Posted:

April 01, 2025

Contact this candidate

Resume:

SNEHA

Sr. Data Engineer

Email: **********@*****.***

Phone: +1-253-***-****

PROFESSIONAL SUMMARY:

Over 9 years of hands-on experience designing, developing, and implementing end-to-end data engineering solutions using various technologies and database management systems such as SQL, Spark-SQL, Python, Scala, Hadoop, and Spark.

Proven expertise in managing and processing large datasets, leveraging technologies like HDFS, PySpark, Sqoop, Hive, Pig, and Flink to extract meaningful insights.

Extensive experience in cloud platforms, including AWS, and Azure, focusing on services like AWS Redshift, Azure Data Warehouse, Big Query, and Snowflake for scalable and efficient data storage and processing.

Leveraged open-source tools like Hadoop, Spark, and Kafka for efficient data processing.

Applied predictive analytics and machine learning models for actionable insights.

Experience in developing data pipelines for structured and unstructured data. Managed and administered Kafka clusters, ensuring optimal performance, scalability, and reliability across multiple environments, including production and staging.

Proficient in working with various databases such as MySQL, DynamoDB, PostgreSQL, and MongoDB, ensuring optimal data storage and retrieval mechanisms.

Adept at implementing real-time data processing using technologies like Kinesis, Kafka, Azure Stream Analytics, and Pub/Sub, enhancing the organization's ability to react to dynamic data changes.

Implemented Snowflake data sharing and cloning features to enable seamless data collaboration across teams.

Developed interactive dashboards and reports using Power BI, integrating with Azure data sources to provide actionable business insights and drive data-driven strategies.

Designed and implemented data pipelines using Snowflake for efficient data storage and retrieval, ensuring data integrity and performance optimization.

Analyzed complex data processing requirements, source, and target models to create sophisticated SQL queries for data extraction, transformation, and loading into downstream applications.

Optimized Spark jobs in Databricks by tuning configurations, caching, and partitioning strategies, leading to significant performance improvements.

Designed and developed comprehensive reports using SQL Server Reporting Services (SSRS) to provide insights and support decision-making processes.

implemented machine learning models at scale, involving all phases including data ingestion, feature engineering, model training, hyperparameter tuning, model evaluation, and model monitoring.

Designed and developed AWS cloud formation templates. Also wrote JSON templates for cloud formation and contributed to our repository on GitHub.

Implemented stateful stream processing solutions with Flink, managing large states and ensuring consistent data processing across distributed environments.

Expertise in SQL for querying large datasets and ensuring optimal performance.

Used the AWS Sage Maker to quickly build, train, and deploy the Machine learning models.

Expert in writing UNIX shell scripting, SQL, and Python scripts.

Studied and stayed current on the features and functionality of PostgreSQL.

Developed Data Pipelines using Python for medical image pre-processing, Training, and testing.

Implemented a multi-tiered data storage solution maximizing cost efficiency by intelligently archiving aged data and reducing storage expenditures by 30%.

Strong knowledge of Relational Database Management Systems (RDBMS) and SQL, delivering optimized queries for complex data operations and large-scale data warehousing.

Designed a build dashboard for use in monitoring performance metrics while maintaining existing dashboards for accuracy and efficiency and incorporating changes using Tableau and Python.

Developed and maintained scalable data pipelines using Databricks, ensuring efficient ETL processes for large datasets.

Proficient in setting up monitoring and logging solutions using AWS CloudWatch, ELK Stack, Grafana, and Prometheus, ensuring robust data system health.

A seasoned professional utilizing project management tools like Jira and following Agile and Scrum methodologies, fostering collaborative teamwork and successful project delivery.

Proficient in handling various data formats, including XML and JSON, with scripting capabilities in PowerShell, enhancing data processing flexibility and adaptability.

Consistently delivered high-quality, scalable, and secure data solutions, highlighting a commitment to excellence in every project undertaken.

TECHNICAL EXPERTISE

Programming Languages: SQL, Scala, Python

Scripting: Shell Scripting, PowerShell

AWS: EC2, S3, IAM, EMR, Cloud Formation, Lambda,.

Database Management systems- MYSQL, Microsoft SQL Server, Oracle database

GCP: Apache Dataflow, Pub/Sub, Dataproc, Data prep

Big Data Technologies: Hadoop, HDFS, Spark-SQL, Sqoop, Hive, Pig, Spark, MapReduce, Flink, CDH

Data Warehousing: Snowflake, Redshift, Azure DW, Big Query

Data Storage: MySQL, DynamoDB, PostgreSQL, MongoDB

Data Streaming: Kinesis, Kafka, Azure Stream Analytics, Pub/Sub

ETL and Data Processing: AWS Glue, Apache Airflow, Informatica/ ab initio, Dataproc, Data prep

Containerization and Orchestration: Docker, Kubernetes, AWS EKS

Version Control and Collaboration: Git, Bitbucket, GitHub, GitLab, Jenkins, Azure DevOps

Infrastructure as Code (IaC): CloudFormation, Terraform, Ansible, ARM Templates

Data Visualization: AWS Quick Sight, Power BI, Tableau

Monitoring and Logging: AWS CloudWatch, ELK Stack, Grafana, Prometheus

Project Management and Agile: Jira, Agile, Scrum

Other Tools and Libraries: Pandas, NumPy, Scikit-Learn, Terraform, IAM, Teradata, TensorFlow, XML, JSON, Maven

PROFESSIONAL EXPERIENCE:

OPTUM, Minnesota Sr. Data Engineer June 2022 – Present

Responsibilities:

Spearheaded designing and implementing end-to-end data pipelines, leveraging Python, AWS Glue, and AWS Data Pipelines to ensure seamless data flow and processing.

Leveraged Azure Blob Storage and Azure Data Lake Storage for scalable and secure data storage solutions, optimizing data retrieval and processing times.

Design and develop RESTful APIs using Node.js, Express, and other web frameworks also Integrate APIs with databases such as MongoDB, MySQL, and PostgreSQL

Implemented internal process improvements, automated manual processes, and proposed re-designing of infrastructure as appropriate to achieve scalability using SQL, Spark-SQL, and Python.

Utilized SSMS for database management, including writing and optimizing SQL queries, managing database objects, and performing data manipulation tasks.

Implemented data transformations and ETL processes using SSIS to ensure data quality and integrity in reporting solutions.

Integrated Flink with various data sources and sinks, including Apache Kafka, HDFS, and relational databases, facilitating seamless data ingestion and output.

Implemented Snowflake data sharing and cloning features to enable seamless data collaboration across teams.

Documented report development processes, data models, and technical specifications to facilitate knowledge sharing and maintenance.

Designed and developed efficient ETL pipelines using AWS Glue and PySpark, transforming and migrating large datasets across distributed systems.

Delivered insights and analytics from data processing and machine learning models to stakeholders, providing clear explanations of results and actionable business recommendations through dynamic dashboards and reports (e.g., Power BI, Tableau).

Designed and implemented scalable database schemas for Apache Cassandra, optimizing data distribution and query performance for high-traffic applications.

Developed Stored Procedures and focused on creating reusable SQL components, streamlining data transformations and improving operational efficiency.

Collaborated with cross-functional teams to understand business requirements and translate them into technical solutions using Snowflake.

Developed and managed large-scale data processing pipelines in Databricks, leveraging Apache Spark for high-performance data transformations.

Strong experience in creating Database objects such as tables, views, functions, stored procedures, and indexes in Teradata.

Configured and managed Kafka Connect and Kafka Streams for real-time data integration and stream processing, enabling seamless data flow between systems.

Implement database enhancements and modifications based on evolving business requirements and user feedback.

Extensive experience in data warehousing solutions, including OLAP, OLTP, Dimensional Modeling, and Facts/Dimensions.

Designed and implemented scalable data lakes to handle petabytes of structured and unstructured data. Utilized Redshift for high-performance data warehousing, optimizing query performance and enabling timely analytics for stakeholders.

Utilized PostgreSQL replicate to write data integration jobs and maintain all development.

Leveraged Power BI to create meaningful and actionable data visualizations, enabling stakeholders to make informed decisions.

Developed reusable components using AWS services such as Lambda, S3, CloudFormation, and Airflow, optimizing the ETL process for cloud environments.

Utilized Power Apps to streamline data workflows and improve data accessibility for business users. Implemented and fine-tuned machine learning models with TensorFlow, providing valuable predictive analytics capabilities.

Managed relational databases, including MySQL and DynamoDB, ensuring data integrity and accessibility for various applications.

Proficient in handling big data technologies such as Hadoop, HDFS, PySpark, Sqoop, Hive, Pig, and Spark for large-scale data processing.

Collaborated with cross-functional teams using Agile and Scrum methodologies, ensuring efficient project delivery.

Experience in building and architecting multiple Data pipelines, end-to-end ETL, and ELT processes for data ingestion and transformation in GCP and coordinating tasks among the team.

Utilized Jira for project management, tracking tasks, and facilitating effective communication within the development team.

Employed Infrastructure as Code (IaC) principles with CloudFormation and Ansible for streamlined and reproducible infrastructure deployment.

Containerized applications using Docker, facilitating consistent and scalable deployment across various environments.

Extensive experience with SQL, SSRS, SSMS, and other relevant data management and reporting tools.

Created complex queries based on the business requirement using PySpark/Spark SQL.

Monitored and optimized system performance using AWS CloudWatch, ensuring reliable and scalable data processing.

Demonstrated proficiency in adhering to Agile principles, consistently delivering high-quality software solutions on schedule.

Collaborated with teams to define, prioritize, and manage data engineering tasks, ensuring alignment with business objectives.

Tech Stack: SQL, shell scripting, Spark-SQL, Python, Teradata, REST APIs, Power BI, AWS cloud watch, NumPy, Scikit-Learn, Terraform, AIM TensorFlow, MySQL, DynamoDB, Hadoop, HDFS, PySpark, Sqoop, Hive, Pig, Spark, Git, Jenkins, Snowflake, CloudFormation, Ansible, Docker, AWS EKS, Quick Sight, CloudWatch, Agile, Scrum, Jira.

NECA, Washington, DC Data Engineer/Analyst Feb 2020 - May 2022

Responsibilities:

Designed and implemented data engineering solutions using AWS services such as EC2, S3, Lambda, and RDS to enable efficient and scalable data processing.

Developed SQL and Python scripts for data manipulation, transformation, and integration, ensuring seamless communication between different data infrastructure components.

Implemented and managed Kafka for real-time data streaming, facilitating timely and accurate data updates across various systems.

Implemented data processing solutions using PySpark and Spark SQL within Databricks, enhancing data processing speed and accuracy.

Used Chart Visualization in Tableau such as box plots, bullet graphs, etc.

Orchestrated data workflows using Apache Airflow, optimizing the scheduling and execution of data pipeline tasks.

Deployed Databricks on cloud platforms (e.g., Azure, AWS) and integrated with cloud-native services like Azure Data Factory and AWS Glue for seamless data movement.

Educate developers on how to commit their work and how they can make use of the ci/cd pipelines that are in place.

Developed and optimized SQL queries in AWS Redshift to support complex analytical requirements.

Handled JSON and BSON formats, ensuring seamless data exchange and compatibility across systems.

Integrated multiple data sources into Power BI to create unified views of business data, improving reporting accuracy and efficiency.

Built and maintained data warehouses and data lakes using AWS Redshift and Athena.

Utilized PostgreSQL and MongoDB for efficient data storage and retrieval, ensuring data integrity and availability.

Migrated data from legacy SQL databases to MongoDB, optimizing data structures for NoSQL architecture.

Implemented a 'serverless' architecture using API Gateway, Lambda, and Dynamo DB and deployed AWS Lambda code from Amazon S3 buckets. Created a Lambda function and configured it to receive events from your S3 bucket.

Led incident response efforts, troubleshooting and resolving Kafka-related issues, and ensuring minimal impact on business operations.

Designed the data models to be used in data-intensive AWS Lambda applications which are aimed to do complex analysis, creating analytical reports for end-to-end traceability, lineage, and definition of Key Business elements from Aurora.

Developed reusable components using AWS services such as Lambda, S3, CloudFormation, and Airflow, optimizing the ETL process for cloud environments.

Implemented infrastructure as code using Terraform, enhancing the scalability and maintainability of the overall data architecture.

Designed and maintained Snowflake data warehouses for high-performance analytics and reporting.

Administered and maintained Cassandra clusters, handling node provisioning, data replication, and failure recovery to ensure system reliability.

Proficient in handling XML and JSON data formats, ensuring compatibility and smooth data exchange between systems.

Created AWS Lambda functions using Python for deployment management in AWS and designed and implemented public-facing websites on Amazon Web Services and integrated it with other applications infrastructure.

Integrated Snowflake with various data sources, including cloud storage services (e.g., Amazon S3, Google Cloud Storage) and external APIs.

Creating different AWS Lambda functions and API Gateways, to submit data via API Gateway that is accessible via Lambda function.

Monitored Databricks job performance using built-in monitoring tools and logs, identifying and resolving bottlenecks and errors.

Executed data processing tasks using PySpark and Hadoop, optimizing large-scale data processing for improved performance.

Implemented data ingestion and extraction using Sqoop, ensuring seamless integration with various data sources.

Utilized Power BI to create dashboards that monitor performance metrics and provide real-time business insights.

Containerized data applications using Docker and orchestrating them with Kubernetes, enhancing scalability and deployment efficiency.

Managed version control using Bitbucket and automated build and deployment pipelines with Jenkins for continuous integration.

Performed data modeling and schema design for AWS Redshift, optimizing query performance.

Used Amazon EMR for MapReduce jobs and tested locally using Jenkins.

Created external tables with partitions using Hive, AWS Athena, and Redshift.

Developed the PySpark code for AWS Glue jobs and EMR.

Good Understanding of other AWS services like S3, EC2 IAM, and RDS Experience with Orchestration and Data Pipelines like AWS Step functions/Data Pipeline/Glue.

Experience in writing SAM templates to deploy serverless applications on the AWS cloud.

Configured and optimized Spark, Hive, and Pig for efficient data processing and analytics.

Collaborated with cross-functional teams using JIRA, following Agile and Scrum methodologies for project management.

Implemented ELK Stack for centralized logging and monitoring, ensuring real-time visibility into system performance.

Migrated data from on-premises databases to Snowflake, ensuring data accuracy and integrity throughout the process.

Worked on optimizing and tuning the Teradata views and SQL to improve the performance of batch and response time of data for users.

Leveraged Looker for creating and maintaining interactive data dashboards, providing insights for decision-making.

Knowledge of monitoring and managing Flink jobs, including metrics collection and performance tuning.

Ensured data security and compliance by implementing encryption and access controls across the entire data ecosystem.

Utilized Power BI to develop interactive data visualizations, enabling stakeholders to make data-driven decisions.

Conducted data modeling to design efficient database schemas, optimizing data storage and retrieval.

Used Teradata data mover to copy data and objects such as tables and statistics from one system to another.

Performance-tuned and optimized data pipelines for maximum efficiency, reducing processing times and resource utilization.

Tech Stack: SQL, AWS, Python, Kafka, AWS Redshift, Snowflake, Dynamo DB, Lambda, Power BI, Informatica/ ab initio, PostgreSQL, AWS Redshift, Tableau, MongoDB, Teradata, Snowflake, Hadoop, Sqoop, PySpark, Docker, Kubernetes, XML, JSON, Bitbucket, Jenkins, Spark, Hive, Pig, JIRA, Agile, Scrum, ELK, Looker.

Abjayon, India Data Engineer Nov 2017 – Dec 2019

Responsibilities:

Proficient in programming languages such as SQL, Python, and Scala, utilizing them to develop and optimize data processing algorithms.

Developed and reviewed simultaneous SQL queries in Tableau Desktop to validate both static and dynamic data validation.

Automated and streamlined tasks using PowerShell, improving overall efficiency, and reducing manual intervention in data engineering processes.

Managed version control and collaborative development using GitHub, facilitating a streamlined and organized development process.

Shared Power BI dashboards and reports with stakeholders, providing them with valuable insights and fostering a data-driven culture.

Built and deployed scalable data-driven applications and data processing workflows using Java, Scala, and Python. Experience includes designing and implementing end-to-end data pipelines for real-time and batch-processing environments.

Demonstrated strong quantitative and problem-solving skills through complex data engineering challenges.

Passionate about exploring emerging technologies and innovative solutions to improve data processes. security measures and disaster recovery procedures to protect against unauthorized access and data loss.

Develop and deploy the outcome using spark and scala code in the Hadoop cluster running on GCP.

Designed and implemented end-to-end data pipelines using Apache Flink for real-time processing and Apache Cassandra for scalable storage, leading to a 30% improvement in data handling efficiency.

Implemented and maintained JSON-based configurations for seamless integration of data processing workflows.

Developed and maintained monitoring dashboards using Grafana, providing real-time insights into data processing pipelines.

Utilized Power BI for data visualization, creating meaningful and actionable reports for stakeholders.

Collaborated within Agile frameworks, utilizing tools like JIRA for efficient project management and fostering a culture of continuous improvement.

Engaged in Azure DevOps for end-to-end CI/CD pipeline management, ensuring the smooth deployment of data engineering solutions.

Demonstrated adaptability and responsiveness to changing project requirements within an Agile environment, contributing to the team's overall success.

Tech Stack: Python, SQL, AWS, Power BI, Hadoop, Spark, MapReduce, Flink, PowerShell, Dynamo DB, GitHub, ARM templates, JSON, Grafana, Power BI, Tableau, Azure DevOps, JIRA, Agile.

Wissen Technology, India Jr. Data Engineer Jan 2016 – Oct 2017

Responsibilities:

Developed and maintained data pipelines using Google Cloud Platform (GCP) services such as Apache Dataflow for efficient data processing.

Utilized Pub/Sub for real-time messaging and event-driven architectures, ensuring seamless data flow across systems.

Worked with Dataproc to deploy and manage Apache Spark and Hadoop clusters, optimizing data processing tasks.

Implemented Python scripts and REST APIs for data extraction, transformation, and loading (ETL) processes.

Designed and developed CI/CD pipelines using Jenkins.

Created continuous integration and continuous delivery (CI/CD) pipeline on AWS that helps to automate steps in the software delivery process.

Experienced in continuous integration/continuous deployment (CI/CD) processes for smooth deployment and updates.

Implemented code optimization techniques to ensure scalable and maintainable solutions.

Collaborated with Git for version control and participated in collaborative coding using GitLab.

Executed data transformations and analysis using Pandas and NumPy libraries, ensuring data accuracy and integrity.

Implemented and optimized Hive and Impala queries for efficient data retrieval from Hadoop clusters.

Build Data pipelines in airflow in GCP for ETL-related jobs using different airflow operators both old and new operators.

Oversee the utilization of data and log files, managing database logins and permissions to maintain security and operational control.

Utilized Data prep for data cleaning, transformation, and exploratory data analysis.

Executed complex queries and analytics on large datasets using Big Query.

Orchestrated workflow automation using Apache Airflow, enhancing the efficiency of data processing tasks.

Experience in Amazon AWS services such as EMR, EC2, S3, CloudFormation, and RedShift which provides fast and efficient processing of Big Data.

Applied CDH (Cloudera Distribution for Hadoop) for managing and configuring Hadoop components.

Build Data pipelines in airflow in GCP for ETL-related jobs using different airflow operators.

Collaborated with teams using Tableau for data visualization, creating insightful reports and dashboards. Handled XML data formats for seamless integration with various systems.

Employed Maven for project management and dependency control in Java-based projects.

Implemented monitoring and alerting using Prometheus for proactive issue identification and resolution.

Engaged in project management using Jira, ensuring effective communication and task tracking.

Tech Stack: SQL, AWS, GCP, Python, REST APIs, Hadoop, Spark, Hive, Power BI, Impala, Data prep, Big Query, Airflow, Pandas, NumPy, CDH, Git, GitLab, Tableau, XML, Maven, Prometheus, Jira.

EDUCATIONAL BACKGROUND

Master’s in information technology - Valparaiso University IN.

Bachelors - Electronics and Communication Engineering - JNTUK University.

CERTIFICATIONS

Microsoft Azure Fundamentals AZ-900.

Microsoft Azure Data Engineer certified.

Six Sigma Certified.

ITIL Certified.

Contact this candidate