Data Engineer Senior

Location:

Aubrey, TX

Posted:

July 14, 2025

Contact this candidate

Resume:

Vaishnavi. S

Senior Data Engineer

*********@*****.*** +1-469-***-****

PROFESSIONAL SUMMARY

●Over 8 years of experience in data engineering, specializing in designing, building, and optimizing scalable, high-performance data pipelines.

●Extensive knowledge of AWS services, including EC2, S3, Lambda, RDS, DynamoDB, Glue, EMR, Redshift, IAM, VPC, and Route 53.

●Proficient in cloud-native data solutions leveraging Azure Data Factory, Databricks, and AWS Glue for ETL/ELT workflows.

●Strong expertise in big data frameworks: Apache Spark (PySpark, Scala), Kafka, Hive, Sqoop, and Spark Streaming.

●Hands-on experience in container orchestration using AWS ECS and EKS, and automating deployments via Jenkins, GitHub Actions, and AWS CodePipeline.

●Skilled in data warehousing and analytics using Snowflake, Redshift, and Azure Synapse Analytics.

●Practical knowledge of implementing and managing data governance, security, and compliance standards in regulated environments.

●Experienced in building RESTful APIs and microservices with Python frameworks (Django, Flask, FastAPI) to support data-driven applications.

●Proficient in workflow orchestration and pipeline automation with Apache Airflow and AWS Step Functions.

●Deep understanding of networking concepts in cloud environments, including security groups, NACLs, load balancers, and DNS management.

●Developed and deployed machine learning models using AWS SageMaker, TensorFlow, and PyTorch, integrating AI workflows into production systems.

●Strong programming skills in Python, Scala, SQL, and shell scripting, with experience in developing reusable code modules and automation scripts.

●Effective collaborator and communicator working closely with cross-functional teams in Agile and Scrum environments.

TECHNICAL SKILLS:

Big Data Ecosystem

HDFS, Yarn, MapReduce, Spark, Kafka, Kafka Connect, Hive, Airflow, Stream Sets, Sqoop, HBase, Flume, Pig, Ambari, Oozie, Zookeeper, Nifi, Sentry

Hadoop Distributions

Apache Hadoop 2.x/1.x, Cloudera CDP, Hortonworks HDP

Cloud Environment

Amazon Web Services (AWS), Microsoft Azure

Databases

MySQL, Oracle, Teradata, MS SQL SERVER, PostgreSQL, DB2

NoSQL Database

DynamoDB, HBase

AWS

EC2, EMR, S3, Redshift, EMR, Lambda, Kinesis Glue, Data Pipeline

Microsoft Azure

Databricks, Data Lake, Blob Storage, Azure Data Factory, SQL Database, SQL Data Warehouse, Cosmos DB, Azure Active Directory

Operating systems

Linux, Unix, Windows 10, Windows 8, Windows 7, Windows Server 2008/2003, Mac OS

Software/Tools

Microsoft Excel, Stat graphics, Eclipse, Shell Scripting, ArcGIS, Linux, Jupyter Notebook, PyCharm, Vi / Vim, Sublime Text, Visual Studio, Postman

Reporting Tools/ETL Tools

Informatica, Talend, SSIS, SSRS, SSAS, ER Studio, Tableau, Power BI, Arcadia, DataStage, Pentaho

Programming Languages

Python (Pandas, Scipy, NumPy, Scikit-Learn, StatsModels, Matplotlib, Plotly, Seaborn, Keras, TensorFlow, PyTorch), PySpark, T-SQL/SQL, PL/SQL, HiveQL, Scala, UNIX Shell Scripting

Version Control

Git, SVN, Bitbucket

Development Tools

Eclipse, NetBeans, IntelliJ, Hue, Microsoft Office Suite (Word, Excel, PowerPoint, Access)

PROFESSIONAL EXPERIENCE:

Cigna Health Care Jan 2025 - Present

Senior Data Engineer

Responsibilities:

●Designed and developed microservices using Python frameworks (Django, Flask, FastAPI) to deliver high-performance APIs, supporting critical data workflows and reducing latency by 30%.

●Built and optimized RESTful APIs for seamless integration between relational databases and external services, ensuring robust transaction management, failure handling, validation, and retries.

●Developed and deployed Scala-based REST APIs for data-intensive operations, significantly improving API response times through efficient coding and system design.

●Automated infrastructure provisioning and artifact versioning using Terraform integrated with GitHub Actions, ensuring consistent and reproducible deployment environments.

●Implemented embedding-based search models leveraging vector databases to enable intelligent data discovery and contextual search across healthcare datasets.

●Established and maintained CI/CD pipelines with Jenkins and CloudBees, enabling automated testing and deployment across multiple environments.

●Automated build, test, and deployment processes for Scala APIs using GitHub Actions, improving release cycles and ensuring production reliability.

●Leveraged GitHub Actions runners for parallel testing, reducing test execution times by 40%, and enabling multi-environment deployments for development, staging, and production.

●Designed and maintained scalable data warehouse solutions, optimizing complex SQL queries for large-scale analytics and improving query performance by 35%.

●Engineered system integrations between databases and APIs, ensuring data integrity and robust transaction management.

●Improved version control workflows by implementing advanced Git branching strategies, enhancing code collaboration and minimizing merge conflicts.

●Collaborated with cross-functional teams to design and document system architecture for data-driven applications deployed across diverse environments.

Environment:

AWS (EC2, S3, Lambda, Glue, EMR, Redshift, IAM, VPC, Step Functions, Athena), Python (PySpark, Pandas), Apache Airflow, Apache Spark (PySpark, Scala), Kafka, Jenkins, GitLab, Snowflake, Terraform, SQL (Redshift, PostgreSQL), Docker, Linux, Agile/Scrum.

Capital One

AWS Data Engineer Oct 2024 - Jan 2025

Responsibilities:

●Contributed to data migration workflows, transitioning legacy banking systems to AWS while ensuring strict adherence to regulatory standards.

●Built scalable ETL pipelines using Python, PySpark, and AWS Glue to enable high-performance ingestion and transformation of structured and semi-structured data into Amazon S3, Redshift, and Snowflake.

●Specializing in data transformation, standardization, and validation to ensure reliable, high-quality data delivery during AWS cloud migration initiatives.

●Applied consistent naming conventions, data types, and formatting rules (e.g., dates, currency, identifiers) to standardize incoming datasets across ingestion pipelines.

●Developed reusable PySpark modules to enforce schema alignment between source and target systems, supporting accurate and compliant data mapping.

●Maintained a centralized data dictionary and transformation logic repository to facilitate metadata-driven pipeline design and maintain consistency across projects.

●Implemented data profiling using Python and AWS Glue to detect anomalies, outliers, and quality issues pre- and post-migration.

●Applied robust validation logic, including unit-level validations, null checks, and schema enforcement, ensuring audit readiness and compliance.

●Built standardized lookup tables and reference mappings to reconcile inconsistencies in product codes, account types, customer segments, and region codes.

●Created data normalization routines using Python dictionaries to unify values such as country names, transaction types, and account statuses.

●Utilized AWS Glue DynamicFrames and Glue Data Catalog to support schema evolution detection and transformation automation.

●Ensured precision and consistency in downstream analytics by applying rounding rules, timestamp normalization, and currency standardization.

●Participated in cross-functional data governance discussions to help define and enforce enterprise-wide data transformation standards.

●Leveraged AWS services including S3, Glue, Lambda, IAM, EC2, and Step Functions for secure, scalable data orchestration.

●Migrated legacy ETL logic to PySpark, simplifying operational complexity and improving job performance.

●Developed automated workflows using Apache Airflow and AWS Step Functions, enabling seamless and fault-tolerant data movement across environments.

●Created Python-based validation frameworks to uphold high standards of financial data quality throughout transformation and migration stages.

●Automated CI/CD processes using Jenkins and GitLab pipelines, streamlining deployment across development, QA, and production environments.

Environment:

Python, PySpark, AWS Glue, Amazon S3, Redshift, Lambda, Step Functions, EC2, IAM, CloudWatch, Athena, Airflow, Jenkins, GitLab CI/CD, Snowflake, Oracle, SQL, Data Profiling, Data Transformation & Validation, Metadata Management, Data Standardization

Guide One Insurance Aug 2023 - Sept 2024

AWS Data Engineer

Responsibilities:

●Developed Apache Spark data processing applications on AWS EMR to process data from RDBMS and multiple streaming sources using Python.

●Designed and deployed multi-tier applications leveraging AWS services such as EC2, Route 53, S3, RDS, and DynamoDB, focusing on high availability, fault tolerance, and auto-scaling through AWS CloudFormation.

●Configured and managed AWS EC2 instances to run Spark jobs efficiently on EMR clusters.

●Executed data transformations using Spark DataFrames, Spark SQL, Spark file formats, and RDDs to prepare and cleanse data.

●Processed and transformed data from various formats (Text, CSV, JSON) using Python scripting within Spark applications.

●Ingested data from relational databases like MySQL and Teradata using Sqoop for integration into big data workflows.

●Created custom Python functions to parse and flatten complex JSON datasets in Spark DataFrames.

●Enhanced Spark cluster performance by optimizing algorithms and tuning resource utilization.

●Performed wide and narrow Spark transformations such as filtering, joins, lookups, and aggregations on large datasets.

●Worked with Parquet files and Impala through PySpark; developed Spark Streaming apps using RDDs and DataFrames for real-time data processing.

●Built batch and streaming data pipelines to meet business requirements using Spark APIs.

●Utilized AWS Kinesis Data Streams for real-time analytics and integration with Spark Streaming.

●Conducted data cleaning and missing value imputation using Python with backward and forward fill methods; applied feature engineering and encoding using Scikit-learn.

●Developed and deployed machine learning models with TensorFlow and PyTorch for image recognition and NLP applications.

●Improved model accuracy by 50% through data preprocessing and feature engineering.

●Designed and optimized machine learning pipelines on AWS SageMaker for efficient model training and deployment.

●Integrated ML models into production environments, ensuring scalability and real-time performance.

●Collaborated closely with data scientists and product managers to convert business needs into technical solutions and actionable insights.

●Performed A/B testing and evaluation of models to validate and improve performance.

●Created GAN-based synthetic data generators to address data scarcity challenges in training datasets.

●Managed data storage strategies in AWS S3, organizing data into tiers based on access frequency and business needs.

●Connected AWS data stacks to BI tools like Tableau and Power BI to deliver reporting and visualization solutions.

●Worked with DBAs on SQL tuning and optimization for Oracle, MySQL, and MS SQL Server databases.

●Designed and implemented data partitioning and clustering strategies in Snowflake to improve query speed and reduce costs.

●Established data governance frameworks and policies to maintain data quality and security within Snowflake environments.

●Assisted in the setup and maintenance of MongoDB clusters hosted on AWS EC2 instances.

●Monitored Spark applications using Spark UI to identify and troubleshoot executor failures, data skew, and runtime bottlenecks.

●Conducted performance stress testing for DynamoDB deployments on AWS EC2 to ensure production stability.

●Automated deployment pipelines and routine operations using Unix Shell scripting.

●Partnered with data science teams to develop machine learning models on Spark EMR clusters aligned with business objectives.

●Participated in Agile SCRUM processes, delivering iterative project updates and enhancements.

Environment: Hadoop 2. x, Spark v2.0.2, Hive, Sqoop, Kafka, Spark streaming, ETL, Scala, Python (Pandas, Numpy), PySpark, GIT (version control), MySQL, MS SQL, MongoDB, AWS (EC2, S3, EMR, RDS, Lambda, Kinesis, Redshift, Cloud Formation)

Molina HealthCare Jan 2023 – Jul 2023

AWS Data Engineer

Responsibilities:

●Designed and orchestrated complex workflows using AWS Step Functions and State Machines, significantly improving process efficiency and reliability.

●Implemented automated testing frameworks that reduced deployment errors by 50% and accelerated release cycles.

●Developed serverless solutions with AWS Lambda (Python), optimizing application performance and reducing operational costs.

●Provisioned and managed AWS infrastructure, including EC2 instances, S3 buckets, RDS databases, and VPCs to support business requirements.

●Built custom ETL solutions for batch and real-time data ingestion into Hadoop clusters using PySpark and Shell scripting.

●Applied CI/CD best practices through automation of build, test, and deployment processes using Jenkins and GitLab pipelines, enhancing delivery speed and consistency.

●Created sophisticated data transformations using Azure Data Factory (ADF) and Scala to support complex data workflows.

●Developed and optimized Apache Airflow DAGs to schedule ETL jobs, incorporating Pools, Executors, and multi-node setups to improve workflow efficiency.

●Tuned Airflow performance through configuration optimization to ensure reliable pipeline execution.

●Executed data transformation and aggregation using Apache Spark RDDs, DataFrames, and Spark SQL for high-volume analytics.

●Delivered real-time insights via Spark Scala functions, enhancing cluster performance through code optimizations.

●Processed large-scale datasets leveraging Spark Context, Spark SQL, and Spark Streaming APIs.

●Monitored Spark clusters using Log Analytics and Ambari Web UI to improve stability and reliability.

●Developed custom input adapters using Spark, Hive, and Sqoop for data ingestion from Snowflake, MS SQL, MongoDB, and other sources into HDFS.

●Utilized Sqoop, Flume, and Spark Streaming APIs to import data from web servers and Teradata for analytics.

●Enhanced processing efficiency with Scala-based concurrency and parallelism techniques.

●Built MapReduce jobs in Scala to compile JVM bytecode, accelerating data processing workflows.

●Improved Spark batch processing by tuning batch intervals, parallelism levels, and memory usage.

●Managed and monitored daily incremental data loads from MongoDB, MS SQL, and MySQL sources.

●Implemented indexing on data ingestion pipelines using Flume sinks to write directly to cluster-based indexers.

Environment: Hadoop, Spark, Hive, Sqoop, HBase, Flume, Ambari, Scala, MS SQL, MySQL, Snowflake, MongoDB, Git, Data Storage Explorer, Python, AWS Lambda, API Gateway, DynamoDB, S3, EC2, RDS, etc.

Happiest Minds Technologies Aug 2018 – June 2021

Data Engineer

Responsibilities:

●Designed and implemented complex network algorithms using Python data structures and OOP concepts.

●Developed PHP/MySQL backend solutions integrating with Flash front-end for dynamic data entry and retrieval.

●Built database models, APIs, and Views using Python frameworks to develop interactive web applications.

●Created Django forms for efficient online user data collection and management.

●Developed PyQt-based data tables for CRUD operations on patient and policy information.

●Built Python modules to connect and monitor Apache Cassandra instances.

●Developed MVC prototype applications using Django as a replacement for legacy systems.

●Improved data security and reporting through caching and efficient data reuse.

●Created user interfaces using JavaScript, HTML5, and CSS3.

●Managed datasets using Pandas and MySQL, executing complex queries through Python-MySQL connectors.

●Followed test-driven development practices; wrote unit tests with unittest and pytest frameworks.

●Developed web prototypes using jQuery and AngularJS to enhance frontend capabilities.

●Managed code versioning and deployed applications on Heroku using Git.

●Maintained and enhanced applications based on client feedback and requirements.

Environment: Python 3, Django 1.6, Tableau 8.2, Beautiful soup, HTML5, CSS/CSS3, Bootstrap, XML, JSON, JavaScript, jQuery, Angular JS, Backbone JS, Restful Web services, Apache spark, Linux, Git, Amazon s3, Jenkins, MySQL, MongoDB, T-SQL, Eclipse.

Mouritech Dec 2015 – July 2018

Python developer

Responsibilities:

●Implemented complex network algorithms using data structures such as dictionaries, tuples, and object-oriented class inheritance in Python.

●Developed PHP/MySQL backend systems for data entry integrated with Flash, collaborating with frontend developers to ensure accurate data retrieval via query strings.

●Designed and implemented database models, APIs, and views with Python to create interactive web applications.

●Created and managed user data collection forms using the Django framework.

●Built PyQt-based data tables for managing patient records and policy information, including add, update, delete, and display functionalities.

●Developed Python modules for monitoring Apache Cassandra cluster status.

●Designed an MVC prototype using Django to replace legacy applications, improving maintainability and scalability.

●Enhanced data security and reporting by implementing effective caching strategies to optimize data reuse.

●Built dynamic user interfaces using JavaScript, HTML5, and CSS3 to enhance user experience.

●Managed and analyzed datasets using Pandas and MySQL, performing complex queries through Python-MySQL connector.

●Developed and executed unit and integration tests with pytest and Python’s unittest framework, following test-driven development practices.

●Created responsive web prototypes using jQuery and AngularJS for improved front-end interactivity.

●Automated deployment processes by deploying applications to Heroku using Git version control.

●Provided ongoing application maintenance and enhancements based on client feedback and business needs.

Contact this candidate