Data Engineer Senior

Location:

Plano, TX, 75093

Posted:

October 15, 2025

Contact this candidate

Resume:

Harshitha O

Senior Data Engineer

+1-732-***-**** ****************@*****.*** LinkedIn

Professional Summary:

Senior Data Engineer with 5+ years of experience in building scalable data pipelines, cloud infrastructure, and real-time processing systems using AWS and Apache Spark.

Expert in developing ETL workflows using PySpark, Airflow, and SQL, enabling clean, structured data for analytics, reporting, and machine learning across large datasets.

Designed and deployed cloud-native data architectures on AWS including S3, EMR, Redshift, and Glue, improving data availability, performance, and cost-efficiency.

Skilled in containerizing data applications using Docker and automating deployments via GitHub Actions, ensuring consistent environments across development and production.

Proficient in programming with Python, Scala, and Java; applied to build high-performance data transformation pipelines and backend data systems.

Engineered data lakes and data warehouses using Delta Lake, Snowflake, and Redshift, integrating structured and semi-structured data at scale.

Led end-to-end development of data platforms, including ingestion, cleaning, transformation, governance, and visualization, supporting cross-functional teams in decision-making.

Implemented role-based access control, data lineage, and auditing through Unity Catalog and Informatica, aligning with enterprise governance standards.

Developed Power BI and Tableau dashboards, enabling leadership to visualize KPIs, monitor data trends, and take informed decisions in real time.

Applied best practices in performance tuning, query optimization, and storage formats like Parquet to reduce latency and increase processing throughput.

Automated validation and data quality checks using Python and SQL, maintaining high trust and accuracy in critical reporting workflows.

Created real-time data solutions using Kafka and Spark Streaming, supporting business operations that require low-latency processing and alerting.

Collaborated with business and engineering teams to translate requirements into technical solutions, ensuring delivery aligned with goals and timelines.

Technical Skills:

Big Data Ecosystem

Apache Spark, HDFS, Yarn, MapReduce, Kafka, HIVE, Sqoop, Airflow, Oozie, Zookeeper, Flume, IntelliJ, Impala

Cloud Environments

AWS, Azure, GCP

Operating Systems

Linux, Windows, Unix, Mac-OS-X, CentOS, Red Hat

Web Technologies

HTML, XML, JSON, CSS, JavaScript, Node.js

Databases

MS-SQL Server,Oracle 11g/10g, MongoDB, Snowflake, Cassandra, DB2, Teradata

Tools Used

Eclipse, VSCode, WinSCP, NetBeans, JUnit, Power BI, IntelliJ, GitHub, Spring Boot

Methodologies

Agile/Scrum, Jira and Waterfall

Monitoring Tools

Azure monitor, AWS cloud watch, Zookeeper

Programming Languages

Python, Scala, Java, SQL, PostgreSQL,PL/SQL

Work Experience:

Client: Amgen – Thousand Oaks, CA Dec 2023 - Present

Role: Sr. Data Engineer

Responsibilities:

Optimized data workflows and enhanced Python scripts for processing and aggregation in Databricks.

Containerized data processing applications using Docker for scalable deployment and consistent environment management.

Developed scalable data pipelines using Python scripting and PySpark to process large datasets, reducing processing time

Designed and implemented scalable ETL pipelines on AWS using EMR for processing large datasets and Redshift for data storage and analytics.

Orchestrated and automated ETL workflows with Airflow, enhancing pipeline efficiency and scheduling.

Utilized Python scripting along with PySpark for distributed data processing, transforming raw data into clean, structured formats for analysis.

Tuned T-SQL queries for performance by optimizing execution plans, indexing, and joins.

Designed solutions for real-time analytics using Microsoft Fabric’s Lakehouse architecture, supporting large-scale data ingestion and interactive querying capabilities.

Developed and optimized ETL processes using AWS services like EMR and Redshift, ensuring seamless data transformation and storage across cloud platforms.

Built and maintained ETL pipelines with Python and PySpark to handle large-scale data processing tasks on cloud platforms like AWS.

Enabled seamless Power BI reporting and dashboards on top of Fabric-based data, allowing business users to access live insights through Direct Lake mode for enhanced performance.

Implemented data quality checks and validations with Python and SQL to ensure data accuracy and integrity.

Managed data processing and analytics pipelines in AWS, utilizing EMR for big data processing and Redshift for business intelligence and reporting.

Leveraged PySpark and Python to implement parallelized data processing jobs, improving the efficiency of data transformation workflows.

Applied object-oriented programming principles in Python, Scala, and Java for scalable data engineering solutions.

Engineered scalable and efficient data pipelines in Microsoft Fabric using Dataflows, Pipelines, and Notebooks, reducing latency and improving data processing efficiency.

Built scalable NoSQL databases in Cosmos DB, MongoDB, and Cassandra, and used Parquet for efficient data storage.

Led data governance initiatives and managed access control using Databricks Unity Catalog for centralized data asset management.

Worked with AWS Redshift for building data warehouses, and used EMR clusters to efficiently process and transform large-scale data sets in real time.

Automated CI/CD pipelines for data engineering with GitHub Actions, including testing, code linting, and packaging.

Implemented data storage and analytics solutions using AWS Redshift, ensuring optimal performance and cost-efficiency for business-critical applications.

Collaborated on real-time troubleshooting of complex data migration challenges, ensuring smooth project execution.

Environment: Python, Scala, SQL, MS SQL Server, Delta Lake, Parquet, PostgreSQL, .NET,DBT, SSMS, Metadata, Visual Studio, VS Code, SSIS, Docker, ETL, Miro Board.

Client: Morning Star - Chicago, IL Jan 2023 – Nov 2023

Role: Sr. Data Engineer

Responsibilities:

Engineered Snowflake data pipelines for data extraction, transformation, and aggregation; implemented Snowflake Streams and Tasks for real-time data ingestion and processing.

Tuned PySpark jobs by optimizing partition sizes, caching, and serialization, improving job execution time and resource efficiency.

Automated data pipeline workflows using AWS EMR, integrating data from multiple sources and loading it into AWS Redshift for streamlined querying and analysis.

Automated validation and cleanup of CSV files, ensuring data consistency and accuracy before ingestion.

Created custom Python scripts using PySpark for big data processing, transforming data across various storage systems and loading it into data warehouses.

Implemented Snowflake's data governance features, such as role-based access control (RBAC), auditing, and data lineage, supporting secure and compliant data usage.

Developed and optimized Power BI reports, using techniques to enhance performance and enable incremental refresh for efficient reporting.

Designed and deployed distributed data processing solutions using PySpark and Python, enabling real-time analytics on large-scale datasets.

Utilized AWS EMR for data transformation tasks, followed by storing processed data in Redshift for fast querying and analytics, reducing latency

Architected customer data integration workflows in Microsoft Customer Data Platform (CDP), unifying diverse data sources for enhanced insights.

Used PySpark and Python to implement data wrangling and cleaning routines, preparing raw data for downstream analytics and machine learning models.

Developed and maintained high-performance data pipelines in AWS, including EMR for distributed data processing and Redshift for data warehousing.

Developed containerized data pipelines and automated deployments using GitHub Actions and Docker, streamlining deployment processes and minimizing environment inconsistencies.

Utilized Parquet for big data storage, benefiting from its columnar structure for faster queries and reduced storage costs.

Automated complex data processing tasks with Python and PySpark, improving pipeline efficiency and reducing manual intervention

Created reusable C# components and custom Python scripts for seamless data migration and workflow automation.

Optimized Redshift clusters for better query performance, while leveraging AWS EMR for large-scale data processing and transformation across multiple datasets.

Designed Snowflake architectures using features such as Virtual Warehouses, Time Travel, and Snowpipe for continuous data ingestion.

Integrated PySpark with Python scripts to optimize data transformation jobs, reducing the overall data processing time and improving query performance.

Built and deployed end-to-end data pipelines using AWS services, including EMR for data transformation and Redshift for fast, scalable querying and reporting

Employed Alteryx for predictive and geospatial analysis, enhancing advanced analytics capabilities.

Managed data governance initiatives, maintaining data quality, accuracy, and compliance, including creating data dictionaries, catalogs, and lineage documentation.

Utilized AWS EMR for distributed processing of large datasets and integrated with Redshift for real-time data analysis and reporting.

Led version control and code release processes using Git and Bitbucket.

Collaborated with cross-functional teams and addressed data issues and latency concerns, resolving JIRA tickets to support application customers.

Optimized and fine-tuned Redshift clusters for efficient data querying, while automating data processing workflows with AWS EMR to handle large data volumes.

Environment: SQL Server, SQL Server Management Studio, Teradata, Visual Studio, VSTS, Power BI, PowerShell, SSIS, DataGrid, ETL Extract Transformation and Load, Business Intelligence (BI).

Client: Medvarsity – Hyderabad, India Jan 2020 – June 2022

Role: Data Engineer

Responsibilities:

Designed and developed data integration solutions with Informatica PowerCenter, enabling seamless ETL of data from various sources.

Designed and deployed scalable data pipelines using AWS services, including S3, Lambda, and Glue, to automate data ingestion and transformation.

Implemented Unity Catalog features, such as role-based access control (RBAC), auditing, and data lineage, supporting secure and compliant data usage.

Migrated legacy Parquet and Avro data lakes to Delta Lake format, enhancing performance, concurrency, and reliability.

Managed cloud infrastructure and data workflows on AWS, leveraging services such as S3, EC2, and RDS to optimize data storage and processing.

Implemented schema evolution in Delta Lake to allow data lake modifications without downtime or reprocessing.

Created automated ETL procedures to transfer transactional data from OLTP systems to OLAP and Big Data platforms for advanced analytics.

Worked with ACID-compliant OLTP databases such as SQL Server, PostgreSQL, MySQL, Oracle, and Aurora to assure data integrity.

Engineered data processing workflows using AWS Lambda and Step Functions to automate data transformations and ensure real-time data availability.

Developed and implemented data governance standards for OLTP systems, assuring adherence to security, encryption, and retention policies.

Implemented data governance practices using tools like Alteryx for data lineage, metadata management, and governance initiatives.

Built and maintained cloud-native data solutions on AWS, utilizing S3 for data storage, Redshift for analytics, and Glue for ETL operations

Designed and executed integrations between Oracle NetSuite and other enterprise systems, ensuring seamless synchronization of financial, operational, and transactional data.

Customized Oracle NetSuite scripts and workflows to align with business-specific requirements, enhancing ERP system functionality.

Utilized AWS EC2 instances for scalable compute resources to support big data processing and AWS S3 for reliable, cost-effective data storage.

Proficient in Teradata database design, development, optimization, and maintenance, including stored procedures and query performance tuning.

Environment: Python, SQL, Oracle NetSuite, Unity Catalog, Kafka, Zookeeper, HDFS, Oozie, Spark, Informatica, SQL server, HBase, Delta Lake, Flume, Parquet, TeradataKafka, PySpark, Tableau.

Contact this candidate