Data Engineer Senior

Location:

Overland Park, KS

Posted:

September 10, 2025

Contact this candidate

Resume:

DHRUVA A

Senior Data Engineer

EMAIL: ***************@*****.*** PH NO: 913-***-****

LINKEDIN: www.linkedin.com/in/dhruvatra

PROFESSIONAL SUMMARY:

Around 8+ years of professional experience as a Data Enginner with an expert hand in the areas of Database Development, ETL Development, Data modelling, Report Development and Big Data Technologies.

Hands on experience working with Amazon Web Services (AWS) cloud and its services such as AWS EC2, S3, RDS, EMR, Lambda, AWS Redshift, Cloud Watch, AWS Data Pipeline, ETL and other AWS Services.

Extensive experience in Azure Data Factory, Data Lake, Azure SQL Database, Data Bricks, Stream Analytics, Logic Apps and Azure Monitor

Hands-on experience designing and operating secure, cloud-native solutions in Azure using Terraform, CI/CD pipelines, and containerization (Docker, Kubernetes) for financial services.

Skilled in Azure Policy authoring and governance, ensuring compliance with security and regulatory frameworks.

Strong knowledge of application development, system integration, and automation using Python with the Azure SDK across the full SDLC.

Experienced in designing and developing machine learning models in Python using Scikit-learn and Tensor Flow for predictive analytics and trend forecasting.

Expertise in developing Spark applications using PySpark and Spark SQL for data extraction, transformation, and aggregation to uncover business insights.

Experienced in developing complex ETL pipelines using Spark and PySpark for data transformation, cleansing, and aggregation

Skilled in integrating Apache Kafka with Spark for real-time data ingestion and processing, utilizing Kafka topics and partitions to ensure reliable and scalable data streaming solutions.

Experienced in managing Kafka clusters, schema evolution and real-time data streaming.

Extensive experience in Hadoop ecosystem components including HDFS, Map Reduce, Hive, Pig and HBase for managing and analyzing large datasets.

Proficient in optimizing Hadoop YARN clusters and Map Reduce jobs to improve performance and efficiency in handling large datasets.

Experienced in Transform, and Load data from heterogeneous data sources to SQL Server using SQL Server Integration Services (SSIS) Packages.

Experienced in optimizing SQL and PL/SQL queries for data extraction, transformation, and reporting.

Skilled in writing and optimizing complex MySQL queries, stored procedures, and triggers to improve query performance and data retrieval.

Skilled in integrating NoSQL databases with data warehousing systems for holistic data analysis and reporting.

Ensured data quality through validation frameworks and implemented fine-grained security using IAM policies and role-based access control across cloud platforms

Experienced in developing and customizing Power BI reports, including paginated reports and branded themes for impactful data visualization.

Skilled in creating advanced Tableau dashboards with features like custom extensions for deep insights.

Proficient in leveraging Snowflake’s cloud-based data warehousing capabilities for scalable and efficient data storage, processing, and analytics.

Proficient in using GitLab for version control, managing source code, and tracking changes through branches, commits, and merge requests.

Skilled in configuring and customizing Jira workflows, screens, and fields to meet specific project requirements and streamline processes.

TECHNICAL SKILLS:

Programming Languages

Python, SQL, PL/SQL, T-SQL, Scala, R

Data Processing

PySpark, Spark SQL, Apache Kafka, Spark Streaming, Hadoop, HDFS, Map Reduce, Hive, Pig, HBase, Snowflake, Terraform, ARM Templates, Docker, Kubernetes, AKS

Clouds

AWS, Azure (Azure Policies, RBAC, IAM, Network Security Groups, Secrets Management, Baseline Hardening)

ETL Tools

Informatica Power Centre, Informatica Data Quality (IDQ), SSIS

Data Visualization

Power BI, Tableau, Excel

Database Management

MySQL, PostgreSQL, SQL Server, NoSQL Databases (MongoDB, Cassandra, HBase)

Version Control & CI/CD

Git Hooks, Jira, Jenkins, Spinnaker, GitLab CI/CD

Data Integration

Apache Sqoop, Apache Flume, Kafka Connect, Data Factory, Glue

Machine Learning

Scikit-learn, Tensor Flow, Spark MLlib

PROFESSIONAL EXPERIENCE:

Client: BOK Financial, Tulsa,OK. Feb 2024 – Present

Role: Senior Data Enginner

Responsibilities:

Designed and deployed infrastructure-as-code solutions using Terraform for Azure Data Factory and Databricks environments, ensuring consistent, automated provisioning.

Authored and enforced Azure Policies for compliance, cost control, and security governance across data workloads.

Built and maintained CI/CD pipelines in Jenkins and GitLab to automate deployment of PySpark and Python applications, improving delivery speed by 40%.

Containerized data processing applications using Docker and orchestrated them in Kubernetes (AKS) for scalable, cloud-native execution.

Implemented cloud security best practices including RBAC, IAM, secrets management, and network segmentation for sensitive financial datasets.

Created and optimized ETL workflows and mappings in Informatica Power Center to extract, transform, and load data from multiple sources into target data warehouses.

Designed and developed machine learning models in Python using Scikit-learn and Tensor Flow to predict trends and patterns in large datasets.

Developed Python-based data wrangling and preprocessing pipelines to transform raw data into analytical datasets, leveraging libraries like Dask for parallel processing on large datasets.

Developed Spark applications using PySpark and Spark-SQL for Data Extraction, Transformation and Aggregation from multiple file formats to discover the hidden business insights.

Optimized PySpark jobs for high-dimensional financial datasets, improving query performance for analytical models.

Led the implementation of Databricks-based data pipelines supporting financial analytics across 10TB+ datasets, optimizing compute costs and query performance by 40%.

Applied Spark’s MLlib library to conduct large-scale data analysis and build predictive models.

Developed custom Spark UDFs (User Defined Functions) to handle complex data processing logic and extend Spark's built-in functionalities.

Automated data pipeline deployment and monitoring using CI/CD practices for PySpark applications, enhancing development efficiency and reliability.

Used schema registry tools to handle schema evolution and ensure compatibility across Kafka producers and Kafka consumers.

Conducted Kafka cluster upgrades and maintenance activities, including schema evolution and version management, to ensure system stability and scalability.

Generated ad-hoc SQL queries using joins, database connections and transformation rules to fetch data from legacy DB2 and SQL Server database systems.

Integrated NoSQL databases with data warehousing solutions to combine NoSQL data with traditional data sources for comprehensive analysis.

Conducted regular cluster maintenance, performed capacity planning, and applied tuning optimizations on Hadoop YARN clusters.

Optimized Map Reduce jobs to process and analyze large datasets efficiently within the Hadoop ecosystem.

Developed Power BI paginated reports using Power BI Report Builder for highly formatted reports.

Designed custom Power BI themes and branding for reports to align with corporate identity and enhance visual consistency.

Utilized Git for CI/CD pipelines to automate code quality checks, streamline testing, and enhance security scans, ensuring efficient code and Docker image distribution.

Used Jira automation tools to trigger alerts and notifications for task dependencies and critical blockers, improving team response time to issues during sprints.

Environment: Azure (Data Factory, Databricks, Logic Apps, Monitor, Stream Analytics, Synapse Analytics, Data Lake Storage (ADLS) Gen2, Functions), Informatica Power Center, Python, Scikit-learn, Tensor Flow, Dask, PySpark, Spark-SQL, Spark MLlib, Spark UDFs, CI/CD, Kafka, Schema Registry, DB2, SQL Server, NoSQL, Hadoop YARN, Map Reduce, Snowflake, Power BI, Git, GitLab, Jira.

Client: Oscar Health, New York, New York. Aug 2023 – Jan 2024

Role: Senior Data Enginner

Responsibilities:

Utilized AWS S3 as a primary storage solution for data integration workflows, integrating with AWS Glue and AWS Lambda to automate ETL processes.

Monitored and optimized performance of AWS Redshift and RDS databases using AWS Cloud Watch and performance insights, identifying and resolving bottlenecks to enhance system efficiency

Designed and implemented AWS Step Functions to orchestrate serverless workflows, integrating with AWS Lambda, S3, and DynamoDB for automated data processing.

Built AWS Glue Spark jobs to transform raw JSON (S3) into Parquet, reducing storage costs by 30%

Utilized AWS IAM policies and roles to enforce fine-grained access control across AWS services, ensuring security and compliance with organizational standards.

Developed robust data pipelines and Delta Lake architectures on Databricks, integrated with Unity Catalog and S3 for structured access control.

Utilized Informatica Data Quality (IDQ) to implement data cleansing, profiling, and validation processes to improve data quality.

Automated report generation and data visualization tasks by creating Python scripts using Matplotlib, Seaborne and Plotly.

Implemented Python-based unit testing and continuous integration for data pipeline workflows, improving code reliability

Delivered data pipelines compliant with healthcare regulations, including HIPAA and internal audit policies.

Identified AWS cost-saving opportunities by analyzing Redshift cluster usage, reducing spend by 20%

Created and optimized complex SQL queries using Spark SQL for interactive data analysis and reporting.

Implemented Spark-based data validation frameworks to ensure the accuracy and consistency of data across various stages of processing.

Executed unit tests for PySpark transformations and actions to ensure code reliability and correctness.

Optimized PySpark jobs by tuning configurations and applying best practices to enhance performance and reduce processing time.

Implemented Kafka streams applications for real-time data processing and enrichment, leveraging Kafka’s stream processing capabilities.

Integrated Spark with Hadoop ecosystem tools such as Hive and HBase to perform complex data transformations and aggregations.

Automated the extraction of complex event-based data from real-time systems using Hadoop technologies such as Flume and Kafka, enabling near real-time analytics on operational data.

Automated data validation and auditing processes using SQL and PL/SQL to ensure consistency, quality and accuracy of data within data warehouses and reporting systems.

Engineered data partitioning strategies for NoSQL databases to improve query performance and manage large-scale data distributions.

Implemented and maintained Snowflake data warehouse, enabling efficient storage and querying of vast datasets.

Created Data Marts and multi-dimensional models like Star schema and Snowflake schema.

Developed complex relational database structures and data warehouse structure to feed data continuously without any load failures to the Power BI reports.

Developed Ad-Hoc reports using Power BI by connecting to various data sources and using data blending techniques.

Leveraged GitLab for source code management, CI/CD pipelines, and project tracking, improving collaboration and productivity in the development lifecycle.

Customized Jira boards to support specific data engineering workflows, adding custom fields, statuses, and categories to better organize and manage tasks throughout the development lifecycle.

Environment: AWS S3, AWS Glue, AWS Lambda, AWS Redshift, AWS RDS, AWS CloudWatch, AWS Step Functions, AWS IAM, Informatica Power Center, Informatica Data Quality (IDQ), Matplotlib, Seaborn, Plotly, Python Unit Testing, Spark SQL, PySpark, Spark Data Validation Frameworks, Kafka Streams, Kafka Cluster Management, Hadoop Hive, Hadoop HBase, Hadoop Flume, SQL, PL/SQL, NoSQL databases, Snowflake, Power BI, Power BI Data Blending, GitLab, Jira

Client: Bata India Limited, Gurgaon, India Apr 2019 – Jun 2023

Role: Data Engineer

Responsibilities:

Involved in Design and Implementation Data Ingestion pipelines from multiple sources using Azure Data Factory and Azure Databricks.

Managed and optimized data storage solutions using Azure Data Lake Storage and Azure SQL Database, ensuring scalable and cost-effective data storage for large datasets.

Developed custom transformations and reusable components in Informatica to address specific business needs.

Utilized Python and Pandas to handle complex data manipulations, including the merging of large datasets from multiple sources, allowing for more comprehensive analysis and reporting.

Automated data validation using Python libraries such as Pandas and NumPy to ensure data accuracy and consistency across multiple data sources.

Monitored and debugged PySpark applications, using Spark logs and metrics to troubleshoot issues and improve system stability.

Integrated Kafka with Spark Streaming for real-time data processing and analytics, leveraging Kafka as a data source for Spark applications.

Managed data ingestion from various sources into Hadoop Distributed File System (HDFS) using tools such as Apache Sqoop for structured and semi-structured data.

Optimized data transformation and aggregation jobs using Apache Pig and Hive on Hadoop, improving the performance of complex queries and reducing execution times for analytics tasks.

Performed data analysis and data profiling using complex SQL queries on various sources systems including Oracle and SQL Server.

Created custom T-SQL procedures to read data from flat files to dump to SQL Server database using SQL Server import and export data wizard.

Designed and implemented sharding strategies to horizontally scale NoSQL databases and efficiently distribute data across multiple nodes.

Incorporated advanced visual analytics techniques such as clustering, trend lines, and forecasting within Tableau dashboards to provide deeper insights.

Enhanced Tableau dashboards with geospatial data, leveraging map visualizations and spatial analytics to provide location-based insights.

Designed and implemented automated GitLab pipelines to handle data schema changes and database migrations.

Environment: Azure (Data Factory, Databricks, Data Lake Storage, SQL Database), Informatica, Python, Pandas, NumPy, PySpark, Kafka, Spark Streaming, Sqoop, Pig, Hive, Hadoop, SQL, T-SQL, NoSQL, Tableau, GitLab.

Client: Medanta, Hyderabad, India Jun 2017 – Mar 2019

Role: Data Analyst

Responsibilities:

Developed and maintained automated workflows for data ingestion and processing in AWS S3 and AWS EMR, integrating with other AWS services to streamline ETL processes and enhance data accessibility.

Implemented security best practices for AWS S3 and AWS EC2, including encryption of data at rest and in transit.

Optimized and refactored existing Python code to improve performance and scalability, reducing execution time and resource consumption for data processing tasks.

Developed custom data transformation scripts in Python to handle complex data conversions and aggregations, ensuring data consistency across multiple systems and reports.

Developed and maintained Hadoop-based data pipelines for ETL processes, automating data flow from source systems to HDFS.

Optimized Hive QL queries for large-scale data analysis, ensuring efficient data retrieval and transformation in Hadoop environment.

Involved in writing SQL Queries and used Joins to access data from Oracle and MySQL.

Developed and maintained data pipelines using SQL and PL/SQL scripts to extract, transform, and load (ETL) data across heterogeneous data sources.

Designed MySQL databases, including schema design, table creation, indexing, and performance tuning, to support robust data storage and retrieval systems.

Utilized Excel’s data validation and conditional formatting features to ensure data integrity and highlight anomalies for further investigation.

Designed and implemented structured templates in MS Word for standardized reporting, ensuring consistency and professionalism across all documentation.

Developed custom Tableau extensions and integrations to enhance dashboard functionality and incorporate external data sources and tools.

Environment: AWS (S3, EMR, EC2), ETL, Python, Hadoop, HDFS, Hive QL, SQL, PL/SQL, Oracle, MySQL, Excel, MS Word, Tableau.

EDUCATION: BTech in Computer Science and Engineering, June 2013 - May 2017 at JNTUH, India

Masters in Computer Science Aug 2023-Dec 2024 UCM, Lee's summit, MO

Contact this candidate