Data Engineer Big

Location:

San Antonio, TX, 78240

Salary:

90000

Posted:

October 15, 2025

Contact this candidate

Resume:

Pradeep Theja

Sr.Data Engineer

314-***-**** *************@*****.***

Professional Summary:

7+ years of experience in IT, which includes experience in Big data Technologies, Hadoop ecosystem, Data Warehousing, SQL related technologies in various sectors. 6+ Years of experience in Big Data Analytics using Various Hadoop eco-systems tools and Spark Framework and Azure cloud services using Scala and python as the main programming dialect. 3+ years of experience in Data warehouse developer role.

Technical Summary:

Extensive experience in designing and implementing Azure-based data pipelines using Azure Data Factory (ADF), Azure Synapse Analytics, Databricks, and Azure Data Lake Storage (ADLS).

Developed and optimized ETL/ELT workflows for data warehousing solutions using Azure Synapse, SQL Server, and traditional RDBMS such as Oracle, Teradata, and PostgreSQL.

Proficient in Hadoop ecosystem components including HDFS, Hive, Sqoop, MapReduce, and YARN, migrating on-prem big data workloads to Azure cloud platforms.

Designed and implemented real-time data streaming solutions using Apache Kafka and Azure Event Hubs, processing large-scale event-driven data.

Developed batch and streaming data processing frameworks in Azure Databricks using PySpark and optimized Spark job execution with caching and partitioning.

Built data ingestion pipelines using Kafka Connect to capture Change Data Capture (CDC) events from SQL Server and Oracle into Synapse Analytics.

Implemented Data Lakehouse architecture using Delta Lake in Databricks, enabling ACID transactions, schema enforcement, and time-travel features.

Developed and managed large-scale data warehousing models in Synapse Analytics, implementing star and snowflake schema designs for optimized analytics.

Configured and managed Kafka topics with appropriate partitioning and replication strategies for fault-tolerant and high-throughput data processing.

Developed Spark Streaming applications to consume, transform, and store Kafka messages in ADLS and Synapse for real-time analytics.

Implemented workload isolation and performance tuning strategies in Synapse dedicated SQL pools, optimizing resource allocation and indexing strategies.

Developed metadata-driven ingestion frameworks using ADF and Kafka, dynamically handling schema evolution and incremental data loads.

Configured role-based access control (RBAC) and data encryption policies in Azure SQL, Synapse Analytics, and Kafka to ensure security and compliance.

Integrated Hadoop-based processing workflows with Azure Synapse and Databricks, ensuring seamless data transformation across cloud and on-premises environments.

Designed real-time fraud detection pipelines using Kafka Streams and Spark Structured Streaming, integrating with Synapse for real-time analytics.

Optimized data storage and retrieval strategies using Parquet and ORC formats in ADLS, improving query performance for large datasets.

Developed and deployed Azure Functions to automate event-driven processing of streaming data from Kafka and Event Hubs.

Integrated Azure Event Hubs with Databricks Structured Streaming to enable near real-time data processing and anomaly detection in IoT data.

TECHNICAL SKILL-SET:

Big Data Technologies: HDFS, MapReduce, Hive, Sqoop, Flume, Oozie, Spark, Zookeeper, Delta Lake

Languages: Python, SQL, PL/SQL, Scala, HiveQL, Unix Shell Scripting, JSON, YAML

Cloud Architecture: Azure Data Lake Storage (ADLS), Azure Data Factory (ADF), Azure Databricks, Azure Synapse Analytics, Azure SQL Database, Azure Cosmos DB, Azure Blob Storage, Azure Functions, Azure Kubernetes Service (AKS), Azure Purview

Streaming & Messaging Systems: Apache Kafka, Azure Event Hubs, Apache Flink, RabbitMQ

Data Warehousing & Databases: Azure Synapse, Microsoft SQL Server, PostgreSQL, Teradata, Snowflake

NoSQL Databases: HBase, MongoDB, Cosmos DB

Reporting & Visualization Tools: Power BI, Tableau, SSRS

Web Services & API Development: REST API, SOAP, GraphQL, Microservices, Azure API Management

DevOps & CI/CD Tools: Azure DevOps, Jenkins, GitHub Actions, Terraform, Bicep, Docker, Kubernetes

Version Control & Build Tools: GitHub, GitLab, Bitbucket

Tools & IDEs: Visual Studio Code, Eclipse, IntelliJ IDEA

Professional Experience:

Webster Bank, Dallas, Texas Sep 2023 – Present

Sr. Azure Data Engineer.

Responsibilities:

Created Batch & Streaming Pipelines in Azure Data Factory (ADF) using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data.

Created Azure Data Factory (ADF) Batch pipelines to Ingest data from relational sources into Azure Data Lake Storage (ADLS gen2) & incremental fashion and then load into Delta tables after cleansing

Created Azure logic apps to trigger when a new email received with an attachment and load the file to blog storage

Implemented CI/CD pipelines using Azure DevOps in cloud with GIT, Maven, along with Jenkins plugins.

Developed Spark applications using PySpark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats.

Designed and developed ELT pipelines in Azure Data Factory (ADF) to ingest structured and semi-structured data from APIs, FTP, and relational databases into Azure Data Lake Storage (ADLS).

Implemented automated data partitioning and clustering strategies in Azure Synapse Analytics to improve query execution performance for large datasets.

Developed PySpark and Scala-based ETL jobs in Azure Databricks to cleanse, transform, and aggregate high-volume streaming and batch data.

Configured and managed Kafka Connect to ingest streaming data from SQL Server, PostgreSQL, and NoSQL sources into Azure Synapse and Cosmos DB.

Integrated Databricks Delta Lake with Kafka for real-time data lake ingestion and transformation, implementing incremental processing using structured streaming.

Developed event-driven architectures using Azure Event Grid and Azure Functions to trigger data transformations based on real-time events.

Optimized Azure Synapse workloads by implementing Workload Management Classifications to prioritize critical queries and optimize resource usage.

Designed a real-time anomaly detection system using Kafka, Azure Stream Analytics, and Azure Machine Learning to detect fraud in financial transactions.

Implemented Databricks Job Clusters and Autoscaling features to optimize cost and performance for heavy computational workloads.

Developed and managed complex ETL workflows in Databricks using Task Orchestration, Notebook Workflows, and Parameterized Jobs.

Implemented fault-tolerant and retry mechanisms in ADF pipelines to handle transient failures during data movement across different sources.

Created data ingestion frameworks that integrate IoT sensor data from Azure IoT Hub into ADLS and Synapse Analytics for predictive maintenance.

Configured and monitored Kafka Streams applications in Azure Kubernetes Service (AKS) to enable microservices-based event-driven data processing.

Developed dynamic, metadata-driven data pipelines in ADF and Databricks, automating schema inference and reducing manual development efforts.

Orchestrated big data processing pipelines using Apache Airflow in combination with Databricks to automate end-to-end workflows.

Implemented data quality checks using Great Expectations in Databricks to validate and profile data before loading into Synapse Analytics.

Optimized Spark SQL queries by tuning Shuffle Partitions, Skew Handling, and Broadcast Joins in Databricks for improved performance.

Implemented data lake lifecycle policies in ADLS to automatically tier cold and archival data to reduce storage costs and maintain compliance.

Transformed and Copied data from the JSON files stored in a Data Lake Storage into an Azure Synapse Analytics table by using Azure Databricks

Environment: Azure Cloud (Azure Data Factory, Azure Databricks, Azure Data Lake Storage Gen2, Azure Synapse Analytics, Azure Logic Apps, Azure Event Grid, Azure Functions, Azure Key Vault, Azure Monitor, Azure Blob Storage, Azure IoT Hub, Azure Kubernetes Service, Azure Cosmos DB)

Cigna, Bloomfield, Connecticut Nov 2021 – Aug 2023

Azure Data Engineer.

Responsibilities:

Involved in complete Big data flow of the application starting from data ingestion from upstream to HDFS, processing and analyzing the data in HDFS

Worked on Azure cloud components (Databricks, Data Lake, Blob Storage, Data Factory, Storage Explorer, SQL DB, SQL DWH, Cosmos DB).

Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from Azure SQL, Blob storage, and Azure SQL Data warehouse

Worked with Azure BLOB and Data lake storage and loading data into Azure SQL Synapse analytics (DW).

Involved in developing data ingestion pipelines on Azure HDInsight Spark cluster using Azure Data Factory and Spark SQL.

Conducted performance tuning and optimization activities to ensure optimal performance of Azure Logic Apps and associated data processing pipelines.

Developed Spark Streaming application to process real-time data from various sources such as Kafka, and Azure Event Hubs.

Build streaming ETL pipelines using Spark Streaming to extract data from various sources, transform it in real-time, and load it into a data warehouse such as Azure Synapse Analytics

Developed end-to-end data pipelines on Azure using Data Factory (ADF), Databricks, and Synapse Analytics for batch and real-time data processing.

Ingested data from on-prem sources and cloud applications into Azure Data Lake Storage (ADLS) using ADF Linked Services, Pipelines, and Datasets.

Transformed and processed large-scale structured and semi-structured datasets using PySpark in Databricks, leveraging Spark SQL and DataFrames.

Implemented streaming ETL pipelines using Azure Event Hubs and Databricks Structured Streaming to process real-time transactional data from Kafka topics.

Designed and optimized Delta Lake tables in Azure Databricks, enabling ACID transactions, schema enforcement, and time-travel functionality.

Developed Spark Streaming applications to consume and transform real-time event data from Kafka and Azure Event Hubs, persisting data into ADLS and Synapse.

Built and configured Azure Synapse SQL pools with partitioning, indexing, and distribution strategies for improved query performance.

Implemented Change Data Capture (CDC) pipelines using ADF and Kafka to capture incremental changes from SQL Server and Oracle databases.

Developed batch processing jobs using PySpark and Azure HDInsight to process data stored in ADLS and export refined datasets to Synapse Analytics.

Created parameterized data pipelines in ADF to dynamically handle multiple source systems, reducing ETL development time and improving reusability.

Built real-time dashboards in Power BI by connecting to Azure Synapse using DirectQuery mode to visualize near-instant updates from Kafka event streams.

Developed Python scripts for data validation, cleansing, and transformation before loading data into Azure SQL Database.

Orchestrated Spark jobs and Hive queries using Apache Oozie and integrated with ADF for automated workflow execution.

Implemented Role-Based Access Control (RBAC) and data encryption in Synapse Analytics and ADLS to ensure secure data governance.

Integrated Azure Purview for metadata tracking and lineage management, ensuring compliance with data governance policies.

Automated CI/CD deployments of ADF, Databricks, and Synapse scripts using Azure DevOps pipelines and Terraform.

Environment:Azure Cloud (Azure Data Factory, Azure Databricks, Azure Data Lake Storage Gen2, Azure Synapse Analytics, Azure Blob Storage, Azure Event Hubs, Azure HDInsight, Azure Key Vault, Azure Monitor, Azure Purview, Azure Cosmos DB, Azure Storage Explorer, Azure SQL Database, Azure SQL Data Warehouse, Azure Logic Apps, Azure Event Grid, Azure Functions)

Change Healthcare, Nashville, TN Oct 2020 –Nov 2021

Data Engineer

Architected and implemented robust, scalable healthcare data pipelines using PySpark, enabling efficient processing, transformation, and real-time analytics of clinical, claims, and patient health data, ensuring regulatory compliance and system reliability.

Designed and optimized enterprise-grade relational databases using SQL Server, applying advanced query optimization, indexing strategies, and performance tuning to support critical healthcare applications, electronic health records (EHR), and patient management systems.

Developed seamless database integration layers with SQLAlchemy, leveraging its ORM capabilities to efficiently manage clinical data interactions between healthcare applications and backend systems, enhancing scalability and maintainability.

Built secure, high-performance healthcare APIs with FastAPI, incorporating robust authentication mechanisms, comprehensive logging, and version control to ensure secure, efficient patient data access and interoperability.

Engineered resilient healthcare data orchestration workflows using Dagster, ensuring reliable execution, monitoring, and troubleshooting of complex data pipelines critical for clinical reporting, analytics, and care coordination.

Streamlined healthcare software development lifecycles by configuring CI/CD pipelines with Bitbucket, automating code quality checks, rigorous testing, compliance validation, and deployments to improve operational efficiency and reduce software release cycles.

Delivered actionable clinical insights and analytics by designing interactive, visually engaging dashboards and reports using AWS QuickSight, empowering healthcare stakeholders to make informed, data-driven decisions.

Optimized and managed clinical databases hosted on AWS RDS (PostgreSQL and SQL Server), ensuring secure, highly available, cost-effective database solutions supporting patient data management, clinical decision-making, and regulatory compliance.

Developed scalable, event-driven healthcare architectures leveraging AWS SNS, SQS, Step Functions, and Lambda, facilitating asynchronous processing for patient alerts, real-time monitoring, and seamless integration between clinical systems.

Built real-time, high-throughput data streaming solutions using Kafka, ensuring reliable, scalable data flow between electronic health records (EHR), telehealth systems, and remote patient monitoring platforms.

Automated clinical data ingestion, cataloging, and ETL processes using AWS Glue Crawler, simplifying metadata management, data discoverability, and compliance reporting across diverse healthcare datasets.

Designed and managed advanced healthcare analytics and big data workflows using Databricks, utilizing its collaborative analytics environment to enhance predictive modeling, clinical research, and patient outcome analyses.

Deployed and scaled containerized healthcare applications using AWS ECS, ensuring fault-tolerance, optimized resource utilization, and efficient orchestration for modern healthcare microservices architectures.

Automated cloud infrastructure provisioning and management with Terraform, implementing secure, scalable, and repeatable infrastructure-as-code solutions tailored to meet stringent healthcare regulatory requirements such as HIPAA.

Collaborated with clinical teams and data scientists to design, deploy, and monitor machine learning pipelines on AWS, transforming predictive healthcare models into production-ready solutions for early disease detection, patient risk stratification, and clinical decision support.

Configured and managed essential cloud infrastructure components including EFS, FSx, VPCs, subnets, and NAT Gateways, ensuring secure, compliant, and high-performance environments for healthcare data engineering workloads.

Environment: AWS Cloud (AWS Glue, AWS Lambda, AWS Step Functions, AWS SNS, AWS SQS, AWS RDS – PostgreSQL/SQL Server, AWS ECS, AWS EFS, AWS FSx, AWS VPC, AWS NAT Gateway, AWS QuickSight),Azure Cloud (Databricks)

State Of Tennessee, Nashville, Tennessee May 2019 – October 2020

Big Data Engineer.

Responsibilities:

Worked on development of data ingestion pipelines using ETL tool, Informatica & bash scripting with big data technologies including but not limited to Hadoop, Hive, Spark, Kafka.

Gathered requirements for ingestion of new data sources including life cycle, data quality check, transformations, and metadata enrichment.

Developed data pipeline using Flume, Sqoop, Pig and Java Map Reduce to ingest customer behavioral data into HDFS for analysis.

Involved in SQOOP implementation which helps in loading data from various RDBMS sources to Hadoop systems and vice versa.

Developed high-performance data pipelines in Azure Databricks using Spark Core and PySpark to process large datasets from ADLS and relational databases.

Integrated Azure Event Hubs with Apache Kafka to enable real-time event-driven data streaming and processing within Azure data pipelines.

Developed real-time data enrichment pipelines using Spark Structured Streaming, processing Kafka events before storing them in Delta Lake.

Implemented Kafka partitioning and replication strategies to optimize message throughput and fault tolerance in distributed streaming environments.

Built a unified batch and streaming processing architecture using Apache Spark, Kafka, and Azure Synapse to support hybrid data workflows.

Implemented HBase and Cassandra integrations with Databricks for fast, scalable NoSQL-based analytics on unstructured and semi-structured data.

Optimized Hive query performance on HDInsight by tuning Tez execution, Bucketing, and vectorized query execution.

Built data ingestion pipelines using Apache Sqoop to migrate structured data from MySQL, Oracle, and Teradata into Azure Data Lake Storage (ADLS).

Configured and optimized Spark Shuffle operations, including Skew Join handling, Salting, and Adaptive Query Execution (AQE) for performance improvements.

Deployed and managed Azure Kubernetes Service (AKS) clusters to run Kafka workloads for real-time event processing at scale.

Implemented log aggregation using Kafka, Logstash, and Azure Monitor to centralize application logs and perform real-time analytics.

Developed and scheduled Apache Oozie workflows to orchestrate multi-step ETL processes in Azure HDInsight and Databricks.

Utilized HDFS file compression techniques such as Snappy and Gzip to optimize storage and reduce read/write latencies in HDInsight.

Integrated Apache Flink with Kafka for complex event processing (CEP) to detect fraud patterns and anomalies in real-time transaction data.

Built a streaming data warehouse using Kafka, Delta Lake, and Azure Synapse, allowing near real-time insights into customer behavior.

Implemented Spark RDD transformations for batch processing workloads in Hadoop, ensuring efficient parallel execution of large-scale data.

Configured and fine-tuned Kafka Consumer Group settings to ensure efficient message processing while minimizing lag in Databricks jobs.

Developed time-series data analytics pipelines using Spark, HBase, and ADLS to support IoT telemetry and predictive maintenance use cases.

Leveraged Azure Data Explorer (ADX) for real-time log analysis and integrated it with Kafka for high-speed ingestion of telemetry data.

Implemented UNIX scripts to define the use case workflow and to process the data files and automate the jobs.

Environment: Azure Cloud (Azure Databricks, Azure Data Lake Storage Gen2, Azure Synapse Analytics, Azure Event Hubs, Azure Kubernetes Service (AKS), Azure Monitor, Azure Data Explorer (ADX), Azure HDInsight),Hadoop Ecosystem (HDFS, Hive, Spark Core, SparkStructuredStreaming,Flume,Sqoop,Pig,Tez,Oozie,HBase,Cassandra,DeltaLake,Snappy,Gzip)

Kroger, Cincinnati, Ohio (Intern) March 2018 – May 2019

Data Engineer.

Responsibilities:

Designed high-throughput Kafka producers and consumers using Python and Spark Streaming to process real-time financial transactions in Azure.

Optimized Spark workloads by implementing Broadcast Joins, AQE (Adaptive Query Execution), and caching techniques in Azure Databricks.

Built NoSQL-based data processing pipelines integrating Azure Cosmos DB and Databricks for handling semi-structured and hierarchical data formats like JSON and Parquet.

Developed automated Spark job monitoring and logging frameworks using Azure Monitor and Log Analytics to track performance and failures.

Designed incremental ingestion pipelines using ADF with watermarking techniques to process only newly added records from relational databases.

Implemented a unified schema registry in Kafka for ensuring compatibility across multiple producer-consumer applications in a real-time streaming environment.

Developed automated metadata-driven ETL frameworks to dynamically handle schema evolution across multiple data sources.

Designed a real-time analytics dashboard powered by Azure Stream Analytics and Power BI, enabling instant visibility into data pipeline health metrics.

Developed custom event-driven microservices in Azure Functions and Kafka Streams to enrich real-time streaming data before landing in Azure Data Lake.

Implemented role-based security for data pipelines using Azure Managed Identities, ensuring granular access control to ADLS, Synapse, and Kafka topics.

Configured Kafka MirrorMaker to synchronize real-time messages across multi-cloud environments, ensuring business continuity and minimal downtime.

Developed automated job scheduling frameworks using Apache Airflow to coordinate multi-step ETL workflows across Azure Synapse and Databricks.

Implemented HDFS to ADLS migration strategies using DistCp and Azure Storage Blob APIs to modernize legacy Hadoop-based data lakes.

EDUCATION

Masters in Information Technology at Webster University.

Contact this candidate