Data Engineer Machine Learning

Location:

Tulsa, OK

Posted:

October 15, 2025

Contact this candidate

Resume:

Work Experience

First United Bank Azure Data Engineer

Description: First United Bank is one of the largest privately-owned banks it’s a wide range of financial services to individuals, businesses, and communities, I am implementing Azure SQL Database and Cosmos DB to support transactional and analytical workloads, ensuring reliability and scalability.

Key Responsibilities:

Designed and implemented scalable, reliable, and efficient data pipelines using Python, Java/Scala, and Shell Scripting to ingest, transform, and process large datasets

Partnered with data analysts and scientists to provide clean, well-structured datasets for advanced analytics and machine learning projects, Designed and implemented scalable ETL pipelines to ingest, transform, and load large datasets into Amazon Redshift and Google BigQuery.

Configured role-based access controls in Snowflake and Hadoop, ensuring secure data access and compliance with organizational standards, Designed and implemented distributed data processing workflows using MapReduce to handle large-scale datasets efficiently.

Highly Involved into Data Architecture and Application Design using Cloud and Big Data solutions on Azure, Microsoft Azure Extensively worked on Relational Data Base, Postgres SQL as well as MPP database like Redshift.

Leading the effort for migration of Legacy-system to Microsoft Azure cloud-based solution Re-designing the Legacy Application solutions with minimal changes to run on cloud platform.

Designed and implemented scalable data pipelines using PySpark for batch and real-time data processing.

Managed source code repositories using Git for version control, ensuring code integrity and streamlined collaboration across teams, Designed, developed data integration programs in a Hadoop environment with NoSQL data store Cassandra for data access and analysis, used job management scheduler apache Oozie to execute the workflow.

Worked on apache Solr for indexing and load balanced querying to search for specific data in larger datasets.

Involved in performance tuning of spark jobs using Cache and using complete advantage of cluster environment.

Implemented version control and CI/CD practices for Airflow DAGs to ensure consistent deployment and maintainability.

Designed and implemented Sqoop jobs to efficiently transfer large-scale data between Hadoop HDFS and relational databases such as MySQL, Oracle, and SQL Server.

Designed and deployed scalable data models in Cassandra for handling large-scale distributed systems.

Migrated legacy systems to PostgreSQL, enhancing data reliability and reducing operational costs, Maintained comprehensive documentation for ETL workflows, database schemas, and coding standards for scalability and team collaboration, Automated data workflows involving MapReduce, Pig, Hive, and HBase using tools like Apache Oozie.

Designed and implemented robust ETL pipelines to ingest, transform, and load data into Snowflake from HDFS, ensuring data quality and consistency, Designed and implemented schema models in MongoDB to support

Designed and implemented robust ETL/ELT pipelines in Azure Data Factory to ingest and transform data into Azure Data Lake and Azure Synapse Analytics.

Integrated Cassandra with streaming platforms like Apache Kafka for real-time data processing.

Developed and maintained HBase-based data storage solutions for high-throughput, low-latency applications.

Environment: Python, Java/Scala, and Shell Scripting, ETL pipelines, Amazon Redshift and Google BigQuery, Snowflake, MapReduce, PySpark, Git, CI/CD, MongoDB, HDFS, Cassandra, Apache Kafka

Johnson & Johnson AWS Data Engineer

Description: Johnson & Johnson (J&J) is a global leader in healthcare and pharmaceuticals, I implemented data pipelines using AWS services such as AWS Glue, Amazon Redshift, S3, and Lambda to streamline the extraction, transformation, and loading (ETL) of large datasets from multiple sources into the data warehouse.

Key Responsibilities:

Developed and optimized ETL workflows in Python and SQL to extract, clean, and load data into data warehouses, ensuring high data quality and consistency, Designed and implemented efficient ETL pipelines using Python and Scala to extract, transform, and load large datasets into MySQL and PostgreSQL databases.

Automated data ingestion workflows using tools like AWS Lambda, Cloud Functions, and orchestration frameworks such as Apache Airflow or Google Cloud Composer.

Implemented time-series and real-time data storage solutions using Cassandra's wide-column architecture, automated infrastructure provisioning and deployment using AWS Cloud Formation and Terraform to ensure consistency and scalability.

Automated ETL workflows using tools like Apache Airflow, Informatica, or custom Python scripts, improving data ingestion efficiency, Optimized MapReduce jobs for performance and reduced processing time

AWS S3 as storage layer Create and setup self-hosted integration runtime on virtual machines to access private networks.

Enterprise Data Lake was designed and set up to enable a variety of use cases covering analytics processing, storing, and reporting on large amounts of data Working on building visuals and dashboards using Power Bl reporting tool.

Utilized PySpark for ETL processes, including data cleansing, transformation, and enrichment to prepare datasets for analysis.

Configured and maintained Zookeeper clusters to manage distributed systems and ensure synchronization between Kafka, Flume, and other Hadoop components.

Integrated HBase with MapReduce jobs for efficient data processing and retrieval in distributed systems.

Optimized Snowflake queries and workloads, leveraging clustering, partitioning, and caching to improve execution times.

Developed and maintained data models in Snowflake, optimizing schema design for query performance and storage efficiency Used AWS EMR to move large data (Big Data) into other platforms such as AWS data stores, Amazon S3 and Amazon Dynamo DB Developed AWS lambdas using Python & Step functions to orchestrate data pipelines.

Built and maintained CI/CD pipelines using tools like Jenkins, GitLab CI/CD, or Azure DevOps to automate deployment processes for data engineering projects.

Developed and maintained optimized data models in Redshift and BigQuery to support efficient querying and analytics workflows, Configured and managed HBase clusters, ensuring optimal performance and reliability.

Integrated diverse data sources and APIs into unified datasets using Python, ensuring seamless data flow across systems.

Optimized MongoDB queries and indexing strategies to enhance database performance and scalability.

Environment: ETL workflows, Python, SQL, ETL pipelines, Python, Scala, MySQL, PostgreSQL, AWS Lambda, Cloud Functions, Apache Airflow or Google Cloud Composer, Apache Airflow, Informatica, MapReduce, Zookeeper, Hadoop, Hbase, Jenkins, MongoDB

Digit Insurance GCP Data Engineer

Description: Digit Insurance is a digital-first general insurance company known for its innovative approach to simplifying insurance products; I worked and maintained a centralized data warehouse in BigQuery, enabling advanced analytics and business intelligence reporting, and Monitored and optimized GCP resource utilization, reducing cloud costs

Key Responsibilities:

Built distributed data processing systems leveraging Apache Spark and Hadoop ecosystems with Java/Scala for real-time and batch data processing, Architected and maintained HBase tables for low-latency read/write access to real-time data.

Managed, optimized, and ensured the integrity of relational databases (MySQL, PostgreSQL) to support high-volume data storage and retrieval, Monitored Zookeeper logs and resolved coordination-related issues to maintain system stability.

Designed and implemented branching strategies (e.g., Git flow) to support multiple parallel development workflows.

Designed, implemented, and maintained scalable data pipelines using GCP services like BigQuery, Cloud Dataflow, Cloud Pub/Sub, and Cloud Storage.

Experience in building power BI reports on Azure Analysis services for better performance, Built and maintained ETL pipelines to ingest, transform, and load data into NoSQL databases like MongoDB, Cassandra, and HBase.

Developed Streaming applications using PySpark to read from the Kafka and per sist the data NoSQL databases such as HBase and Cassandra Used stored procedure, lookup, executes pipeline, data flow copy data, azure function features in ADF Worked on Big Data Hadoop cluster implementation and data integration in developing large-scale system software

Developed and maintained ETL pipelines using Apache Pig to process and transform unstructured and semi-structured data into actionable formats, Implemented data pipelines to ingest and query data in HBase from sources like Hadoop and Spark.

Tuned Cassandra performance by optimizing compaction strategies, read/write paths, and partition keys, Developed distributed data processing workflows leveraging Scala and big data frameworks for real-time and batch data processing.

Integrated unit and integration tests into CI/CD pipelines to validate ETL processes and ensure code quality.

Built and managed data lakes using GCP Cloud Storage, enabling secure and scalable storage of structured and unstructured data, Tuned SQL queries for performance improvement in Redshift and BigQuery, reducing query execution time

Worked on migration of data from On-prem SQL server to Cloud databases (Azure Synapse Analytics (DW) & Azure SQL DB) Developed mapping document to map columns from source to target.

Deployed scalable data pipelines in Azure Data Lake, AWS S3, and EMR environments to handle petabyte-scale data.

Created azure data factory (ADF pipelines) using Azure Poly Base and Azure blob Worked on python scripting to automate generation of scripts Data cu ration done using azure data bricks.

Utilized Azure Databricks to process and analyze large datasets, leveraging Apache Spark for distributed data processing and real-time analytics, Developed ETL pipelines to integrate data from MongoDB into analytics platforms and data warehouses.

Managed large-scale data storage and processing using HDFS, ensuring efficient data distribution and replication

Environment: Apache Spark and Hadoop, Java/Scala, Hbase, MySQL, PostgreSQL, Zookeeper, GCP, BigQuery, Cloud Dataflow, power BI, PySpark, Kafka, NoSQL, Big Data Hadoop, Apache Pig, Hadoop and Spark, MongoDB, MongoDB

Levi Strauss & Co Data Engineer

Description: Levi Strauss & Co is a strong retail and e-commerce presence. Its products are sold through Levi’s® stores, third-party retailers, and online platforms, I Improved dash board and reporting by integrating real-time data pipelines with BI tools like Tableau and Power BI, enhancing decision-making speed.

Key Responsibilities:

Created and managed relational and NoSQL databases, writing complex SQL queries and stored procedures to support analytics and reporting needs, Monitored data workflows using AWS CloudWatch, and optimized performance through Athena for serverless querying and cost-effective analytics

Created and optimized complex SQL queries to enhance database performance and support analytics teams with timely insights, Developed and maintained Maven-based build pipelines for data engineering projects, automating dependency management and project packaging.

Converted existing AWS Infrastructure to Server less architecture (AWS Lambda, Kinesis), deploying via Terraform and AWS Cloud Formation templates Developed analytical components using Scala, Spark

Working on building visuals and dashboards using Power Bl reporting tool.

Designed and implemented end-to-end ETL pipelines using a combination of Sqoop, Flume, and Kafka for batch and real-time data processing, Created and optimized Hive queries for data aggregation and analysis, ensuring high performance on massive datasets stored in HDFS.

Designed and maintained optimized Tableau data sources by implementing efficient SQL queries and leveraging advanced data modeling techniques.

Documented data engineering workflows, CI/CD processes, and pipeline architecture for team reference and knowledge sharing, Monitored and maintained Cassandra clusters to ensure consistent uptime and fault tolerance.

Utilized Hue for querying and analyzing data within Hadoop and Snowflake environments, enabling self-service analytics for business users, Automated MongoDB backup and recovery processes to ensure data integrity and minimize downtime.

Managed cloud-based data warehouses, ensuring data integrity, scalability, and availability across AWS Redshift and Google Cloud BigQuery platforms.

Performed data migration and integration tasks between HBase and other databases (e.g., Hive, Impala), Designed and implemented relational and dimensional data models in MySQL and PostgreSQL to meet business analytics needs.

Environment: NoSQL, SQL, AWS CloudWatch, SQL queries, Maven, ETL pipelines, Sqoop, Flume, and Kafka, Hive, Hadoop, MongoDB, AWS Redshift, Google Cloud BigQuery

Education

University of Lindsey Wilson college - Kentucky, USA (2023 – 2024)

Information Technology management / Masters

Chakrala DEEPAK

Oklahoma, USA

Data Engineer

Experienced Data Engineer with 5+ years of expertise in designing, building, and optimizing scalable data pipelines, ETL processes, and data warehousing solutions.

Proficient in list relevant tools/technologies, e.g., Python, SQL, Spark, Hadoop, AWS/GCP/Azure, with hands-on experience in data modeling, CI/CD pipeline automation, and real-time data processing.

Skilled in cloud-based data engineering solutions, including [specific platforms like AWS Redshift, GCP BigQuery, or Azure Data Lake].

Proficient in designing and implementing data solutions using Snowflake's cloud data platform, including schema design, data ingestion, and performance tuning.

Expertise in managing and analyzing large-scale datasets using distributed systems and tools like Apache Spark, Kafka, and Hive, Track record of optimizing data pipeline performance, reducing processing times Strong collaboration with data scientists, analysts, and stakeholders to deliver end-to-end data solutions aligned with business goals.

Knowledge of data governance, security protocols, and compliance standards like GDPR and HIPAA.

Python, Flask, and Angular form a great stack to build modern web applications, Experienced in installing, configuring, modifying, testing and deploying applications with Apache Expert in developing web-based applications using PHP, XML, CSS3 HTMLS, DHTML, XHTML JavaScript and DOM scripting Good knowledge of web services with protocols SOAP and REST Hands on experience with different programming languages such as Python SAS.

Extensive experience in Data Mining solutions to various business problems and generating data visualizations using Tableau, PowerBI, Birst Alteryx Strong understanding of the principles of Data warehousing, Fact Tables, Dimension Tables, Star and Snowflake schema modeling.

Experience in working with business intelligence and data warehouse software including SSAS/SSRS/SS IS, Business Objects, and Amazon Redshift. Azure Data Warehouse and Teradata Working experience in data analysis techniques using Python libraries like NumPy, Pandas, and SciPy and visualization libraries of Python like Seaborn, Matplotlib.

Worked on big data technologies like Hadoop/ HDFS, Spark, scala, Map Reduce Pig Hive, Scoop to extract and load data of various heterogeneous sources like Oracle, flat files XML, other streaming data sources into EDW and transform for analysis (ETL/ELT) Experienced in MVW frameworks like Django, Angular js, Java Script backbone.js, JQuery and Node.

Programming Languages: Python, Java/Scala, SQL, R, Shell Scripting

Databases: MySQL, PostgreSQL, Oracle, Microsoft SQL Server, MongoDB, Cassandra, DynamoDB, Snowflake, Amazon Redshift, Google BigQuery, Azure Synapse.

Big Data Frameworks: Apache Hadoop, Apache Spark, Hive, Pig, Flink

ETL and Data Integration Tools: Apache Nifi, Talend, Informatica, Airflow, AWS Glue, Microsoft SSIS

Cloud Platforms: Amazon Web Services (AWS): S3, Redshift, Glue, Lambda, EMR, Google Cloud Platform (GCP): BigQuery, Dataflow, Dataproc, Microsoft Azure: Azure Data Factory, Synapse, Blob Storage.

DevOps and CI/CD: Docker, Kubernetes, Jenkins, Terraform, Ansible

Data Visualization: Tableau, Power BI, Looker, Superset

Machine Learning: Tensor Flow, PyTorch, Scikit-learn, ML flow

Summary:

Skills:

+1-918-***-****

Seasoned data engineer with 5+ years of experience in data architecture and team management. Aiming to lead cross-functional teams in designing end-to-end data solutions that drive operational excellence and strategic growth

***************@*****.***

Objective:

Aug 2024 - Present

Oklahoma, USA, USA

Oklahoma City, Oklahoma, USA

Apr 2023 - Nov 2023

Bengaluru, India

May 2021 - Jul 2023

Bengaluru, India

Mar 2019 - Apr 2021

Contact this candidate