Data Engineer Machine Learning

Location:

Irving, TX

Posted:

September 14, 2025

Contact this candidate

Resume:

AKHIL NALABOLU

Email: ************@*****.*** Phone: +1-214-***-****

PROFESSIONAL SUMMARY

Results-driven Data Engineer with 4+ years of experience designing and optimizing cloud-based big data pipelines across AWS, Azure, and GCP. Expert in ETL/ELT workflows, real-time streaming architectures, and data lakehouse solutions using Apache Spark, Kafka, Snowflake, and Databricks. Skilled in Python, SQL, and PySpark for large-scale transformations and advanced analytics. Strong background in machine learning integration, orchestration, and performance tuning, delivering up to 30% faster processing speeds and enabling real-time, data- driven decision-making. Proven track record of cloud migrations, cost optimization, and compliance with GDPR and CCPA standards. TECHNICAL SKILLS

Hadoop Ecosystem: HDFS, Hive, Pig, YARN, Spark, Spark SQL, MapReduce, Kafka, Sqoop, Delta Lake, Iceberg

Data Processing & Analytics: Apache Spark, Spark MLlib, Spark Streaming, Spark GraphX, dbt (Data Build Tool), Apache Flink

Programming Languages: Python, Scala, SQL, PL/SQL, UNIX Shell Scripting, PySpark

Databases (RDBMS & NoSQL): Teradata, Oracle, DB2, SQL Server, MySQL, MongoDB, Cassandra, HBase, Elasticsearch

Cloud Databases: Amazon Redshift, AWS Snowflake, PostgreSQL, Google BigQuery, Azure Synapse Analytics

Cloud Platforms: AWS (EC2, S3, Glue, Lambda, IAM), Azure

(Data Factory, Databricks), GCP (BigQuery, Dataflow)

ETL & ELT Tools: IBM InfoSphere, SQL Server Integration Services (SSIS), Apache Sqoop, Fivetran, Matillion

Workflow Orchestration: Apache Airflow, AWS Glue, Azure Data Factory, Dagster

Machine Learning Models: Linear/Logistic Regression, Naïve Bayes, Decision Trees, Random Forest, KNN, SVM, Gradient Boosting, PCA, LDA, Time Series Analysis

Deep Learning & NLP: TensorFlow, Keras, CNN, RNN, NLP

(SpaCy, Gensim), Hugging Face Transformers

Visualization & BI Tools: Tableau, Power BI, Matplotlib, Seaborn, Looker

Development Environments: PyCharm, Jupyter Notebook, Visual Studio, VS Code

Version Control & CI/CD: Git, GitHub, JIRA, GitLab, Jenkins, Bitbucket Pipelines

Containerization & Deployment: Docker, Kubernetes, Terraform

Operating Systems: Linux, Unix, Windows, Mac O

PROFESSIONAL EXPERIENCE

Goldman Sachs Global Jan 2024 – Present

Data Engineer

Designed and deployed scalable ETL/ELT pipelines ingesting multi-terabyte datasets into Snowflake and Redshift, ensuring 99.9% data accuracy.

Built real-time streaming solutions using Apache Kafka and Spark Streaming, processing 2M+ daily transactions for fraud detection and anomaly monitoring.

Automated complex workflows in Apache Airflow and Azure Data Factory, reducing operational overhead by 40%.

Developed ML pipelines in Spark MLlib and TensorFlow, improving predictive model accuracy for customer churn by 12%.

Leveraged Azure Databricks Delta Lake to handle slowly changing dimensions, improving historical accuracy in BI reporting.

Implemented dbt transformation layers in Snowflake, enabling self-service analytics for 20+ business teams.

Created CI/CD automation via Jenkins and Git, reducing deployment cycles by 25% and introducing automated testing gates.

Built real-time Power BI dashboards that aggregated KPIs across business lines, cutting decision-making time from hours to minutes.

Partnered with InfoSec to integrate GDPR/CCPA compliance measures in pipelines, reducing audit risk by 100% pass rate.

Optimized Spark cluster resource allocation, reducing compute costs by 20% through intelligent job parallelization. Coforge Aug 2019 – Jul 2022

Data Engineer

Engineered AWS Redshift pipelines to integrate data from ERP, CRM, and streaming sources, improving BI data availability by 35%.

Designed and optimized T-SQL stored procedures and partitioned tables to handle billions of records efficiently.

Built data lakes on AWS S3 and Azure Data Lake, enabling scalable storage for unstructured, semi-structured, and structured datasets.

Integrated Apache Sqoop to migrate high-volume data from on-premises Oracle DB to cloud storage.

Collaborated with data scientists to operationalize ML models in Spark MLlib, reducing fraudulent transaction rates by 15%.

Led on-prem to AWS migration of 50+ data pipelines, achieving a 20% cost reduction and better scalability.

Tuned Spark jobs using Dynamic Resource Allocation, reducing job runtimes by 30%.

Implemented Airflow DAG monitoring with automated alerts, improving incident response time by 45%.

Established version-controlled SQL and ETL scripts in Git for better change tracking and rollback.

Configured AWS IAM roles and encryption to meet enterprise security standards and compliance frameworks. EDUCATION

Master of Science in Computer Science – University of Central Missouri, USA

Bachelor of Technology in Computer Science – Vignan’s Institute of Science and Technology, India PROJECT HIGHLIGHTS

Real-Time Fraud Detection Platform

Architected a Kafka-Spark Streaming pipeline processing 500K+ transactions/hour with ML-driven risk scoring in under 2 seconds. Cloud Data Lake Modernization

Migrated 15 TB of structured and unstructured data to AWS S3 with Glue crawlers, reducing retrieval latency by 40%.

Contact this candidate