Data Engineer Machine Learning

Location:

India

Salary:

80000

Posted:

September 12, 2025

Contact this candidate

Resume:

Venkata Sai

Email: *********@*****.***

Mobile: +1-216-***-****

Data Engineer

PROFESSIONAL SUMMARY:

Data Engineer with 5+ years of experience, demonstrating analytical thinking and attention to detail in designing ETL pipelines and cloud solutions across diverse business applications. Leveraged innovative thinking skills to solve complex problems.

Built scalable batch and streaming pipelines using Apache Spark and SQL, supporting cross-functional reporting and analytics, while connecting the dots across various applications to provide an E2E view.

Designed real-time data processing systems, communicating effectively with both technical and non-technical audiences, and presenting complex information in detailed tech speak or executive summaries.

Implemented cloud-native data platforms using Google Cloud services, proficiently utilizing query tools and strong PL/SQL skills to write and analyze complex queries and stored procedures for data analysis.

Developed reusable Python frameworks with GCP SDKs for ingestion from REST APIs, databases, and flat files, while maintaining expertise with Microsoft Office suite usage for documentation.

Engineered BigQuery data models to optimize cost and performance, leveraging experience with Oracle Exadata or 10g and above for efficient data storage and retrieval in large databases.

Created secure service accounts, IAM roles, and VPC networks in GCP, demonstrating know-how working in Agile/scrum teams for prioritization of work and resource assignments to meet project goals.

Built Airflow DAGs in Cloud Composer to orchestrate data pipelines, functioning as a strong team player who can influence and guide the team for success in a collaborative environment.

Managed machine learning feature pipelines using Vertex AI, working well in a team environment with minimal supervision and demonstrating willingness to ask questions and reach out for assistance.

Leveraged Terraform and GitHub Actions to deploy infrastructure and pipelines as code, identifying priorities and managing multiple projects simultaneously to meet deadlines and deliverables.

Designed loosely coupled ingestion architectures using Cloud Functions, GCS triggers, and BigQuery, applying know-how in effort and financials estimation for project planning and resource allocation.

Used Dataflow and Apache Beam to unify batch and streaming pipelines, ensuring effective communication across the organization and adapting communication style to suit technical or business audiences.

Implemented cost optimization strategies by monitoring query usage, storage lifecycle policies, and resource scheduling, demonstrating problem-solving skills and innovative thinking to improve efficiency.

Delivered high-quality datasets to stakeholders using BigQuery and Looker, ensuring attention to detail and analytical thinking in providing consistent, trusted, and timely data for decision-making.

Ensured secure cloud-to-on-prem connectivity using GCP VPN and service accounts, demonstrating proficiency in connecting the dots across various applications to understand the end-to-end view.

Partnered with cross-functional teams to gather data requirements and translate business needs into production-grade data engineering solutions, functioning as a team player to influence and guide others.

Contributed to data governance and documentation efforts by standardizing schemas, tagging assets, and recording lineage, identifying priorities and managing multiple projects simultaneously effectively.

TECHNICAL SKILLS:

Cloud Platforms - Google Cloud Platform (BigQuery, Dataflow, Composer, GCS, Pub/Sub), Microsoft Azure, AWS

Big Data - Apache Spark, Apache Kafka, Apache Beam, Hive, Hadoop, Databricks

Languages - Python, SQL, Java, Scala, Bash, PL/SQL

Data Warehousing - BigQuery, Snowflake, Azure Synapse, Oracle Exadata

ETL Tools - Apache Airflow, GCP Dataflow, Azure Data Factory

Version Control & CI/CD - Git, GitHub, Terraform, Cloud Build, Jenkins

Visualization - Power BI, Tableau

Other Tools - Apache NiFi, Cloud Functions, GCP IAM, Microsoft Office

Databases - Oracle 10g

Methodologies - Agile, Scrum

PROFESSIONAL EXPERIENCE:

Bank of America January 2025 – Present

Data Engineer

Responsibilities:

Applied analytical thinking and innovative thinking skills to design and implement high-throughput ETL pipelines using Apache Spark and GCP Dataflow, ensuring data quality and accuracy. This involved connecting dots across various applications to understand the E2E view and deliver robust solutions.

Integrated diverse data sources, including transaction logs and third-party APIs, using strong PL/SQL skills to write and analyze complex queries, stored procedures, and optimize data consolidation for fraud detection and regulatory reporting.

Migrated legacy data workflows into BigQuery, optimizing performance and reducing ETL runtimes by 40%, while improving data quality and operational SLA compliance, demonstrating attention to detail and problem-solving skills.

Employed BigQuery optimization techniques such as clustering and partitioning, resulting in significant improvements to query performance and GCP billing efficiency, showcasing proficiency with query tools to aid data analysis.

Automated infrastructure provisioning using Terraform for GCP resources, ensuring reproducibility, scalability, and minimal human error, while working well in a team environment with minimal supervision and a willingness to ask questions.

Scheduled and monitored over 100 production-grade pipelines using Airflow on Cloud Composer with built-in error handling and SLA alerts, demonstrating good skills in identifying priorities and managing multiple projects simultaneously.

Designed and maintained reusable Python libraries for data transformations, enabling code reuse and standardization across engineering teams and improving overall development velocity, showcasing strong communication skills.

Developed Looker dashboards backed by BigQuery, providing real-time insights into transaction patterns and financial health metrics for business stakeholders, demonstrating the ability to effectively communicate across the organization.

Partnered with QA, analysts, and auditors to validate pipeline outputs, ensuring consistent data delivery and fostering confidence across finance, risk, and compliance teams, while working in Agile/scrum teams for prioritization of work.

Pfizer June 2021 – April 2023

Data Engineer

Responsibilities:

Designed scalable data pipelines using Azure Data Factory, Python, and Spark on Databricks to ingest and process structured and unstructured clinical trial data, applying analytical thinking and innovative thinking skills to complex problems.

Automated ingestion processes for diverse research datasets, applying schema evolution logic and metadata tagging to support traceability and regulatory reporting, demonstrating attention to detail and proficiency with query tools.

Collaborated with QA and compliance teams to implement data pipelines aligned with FDA CFR Part 11 guidelines, ensuring traceable and validated data movement for R&D analytics, showcasing strong communication skills.

Leveraged Delta Lake on Databricks to maintain ACID-compliant, version-controlled raw and curated datasets, supporting reproducibility and lineage tracking for clinical study data, connecting dots across various applications.

Built parameterized, reusable ADF pipeline templates and Databricks notebooks, standardizing ingestion and transformation logic across scientific data domains, while working well in a team environment with minimal supervision.

Created domain-specific data models in Azure Synapse Analytics by collaborating with medical experts to represent drug efficacy metrics and trial outcomes accurately, demonstrating the ability to effectively communicate across the organization.

Enabled secure access to cloud resources by integrating Azure Key Vault secrets into all data pipelines, protecting credentials and sensitive metadata, showcasing good skills in identifying priorities and managing multiple projects.

Developed Power BI dashboards for research scientists, visualizing patient enrollment trends and trial KPIs in near real-time, using strong PL/SQL skills to analyze complex queries and stored procedures for data analysis.

Aligned agile deliverables with sprint planning, stand-ups, and retrospectives, ensuring timely delivery of data assets that supported product and trial decision-making, while working in Agile/scrum teams for prioritization of work.

Goldman Sachs August 2018 – May 2021

Data Engineer

Responsibilities:

Built scalable Spark and Kafka-based data pipelines for risk and compliance analytics, processing high-frequency trade data across batch and micro-batch architectures, applying analytical thinking and innovative thinking skills.

Developed reusable Scala ingestion modules to support real-time and intraday data loads, reducing engineering overhead and enabling faster deployment of new market data feeds, demonstrating attention to detail.

Collaborated with risk analysts to define data models for VaR calculations and financial reporting, ensuring alignment with internal risk frameworks and regulatory requirements, showcasing strong communication skills.

Migrated legacy Oracle PL/SQL pipelines to Hive and Spark, leveraging distributed processing to improve pipeline throughput and maintainability, connecting dots across various applications to understand the E2E view.

Integrated multiple data sources including SFTP feeds and Kafka streams into unified workflows, supporting enterprise-wide financial analytics pipelines, while working well in a team environment with minimal supervision.

Applied advanced Spark SQL techniques for joins and aggregations on large datasets, optimizing memory usage and ensuring sub-second query response times, demonstrating proficiency with query tools to aid data analysis.

Tuned Spark executor and memory configurations, improving performance and cluster stability while lowering resource consumption across scheduled data jobs, showcasing good skills in identifying priorities and managing multiple projects.

Implemented schema evolution strategies and source format handling to enable resilient ingestion pipelines adaptable to upstream structure changes, using strong PL/SQL skills to analyze complex queries and stored procedures.

Used GitLab CI/CD pipelines for automated code deployment and version control, ensuring quality and consistency across development environments, while working in Agile/scrum teams for prioritization of work.

Maintained Hive metadata catalogs for analytics teams, enabling consistent data discovery and improving data governance across risk and finance domains, demonstrating the ability to effectively communicate across the organization.

Educational Details:

Master of Science in Information Systems - Indiana Institute of Technology

Bachelor of Technology in Civil Engineering - Marri Laxman Reddy Institute of Technology and Management, India

Contact this candidate