Big Data Engineering

Location:

Morrisville, NC, 27560

Salary:

Posted:

July 24, 2025

Contact this candidate

Resume:

Yashwanth Venkat

+1-910-***-**** **************@*****.*** http://www.linkedin.com/in/yashwanth-venkat-g-1b5828173 PROFESSIONAL SUMMARY:

Overall 7+ years of experience in Data engineering and skilled in designing, developing, and optimizing high-scale, mission-critical data ecosystems, with expertise in Big Data Engineering, Cloud Data Platforms, and Advanced Data Pipelines across multi-cloud and on-premises environments.

Extensive experience with Hadoop, Apache Spark, Databricks, and Snowflake, optimizing petabyte-scale distributed data processing workloads, leveraging Spark tuning, parallelization strategies, and advanced memory management techniques

Aptitude in EMR, EC2, S3, Lambda, Athena, Glue, RDS, Redshift, and DynamoDB, building serverless big data architectures and auto-scalable processing pipelines.

Proficient in Azure Data Lake Storage, Azure Data Factory, Synapse, and Azure SQL Server, designing high-performance ELT/ETL solutions leveraging PolyBase, Dataflows, and Spark pools.

Hands-on experience with Apache Beam, DataProc, BigQuery, Dataflow, and Composer, developing low- latency, streaming analytics solutions with real-time event processing.

Well-Verse in Python, Scala, and Java, developing highly parallelized ETL/ELT frameworks, building custom ML-based anomaly detection models, and optimizing multi-threaded processing logic in Spark and Databricks.

Demonstrated Mastery of in SQL, PL/SQL, Hive, MySQL, DB2, PostgreSQL, and Oracle, optimizing terabyte-scale queries, implementing advanced indexing, partitioning, materialized views, and column store strategies for real-time, high-performance querying.

Skillset experience in version control using Git, GitLab, GitHub, SVN, and CVS, implementing best-practice branching strategies like GitFlow and Trunk-based development, while designing containerized big data pipelines with Docker and Kubernetes and automating CI/CD workflows using Jenkins, GitHub Actions, and Terraform for seamless multi-cloud deployments.

Designed and implemented ML-driven data pipelines, integrating Scikit-learn, TensorFlow, and PyTorch into big data ecosystems, automating anomaly detection, forecasting, and entity resolution at scale.

Proficient in Apache Airflow, Oozie, Azkaban, and AWS Step Functions, designing dynamic DAG-based orchestration workflows with intelligent retry mechanisms and SLA-driven autom ation.

Expertise in management, deployment, and optimization of Hadoop clusters built on Cloudera and Hortonworks, enhancing HDFS/YARN configurations, ensuring high-availability setups, and implementing robust Kerberos security protocols Architected and implemented multi-layered data architectures with Kimball

& Inmon methodologies, utilizing Erwin for data modeling, and optimizing schema evolution strategies across cloud data warehouses.

Deep expertise in multi-format data processing, working extensively with CSV, Parquet, ORC, Avro, JSON, XML, and raw text across distributed computing environments, optimizing serialization & compression strategies.

Experience in designing and developing the executive-level, real-time dashboards using Tableau, Power BI, and Looker, integrating multi-source analytics pipelines, embedding custom DAX/MDX calculations for business intelligence.

Expert in constructing the high-performance OLAP cubes using Kyvos, Apache Druid, Azure Analysis Services, and SSAS, supporting multi-dimensional, ad-hoc analytics for enterprise-scale reporting.

Mastery in JIRA, ServiceNow, Rally and TFS, implementing agile sprint planning, backlog grooming, and DevOps-driven incident management for high-availability systems. TECHINCAL SKILLS:

Bigdata Frameworks Apache Spark, Hive, Sqoop, Kafka, Hadoop, MapReduce, HDFS, Pig, Oozie, Apache Airflow

Languages Python, Shell Scripting, PySpark, HiveQL, Scala, Java Visualization Power BI, Tableau, Looker

Data Base Microsoft SQL Server, PostgreSQL, MySQL, Oracle, DB2 Datawarehouse Snowflake, Teradata

Cloud AZURE:

Azure: Azure Data Lake, Azure Data Factory, Azure Databricks, Azure SQL Database, Azure Synapse, Azure Analysis Services, Azure Blob Storage, Azure Monitor, Azure key vault, Azure HDinside, Azure Data Explorer.

AWS:

AWS S3, EC2, Redshift, Redis, EMR, Kinesis, Lambda, Step Functions, Glue, Athena, DynamoDB, Elasticsearch, Service Catalog, CloudWatch, IAM, SageMaker, RDS, API Gateway, AWS CloudWatch, Aws Glue ML Transforms, Amazon Athena GCP:

Google Cloud Storage, Dataflow, BigQuery, DataProc, Pub/Sub, Composer, AI Platform, Cloud Functions, Dataplex, Data Catalog.

IDE’s PyCharm, Jupyter Notebooks, Visual Studio Code, IntelliJ IDEA, Eclipse Web programming HTML, CSS, JavaScript, Angular, Flask, Django Workflow Automation Apache Airflow, Oozie, Step Functions, Prefect, Azkaban, Apache NiFi Security & Access Control IAM, OAuth, JWT, Apache Ranger, Kerberos, SSL/TLS Encryption Machine Learning & AI Scikit-learn, TensorFlow, PyTorch, MLlib, SageMaker, AutoML PROFESSIONAL EXPERIENCE:

Freddie Mac May 2023 - Present

Senior Data Engineer

Responsibilities:

Architected and engineered highly scalable ETL/ELT data pipelines integrating diverse data sources such as SQL Server, Oracle, web analytics, APIs to enable real-time insights and analytics at enterprise scale.

Designed and optimized high-performance Databricks architectures and utilizing Spark’s distributed computing capabilities to process petabyte-scale datasets efficiently and optimizing the query execution time.

Developed and maintained Informatica Power Center workflows for enterprise-grade data integration, performing complex transformations, data cleansing, and job orchestration across on-prem and cloud platforms

Automated complex workflows in Azure Databricks, Arranging SQL and Python-based transformations with Databricks Jobs, Delta Live Tables, and Apache Airflow, ensuring fault-tolerant data processing pipelines.

Configured Azure Data Lake Gen2 storage for high-throughput of Data streaming, managing structured and semi- structured datasets such as CSV, Parquet, Delta formats, Managing storage costs and access speeds for high- frequency analytics workloads.

Engineered and optimized Oracle SQL and PL/SQL procedures, implementing parallel execution strategies and query tuning techniques and using DBMS Scheduler for automating batch processing.

Pioneered the migration of mission-critical enterprise data warehouse from Oracle to Azure SQL Data Warehouse, designing and implementing conceptual, logical, and physical data models using star and snowflake schemas. Applied distributed processing, column store indexing, and partitioning strategies to optimize reporting and support near real-time dashboarding.

Designed real-time streaming architectures using Azure Event Hubs, Kafka, and Spark Structured Streaming, enabling near-instantaneous processing of clickstream and customer engagement data.

Developed highly interactive, self-service Tableau dashboards for C-level executives, integrating data from Google BigQuery, Azure SQL Data Warehouse, and on-premises databases and empowering data-driven business decisions.

Executed outstanding performance of ETL workflows, identifying bottlenecks and optimizing cluster configurations in Databricks, tuning Spark Shuffle, caching, partitioning, and DAG optimizations, resulting in a reduction in processing time.

Built an Alteryx-based data validation framework to automate multi-system for Cross-checking processes between disparate datasets, flagging inconsistencies and ensuring high data quality, publishing insights to Power BI Server.

Enabled multi-cloud data strategies, integrating Azure, Google BigQuery, and third-party API-based data sources into a unified data warehouse while implementing cost-efficient cloud storage and query execution strategies.

Designed, developed, and optimized data structures, SQL queries, and automated pipelines for migrating legacy systems to Snowflake Data Warehouse, integrating Airflow, AWS S3, Web APIs, and Teradata for seamless data processing and reporting.

Designed and enforced data governance, security and compliance policies across cloud data lakes for ensuring adherence to GDPR, CCPA, and enterprise security standards for sensitive data handling.

Engineered high-performance OLAP cubes in Azure Analysis Services, optimizing complex aggregations for enterprise reporting, ensuring real-time insights for sales, marketing, and finance teams.

Utilized J2EE tools to build and maintain enterprise-grade web applications, ensuring seamless integration with backend data processing pipelines and enhancing the scalability and performance of data-driven services.

Developed and published custom Python packages to PyPI, ensuring seamless distribution and easy installation across teams and projects. Implemented version control for packages, facilitating collaborative development, promoting code reusability, and minimizing duplicated efforts by enabling consistent access to shared functionalities Environment: SQL Server, Oracle, PL/SQL, Google BigQuery, Apache Spark, Delta Lake, Databricks, Azure Data Factory, Apache Airflow, Azure Event Hubs, Kafka, Azure SQL Data Warehouse, Azure Analysis Services, Alteryx, Power BI Server, CI/CD pipelines, Snowflake, Teradata. Fidelity Investments Jan21 – Dec 2022

Senior AWS Data Engineer

Responsibilities:

Architected and optimized large-scale distributed data processing workflows using Apache Spark, Hadoop, and Hive integrated with AWS S3 for scalable storage, handling terabytes of structured and unstructured data. This infrastructure was optimized with EMR to improve processing efficiency in an enterprise environment.

Developed highly optimized Spark applications using PySpark, DataFrames, RDDs, and SparkSQL, tuning performance with strategies such as broadcast joins, partitioning, and caching. Utilized AWS Glue for efficient ETL processes, reducing query execution times.

Engineered robust ETL pipelines, integrating real-time and batch data ingestion from multiple data sources using Kafka, Sqoop, AWS Lambda, and Spark Streaming to ensure reliable data flow into AWS S3, Hive, and AWS Redshift.

Designed and implemented schema evolution and data modeling strategies, including conceptual and physical models using Hive external tables, star schemas, and optimized formats like ORC/Parquet. Enabled scalable querying of S3-stored datasets through AWS Athena for faster BI and reporting workloads, supporting evolving business needs.

Developed advanced data validation and enrichment frameworks in PySpark using custom UDFs and AWS Glue ML Transforms to incorporate machine learning-based anomaly detection to enhance data quality and reduce inconsistencies across multiple data sources.

Created custom data processing pipelines to extract, transform, and load high-velocity consumer response data into AWS S3 and Hive, optimizing storage and query execution strategies to support real-time and batch analytical workloads. Utilized AWS Redshift for further data aggregation and reporting.

Organized and automated complex workflows using AWS Step Functions and Apache Airflow, building and scheduling dynamic DAGs that adapt to dependencies, ensuring fault tolerance, retry mechanisms, and SLA-driven data processing.

Designed optimized data extraction strategies from legacy systems with Sqoop and AWS RDS, improving ingestion speeds through direct JDBC optimizations and split-by tuning. Also, AWS DynamoDB utilized for low-latency NoSQL storage.

Developed dynamic Python-based ETL frameworks to adapt to schema changes, utilizing pandas, NumPy, and Spark DataFrames for high-speed data cleansing, transformation and aggregation. Utilized Elasticsearch for fast search queries on processed datasets.

Implemented complex Hive queries involving multi-level joins, window functions, and dynamic partition pruning. These optimizations enabled real-time analytics, with AWS Athena used for querying large datasets stored in S3, drastically reducing reporting times for business stakeholders.

Devised a metadata-driven ingestion pipeline to auto-discover and register datasets in AWS Glue and Hive, enabling self-service analytics. This approach reduced manual data engineering efforts and improved scalability of data cataloging.

Optimized cluster configurations and resource allocation in Spark and Hadoop on AWS EMR, fine-tuning YARN resource management, driver/executor tuning, and shuffle optimizations. This resulted in a reduction in infrastructure costs, and real-time analytics were powered using Kinesis and AWS Lambda.

Built Alteryx-based data validation frameworks, automating multi-system reconciliation processes between disparate datasets, flagging inconsistencies, and ensuring high data quality, publishing insights to Tableau Server Environment: PySpark, Hadoop, Hive, Spark Streaming, Kafka, Airflow, HDFS, ORC/Parquet, Sqoop, Hive UDFs, MySQL, RDBMS, Linux, Python, YARN, Apache Ranger, Databricks, MLlib,Snowflake and Teradata, AWS S3, EC2, Redshift, Redis, EMR, Kinesis, Lambda, Step Functions, Glue, Athena, DynamoDB, Elasticsearch, Service Catalog, CloudWatch, IAM, SageMaker, RDS, API Gateway, AWS CloudWatch, AWS snowflake, Tableau. Capital One Jan 2019 – Jan 2021

Big data engineer

Responsibilities:

Build and deployed multi-node Hadoop clusters from scratch, configuring HDFS, YARN, MapReduce, Hive and HBase for high-performance distributed data processing.

Developed and optimized highly parallelized Java-based MapReduce programs, implementing custom partitioners, combiners, and in-memory caching strategies to significantly reduce computation time for complex data cleaning and preprocessing pipelines.

Designed and implemented Hadoop-based ETL pipelines, utilizing Hive, Pig, and Sqoop to efficiently transform and load terabyte-scale datasets between HDFS, relational databases, and cloud storage platforms.

Supervised Hadoop ecosystem performance tuning, optimizing block size, memory allocation, compression strategies such as Snappy, LZO, and file formats like ORC, Parquet reducing query execution times.

Engineered a distributed, fault-tolerant data ingestion framework, automating data imports and exports between HDFS, Hive, and RDBMS using Sqoop, Kafka, and Spark, ensuring seamless data movement across enterprise systems.

Designed custom Hive UDFs in Java, extending Hive’s capabilities for advanced text processing, fuzzy matching, and analytics at scale.

Managed Hadoop job workflows using Apache Oozie and Airflow, automating complex dependencies, job retries, and error handling mechanisms for critical data processing tasks.

Configured and managed Cloudera-based Hadoop infrastructure, overseeing multi-node cluster deployments, upgrades, access control policies like Kerberos, Ranger, and high-availability setups, ensuring system uptime.

Developed and optimized HBase schema designs, implementing row key modeling, Bloom filters, and coprocessors to support millisecond-level random access queries on billions of records.

Utilized Zookeeper for distributed coordination and leader election mechanisms, ensuring cluster stability and consistent metadata synchronization across services like HBase, Kafka, and Oozie.

Designed near real-time data processing pipelines using Kafka, Spark Streaming, enabling low-latency analytics for clickstream, IoT, and log processing Use cases.

Implemented advanced cluster monitoring and logging solutions, integrating Grafana, Prometheus and ELK Stack to proactively detect performance bottlenecks and system failures.

Managed data governance and compliance initiatives, implementing fine-grained access control such as Apache Ranger, Sentry and encryption strategies to ensure security for sensitive and PII data.

Built self-service analytical environments for business users, integrating Hive, Presto, and BI tools to enable ad-hoc querying of massive datasets.

Environment: Java, Python, SQL, Hadoop HDFS, Hadoop YARN, Hadoop MapReduce,Apache Spark, Apache Hive, Apache Pig, Apache HBase, Apache Kafka, Apache Sqoop, Apache Oozie, Apache Airflow, Cloudera, Kerberos, Apache Ranger, Apache Sentry, Apache Zookeeper, ORC, Parquet, Snappy, LZO, Grafana, Prometheus, ELK Stack, ELK Stack Elasticsearch, ELK Stack Logstash, ELK Stack Kibana, AWS S3,AWS Athena, AWS EC2,AWS Auto-scaling, Azure, Presto, Apache Hive, AWS Athena, Tableau.

Tata Consultancy Services Jan 2017 – Jan 2019

Python Developer

Responsibilities:

Involved in the full project lifecycle from design, development to deployment, testing, implementation, and post- production support, ensuring seamless application delivery and system performance.

Developed web-based applications using PHP, XML, JSON, and MVC3 architecture using Python scripts to automate database updates, content management, and file manipulations for enhanced operational efficiency.

Built robust database models, APIs, and views using Python and facilitating smooth integration then after ensuring the creation of an interactive and efficient web application.

Proficiently built and developed the presentation layer of web applications, utilizing HTML, CSS, JavaScript, JQuery, AJAX, and Bootstrap to build responsive, dynamic, and user-friendly interfaces.

Utilized PyPI to automate various Python dependencies, optimizing package installations and ensuring compatibility across diverse environments which improves project efficiency and reduces setup time.

Developed XML schema documents and implemented frameworks for parsing XML data, ensuring the successful exchange of structured data between systems and applications.

Utilized Python for processing XML and JSON data, implementing core business logic and enhancing data exchange, contributing to the overall automation and efficiency of the web application.

Managed large datasets using Pandas data frames and MySQL, integrating with multiple relational databases like MySQL, Oracle, PostgreSQL and contributing to the development of REST APIs with Flask, SQL Alchemy, and OpenStack API integration, while automating deployment using Jenkins and Git within Agile SCRUM methodologies.

Environment: Python, XML, MySQL, Apache, HTML, CSS, JavaScript, Shell Scripts, Oracle, PostgreSQL, REST, SOAP, JSON, Django.

Contact this candidate