Yash Chowdary
Data Engineer
Email: *.*****@*******.***
Ph: +1-201-***-****
LinkedIn: https://www.linkedin.com/in/yashchowdary/
Professional Experience
Data Engineer with 5+ years of experience designing scalable data architectures, optimizing cloud-based ETL workflows, and implementing high-performance data solutions. Specialized in Azure/GCP and AWS ecosystems, leveraging Databricks, ADF, Redshift, and Snowflake to drive real-time analytics, big data processing, and cloud migrations. Adept at building robust data pipelines, reducing query times, and enhancing system efficiency to support business intelligence and decision-making.
Summary:
• Rich experience in Azure, AWS, and GCP, and services like Azure Data Factory, AWS Glue, EMR, Lambda, and GCP BigQuery.
• Developed complex ETL pipelines using Apache Spark (PySpark/Scala), Snowflake, and Python on terabyte-scale datasets across multi-cloud platforms.
• Strong background in data warehousing and dimensional modeling (SCD2) with hands-on experience using Snowflake, Redshift, and Azure Synapse Analytics.
• Automated ML pipeline and CI/CD workflows using Terraform, Jenkins, Docker, Kubernetes, and Apache Airflow.
• Deployed real-time monitoring and logging with AWS CloudWatch, configuring alarms and logs on EC2, RDS, Lambda, and Glue to reduce downtime.
• Integrated different data sources like Oracle and APIs and used Pandas, SQL, and Python for data extraction, transformation, loading, and analysis.
• Created and managed Power BI dashboards and semantic models; skilled in Tableau and AWS QuickSight in displaying insights through interactive visualizations.
• Hands-on experience with SnowSQL, Snowpipe, streams, tasks, time travel, and performance tuning; managed Snowflake advanced features like metadata management and data sharing.
• Working knowledge of REST API integrations and real-time data processing with Kafka, Spark Streaming, Cloud Pub/Sub, and Cloud Dataflow.
• Contributed to Infrastructure-as-Code using Terraform and CloudFormation, and ensured deployments and version control using Git in agile team environments.
Technical Skills:
• Operating Systems: Sun Solaris 11, 10, 9, 8, 7, Red Hat Linux 7.x
• Monitoring tools: Appdymanics and Kibana.
• ETL Tools: ADF,ADB,DBT,Airflow,Dataproc,Dataflow, AWS Glue, RedShift, Ab Initio GDE
• Provisioning Tools: Terraform, CloudFormation, ARM Template
• Database: DB2(Z'OS), Oracle (11g,10g), SQL Server 7.0, MySQL and MS Access (using VBA).
• Configuration Management: Ant,Ansible,chef
• Public Cloud: AWS, Azure and GCP Cloud, On prim, HPE
• Test and Build Systems: Jenkins, Maven.
• Scrum tool: Agile, waterfall, DevOps
• Design tools: UML (Use cases, Class Diagrams, Sequence Diagrams)
• Big-Data Technologies: Hadoop, Apache Spark, Hive, Scala, HDFS, PySpark.
• BI Tools: SQL Server Management Studio, SQL Assistance
• Versioning tools: SVN, Git, GitHub
• ETL Tools: Informatica, Pentaho, Denodo
• Monitoring Tool: Apache Airflow, redwood
• Visualization/ Reporting: Tableau, and Power BI
• SDLC Process: Agile Software Development with Scrum, Waterfall methodology
Certification: Microsoft Certified: Azure Data Fundamentals
Certification: Microsoft Certified: Fabric Data Engineer Associate DP-700
Education details:
Master’s Degree: Computer Science, University of Central Missouri. Bachelor’s Degree: Electronics and Communication Engineering, GITAM. Projects Summary
Client: Intrust Bank, Wichita, KS, Aug 2024 to Till Now
Role: Senior Data Engineer
Roles and Responsibilities
• Designed and optimized data models, ensuring efficient storage, retrieval, and processing for high-performance analytics.
• Designed and implemented massive-scale data conversion pipelines using PySpark on datasets larger than 100M records.
• Developed scalable ETL pipelines with Azure Data Factory (ADF) to load and transform data from various sources.
• Used Azure Synapse Analytics for creating and optimizing enterprise-level data warehousing solutions.
• Processed and stored big data using Azure Data Lake Storage (ADLS) with security access controls.
• Set up Azure Blob Storage for storing semi-structured and unstructured datasets for analysis and retrieval.
• Used AWS S3 buckets to store raw and transformed datasets with lifecycle policies.
• Implemented serverless data pipelines using AWS Lambda and AWS Glue for cost-efficient automation.
• Actively participated in daily stand-ups, sprint planning, and retrospectives as part of a cross-functional Agile team.
• Query large datasets using Amazon Redshift for business intelligence and analytics.
• Conducted ad-hoc analysis with Google BigQuery for fast exploration of large-scale datasets.
• Enforced IAM roles and policies across cloud platforms for safe access to data and services.
• Enhanced Spark operations using partitioning, caching, and broadcast variables to cut down execution time.
• Utilized Azure Monitor and CloudWatch for performance monitoring, alerting, and debugging of pipelines.
• Created Linked service to land the data from different sources to Azure Data Factory.
• Created advanced SQL scripts for data transformation, validation, and loading in data warehouse environments.
• Created and operated ETL/ELT pipelines to automate data ingest and enrichment workflows.
• Employed SSIS for legacy data movement and migration operations involving SQL Server and flat files.
• Designed relational and dimensional data structures (Star/Snowflake schema) for BI reporting.
• Developed Python-based scripts for data cleansing, transformation, and automation.
• Incorporated third-party APIs and flat files into ETL processes for regular ingestion.
• Improved pipeline performance using partitioning, parallelism, and indexing techniques.
• Integrated Spark with Azure Data Lake and Blob Storage to handle batch and stream data in a distributed setting.
• Implemented data quality checks and validation rules for clean and accurate datasets.
• Developed maintainable, reusable parameterized pipelines in ADF and modular Python code.
• Managed high-volume data intake with parallel pipelines, with logging and retry properly implemented.
• Created interactive dashboards and KPIs with Power BI and Tableau from cloud warehouse data.
• Utilized Git for versioning and engaged in code review to ensure code quality in a collaborative environment.
• Deployed and tracked pipelines using CI/CD tools such as Azure DevOps and GitHub Actions.
• Ensured accurate documentation of all data flows, schemas, and transformations for audit purposes.
• Worked with cross-functional groups and stakeholders to collect requirements and provide data-driven solutions.
Environment: Python, PySpark, Azure SQL DB, Azure Data Factory, Azure Databricks, Azure Data Lake, Hadoop, Spark-SQL, Kafka, Cassandra, Apache Airflow, MongoDB, MapReduce, AWS
Client : Ajackus, India, Sep 2019 to Nov 2022
Role: Data Engineer
Roles and Responsibilities
• Developed Spark applications using Scala and Spark-SQL for data extraction, transformation, and aggregation from multiple data file formats.
• Developed and modified ETL processes and workflows that load data for the data warehouse using ETL tools(transitioning from SSIS to Azure integration technologies) and SQL code.
• Built and optimized data pipelines using Apache Spark in big data systems.
• Built and maintained event streaming using Kafka for real-time data intake and processing.
• Processed and analyzed big data using Hadoop and its ecosystem (HDFS, Hive, HBase).
• Encoded complex SQL queries, stored procedures, and views for data transformation.
• Built ETL jobs using Scala, Python, and Java for large-scale data processing.
• Designed batch and streaming ETL pipelines with Apache Airflow for scheduled tasks.
• Designed scalable data storage infrastructure on HDFS, Hive, HBase, and Amazon S3.
• Applies IAM, access control, and data governance in AWS/Azure worlds.
• Applied Docker and Kubernetes for container-based data processing workloads.
• Performed database testing on data warehouse by using complex SQL queries in SQL Server to validate data.
• Resolved serverless/cloud-native ETL enablement on AWS Glue or Azure Data Factory.
• Implemented continuous integration and delivery (CI/CD) principles to automate deployment of data pipelines in Agile releases.
• Tuned Spark jobs and cluster performance for high throughput and low latency.
• Confirmed data validation, dependability, and integrity throughout pipelines.
• Enforced monitoring and alerting using AWS CloudWatch or Azure Monitor.
• Collaborated with stakeholders to gather and map business requirements to data solutions.
• Developed dimensional data models and data marts (snowflake schema, star schema).
• Create and maintain optimal data pipeline architecture in Microsoft Azure Cloud using Data Factory and Azure Databricks.
• Improved cloud cost and efficiency through the reduction of storage and compute.
• Automated deployment of ETL through CI/CD pipelines (GitHub Actions, Azure DevOps).
• Established metadata catalogs and documented lineage for data governance.
• Experience in working with NoSQL databases (HBase, Cassandra) for high concurrency and flexible schema
• Developed reusable data transformation modules and libraries in Python/Scala.
• Migrated on-premise systems to cloud-based data architecture (AWS, Azure, GCP).
• Performed ad-hoc analytics using BigQuery, Redshift, or Synapse.
• Performed design/code reviews to validate best practices in data engineering.
• Taught junior engineers Spark, Kafka, and cloud data solutions.
• Remain actively knowledgeable about the newest data engineering tech (e.g., Databricks certification).
Environment: SQL, Python, Hadoop, MapReduce, Hive, Sqoop, Kafka, Oozie, Scala, AWS Glue, Redshift, S3, EC2, IAM, EMR, Spark, Azure Data Factory, ADB, Logic Apps, Blob
Client: Monocept, India, May 2018 to July 2019
Role: Hadoop Developer
Roles and Responsibilities
• Integrated Snowflake into the data architecture, improving BI reporting and real-time analytics.
• Designed and implemented end-to-end ETL/ELT pipelines using Azure Data Factory, AWS Glue, and Apache Airflow for automating ingestion and transformation of unstructured and structured data.
• Designing and implementing a fully operational production-grade large-scale data solution on Snowflake Data Warehouse.
• Designed reusable Spark modules for ETL pipelines, which are modular and maintainable across projects.
• Built and queried large-scale data warehouses using Azure Synapse Analytics, Amazon Redshift, and Google BigQuery.
• managed data storage in Azure Data Lake, Blob Storage, and AWS S3, along with corresponding backups and lifecycle policies.
• Developed data transformation logic in Python, Scala, and SQL with performance, modularity, and code reusability.
• Utilized star and snowflake schema modeling for reporting and, additionally, optimized pipeline performance through indexing and partitioning.
• Enforced data validation, quality checks, and error-handling through automated scripts and logging with retry mechanisms.
• Used third-party APIs, batch feeds, and real-time streams integration in cloud data ecosystems.
• Used Jira and Confluence for sprint progress tracking, technical solution documentation, and business objective alignment.
• Used Spark SQL for data discovery, ad-hoc queries, and data preparation for subsequent analytics.
• Used Cloud IAM, RBAC, and governance in AWS and Azure environments for secure access management.
• Used Azure Monitor and AWS CloudWatch for proactive monitoring of pipelines, logging, and alerting.
• Created and shared Power BI/Tableau dashboards for cloud data sources that delivered business insights.
• Deployed version control with Git, and published pipelines with CI/CD tools like Azure DevOps and GitHub Actions.
• Collaborated with cross-functional teams in transforming business requirements into cloud-native, scalable data solutions using Docker and Kubernetes as applicable.
Environment: GCP bigquery, dataflow, storage bucket, cloud run, Jenkins, Power BI, Snowflake, Spark, Hadoop, Apache Spark, EMR, SQL