Data Engineer Governance

Location:

United States

Salary:

135000

Posted:

May 16, 2025

Contact this candidate

Resume:

SANJAY AREPALLY

Email: ******************@*****.*** Phone: +1-813-***-**** LinkedIn

SUMMARY

• Experienced Data Engineer with 3+ years a strong foundation in building scalable, cloud-native data pipelines for batch and real-time processing.

• Proficient in modern data technologies including PySpark, Apache Spark, Kafka, Databricks, Delta Lake, and Airflow.

• Delivered impactful data engineering solutions at JPMorgan Chase and PNC Bank, including ESG analytics and enterprise data lake modernization.

• Skilled in cloud platforms (AWS, Azure, GCP), with expertise in Snowflake, Redshift, and BigQuery for data warehousing and reporting.

• Hands-on with ETL/ELT development using Informatica, Talend, dbt, and workflow automation tools like Jenkins and GitHub Actions.

• Adept at implementing data modeling techniques (OLTP, OLAP, dimensional modeling) and ensuring regulatory compliance (GDPR, HIPAA).

• Experienced in version control and agile collaboration using Git, Jira, and Bitbucket.

• Strong analytical and visualization skills using Pandas, Power BI, Qlik Sense, and Matplotlib.

• Passionate about data architecture, automation, and transforming raw data into meaningful business insights. SKILLS

• Programming & Scripting: Python, SQL, Bash, Shell Scripting, PySpark

• Big Data & Distributed Systems: Apache Spark, Apache Flink, Apache Hadoop, Hive, Kafka, Databricks, Delta Lake

• Database Systems: MySQL, PostgreSQL, SQL Server, MongoDB

• Data Warehousing: Snowflake, Amazon Redshift, Google BigQuery

• Data Integration Tools: dbt, Talend, Informatica (MDM, PowerCenter), Qlik Sense

• Cloud Platforms & Services: AWS (S3, Redshift, SageMaker, Data Pipeline), Azure (Data Factory, Azure Machine Learning), GCP (BigQuery, Cloud Storage)

• DevOps & CI/CD Tools: Jenkins, GitHub Actions, Docker, Kubernetes, Apache Airflow, MLflow

• Infrastructure as Code: Terraform, Ansible

• Data Modeling & Architecture: OLTP, OLAP, Data Lakes, Dimensional Modeling, Snowflake Schema

• Version Control & Collaboration: Git, Bitbucket, Jira, Trello, Asana

• Data Analysis & Visualization: Pandas, NumPy, Matplotlib, Power BI, Qlik Sense

• Machine Learning (Basic): TensorFlow, Scikit-learn, MLlib

• Data Governance Frameworks: GDPR, HIPAA

WORK EXPERIENCE

JPMorgan Chase & Co January 2025 – Present

Data Engineer – Intern

• Developed scalable data ingestion pipelines using PySpark and Apache Spark to aggregate ESG metrics from third-party APIs and internal sustainability reports.

• Implemented Delta Lake architecture on Databricks, enabling efficient time-travel and schema evolution for evolving ESG datasets.

• Automated data workflows using Apache Airflow and GitHub Actions, reducing manual intervention in daily ingestion processes by 85%.

• Modeled ESG datasets using dimensional modeling techniques to support executive dashboards built in Power BI.

• Ensured compliance with GDPR by embedding automated data masking and lineage tracking via Terraform and governance frameworks.

PNC Bank November 2020 – July 2023

Data Engineer

• Designed and implemented batch ETL pipelines using Apache Spark and Hive to unify siloed financial datasets (loans, deposits, transactions) into a centralized Hadoop-based data lake.

• Migrated over 30 TB of structured and semi-structured data from legacy SQL Server and mainframe systems to scalable data lake storage layers.

• Developed reusable ETL templates with Informatica PowerCenter and Talend, standardizing ingestion and transformation across 12+ business domains.

• Automated daily and weekly pipeline executions with Jenkins and Ansible, reducing manual job triggers by over 90%.

• Created star and snowflake schema models to support downstream reporting and analytics in Qlik Sense and Amazon Redshift.

• Enabled metadata lineage tracking and documentation using dbt and internal governance policies for regulatory traceability.

• Conducted performance tuning of Spark jobs, achieving up to 65% improvement in batch runtime through memory optimization and partitioning.

• Collaborated with cross-functional teams (data stewards, analysts, compliance) to align architecture with GDPR/HIPAA data retention policies.

• Led migration readiness assessments and performed QA testing for over 100 critical data pipelines to ensure end-to-end data quality.

PROJECTS

Smart Traffic Flow Analysis System

• Built data pipelines using Apache Spark and Kafka to process and analyze streaming data from over 10,000 sensors.

• Designed a predictive model using Scikit-learn to forecast congestion hotspots and recommend alternative routes.

• Deployed the system on Azure Kubernetes Service (AKS) for seamless scalability and performance optimization. Healthcare Claims Data Pipeline

• Extracted and transformed healthcare claims data using Informatica and loaded it into a Snowflake data warehouse.

• Developed Python scripts to perform data cleaning and ensure compliance with HIPAA regulations.

• Automated CI/CD pipelines for deployment using GitHub Actions and integrated monitoring via Prometheus. EDUCATION

• Master of Science in Data Analytics

Indiana Wesleyan University GPA: 3.5

• Bachelor of Technology in Mechanical Engineering Kakatiya Institute of Technology and Science

CERTIFICATIONS

• AWS Certified Data Engineer

• Databricks Data Engineer

• IBM Data Analyst Professional

• AWS Certified Cloud Practitioner

Contact this candidate