SANJAY AREPALLY
Email: ******************@*****.*** Phone: +1-813-***-**** LinkedIn
SUMMARY
• Experienced Data Engineer with 3+ years a strong foundation in building scalable, cloud-native data pipelines for batch and real-time processing.
• Proficient in modern data technologies including PySpark, Apache Spark, Kafka, Databricks, Delta Lake, and Airflow.
• Delivered impactful data engineering solutions at JPMorgan Chase and PNC Bank, including ESG analytics and enterprise data lake modernization.
• Skilled in cloud platforms (AWS, Azure, GCP), with expertise in Snowflake, Redshift, and BigQuery for data warehousing and reporting.
• Hands-on with ETL/ELT development using Informatica, Talend, dbt, and workflow automation tools like Jenkins and GitHub Actions.
• Adept at implementing data modeling techniques (OLTP, OLAP, dimensional modeling) and ensuring regulatory compliance (GDPR, HIPAA).
• Experienced in version control and agile collaboration using Git, Jira, and Bitbucket.
• Strong analytical and visualization skills using Pandas, Power BI, Qlik Sense, and Matplotlib.
• Passionate about data architecture, automation, and transforming raw data into meaningful business insights. SKILLS
• Programming & Scripting: Python, SQL, Bash, Shell Scripting, PySpark
• Big Data & Distributed Systems: Apache Spark, Apache Flink, Apache Hadoop, Hive, Kafka, Databricks, Delta Lake
• Database Systems: MySQL, PostgreSQL, SQL Server, MongoDB
• Data Warehousing: Snowflake, Amazon Redshift, Google BigQuery
• Data Integration Tools: dbt, Talend, Informatica (MDM, PowerCenter), Qlik Sense
• Cloud Platforms & Services: AWS (S3, Redshift, SageMaker, Data Pipeline), Azure (Data Factory, Azure Machine Learning), GCP (BigQuery, Cloud Storage)
• DevOps & CI/CD Tools: Jenkins, GitHub Actions, Docker, Kubernetes, Apache Airflow, MLflow
• Infrastructure as Code: Terraform, Ansible
• Data Modeling & Architecture: OLTP, OLAP, Data Lakes, Dimensional Modeling, Snowflake Schema
• Version Control & Collaboration: Git, Bitbucket, Jira, Trello, Asana
• Data Analysis & Visualization: Pandas, NumPy, Matplotlib, Power BI, Qlik Sense
• Machine Learning (Basic): TensorFlow, Scikit-learn, MLlib
• Data Governance Frameworks: GDPR, HIPAA
WORK EXPERIENCE
JPMorgan Chase & Co January 2025 – Present
Data Engineer – Intern
• Developed scalable data ingestion pipelines using PySpark and Apache Spark to aggregate ESG metrics from third-party APIs and internal sustainability reports.
• Implemented Delta Lake architecture on Databricks, enabling efficient time-travel and schema evolution for evolving ESG datasets.
• Automated data workflows using Apache Airflow and GitHub Actions, reducing manual intervention in daily ingestion processes by 85%.
• Modeled ESG datasets using dimensional modeling techniques to support executive dashboards built in Power BI.
• Ensured compliance with GDPR by embedding automated data masking and lineage tracking via Terraform and governance frameworks.
PNC Bank November 2020 – July 2023
Data Engineer
• Designed and implemented batch ETL pipelines using Apache Spark and Hive to unify siloed financial datasets (loans, deposits, transactions) into a centralized Hadoop-based data lake.
• Migrated over 30 TB of structured and semi-structured data from legacy SQL Server and mainframe systems to scalable data lake storage layers.
• Developed reusable ETL templates with Informatica PowerCenter and Talend, standardizing ingestion and transformation across 12+ business domains.
• Automated daily and weekly pipeline executions with Jenkins and Ansible, reducing manual job triggers by over 90%.
• Created star and snowflake schema models to support downstream reporting and analytics in Qlik Sense and Amazon Redshift.
• Enabled metadata lineage tracking and documentation using dbt and internal governance policies for regulatory traceability.
• Conducted performance tuning of Spark jobs, achieving up to 65% improvement in batch runtime through memory optimization and partitioning.
• Collaborated with cross-functional teams (data stewards, analysts, compliance) to align architecture with GDPR/HIPAA data retention policies.
• Led migration readiness assessments and performed QA testing for over 100 critical data pipelines to ensure end-to-end data quality.
PROJECTS
Smart Traffic Flow Analysis System
• Built data pipelines using Apache Spark and Kafka to process and analyze streaming data from over 10,000 sensors.
• Designed a predictive model using Scikit-learn to forecast congestion hotspots and recommend alternative routes.
• Deployed the system on Azure Kubernetes Service (AKS) for seamless scalability and performance optimization. Healthcare Claims Data Pipeline
• Extracted and transformed healthcare claims data using Informatica and loaded it into a Snowflake data warehouse.
• Developed Python scripts to perform data cleaning and ensure compliance with HIPAA regulations.
• Automated CI/CD pipelines for deployment using GitHub Actions and integrated monitoring via Prometheus. EDUCATION
• Master of Science in Data Analytics
Indiana Wesleyan University GPA: 3.5
• Bachelor of Technology in Mechanical Engineering Kakatiya Institute of Technology and Science
CERTIFICATIONS
• AWS Certified Data Engineer
• Databricks Data Engineer
• IBM Data Analyst Professional
• AWS Certified Cloud Practitioner