Data Engineer Big

Location:

Chelmsford, MA

Posted:

June 03, 2025

Contact this candidate

Resume:

Aneela Garapati

Data Engineer

*********@*****.*** +1-978-***-**** Lowell, MA

SUMMARY

Data Engineer with 5+ years of experience in designing, developing, and optimizing scalable data pipelines and ETL workflows in cloud and big data environments. Proven expertise in leveraging tools such as Databricks, PySpark, Snowflake, and Apache Airflow to build feature engineering and machine learning pipelines supporting credit risk and healthcare analytics. Skilled in data governance, security compliance (HIPAA, PII masking), and regulatory reporting (CCAR, Basel III). Proficient in SQL, Python, R, and big data technologies including Hadoop, Spark, and Kafka, with hands-on experience in AWS and Azure cloud platforms. Adept at collaborating with cross-functional teams to deliver reliable, high-quality data solutions that drive business insights. SKILLS

Methodology: SDLC, Agile, Waterfall

Programming Language: R, Python, SQL

IDE’s: PyCharm, Jupyter Notebook, Visual Studio Code ML Algoritham: Linear Regression, Logistic Regression, Decision Trees, Supervised Learning, Unsupervised Learning, Classification, SVM, Random Forests, Naive Bayes, KNN, K Means, CNN Big Data Ecosystem: Hadoop, MapReduce, Hive, Apache Spark, Pig Framework: Kafka, Airflow, Snowflake, Django, Docker Cloud Technologies: AWS S3, EMR, Glue, Athena, Redshift, IAM, EC2, Lamda, Azure, Virtual Machines, Kubernetes Service, Functions, Data Lake Storage, Data Factory, Synapse Analytics, Monitor, Active Directory Packages: NumPy, Pandas, Matplotlib, SciPy, Scikit-learn, Seaborn, TensorFlow Database: MS SQL Server, PostgreSQL, MongoDB, MySQL Reporting Tools/ Other Software/Other Tools: Tableau, Power BI, SSRS, SSIS, Anaconda, Sitecore, CMS, Jira, Confluence, Jenkins, Git, MS Office, Data Cleaning, Data Wrangling, Critical Thinking, Communication Skills, Presentation Skills, Problem-Solving Operating Systems: Windows, Linux

EXPERIENCE

Commonwealth of Massachusetts, USA Data Engineer Jan 2024 - Current

● Led data migration projects involving Oracle to SQL Server and SQL Server to Amazon Redshift using advanced SQL queries and Python-based ETL scripts, improving data accessibility by 40% across business units.

● Developed and maintained robust ETL processes for extracting, cleansing, transforming, and loading data across multiple systems, reducing manual data processing time by 35%.

● Created Power BI dashboards by connecting to diverse data sources, including Oracle, SQL Server, and Redshift, enabling real- time reporting and improving stakeholder decision-making by 25%.

● Performed rigorous data validation and cleansing to ensure accuracy and consistency across migrated datasets, reducing error rates by 30%.

● Collaborated with business and technical stakeholders to gather data requirements and delivered optimized SQL/AWS-based solutions, enhancing system performance by 20%.

● Utilized Git and TFMS for version control and Jira for tracking project progress and issue resolution in an Agile development environment.

● Implemented data partitioning and query optimization strategies in Redshift, reducing query latency by 45% on analytical workloads.

● Automated daily data ingestion workflows using AWS Lambda and Step Functions, achieving 99.9% uptime and improving scalability.

● Built data quality monitoring scripts using Python and SQL that triggered alerts for anomalies, improving data trustworthiness and reducing SLA breaches by 15%.

Capital One, USA Data Engineer Jan 2023 - Dec 2023

● Developed and scaled end-to-end feature engineering pipelines using Databricks and PySpark to support ML-based credit risk scoring models.

● Implemented Delta Lake for schema evolution, data versioning, and ACID transactions to enable reproducible and audit- compliant model training.

● Integrated MLflow for experiment tracking, model registry, and automated deployment of credit risk models into production environments.

● Collaborated with cross-functional teams including data scientists and ML engineers to automate feature store integration and real-time scoring pipelines.

● Built scalable ETL pipelines using Python and Snowflake for ingestion, transformation, and validation of structured and semi- structured credit risk data.

● Utilized AWS S3 and Databricks Auto Loader for incremental data ingestion with checkpointing and file notifications to streamline batch and streaming ingestion.

● Established robust data quality checks and automated validations using Great Expectations to ensure high integrity in ML model inputs.

● Applied data governance standards through Unity Catalog and implemented fine-grained access controls and data masking for PII compliance.

● Optimized pipeline performance and cost by leveraging Photon Engine and Delta caching on Databricks for large-scale data transformations.

● Orchestrated workflows using Apache Airflow to manage feature extraction, training, and scoring pipelines with dependency- aware DAGs.

● Designed time-series data models for training and validating longitudinal risk trends using partitioned datasets with Delta Lake.

● Maintained CI/CD pipelines with GitHub Actions and Terraform for infrastructure automation, pipeline deployment, and environment reproducibility.

● Documented pipeline architecture, metadata lineage, and operational runbooks to support audit readiness and regulatory reporting.

Byteworks Solutions, India Data Engineer Jan 2019 - Aug 2021

● Implemented Snowflake security features including Role-Based Access Control (RBAC) to enforce granular access management and compliance.

● Developed complex SQL queries to extract, join, and clean large-scale population health datasets for chronic disease tracking and reporting.

● Designed and maintained data warehouse schemas in BigQuery, supporting analytics use cases such as disease prevalence and treatment adherence.

● Built and published interactive dashboards in Tableau to monitor chronic conditions (e.g., diabetes, asthma) across demographic cohorts.

● Created dbt transformation models for reproducible and modular ETL pipelines, ensuring data consistency and reusability.

● Collaborated with clinical and business analysts to define population health metrics and translate requirements into data models and queries.

● Implemented scheduled ETL jobs to process structured health data using SQL scripting and dbt Cloud orchestration features.

● Optimized slow-running SQL queries by using partitioning, clustering, and indexing techniques in BigQuery.

● Conducted data validation and reconciliation between raw extracts and Looker dashboards to ensure reporting accuracy.

● Performed data wrangling on claims, EMR, and social determinant datasets, integrating heterogeneous sources into a unified model.

● Defined data dictionaries, field-level documentation, and data lineage in Confluence to support compliance and team collaboration.

● Partnered with BI developers to build Looker Explores and LookML models aligned with business logic and clinical KPIs.

● Automated data refresh workflows using legacy schedulers to support daily and weekly dashboard updates.

● Created materialized views and intermediate tables in BigQuery to accelerate Tableau visualizations on large population-level datasets.

● Implemented access controls and row-level security in Looker and BigQuery to maintain HIPAA-compliant data handling practices.

● Participated in data quality monitoring by implementing validation checks and exception handling for missing or inconsistent health data.

CERTIFICATIONS

AWS Certified Solutions Architect.

Machine Learning: Microsoft Certified Systems Administrator. Introduction to Generative AI, Generative AI Fundamentals by Google. Raspberry PI Platform and Python Programming for Raspberry PI. EDUCATION

Master’s in Computer Science University of Massachusetts Lowell, Lowell, MA Bachelor’s in Electronics and Computer Science KL University, Guntur, Andhra Pradesh

Contact this candidate