Data Engineer Warehouse

Location:

New Brunswick, NJ

Posted:

February 10, 2025

Contact this candidate

Resume:

SUMMARY

Dayananda M

DATA ENGINEER

New Jersey, USA 201-***-**** ********.******@*****.*** LinkedIn

• Versatile Data Engineer with 4 years of experience in designing, implementing, and optimizing large-scale data solutions across healthcare and financial sectors.

• Proven expertise in cloud-based data architecture, particularly with Microsoft Azure services including Azure Data Factory, Databricks, and Stream Analytics.

• Strong background in data visualization and reporting using tools like Tableau and Power BI to drive business insights and support decision-making.

• Strong proficiency in big data technologies, including Apache Spark, Kafka, and Hadoop, with experience processing and analyzing datasets exceeding 200+ TB.

• Skilled in data warehouse design and optimization, utilizing dimensional modeling techniques, star/snowflake schemas, and slowly changing dimensions (SCD) implementations.

• Experienced in developing robust ETL/ELT pipelines, integrating diverse data sources and APIs, and ensuring data quality and governance.

• Proficient in multiple programming languages and tools, including Python, SQL, PySpark, and dbt, for efficient data transformation and analysis.

• Adept at collaborating with cross-functional teams, documenting data processes, and driving internal improvements to enhance overall data management efficiency.

EXPERIENCE

Data Engineer Goldman Sachs, NY May 2023 – Present

• Involved in a large-scale data migration project, transitioning on-premises SSIS packages to Azure cloud. Redesigned ETL processes, resulting in 40% faster data processing and 30% reduction in infrastructure costs.

• Architected and implemented a robust ETL pipeline using Azure Data Factory, handling over 100 million daily transactions. Integrated diverse data sources including Azure Blobs, Cosmos DB, and Azure SQL databases.

• Leveraged Spark SQL and PySpark within Databricks for complex data transformations on large datasets, Optimizing query performance by 25% through effective partitioning, caching strategies, and code refactoring.

• Created and optimized ad-hoc SQL queries for data analysis and troubleshooting. Reduced time-to-insight by 50% for business users by providing efficient, reusable query templates.

• Integrated PowerBI for real-time data visualization, creating interactive dashboards for key financial metrics. This increased stakeholder decision-making efficiency through improved data accessibility.

• Utilized CI/CD pipelines with Azure DevOps for automated testing and deployment of data pipelines. Reduced release cycles from weeks to days and improved code quality through systematic version control.

• Followed robust data governance policies and security measures ensuring GDPR compliance. Achieved 100% adherence to regulatory requirements through data encryption, access controls, and audit logging.

• Designed and optimized data warehouse schemas using dimensional modeling techniques in Azure Synapse Analytics. Implemented slowly changing dimensions (SCD) and star/snowflake schemas to improve query performance and data historicization.

• Integrated multiple third-party APIs and JDBC connectors to ingest external financial data sources. Expanded the data ecosystem by 40%, enabling more comprehensive market analysis.

• Utilized Azure Stream Analytics to process real-time streaming data from trading platforms. Implemented a fraud detection system capable of generating alerts within 30 seconds of suspicious transaction occurrence. Data Engineer Capgemini, India Jul 2019 - Aug 2022

• Developed and maintained scalable data pipelines using Apache Spark, Apache Kafka, and Apache Flink on AWS to process large volumes of healthcare data for analysis and reporting.

• Implemented 8 data integration solutions for health data sources, including electronic health records (EHR) systems, 15+ medical devices, and 5 claims databases, leveraging 12+ APIs and AWS Glue for data extraction and transformation.

• Designed and optimized 5 data models for efficient health data storage and retrieval using MySQL, PostgreSQL, and Amazon S3, resulting in a 30% improvement in query performance.

• Utilized Amazon S3 to securely store 5 PB of health data and AWS EMR to perform distributed analytics and parallel processing on 200+ TB of data.

• Created data visualizations and reports using Tableau to present health-related insights and support strategic decision- making.

• Collaborated with the data science team to implement machine learning models for health data analysis, including predictive analytics for patient readmission rates and disease progression, using Python and Scikit-learn.

• Designed and implemented a cloud-based data warehouse using Amazon Redshift, applying dimensional modeling techniques like star and snowflake schemas to meet complex business requirements.

• Utilized dbt for data transformation and modeling, improving data quality and reducing development time by 25%.

• Developed and maintained scalable data pipelines using Apache Spark and PySpark on AWS Glue, processing 10+ TB of healthcare data daily and integrating 8 diverse data sources including EHR systems and claims databases.

• Implemented real-time stream analytics solutions using Amazon Kinesis to process data from 15+ medical devices, enabling timely insights and alerts for critical health events.

• Designed and optimized Slowly Changing Dimensions (SCD) Type 2 for tracking historical patient and provider data changes, ensuring accurate point-in-time reporting.

• Diagnosed and optimized long-running queries, implementing performance tuning techniques that resulted in a 40% improvement in overall data warehouse efficiency.

• Developed comprehensive data lineage documentation and data dictionaries, enhancing data governance and usability across the organization.

• Implemented robust monitoring solutions for ETL jobs and data pipelines using AWS CloudWatch, reducing mean time to resolution for issues by 30%.

• Led internal process improvements, automating manual data validation processes and redesigning the data ingestion infrastructure, reducing data processing time by 20%.

• Utilized Git for version control and collaborative development, implementing branching strategies and code review processes to ensure code quality and maintain data pipeline integrity. SKILLS

Methodologies: SDLC, Agile, Waterfall

Programming Language: Python, SQL, R

Packages: NumPy, Pandas, Matplotlib, SciPy, Scikit-learn, TensorFlow, Seaborn Visualization Tools: Tableau, Power BI, Advanced Excel (Pivot Tables, VLOOKUP), Quick Sight IDEs: Visual Studio Code, PyCharm, Jupyter Notebook, IntelliJ Database: MySQL, PL/SQL, MSSQL, PostgreSQL, MongoDB, SQL Server Data Engineering Concept: Apache Spark, Apache Hadoop, Apache Kafka, Apache Beam, ETL/ELT, PySQL, PySpark Cloud Platforms: Microsoft Azure (Azure Blobs, Databricks, Data Lake ), Amazon Web Services (AWS) Other Technical Skills: SSIS, SSRS, SSAS, Maven, Docker, Kubernetes, Jenkins, Terraform, Informatica, Talend, Snowflake, Google Big Query, Data Quality and Governance, Machine Learning Algorithms, Natural Language Processing, Big Data, Advance Analytics, Statistical Methods, Data Mining, Data Visualization, Data warehousing, Data transformation, Critical Thinking, Communication Skills, Presentation Skills, Problem-Solving Version Control Tools: Git, GitHub

Operating Systems: Windows, Linux, Mac OS

EDUCATION

Master of Science in Computer Science – Rutgers - The State University of New Jersey, New Brunswick, NJ, USA Bachelor of Technology in Computer Science and Engineering - National Institute of Technology, Silchar, India CERTIFICATION

• Azure Data Engineer Associate

• SAFe practitioner

Contact this candidate