Data Engineer Processing

Location:

Fairborn, OH

Posted:

March 03, 2025

Contact this candidate

Resume:

Anu Perice Synalla

Data Engineer

+1-937-***-**** : **********@*****.***

PROFESSIONAL SUMMARY

●Accomplished Data Engineer with 5+ years of experience in transforming business operations and enhancing decision-making through data-driven insights.

●Results-driven approach in design, development and deploying large-scale distributed systems along with optimization of data infrastructures on Azure, AWS and GCP across healthcare, payments, retail, mortgage and fintech industries.

●Proficient in architecting and deploying cloud solutions on AWS and Azure, leveraging services like Amazon Redshift, EC2, Azure Databricks, and Google Cloud Data Proc for distributed data processing.

●In-depth knowledge in Data Processing (gathering requirements, design, development, implementation, testing, and documentation), Data Modeling (analysis using Star Schema and Dimensions Tables), Data Integration and Data Transformations (Mapping, Monitoring, Debugging, Performance Tuning and Troubleshooting Hadoop clusters).

●Expert in developing scalable data pipelines and real-time analytics systems using ETL processes, Hadoop ecosystems including HDFS, MapReduce, Hive, Apache Spark, Kafka, DBT and workflow automation using Airflow.

●Specialized expertise in Data warehousing solutions with Snowflake, enhancing data processing, acquisition, and transformation capabilities to streamline data retrieval and support scalable analytics.

●Managed both relational databases (MySQL, Oracle) and non-relational databases (MongoDB, Cassandra) with a strong focus on performance optimization and query tuning through DDL/DML statements to improve database reliability.

●Utilized a broad spectrum of Python libraries, like NumPy, Pandas, Scikit-learn, and TensorFlow, to execute complex data manipulation, and develop machine learning models, significantly enhancing predictive analytics and decision-making.

●Experience in cloud architecture, deploying serverless fault-tolerant containerized applications with Docker and Kubernetes. Adept at leveraging DevOps tools like Terraform and Jenkins for CI/CD pipelines and workflow automation, significantly enhancing efficiency in data processing and security.

●Extensive experience working in the Software Development Life Cycle (SDLC), utilizing Agile, Scrum, and Waterfall methodologies to ensure project agility and precision from analysis through to deployment.

EDUCATION

University of Wright State University, USA

Master of Science in Computer Science 2022 – 2024

SKILLS

Programming Languages: SQL, Python (NumPy, Pandas, Matplotlib, SciPy, Scikit-learn, Seaborn, TensorFlow)

Big Data Ecosystem: Apache Kafka, Apache Spark (PySpark), Airflow, Hadoop, DBT, HDFS, MapReduce, Hive, Pig, Sqoop

Databases: SQL Server, PostgreSQL, MySQL, Oracle, Snowflake, DynamoDB, MongoDB, Cassandra, Teradata, HBase

BI and ETL Tools: Power BI, Tableau, MS Excel, SSIS, SSRS, SSAS, Alteryx, dbt, ELK, Talend

Cloud Platforms: AWS (S3, EC2, Redshift, VPC, IAM, CodeDeploy, ELB, CloudWatch, Lambda, AWS Glue, SNS, KMS, DBT)

Azure (VM, Storage, Data Lake, Data Factory, Data Bricks, Synapse, AKS)

GCP (Dataproc, ComputeEngine, BigQuery, Dataflow, Data Fusion, Load Balancing, EMR, Cloud Spanner)

Cluster Managers: Docker, Kubernetes

DevOps: Terraform, Jenkins

Operating Systems: Windows, Linux

IDEs: PyCharm, Anaconda, Jupyter Notebook

Software Methodologies: SDLC, Waterfall, Scaled Agile (Scrum, Kanban)

EXPERIENCE

Perrigo Grand Rapids, MI

Azure Data Engineer July 2024 – Present

Perrigo is a global leader in over-the-counter consumer goods and pharmaceuticals, focusing on health and wellness products. They aim to make lives healthier and better through affordable, high-quality products.

●Built data pipelines using Azure Data Factory and Databricks, which facilitated effective data management and workflows crucial for coordination. Transformed large sets of structured, semi-structured, and unstructured data using HDFS from SQL and NoSQL (Cassandra) databases through JAVA and RESTful API services, improving data throughput by 40%.

●Utilized Azure Synapse Studio to develop and deploy advanced big data solutions, writing T-SQL and Python scripts to analyze and transform critical data for improved outcomes.

●Architected medium to large scale BI solutions tailored for analytics using Azure Data Factory, Data Lake Analytics and Stream Analytics, improving data accessibility, and supporting healthcare decisions.

●Developed PySpark applications for ETL operations across various data pipelines, enhancing data processing and contributing to a 35% efficiency increase in handling data across Azure Data Lake and Azure Synapse Analytics.

●Managed real-time data ingestion using Kafka as messaging system by ingesting data into Spark Streaming for immediate analysis, ensuring high data quality by arranging flags.

●Optimized Snowflake query performance using clustering keys, materialized views, and query optimization techniques, reducing execution times by 30% and enhancing data accessibility.

●Developed and implemented role-based access control (RBAC) and row-level security policies in Snowflake to ensure compliance with data governance and security best practices.

●Optimized Azure Databricks workflows for secure data extraction from SQL server, using PySpark to process and securely upload sensitive data to SFTP, reinforcing compliance with regulations.

●Designed data integration solutions within Snowflake, significantly enhancing the management and analysis of large-scale data sets. Built PySpark and Spark-Scala code to validate the data from raw source to Snowflake tables using snowSQL.

●Created Data tables utilizing PyQt to display customer and policy information and add, delete, update customer records.

●Conducted data blending and preparation using Alteryx and SQL for Tableau consumption, enhancing the visualization of healthcare analytics and publishing data sources to Tableau server.

●Designed and implemented data ingestion pipelines integrating REST and SOAP APIs with Snowflake and Azure Data Lake, enabling seamless data transfer and real-time processing.

●Automated secure data exchange using SFTP and message queuing (Kafka, ActiveMQ), ensuring reliable and scalable data ingestion across distributed systems

●Monitored data operations and system health using the ELK Stack, optimizing performance, and ensuring system reliability essential for maintaining data integrity.

●Implemented Terraform for automating Azure cloud resource scaling and provisioning, directly supporting the operational needs of applications and services, resulting in a 20% reduction in provisioning time.

●Oversaw container orchestration with Docker swarm and Kubernetes, ensuring reliable deployment of applications. Built and configured Jenkins slaves for parallel job execution, integrating CI/CD processes using Git that enhanced the deployment pipeline and operational efficiency by 25%.

●Spearheaded all the phases of Software Development Life Cycle (SDLC) of the application by using JIRA and Confluence, like gathering requirements, design, development, deployment, and analysis of the application.

Environment: Azure, Azure Data Lake, Azure Databricks, Python, Java, Spark, Kafka, PySpark, ETL, Airflow, Snowflake, Hadoop, SQL, NoSQL, Elastic-search, Kibana, PyQt, Tableau, Alteryx, CI/CD, Terraform, Kubernetes, SDLC

Rocket Companies Detroit, MI

AWS Data Engineer Sep 2023 – June 2024

Rocket Companies is a prominent player in the fintech space, specializing in digital mortgage (home loans) solutions and financial services. They leverage technology to streamline processes and enhance the customer experience.

●Led the data migration to AWS, using Amazon Redshift for efficient data warehousing and HiveQL for advanced reporting, reducing data retrieval times by 30% and enhancing report accuracy.

●Provisioned high availability of AWS EC2 instances and utilized Terraform for migrating existing AWS Infrastructure to serverless architecture (AWS Lambda, Kinesis, Glue, Cloud Formation) which led to 20% improvement in financial data processing speed.

●Managed AWS configurations and network setups within Virtual Private Cloud (VPC), ensuring robust, secure data operations for mortgage processing environments.

●Developed and optimized data pipelines using Pig and Sqoop for ETL processes, transforming and aggregating data before storage in HDFS from UNIX file system, supporting advanced analytics, automation, and decision-making.

●Employed Elasticsearch and Kibana for indexing and visualizing real-time analytics, allowing stakeholders to derive actionable insights from data quickly.

●Developed and optimized data ingestion pipelines using AWS Glue and Lambda, integrating REST and SOAP APIs to streamline data flow into Redshift and S3 for real-time analytics.

●Implemented secure data exchange using SFTP and AWS SQS for message queuing, ensuring reliable, event-driven data processing across distributed AWS environments

●Developed and optimized ETL workflows using SSIS, implemented multidimensional data models with SSAS, and enhanced SQL Server performance for efficient data warehousing and analytics

●Responsible for estimating the size, monitoring, and troubleshooting of the Spark Databricks cluster. Applied the Spark DataFrame API to complete data manipulation within spark sessions, enhancing data precision and performance by 40%.

●Developed and optimized data pipelines using DBT models for data transformations in AWS Redshift.

●Imported and exported databases using SQL Server Integration Services (SSIS) and Data Transformation Services (DTS Packages), improving the efficiency and reliability of data transfers to and from the AWS platform.

●Utilized Docker and Apache Mesos to develop and manage application environments, improving deployment workflows and system consistency across finance management applications.

●Orchestrated continuous integration and deployment processes using Jenkins looper, enhancing deployment cycles and maintaining high standards for application deployments, reducing deployment-related issues by 15%.

●Implemented and managed CI/CD pipelines using Jenkins and OneOps, transforming and cataloging mortgage-related data, which streamlined data processing and improved data accuracy in reporting.

Environment: Amazon Web Services (AWS), Redshift, VPC, EC2, Kinesis, Glue, DBT, HDFS, Spark, Pig, Sqoop, ETL, ELK, SSIS, SSAS, Jenkins, OneOps, CI/CD, Docker

AXA(Capgemini) Bangalore, India

Data Engineer May 2019 – Aug 2022

●Utilized Google Cloud infrastructure, including Google Compute Engine, Google Storage, VPC, IAM and Cloud Load Balancing, Glacier for robust data operations, ensuring secure and scalable data handling. Engaged in requirement gathering, business analysis, and technical design for Hadoop and Big Data projects tailored to financial services.

●Created Data Studio reports to review billing and usage of services, optimizing queries and implementing cost-saving measures crucial for enhancing operations.

●Enhanced the performance of Spark applications, adjusting batch interval times, parallelism levels, and memory tuning to optimize real-time data processing for transaction analysis and predictive analytics in receivables management.

●Developed and deployed Spark-SQL applications in Databricks for data extraction, transformation, and aggregation from multiple file formats, uncovering insights into customer usage patterns for enhancing client interactions.

●Built PySpark applications for interactive analysis, batch processing, and stream processing, facilitating dynamic data interaction to support SaaS-based financial decision-making.

●Managed large datasets using Python libraries like numpy, Panda’s data frames API and SQL, monitored and scheduled data pipelines using triggers in Azure Data Factory, ensuring timely updates and data accuracy in reporting and analytics.

●Utilized Sqoop for data import/export to ingest raw data into Google Cloud Storage, employing a Cloud Dataproc cluster to efficiently handle large volumes of financial data.

●Operated within GCP, managing tasks in GCP Cloud Storage, Cloud Shell, DataProc and utilized Google Cloud Dataflow with Python SDK for deploying both streaming and batch jobs for custom cleaning of text and JSON files and their subsequent analysis through ad hoc reports from BigQuery.

●Achieved a 70% performance optimization on an ETL pipeline by leveraging MapReduce for parallel ingestion and Python DAGS in Airflow, significantly enhancing processing speeds and data throughput.

●Upgraded on-premises data gateways between various data sources like SQL Server to Azure Analysis Services and Power BI service, facilitating seamless data integration and Selenium reporting across systems.

●Managed data ingestion to cloud services and processed the data in Databricks. Coordinated with the team to develop frameworks to generate daily ad-hoc reports and extracts from enterprise data from BigQuery, enhancing the efficiency and responsiveness of financial reporting.

●Visualized the results using Tableau dashboards and used Python Seaborn libraries for data interpretation in deployment.

Environment: Google Cloud (GCP), Data Flow, BigQuery, Data Proc, Storage, VPC, IAM, Hadoop, MapReduce, Spark, Spark SQL, Databricks, Airflow, ETL, Sqoop, Python SDK, Power BI, Tableau.

Citibank Bangalore, India

Software Data Engineer June 2018 – Apr 2019

●Created data models and established an actual data lake on AWS Athena from S3 for use with AWS Quick Sight, providing scalable analytics solutions tailored to the retail sector, supporting advanced data querying.

●Scheduled and managed data jobs and crawlers using AWS workflow features, enhancing the automation of data ingestion through AWS CodePipeline and processing tasks within the ecosystem, increasing operational efficiency by 20%.

●Utilized the Data Build Tool (dbt) for transformations in the ETL process, leveraging AWS Lambda and AWS SQS to streamline data flow and operational efficiency, allowing for more agile data handling and improved turnaround times.

●Performed metadata validation, reconciliation, and error handling in ETL (extract, transform, load) processes, ensuring data integrity and accuracy critical for data management.

●Loaded data from Oracle database and Teradata into HDFS using Sqoop, supporting deeper analytical capabilities.

●Streamlined quality control measures through comprehensive data validations and cleaning through Python, achieving 80% data completeness to ensure reliability and accuracy.

●Created Hive tables with dynamic partitions and buckets, loading various formats of data such as Avro and Parquet, and analyzed the data using HiveQL which facilitated detailed consumer behavior analysis and trend identification, aiding in targeted marketing and product placement strategies.

●Implemented real-time streaming solutions using Kafka KTables as a messaging system and Spark Streaming, enhancing decision-making capabilities in payment processing.

●Developed Databricks ETL pipelines using Spark DataFrames, Spark SQL, and Python scripting, optimizing data transformation processes for marketing and sales data, improving throughput by 35%.

●Scheduled jobs using Airflow scripts, adding various tasks to DAGs, enhancing automation and scheduling reliability with Lambda functions for timely data updates.

●Engineered efficient data retrieval processes by writing and executing complex MySQL queries from Python, utilizing Python-MySQL connector and MySQL dB package.

●Used JIRA for tracking issues and managing project tasks within agile methodologies, ensuring cross-collaboration among teams and alignment with business objectives, improving project delivery timelines by 25%.

Environment: AWS, AWS Quick Sight, Athena, Redshift, Lambda, SQS, S3, Code Pipeline, Data Lake, Hadoop, HiveQL, Sqoop, Spark SQL, Data Frames, ETL, dbt, Oracle, Teradata, MySQL, Java, Python, Airflow, DAGs

Contact this candidate