Data Engineer

Location:

Seattle, WA

Posted:

July 03, 2024

Contact this candidate

Resume:

Soumya Sharma

Bellevue, WA ***********@*****.*** +1-253-***-**** linkedin.com/in/soumya-s

SUMMARY

●Data Engineer with 4+ years of experience in various technologies, tools, and databases like Big Data, AWS, S3, Hadoop, Hive, Spark, Python, etc.

●Working knowledge of all phases of Software Development (SDLC) like Requirement Analysis, Implementation, and Maintenance, and good experience with Agile and Waterfall.

●Good understanding of tools in the Hadoop Ecosystem including Pig, Hive, HDFS, MapReduce, Spark, and Scala for complex business problems.

●Proven knowledge of Cloud Azure, AWS, DevOps, Configuration management, Infrastructure automation, Continuous Integration, and Delivery (CI/CD).

●Hands-on experience in Unified Data Analytics with Databricks, Databricks Workspace User Interface, and Managing Databricks Notebooks.

●Capable of Python including Discriminant Analysis NumPy, Pandas, SciPy, Matplotlib, Seaborn, and Scikit - learn.

●Understanding of designing and developing SQL queries, ETL packages, and business reports using BI Suite (SSIS/SSRS), and Tableau.

●Ability to use RDBMS such as SQL Server, MySQL, and non-relational databases like MongoDB.

CERTIFICATIONS

●AWS Cloud Practitioner: Amazon Web Services, Cloud Computing, Cloud Services, Cloud Platform

●IBM CE-Essentials of Big Data with Hadoop Using IBM InfoShpere BigInsights: Concept of Big Data, Distributed Computing, Data Analytics, a business use case for Data Prediction, MapReduce, Hive, Hadoop, Pig, Oozie.

●HackerRank: SQL(Advanced)

●PwC Power BI Virtual Case Experience: Data Modelling · Data Visualization · Data Collection · Data Transformation · Power View Reports · DAX Functions

SKILLS

●Methodology: SDLC, Agile, Waterfall

●Programming Language: R, Python, SQL

●IDE’s: PyCharm, Jupyter Notebook

●Big Data Ecosystem: Hadoop, MapReduce, Hive, Apache Spark, Pig

●ETL Tools: SSIS, Snowflake

●Cloud Technologies: AWS, Azure, Databricks

●Packages: NumPy, Pandas, Matplotlib, SciPy, Scikit-learn, Seaborn, TensorFlow, PySpark

●Reporting Tools: Tableau, Power BI, SSRS

●Database: MongoDB, MySQL, TSQL

●Other Tools: Git, MS Office, Microsoft Excel

●Operating Systems: Windows, Linux, Mac

WORK & LEADERSHIP EXPERIENCE

WaferWire Cloud Technologies Mar 2024 – Current

-Microsoft

Data Engineer:

●Resolved compliance governance alerts by tracking and addressing active bugs through Azure DevOps and Power BIs for alert monitoring, following a structured process to implement fixes, ensuring all checks were validated and published.

●Experience in DevOps, Azure Cloud platforms and its features, CI/CD (Continuous Integration / Continuous Deployment) process.

●Efficiently managed the deletion of unused Privacy Compliance Dashboard (PCD) agents by unlinking data assets, and followed structured removal procedures, ensuring compliance and troubleshooting issues as necessary.

●Worked on the migration of Classic release pipelines to YAML format in Azure DevOps, employing both manual methods and the OneBranchMigrator tool to adhere to the OneBranch standard.

●Onboarded email notifications in Azure DevOps repositories by configuring Component Governance settings, enabling email notifications, and adding the reporting address for each repo as per the provided Excel list.

●Investigated the process of designing and developing custom applications using Microsoft PowerApps to meet specific business requirements, integrating seamlessly with backend data sources like SharePoint, Azure SQL Database, and Microsoft Dataverse.

Molina Healthcare Feb 2023 - Feb 2024

Data Engineer:

●Working with an Agile environment, with the ability to accommodate and test the newly proposed changes at any point of time during the release.

●Work with Data Lake architecture & set up Hadoop environment for ingesting claim & provider data from multiple sources.

●Develop & implement MapReduce programs for analyzing Big Data with different file formats like structured and unstructured data.

●Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark Databricks cluster.

●Install the application on AWS EC2 instances and also configure the storage on S3 buckets and optimized data processing using Redshift for enhanced query performance and analytics.

●Transform data and convert the data into Parquet format and load the data using AWS Glue jobs.

●Designed and managed AWS Identity and Access Management (IAM) policies to secure AWS resources and ensure least privilege access.

●Develop various data connections from data sourced to SSIS, and Tableau Server for report and dashboard development.

●Designed, implemented, and tested a more robust and efficient database to power data behind SSRS.

●Using Pandas, NumPy, SciPy, and Scikit-learn in Python for scientific computing and data analysis.

●Led the capability-building exercise into Big Data with open-source MongoDB and on the cloud, and utilizing NoSQL for non-relation data storage and retrieval.

●Incorporating Git to keep track of the history, merging code between the different versions of the software, and check-in/check-out options.

Infosys April 2018 - July 2021

-E.ON

Data Engineer:

●Involved throughout the Software Development Life Cycle (SDLC) using Waterfall.

●Worked on Apache Spark streaming API on Big Data distributions in the active cluster environment.

●Integrated diverse data sources into the data warehouse, enabling comprehensive data analysis and reporting.

●Handled importing of data from various data sources, performed transformations using Hive, Pig & loaded data into HDFS.

●Reducing data integration time by 25% through streamlined ETL processes, leveraging Azure Data Factory for orchestrating data workflows, leading to faster access to accurate and up-to-date data.

●Working on Azure Synapse Analytics for implementing Pyspark Notebooks.

●Transformed & Loaded data from source systems to Azure Data Storage services using a combination of Azure Databricks.

●Implemented Integration test cases and developed predictive analytics using Apache Spark Scala APIs.

●Automated reports in Power BI post-data cleaning, resulting in a 30% reduction in report generation time and providing stakeholders with real-time insights for data-driven decision-making and identified critical KPIs.

●Industrialized and implemented 30+ functions, triggers, views, and stored procedures using MySQL, optimizing data processing efficiency by 25% and enhancing overall database performance by 20%.

EDUCATION

New Jersey Institute of Technology Master of Science in Data Science - Computational Track Dec 2022

Coursework: Machine Learning, Deep Learning, Applied Statistics, Data Analytics with R, Introduction to Big Data

Activities: Exam Proctor

Lakshmi Narain College of Technology and Science Bachelor of Engineering in Computer Science & Engineering July 2017

Coursework: Data Structure, Java Programming, Object Oriented Technology, Analysis & Design of Algorithm, Operating System, DataBase Management System.

Activities: Paper Presentation - National Conference, Start-up fair - Indian Institute of Technology (IIT BHU), Member and part of core team of HOPE and Raahat (NGO), Volunteer - Business Master (Planet Engineer) & Go-Kart.

Contact this candidate