Data Engineer

Location:

United States

Salary:

60$/hr

Posted:

August 13, 2024

Contact this candidate

Resume:

Manoj Osuri

Sr. Data Engineer

Email: *********.*********@*****.***

Phone: 609-***-****

SUMMARY:

9+ years of Experience in Information Technology that majorly includes 4 years in Cloud Data Technologies along with a broad range of proficiency in Business Intelligence space. Worked with full Software Development Life Cycle (SDLC) implementation such as Design, Development, Data Analysis, UAT Testing, Deployment, Post-Production Support/Maintenance.

Expertise in providing ERP (Enterprise Resource planning) Solutions and managing various projects in implementation.

Involved in Design and Implementation Data Ingestion pipelines from multiple sources using Azure Data Factory and Databricks

Involved in end-to-end automation of the data pipelines in Azure Data Factory and Azure Databricks.

Coordinated various projects simultaneously to provide solutions to Business.

Strong Working experience with various Database and Reporting tools like SQL Server, Tableau and Power BI.

Developed ETL pipelines using tools like Snowpipe, Matillion, and Talend to efficiently load and transform data from various sources into Snowflake.

Used Pandas, Numpy, Matplotlib and Scikit-learn in Python for developing various machine learning algorithms and utilized machine learning algorithms.

Hands on experience in coding MapReduce/Yarn Programs using Java, Scala, and Python for analyzing Big Data and Strong experience in building Data-pipelines using Big Data Technologies

Migrated legacy data systems to Snowflake, resulting in a 30% reduction in maintenance costs and improved data accessibility for business users.

Proficient in writing advanced SQL queries to extract insights and perform data analysis on Snowflake.

Good understanding of database and data warehousing concepts (OLTP & OLAP).

Developing proper documentation, adhering to quality standards and project delivery timeline.

Proficient in understanding and learning upcoming technologies as per project requirements.

Flexible to adapt and function in any situation by improvising the knowledge level with the upcoming Technologies.

TECHNICAL SKILLS:

Programming Languages

SQL, PL/SQL, Python, Shell scripting

Web Technologies

HTML, CSS

Reporting Tools

Tableau, Power BI, SSIS, SSRS

Databases

Mysql, Oracle, PostgreSQL

Operating System

Windows, Linux

Other Data Technologies

Hadoop, Hive, HBase, PySpark,Informatica, Azure Datafactory, Databricks, Azure Devops.

Certifications:

Microsoft Certified(Certification ID: 994951901) - Power BI Analyst Associate

Databricks Certified: Spark Developer Associate

EDUCATION:

Master of Science in Project Management - June 2019

Harrisburg University of Science and Technology

Bachelor of Technology - June 2014

Jawaharlal Nehru Technological University, Kakinada, India

WORK EXPERIENCE:

Role: Sr. Data Engineer

Client: BNY Mellon, New Jersey May 2023 – Present

Responsibilities:

Garnered Experience In handling data in various formats like AVRO, Parquet and ORC

Involved in different stages of the Project development from Planning, Development, Report Designing, Testing to Deployment and Maintenance.

Performed extensive Debugging and Data Validation and data cleanup activities within large datasets.

Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, Spark SQL, Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks.

Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL server (on premises), write-back tool and backwards.

Used Python packages to do data transformations for our applications.

Developed Spark applications using PySpark and Spark-SQL for Data Extraction, Transformation and Aggregation from multiple file formats to discover the hidden business insights and fetch the customer with usage patterns.

Created SQL scripts and Stored Procedures for Data Manipulation to build special reports based on the requirements.

Troubleshoot the wireless issues for the users.

Collaborated with data scientists and biostatisticians to build and deploy machine learning models for predictive analysis, drug discovery, and clinical trial optimization using Databricks Machine Learning and MLflow.

Implemented ETL pipelines in Snowflake using DBT, tested data quality and performed schema tests, referential integrity tests, and custom tests using DBT, debugged complex queries with DBT by splitting them into multiple models and macros for separate testing, and scheduled and automated DBT runs to keep data models up-to-date and synchronized with source systems.

Programmed in Python, Java, and Scala to develop data engineering solutions

Developed Ad-Hoc reports using Power BI by connecting to Various data sources and using data blending techniques.

Identified required tables, views and exported them into Hive. Performed ad-hoc queries using Hive joins, partitioning, bucketing techniques for faster data access.

Implemented various Hive queries for analytics. Created External tables, optimised Hive queries and improved the cluster performance by 30%.

Leveraged GitLab for source code management, CI/CD pipelines, and project tracking, improving collaboration and productivity in the development lifecycle

Data Extraction, aggregations, and consolidation of Adobe data within AWS Glue using PySpark.

Proficient in writing complex SQL queries and leveraging Snowflake's advanced analytical functions for data transformations, aggregations, and business intelligence tasks.

Utilized Snowflake's security features, including role-based access control (RBAC), encryption, and external token-based authentication, to enforce data protection and compliance.

Developed MapReduce jobs in Java to perform data cleansing, preprocessing, and transformations on large

datasets.

Created Technician Standard report time dashboards to monitor and audit the technician’s performance, assigned job activities and related timeline.

Actively worked in Project Migration Activities by changing and mapping the new data flow and gateway connections is being reflected in reports.

Role: Data Engineer

Client: Cummins, Indianapolis Nov 2021– May 2023

Responsibilities:

Performed project management activities including the development of project plans, overseeing the progress of SDLC lifecycle activities for product.

Involved in gathering, analyzing, and documenting business requirements and functional requirements.

Developed and maintained end-to-end operations of ETL data pipelines and worked with large data sets in Azure Data Factory.

Leveraged Cassandra CQL and Java APIs to retrieve and manipulate data from Cassandra's tables.

Developed multiple notebooks to Ingest data from multiple source systems like Oracle, SQL server and Flat files using Azure Databricks.

Analyzed historical customer data and built a model to identify customers who are most likely to default on premiums, using different machine learning algorithms like linear and logistic regression.

Developed ETL Pipelines using Apache PySpark, Spark SQL, Data Frame APIs.

Implemented full and incremental load using Azure Data Factory as per business requirement.

Involved end-to-end automation of the pipelines in Azure Data Factory and Implemented audit logs for all the Pipelines in Azure Data Factory.

Designed, build and managed ELT data pipeline, leveraging Airflow, python, DBT, Stitch Data and GCP solutions.

Designed ETL pipelines using Apache Airflow to extract, transform, and load data from various sources into Snowflake.

Involved in Troubleshoot for the wireless clients.

Developed Map Reduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS. Implemented a Python-based distributed random forest via Python streaming.

Engineered custom Java MapReduce jobs, unleashing profound insights and patterns from diverse data structures to drive data-driven decision-making.

Optimized query performance and resource utilization on Snowflake, leading to reduced costs and faster data access for the team.

Involved in setting up Kubernetes clusters for running microservices and deploying the applications into production clusters via CI/CD pipelines using Jenkins as the build deploy tool.

Utilized Control-M for scheduling event-based and time-based tasks, managed source code in Azure Git and GitHub repositories for Azure ADF pipelines and Azure Databricks.

Implemented and maintained Snowflake data warehouse, enabling efficient storage and querying of vast datasets.

Implemented custom MapReduce jobs in Java to process and analyze structured and unstructured data and extract meaningful insights and patterns.

Involved in extracting and enriching multiple Cassandra tables using joins in Spark SQL. Also converted Hive queries into Spark transformations.

Implemented and maintained data models using tools like Erwin, ER/Studio, and Snowflake's data modeling features.

Worked on improving the performance of the pipelines/jobs.

Experience in writing programs using PySpark in Azure Databricks.

Created Reports using Power Bi to demonstrate the status of work and report them to IT leads.

Participated in the Agile methodology's daily scrum and other sprint meetings; developed and prioritized product backlogs and the sprint burnt down charts etc. to meet the daily deliverables and timelines.

Embedded the power BI reports to office 365 and share point through power bi Service account.

Worked on performance tuning of the reports to fine tune the data queries and the variables to improve the query response time.

Created reports using SSRS for ad-hoc SQL Queries.

Coordinating with source system owners, everyday ETL jobs, change progress monitoring, Data warehouse target schema Design (Star Schema) and maintenance.

Developed multiple data pipelines to ingest data from multiple source systems using Azure Data Factory.

Troubleshoot and support reports in Power BI workspaces whenever there are issues in production.

Role: Data Engineer/ETL Developer

Client: HABP GLOBAL LLC, New Jersey Aug 2017 - Nov 2021

Responsibilities:

Worked on Creating pipelines in AWS to migrate data from S3 to Redshift.

Implemented Apache Airflow to monitor Pipelines.

Involved in Design, analysis, Implementation, Testing and support of ETL processes for stage, ODS and Mart and prepared ETL flow documentation.

Creating project plans, design and development Building a data model according to business needs by interacting with data architect.

Configured Staging tables, foreign key relationships, static lookups, dynamic lookups, queries and packages.

Worked with Informatica Data Quality (IDQ) toolkit, Analysis, data cleansing, data matching, data conversion, exception handling, and reporting and monitoring capabilities of IDQ.

Adopted Agile Practices and been responsible in driving the agile practices to get the best possible outcome.

Developed Map Reduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS.

Gained exposure to version control systems like Git and collaborated on code repositories.

Involved developing multiple MapReduce programs in Java to extract, transform and merge data from multiple file formats including XML, JSON, CSV, and other file formats.

Generate/Create SQL scripts and stored procedures for data manipulation at the database level to build special reports based on the requirements.

Worked with azure functions, web apps and logic apps and implementation of encryption techniques.

Designed and developed Row -Level-Security (RLS)to give access to technicians and managers based on hierarchy.

Worked on developing SSIS packages to transfer data from different platforms and validate them while transferred to different DBMS.

Analyzed and optimized pertinent data stored in Snowflake using PySpark and Spark SQL.

Created common reusable objects for the ETL team, overlook coding standards and reviewed high - level design specification, ETL coding and mapping standards.

Worked with Informatica Data Quality (IDQ) toolkit, Analysis, data cleansing, data matching, data conversion, exception handling, and reporting and monitoring capabilities of IDQ.

Defined System Trust and Validation rules for base object column.

Involved in migrating the ETL application from development environment to testing environment.

Involved in migrating and cloning reports that were connected to on prem database to AWS Cloud.

Developed Ad-Hoc reports using Power BI by connecting to Various data sources and using data blending techniques.

Developed Oozie Bundles to Schedule Pig, Sqoop and Hive jobs to create data pipelines.

Developed Hive queries to do analysis of the data and to generate the end reports to be used by business users.

Migrated the reports that were SSRS to Power BI.

Developed Supply Chain Dashboard that helps business to know the backorder items and restocking the SKU’s as per requirements.

Created various DAX measures as per the requirements of the dashboard.

Worked with different data sources on-Prem as well as Cloud such as SQL Server, Oracle, AWS Athena, Salesforce and connected them to Power BI.

Developed application in PowerApps using excel and SharePoint online.

Developing, testing, and maintaining data pipelines in Databricks by connecting various data sources and publishing the data to AWS.

Understood of the complex relational database structures and data warehouse structure to feed data continuously without and load failures to the Power BI reports.

Worked in analyzing PL SQL procedures and troubleshot the related query issues.

Created Gateway connection for on-Premises data source and mapped the data source to the right path.

Role: ETL Developer

Client: Geneca Solutions, Hyderabad July 2014 - Feb 2017 Responsibilities:

Gathering the requirements and writing the ad-hoc SQL query batches to update the data and metadata for delta and history loads.

Created Data Marts and multi-dimensional models like Star schema and Snowflake schema.

Created complex Informatica mappings to load the data mart and monitored them. The mappings involved extensive use of transformations like Aggregator, Filter, Router, Expression, Joiner, Union, Normalizer and Sequence generator.

Design and develop PL/SQL packages, stored procedure, tables, views, indexes and functions. Experience dealing with partitioned tables and automating the process of partition drop and create in oracle database.

Designed/Developed table driven ETL process with Perl and shell scripts, which in turn calls OWB mappings, Informatica oracle code to load EDW.

Worked with ETL team in creating External Batches to execute mappings using Informatica workflow designer to integrate data from varied sources like Oracle, flat files, and SQL databases.

Implemented inserting/updating strategy using batch process, which is done nightly, SOAP for near real time data and IDD.

Extracted Data from different flat files, MS Excel, MS Access and transformed the data based on user requirement using informatica Power Center and loaded data into target, by scheduling.

Experience in designing and developing in the Informatica IDQ environment.

Performed Unit Testing, tuned it for better performance and created various documents such as Source - to -Target Data mapping document and unit Test cases Document.

Contact this candidate