Data Engineer Warehouse

Location:

Rockville, MD

Salary:

65/hr

Posted:

January 21, 2025

Contact this candidate

Resume:

NAGENDRA PASAM

PROFESSIONAL SUMMARY

Dynamic and motivated IT professional with around 11 years of

experience as a Data Engineer with expertise in designing data- intensive applications using Hadoop Ecosystem, Big Data Analytical, Cloud Data Engineering, Data Warehouse / Data Mart, Data Visualization, Reporting, and Data Quality solutions. In-depth knowledge of Hadoop architecture and its components like YARN, HDFS, Name Node, Data Node, Job Tracker, Application Master, Resource Manager, Task Tracker, and Map Reduce programming paradigm. Extensive experience in Hadoop-led development of enterprise-level solutions utilizing Hadoop components such as Apache Spark, MapReduce, HDFS, Sqoop, PIG, Hive, HBase, Oozie, Flume, NiFi, Kafka, Zookeeper, and YARN. Profound experience in Columbus, OH 43202

380-***-****

nagendrababupasam78@gma

il.com

Bold Profile

SKILLS

performing Data Ingestion, Data Processing (Transformations, enrichment, and aggregations). Strong Knowledge of the Architecture of Distributed systems and Parallel processing, In-depth understanding of MapReduce programming paradigm and Spark execution framework. Experienced with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data frame API, Spark Streaming, and Pair RDD, and worked explicitly on PySpark and Scala. Handled ingestion of data from different data sources into HDFS using Sqoop, and Flume and perform transformations using Hive, Map Reduce, and then loaded data into HDFS. Managed Sqoop jobs with incremental load to populate HIVE external tables. Experience in importing streaming data into HDFS using Flume sources, and Flume sinks and transforming the data using Flume interceptors. Experience in Oozie and workflow scheduler to manage Hadoop jobs by Direct Acyclic Graph (DAG) of actions with control flows. Designed and developed Power BI graphical and visualization solutions with business requirement documents and plans for creating interactive dashboards. Utilized the Azure Paas service, analyze, plan, and develop modern data solutions that facilitate data visualization. Recognize the application's current state in production and assess how a new implementation will affect the current business procedures. Using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL, extract, transform, and load data from source systems into Azure Data Storage services Analytics for Azure Data Lake. Data is processed in Azure Databricks after being ingested into one or more Azure Services (Azure Data Lake, Azure Storage, Azure SQL, and Azure DW). Creating serverless yml files to AWS resources. Experience with Partitions, and bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance. Experience with different file formats like Avro, parquet, ORC, JSON, and XML. Expertise in Creating, Debugging, Scheduling,

• Hadoop

• MapReduce

• HDFS

• Sqoop

• PIG

• Hive

• HBase

• Oozie

• Flume

• NiFi

• Kafka

• Zookeeper

• Yarn

• Apache Flink

• Apache Spark

• Mahout

• Sparklib

• Apache Druid

• Oracle

• MySQL

• SQL Server

• MongoDB

• Cassandra

and Monitoring jobs using control-M and Oozie. Hands-on experience in handling database issues and connections with SQL and NoSQL databases such as MongoDB, HBase, Cassandra, SQL Server, and PostgreSQL. Created Java apps to handle data in MongoDB and HBase. Used Phoenix to create SQL layer on HBase. Experience in developing enterprise level solution using batch processing (using Apache Pig) and streaming framework (using Spark Streaming, apache Kafka & Apache Flink). Migrated Database from SQL Databases (Oracle and SQL Server) to NO SQL Databases

(Cassandra/MONGODB) Experience in designing and creating RDBMS Tables, Views, User Created Data Types, Indexes, Stored Procedures, Cursors, Triggers, and Transactions. Configured alerting rules and set up pagerduty alerting for Kafka, Zookeeper, Druid, Cassandra, Spark and different microservices in grafana. Set up and maintained Logging and Monitoring subsystems using tools loke; Elasticsearch, Fluentd, Kibana, Prometheus, Grafana and Alertmanager. Expert in designing ETL data flows using creating mappings/workflows to extract data from SQL Server and Data Migration and Transformation from Oracle/Access/Excel Sheets using SQL Server SSIS. Expert in designing Parallel jobs using various stages like Join, Merge, Lookup, remove duplicates, Filter, Dataset, Lookup file set, Complex flat file, Modify, Aggregator, XML. Hands-on experience with Amazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, CloudWatch, SNS, SES, SQS, Lambda, EMR, and other services of the AWS family. Created and configured new batch job in Denodo scheduler with email notification capabilities Implemented Cluster setting for multiple Denodo nodes and created load balance for improving performance activity. Instantiated, created, and maintained CI/CD (continuous integration & deployment) pipelines and apply automation to environments and applications. Worked on various automation tools like GIT, CFT, and Ansible. Experienced with JSON-based RESTful web services, and XML/QML-based SOAP web services and worked on various applications using python integrated IDEs like Sublime Text and PyCharm. Building and productionizing predictive models on large datasets by utilizing advanced statistical modeling, machine learning, or other data mining techniques. Developed intricate algorithms based on deep-dive statistical analysis and predictive data modeling that were used to deepen relationships, strengthen longevity, and personalize interactions with customers

WORK HISTORY

AWS Data Engineer, 03/2024 - Current

AgFirst Columbia, SC

• Designed and developed scalable and cost-effective architecture in AWS Big Data services for data life cycle of collection, ingestion, storage, processing, and visualization

• Developed PySpark Data Ingestion framework to ingest source claims data into HIVE tables by performing Data cleansing, Aggregations and applying De-dup logic to identify updated and latest records

• Involved in creating End-to-End data pipeline within distributed environment using the Big data tools, Spark framework and Tableau for data visualization

• DynamoDB

• PostgreSQL

• Teradata

• Java

• Python

• PySpark

• Pandas

• NumPy

• Scala

• Shell script

• Perl script

• SQL

• GCP

• AWS

• Microsoft Azure

• PyCharm

• Eclipse

• Visual Studio

• Plus

• SQL Developer

• TOAD

• SQL Navigator

• Query Analyzer

• SQL Server Management Studio

• SQL Assistance

• Postman

• SVN

• Git

• GitHub

• Windows 7/8/XP/2008/2012

• Ubuntu Linux

• MacOS

• Kerberos

• Dimension Modeling

• ER Modeling

• Star Schema Modeling

• Snowflake Modeling

• Control-M

• Grafana

• Worked on developing CFT's for migrating the infra from lower environment to higher environment

• Leverage Spark features such as In-Memory processing, Distributed Cache, Broadcast, Accumulators, Map side Joins to implement data preprocessing pipelines with minimal latency

• Experience in creating python topology script to generate cloud formation template for creating the EMR cluster in AWS

• Created serverless yml files to AWS resources

• Experience in using the AWS services Athena, Redshift and Glue ETL jobs

• Integrated Big Data Spark jobs with EMR and glue to create ETL jobs for around 450 GB of data daily

• Designed, developed, and deployed DataLakes, Data Marts and Datawarehouse using AWS cloud like AWS S3, AWS RDS and AWS Redshift and terraform

• Designed, developed, and deployed ETL pipelines using AWS services like, Lambda,Glue, EMR, StepFunction, CloudWatch events, SNS, Redshift, S3, IAM, etc

• Developed Spark applications using Scala and Spark-SQL for data extraction, transformation, and aggregation from multiple data file formats to uncover insights into the customer usage patterns

• Designed and developed ETL pipelines and dashboards using StepFunction, SageMaker,Lambda, Glue and QuickSight

• Worked on TensorFlow and Scikit-learn to develop and deploy predictive models for a variety of applications, including natural language processing, computer vision, and time series forecasting

• Implemented end-to-end machine learning pipelines on AWS, utilizing services such as Amazon SageMaker for model training and deployment, AWS Lambda for serverless computing, and Amazon S3 for data storage and retrieval

• Optimized model performance and scalability through integration with AWS tools, employing techniques like distributed training with Amazon SageMaker's managed infrastructure and leveraging AWS Elastic Compute Cloud (EC2) for parallel processing of large datasets

• Created Terraform modules and resources to deploy AWS services

• Worked on processing batch and real time data using Spark using Scala

• Designed and Developed Real time Stream processing Application using Spark, Kafka, Scala, and Hive to perform Streaming ETL and apply Machine Learning

• Written Programs in Spark using Scala for data quality check

• Ingest the Data into Cargill data lake from different sources & did some transformations in data lake with spark-Scala as per the business requirements

• Used Apache Spark Data frames, Spark-SOL, Spark MLLib extensively and developing and designing

• POC's using Scala, Spark SOL and MLlib libraries

• Developed CICD pipelines and developed require docker images for the pipelines

• Worked on building centralized Data Lake on AWS Cloud utilizing primary services like S3, EMR, Redshift, and Athena

• Created different Power BI reports utilizing the desktop and the online service and schedule refresh

• Assist end users with problems installing the Power BI desktop, installing, and configuring the Personal and On-Premises gateway, connecting to data sources and adding the different users

• Designed and developed Flink pipelines to consume streaming data from kafka and applied business logic to massage and transform and serialize raw data

• Built data ingestion pipelines and moved terabytes of data from existing data warehouses to the cloud and scheduled through AWS StepFunction and used EMR, S3 and Spark

• Worked extensively on fine-tuning spark applications and optimizing SQL queries

• Developed PySpark-based pipelines using spark data frame operations to load data to EDL using EMR for jobs execution & AWS S3 as a storage layer

• Created a full spectrum of data engineering pipelines: data ingestion, data transformations, and data consumption

• Developed an ETL application using Spark, Scala, and Java on EMR to process/transform files and loaded them into AWS S3

• Queried and ran analysis over processed Analytics data using Athena

• Improved the performance of the pipelines further using Apache Spark and Scala with batch and stream processing of the data based on the requirement

• Migrated Database from SQL Databases (Oracle and SQL Server) to NO SQL Databases (Cassandra/MONGODB)

• Automated the data flow and data validations on the input and output data to simplify the testing process using Shell Scripting and SQL

• Established infrastructure and service monitoring using Prometheus and Grafana

• Environment: Azure Databricks, AWS EMR, S3, RDS, Redshift, Lambda, Boto3, DynamoDB, Apache Spark, Google cloud Platform(GCP), P a n d a s, NumPy, HBase, Apache Kafka, HIVE, SQOOP, Flink, Map Reduce, Snowflake, Scala, Apache, Pig, Java, SSRS, Tableau

AWS Data Engineer, 05/2022 - 02/2024

Republic Services, Phoenix, AZ

• Created and managed various types of Snowflake tables, including transient, temporary, and persistent tables, to cater to specific data storage and processing needs

• Implemented advanced partitioning techniques in Snowflake to significantly enhance query performance and expedite data retrieval

• Defined robust roles and access privileges within Snowflake to enforce strict data security and governance protocols

• Designed solutions to process high volume data stream ingestion, processing and low latency data provisioning using Hadoop Ecosystems Hive, Pig, Scoop and Kafka, Python, Spark, Scala, NoSql, Nifi, Druid

• Implemented regular expressions in Snowflake for seamless pattern matching and data extraction tasks

• Developed and implemented Snowflake scripting solutions to automate critical data pipelines, ETL processes, and data transformations

• Developed and optimized ETL workflows using AWS Glue to extract, transform, and load data from diverse sources into Redshift for efficient data processing

• Configured and fine-tuned Redshift clusters to achieve high- performance data processing and streamlined querying

• Integrated AWS SNS and SQS to enable real-time event processing and efficient messaging

• Implemented AWS Athena for ad-hoc data analysis and querying on data stored in AWS S3

• Designed and implemented data streaming solutions using AWS Kinesis, enabling real-time data processing and analysis

• Designed and developed Flink pipelines to consume streaming data from kafka and applied business logic to massage and transform and serialize raw data

• Effectively managed DNS configurations and routing using AWS Route53, ensuring efficient deployment of applications and services

• Implemented robust IAM policies and roles to ensure secure user access and permissions for AWS resources

• Developed and optimized data processing pipelines using Hadoop ecosystem technologies such as HDFS, Sqoop, Hive, MapReduce, and Spark

• Implemented Spark Streaming for real-time data processing and advanced analytics

• Demonstrated expertise in scheduling and job automation using IBM Tivoli, Control-M, Oozie, and Airflow, for execution of data processing and ETL pipelines

• Designed and developed database solutions using Teradata, Oracle, and SQL Server, including schema design and optimization, stored procedures, triggers, and cursors

• Proficient in utilizing version control systems such as Git, GitLab, and VSS for efficient code repository management and collaborative development processes

• Environment: AWS, AWS S3, redshift, EMR, SNS, SQS, Athena, glue, cloudwatch, kenisis, route53, IAM, Sqoop, MYSQL, HDFS, Apache Spark, Hive, Cloudera, Kafka, Google cloud Platform(GCP), Zookeeper, Oozie, PySpark, Pandas, NumPy, Ambari, JIRA, IBM Tivoli, control-m, Flink,Druid,OOZIE, airflow, Teradata, oracle, SQL Azure Data Engineer, 01/2021 - 04/2022

Value Labs Inc, Princeton, NJ

• The main objective of the project is to create and manage moderate to advanced data models, along with developing and maintaining advanced reports that deliver precise and timely data for both internal and external clients

• This involves conducting Data Analysis, Data Profiling, Data Modelling, and Data Governance utilizing Python, Microsoft Azure Cloud Services, ETL, and Big Data Technologies

• The project centers on gathering vital data for prominent software companies renowned for their robust machine learning models, covering text, speech, and image analysis

• Responsibilities:

• Developed ETL pipelines in and out of data warehouse using combination of Python and Snowflakes SnowSQL writing SQL queries against Snowflake

• Developed Spark applications using Scala to perform data cleansing, validation, transformation and summarization activities according to the requirement

• Developed data pipelines using Stream Sets Data Collector to store data from Kafka into HDFS, HBase

• Used Azure Data Factory as an orchestration tool for integrating data from upstream to downstream systems

• Created application interface document for the downstream to create new interface to transfer and receive files through Azure Data Share

• Created linked service to land the data from SFTP location to Azure Data Lake

• Used CosmosDB for storing catalog data and for event sourcing in order processing pipelines

• Worked on the process of streaming the data using Kafka, Spark and Hive and developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using SQL

• Worked with Spark Python libraries to manipulate the data using broadcast joins and sort merge joins

• Built NiFi dataflow to consume data from Kafka, make transformations on data, place in HDFS and exposed port to run Spark streaming job

• Worked on migration of on-premise data (Oracle/ SQL Server/ DB2/ MongoDB) to Azure Data Lake and Stored (ADLS) using Azure Data Factory (ADF V1N2)

• Worked with Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, Data Frame, pair RDD and Spark YARN

• Developed Spark applications using PySpark and Spark-SQL for data transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns

• Involved in developing various Machine Learning Models such as Logistic regression, KNN, and Gradient Boosting with Pandas, NumPy, Seaborn, Matplotlib, Scikit-learn in Python

• Automated all the jobs, for pulling data from FTP server to load data into Hive tables, using Oozie workflows

• Developed Pig Latin scripts to extract the data from the web server output files to load into HDFS

• Converted Hive/SQL queries into Spark transformations using Spark Data Frames and Scala

• Worked with different file structures with different Hive file formats like Text file, Sequence file, ORC file, Parquet and Avro to analyze the data to build data model and reading them from HDFS and processing through parquet files and loading into HBASE tables

• Developed several new MapReduce programs to analyse and transform the data to uncover insights into the customer usage patterns

• Designed and developed custom aggregation framework for reporting and analytics in Hive, Presto and Vertica

• Worked on Apache Airflow to run tasks in parallel to create a database in MongoDB and Cassandra

• Performed data quality issue analysis using SnowSQL by building analytical warehouses on Snowflake

• Developed MapReduce applications using Hadoop MapReduce programming for processing and used compression techniques to optimize MapReduce Jobs

• Implemented Jenkins and built pipelines to drive all microservice builds out to Docker Registry and deployed to Kubernetes

• Built CI/CD pipelines using Azure DevOps

• Deployed applications integrating Git version control with it and optimized the PySpark jobs to run on Kubernetes Cluster for faster data processing

• Worked on optimization of existing ETL workflows, implementing strategic improvements that resulted in a marked enhancement of system performance and reliability

• Responsible for creating, debugging, scheduling and monitoring jobs using Airflow for ETL batch processing to load into Snowflake for analytical processes

• Engineered automated systems for routine data quality checks, integrating proactive monitoring tools that pre-empted data quality issues before they affected downstream processes

• Created action filters, parameters and calculated sets for preparing dashboards and worksheets using PowerBI

• Environment: Microsoft Azure (Data Factory, Databricks, Data Share, Data Lake, CosmosDB, Azure DevOps), Hadoop, Hive, Pig, Spark, Kafka, MapReduce, Flume, Oozie, HDFS, HBase, NiFi, Python, Scala, PySpark, Pandas, NumPy, R, SQL, Snowflake, Apache Airflow, Jenkins, Azure DevOps, Docker, Kubernetes, Git, MongoDB, Cassandra, Power BI .

Data engineer/Data Analyst, 02/2015 - 11/2019

Reverse Informatics, India

• As a Sr Data Engineer designed and deployed scalable, highly available, and fault tolerant systems on Azure

• Involved in complete SDLC life cycle of big data project that includes requirement analysis, design, coding, testing and production

• Lead the estimation, review the estimates, identify the complexities and communicate to all the stakeholders

• Defined the business objectives comprehensively through discussions with business stakeholders, functional analysts and participating in requirement collection sessions

• Migrated on-primes environment on Cloud using MS Azure

• Performed data Ingestion for the incoming web feeds into the Data lake store which includes both structured and unstructured data

• Designed the business requirement collection approach based on the project scope and SDLC (Agile) methodology

• Migrated data warehouses to Snowflake Data warehouse

• Installed and configured Hive and also written Hive UDFs and Cluster coordination services through Zookeeper

• Installed and configured Hadoop Ecosystem components

• Defined virtual warehouse sizing for Snowflake for different type of workloads

• Extensively used Agile Method for daily scrum to discuss the project related information

• Worked with data ingestions from multiple sources into the Azure SQL data warehouse

• Transformed and loading data into Azure SQL Database

• Wrote Spark applications for Data validation, cleansing, transformations and custom aggregations

• Developed HIVE scripts to transfer data from and to HDFS

• Implemented Hadoop based data warehouses, integrated Hadoop with Enterprise Data Warehouse systems

• Performed reverse engineering using Erwin to redefine entities, attributes and relationships existing database

• Used MongoDB to store Big Data in JSON format and Aggregation is used in MongoDB to Match, Sort and Group operation

• Developed the back-end web services using Python and designed the front end of the application using Python, CSS, AJAX, JSON, Drupal and JQuery

• Development and maintenance of data pipeline on Azure Analytics platform using Azure Databricks

• Created Airflow Scheduling scripts in Python

• Consumed Web Service using WSDL and SOAP tested using SOAP UI

• Ingested data into HDFS using Sqoop and scheduled an incremental load to HDFS

• Worked with Hadoop infrastructure to storage data in HDFS storage and use HIVE SQL to migrate underlying SQL codebase in Azure

• Exposed Java APIs for other applications to access data using REST API

• Implemented Web Services clients for APIs by using Spring Web Service Template class

• Created Data Pipeline to migrate data from Azure Blob Storage to Snowflake

• Worked on Snowflake modeling and highly proficient in data warehousing techniques for data cleansing, Slowly Changing Dimension phenomenon, surrogate key assignment and change data capture

• Maintained NoSQL database to handle unstructured data, clean the data by removing invalidate data, unifying the format and rearranging the structure and load for following steps

• Participated in NoSQL database maintaining with Azure Sql DB

• Involved in Kafka and building use case relevant to our environment

• Identified data within different data stores, such as tables, files, folders, and documents to create a dataset in pipeline using Azure HDInsight

• Optimized and updated UML Models (Visio) and Relational Data Models for various applications

• Wrote Python scripts to parse XML documents and load the data in database

• Written DDL and DML statements for creating, altering tables and converting characters into numeric values

• Translated business concepts into XML vocabularies by designing XML Schemas with UML

• Worked on Data load using Azure Data factory using external table approach.

ETL Developer, 08/2013 - 01/2015

Reliance Industries, India

• Defined various Facts and Dimensions in the data mart, Aggregate and Summary facts

• Extensively used Power Center/Mart to design Multiple Mappings with Embedded Business Logic

• Created Transformations like Aggregator, Expression, Filter, Router, Sequence Generator, Update Strategy, Joiner, Rank and Source Qualifier Transformations in the Informatica Designer

• Created Complex Mappings using Lookup, Sorter, Aggregator, and Router transformations for populating Target Table in Efficient manner

• Prepared the ETL Mapping Spreadsheet and Fact Dimension Matrix simultaneously prior to performing the ETL process

• Created Mapplet and used them in different Mappings

• Performance tuning of the Informatica mappings using various components like Parameter files, Variables and Dynamic Cache

• Excellent experience in ETL Tools like Informatica and in implementing Slowly Changing Dimensions (SCD)

• Maintained Warehouse Metadata, Naming Standards and Warehouse Standards for future application development

• Extensively Used SQL queries, Indexes, Oracle Functions

• Oracle 10g, Informatica Power Center 9

CERTIFICATIONS

Databricks certified data engineer professional

snowpro advanced data engineer

AWS certified data engineer

certified azure data engineer associate

Contact this candidate