Data Engineer Warehouse

Location:

Plano, TX

Posted:

February 13, 2025

Contact this candidate

Resume:

RISHITHA REDDY

*******************@*****.*** 469-***-****

PROFESSIONAL SUMMARY

Data Engineer having around 9+ years of professional experience with Expertise in Big Data ecosystem- Data Acquisition, Ingestion, Modeling, Storage Analysis, Integration, and Data Processing, Hadoop Ecosystem, Data warehouse, Cloud engineering.

• Working Experience with AWS cloud (EMR, EC2, S3, Lambda, Glue, Elasticsearch, Kinesis, DynamoDB, Redshift,).

• Worked on ETL Migration services by creating and deploying AWS Lambda functions to provide a serverless data pipeline that can be written to Glue Catalog and queried from Athena.

• Experienced working with AWS utilities such as EMR, S3 and cloud watch to run and monitor Hadoop and spark jobs on Amazon Web Services (AWS).

• Experienced in implementing ETL workflows, leveraging Snowflake’s Snowpipe, Streams, and Tasks to automate data ingestion, transformation, and real-time processing, improving data availability and reducing latency.

• Strong AWS EC2 experience to provide a complete compute, query processing, and storage solution for a wide range of applications.

• Strong experience with integrating Snowflake with cloud platforms like AWS and Azure, as well as leveraging advanced features such as Snowpipe, Streams, and Tasks for real-time data ingestion.

• Analytics and cloud migration from on-premises to AWS Cloud with AWS EMR, S3, and DynamoDB

• Working experience with Azure Cloud, Portal, Azure Cosmos DB, Azure Synapse Analytics, Azure Data Lake Storage, Azure Data Factory, Azure Stream Analytics, Azure Databricks, Azure Log Analytics and Azure Blob storage

• Working knowledge on Azure cloud components such as Storage Explorer SQL DWH, Cosmos DB HDInsight, Databricks, Data Factory, Blob Storage.

• Extensive experience working with micro batching to ingest millions of files on the Snowflake cloud when files emerge in the staging area and Data was imported into the Snowflake cloud data warehouse using Snow pipe.

• Working knowledge of Apache Spark streaming and the Batch framework and created Spark jobs to transform and aggregate data.

• Acquired profound knowledge on spark ecosystem and Architecture in developing production ready Spark applications utilizing Spark Core, Spark Streaming, Spark SQL, Data Frames, Datasets and Spark-ML.

• Expertise in building PySpark, and Scala applications for interactive analysis, batch processing, stream processing.

• Expertise on architecture and components of Spark, and efficient in working with Spark Core, Spark SQL, Spark streaming and expertise in building PySpark and Spark-Scala applications for interactive analysis, batch processing and stream processing.

• Handled importing other enterprise data from different data sources into HDFS using Sqoop and performing transformations using Hive, Map Reduce and then loading data into HBase tables.

• Designed, implemented, and optimized data pipelines for large-scale AI/ML projects, ensuring high-quality, real- time data flow using tools like Apache Spark, Kafka, and Airflow.

• Build scalable data pipelines, designing data models, and implementing data transformations using DBT.

• Proficiency in SQL across several dialect (MySQL, PostgreSQL, Redshift, SQL Server, and Oracle)

• Involved in converting Hive/SQL queries into Spark transformations using Spark SQL and Scala

• Proficient Experience in writing Spark scripts in Python, Scala and SQL for development and analysis.

• Using Python, Spark SQL, S3, and Redshift, I performed ETL operations on terabytes of data to extract consumer insights.

• Solid knowledge of Dimensional Data Modeling with Star Schema and Snowflake for FACT and Dimensions Tables using Analysis Services

RNDC Dallas, Texas

Big Data Engineer

Apr2023 - Present

• Extensive expertise with SQL and NoSQL databases, data modeling, and data pipelines, as well as creation and automation of ETL pipelines using SQL and Python.

• Skilled in Sqoop with extensive experience ingesting data from RDBMS including Oracle, MS SQL Server, Teradata, PostgreSQL, and MySQL.

• Developed and Aggregation from different file formats such as XML, JSON, CSV, Avro, Parquet, and ORC, I've used Python (PySpark), Java, Scala, and Spark-SQL.

• Experienced in designing interactive dashboards, reports, performing ad-hoc analysis, and visualizations using Tableau, Power BI, Arcadia, and Matplotlib

• Used Agile and Waterfall Scrum Methodologies, very involved in all aspects of SDLC.

TECHNICAL SKILLS

Big Data HDFS, Yarn, MapReduce, Pig, Hive, HBase, Cassandra, Sqoop, Apache Spark, Scala, Kafka, Elastic Search, Kibana. Hadoop Apache Hadoop 2.x/1.x, Cloudera CDP, Hortonworks HDP, Amazon EMR (EMR, EC2, EBS, RDS, S3, Glue, Elasticsearch, Lambda, Kinesis, SQS, DynamoDB, Redshift, ECS) Azure HDInsight

(Databricks, Data Lake, Blob Storage, Data Factory, SQL DB, SQL DWH, Cosmos DB, Azure DevOps, Active Directory).

Programming Languages Python, C, JAVA, J2EE, SQL, Pig Latin, HiveQL, Scala, Unix Shell Scripting, R.

Cloud Environments Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP), OpenStack

Reporting Tools/ETL Tools Informatica, Talend, SSIS, SSRS, SSAS, ER Studio, Tableau, Power BI, Arcadia, Data stage

Databases MySQL, Teradata, Oracle, MS SQL SERVER, PostgreSQL, DB2, Cassandra, MongoDB, Redis

Development Tools Jupyter Notebook, Eclipse, NetBeans, IntelliJ, Hue, Microsoft Office Suite (Word, Excel, PowerPoint, Access)

Others Machine learning, NLP, Stream Sets, Spring Boot, Terraform, Docker, Kubernetes, Jenkins, Chef, Ansible, Splunk, Jira Version Control Git, SVN, Bitbucket

Methodologies Agile/Scrum, Waterfall

PROFESSIONAL EXPERIENCE

Key Responsibilities:

• Built a streamlined ETL process, ingesting transactional and event data from a web application to AWS S3 using Python and Kafka, catering to a daily user base of 4,500. This initiative resulted in an annual cost savings of over 24% by reducing the client’s expenses.

• Developed and optimized ETL pipelines using Snowflake and Apache Spark, improving data processing efficiency by 30%

• Developed PySpark and Pandas scripts to perform comprehensive data validation for SQLServer. databases, ensuring data accuracy and consistency which significantly reduced the workload of the client’s Quality Engineer (QE) resources by 35%.

• Leveraging AWS Glue and Lambda functions to ensure efficient and automated data extraction, transformation, and loading (ETL) processes.

• Collaborated with data scientists to build ETL processes that prepared large datasets for machine learning models.

• Integrated Terraform scripts into a CI/CD pipeline (using Jenkins) to enable automated infrastructure provisioning and updates with version control, reducing deployment time by 30%.

• Analyzing data quality reports by reconciling data using Spark and Scala, dropping data quality issues by 35%

• Migrated legacy ETL pipelines to a modern ELT framework, integrating DBT with Snowflake.

• Automated data pipeline deployment using Cloud Composer (Airflow) for workflow orchestration, improving pipeline reliability and efficiency.

• Designed and developed ETL workflows using Snowflake's Snow pipe and Streams, automating real-time data ingestion from external sources.

• Built scalable and efficient ETL pipelines for data ingestion, transformation, and loading into BigQuery and Cloud SQL from multiple sources, including Google Analytics and Cloud Pub/Sub Monitored the SQL scripts & modified them for improved performance using PySpark SQL.

• Analyzed large amounts of data sets using AWS Athena to determine the optimal way to aggregate & report on it.

• Designed and implemented modular data transformation pipelines using DBT, reducing processing time by 28%.

• Implemented advanced Spark procedures like text analytics & processing using the in-memory computing capabilities.

• Created Dashboards & reported deliverables in Tableau, utilized advanced features, capabilities, & designs. Environment: Spark, Spark-Streaming, Spark SQL, AWS EMR, AWS Athena, AWS S3, AWS Glue, AWS Lambda, Confluent Control Center (C3), Python, Pyspark, SQL Server Management Studio, Confluent Kafka Library, Git, Tableau & Agile Methodologies. Key Responsibilities:

• Worked in the migration of data from an on-premises Cloudera cluster to AWS EC2 instances running on an EMR cluster.

• Designed an ETL pipeline to extract logs, store them in an AWS S3 Data Lake, and analyze them further using PySpark.

• Developed Scripts and workflow as a product was automated using Apache Airflow and shell scripting to ensure daily execution in production.

• Participated in user meetings, gathered business requirements & specifications for the Data-warehouse design.

• Implemented AWS EMR to transform and transfer enormous amounts of information among Amazon Simple Storage Service (Amazon S3) and Dynamodb, among many other AWS data storage and databases.

• Designed and managed data warehouse architecture using BigQuery, ensuring seamless data modeling, transformation, and reporting for business analytics.

• Integrated Cloud Dataproc clusters with Apache Spark for distributed data processing, reducing processing time by 40%

• Automated ETL workflows for data extraction, transformation, and loading using Python, SQL, and Apache Beam to support AI/ML model development.

• Created on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and PySpark Apple Dallas, Texas

Big Data Engineer

Apr2022 – Apr2023

Transunion Dallas, Texas

Big Data Engineer

Jan 2021-Apr2022

• Worked with Teradata to import data from multiple sources and process to downstream applications.

• Constructed an orchestration tool using Python and docker so developers can deploy products built by themselves as DAGs using AWS Batch and Step Functions on AWS cloud stack.

• Implemented data lakes and warehouses to streamline data storage, retrieval, and analysis for AI/ML workflows.

• Deployed Python code on Lambda and executed using AWS S3, Lambda, RDS, IAM, and Secrets Manager.

• Designed and implemented modular data transformation pipelines using DBT, reducing processing time by 30%.

• Exploring with Spark, improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, and Pair RDD's.

• Responsible for Data Collection & Management for the AI/ML Engineering Operations Group.

• Created AWS RDS (Relational database services) to work as Hive meta store and could combine EMR cluster's metadata into a single RDS, which avoids the data loss even by terminating the EMR

• Collaborated with business analysts, data engineers, and product teams to translate business requirements into actionable data visualizations, reducing manual reporting effort by 40%

• Expertly executed Extraction, Transformation, and Loading (ETL) operations utilizing Oozie workflows or Airflow to carry out numerous Python (PySpark), Hive, Shell, and SSH actions based on business need.

• Involved in performance tuning of spark applications for fixing right batch interval time and memory tuning.

• Implemented Spark Scripts using Spark Session, Python, Spark SQL to access hive tables into spark for faster processing of data.

• Developed Python UDF for transformation and spark-streaming and validate the data using the data warehouse Hive with HiveQL.

• Worked in container-based technologies like Docker and Kubernetes.

• Experience with analytical reporting and facilitating data for Tableau dashboards.

• Worked in Agile Methodology and used JIRA for maintain the stories about project. Environments: AWS, EC2, S3, Lambda, Glue, Spark, Scala, RDS, DynamoDB, Redshift, ECS, Python, Java, Scala, SQL, Sqoop, Kafka, Airflow, Oozie, HBase, Teradata, Cassandra, MLlib, Tableau, HiveQL, Git. Key Responsibilities:

• Worked on AWS Data pipeline to configure data loads from S3 to into Redshift.

• Developed and implemented ETL pipelines on S3 parquet files in a data lake using AWS Glue.

• Loaded data into S3 buckets using AWS Glue and PySpark. Involved in filtering data stored in S3

• Used buckets using Elasticsearch and loaded data into Hive external tables.

• Involved in building a data pipeline and performed analytics using AWS stack (EMR, EC2,

• S3, RDS, Lambda, Glue, SQS, Redshift).

• Used AWS services like EC2 and S3 for data sets processing and storage. Experienced in Maintaining the Hadoop cluster on AWS EMR.

• Migration from on-premises data systems to Snowflake, reducing infrastructure costs by 40% and improving query performance by 50%

• Used AWS Data Pipeline to schedule an AWS (Amazon) EMR cluster to clean and process web server logs stored in AWS (Amazon) S3 bucket.

• Created a Data Pipeline utilizing Processor Groups and numerous processors in Apache Nifi for Flat File, RDBMS as part of a Proof of Concept (POC) on AWS (Amazon) EC2.

• Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary

• Performed transformations and Aggregation to build the common learner data model and persists the data in HDFS.

Fiserv Dallas, Texas

Data Engineer

Oct2019-Dec2020

• Developed custom ETL solutions, batch processing, and a real-time data ingestion pipeline to transport data into and out of Hadoop Using Pyspark and shell scripting.

• AWS EC2 has been used to select and generate data into csv files, that were then stored in AWS S3 and subsequently structured and stored in AWS Redshift.

• Performed Multiple ETL testing such as Extracting the data using necessary queries from database transform, and upload into the Data warehouse servers.

• Developed Spark Scala notebook to perform data cleaning and transformation on various tables.

• Performed ETL operations using Python, SparkSQL, S3 and Redshift on terabytes of data to obtain customer insights.

• Worked on creating Hive tables and written Hive queries for data analysis to meet business requirements and experienced in Sqoop to import and export the data from Oracle & MySQL

• Involved in creating Hive tables, loading with data, writing hive queries that will run internally in map reduce. way.

• Worked on creating Hive tables and written Hive queries for data analysis to meet business requirements and experienced in Sqoop to import and export the data from Oracle & MySQL.

• Experience in developing real-time processing and core tasks with Spark Streaming and Kafka as a data pipeline system, as well as a good understanding of Cassandra architecture, replication strategy, gossip, snitches, and other features.

• Optimized ETL processes to reduce data processing time by 40% using Snowflake’s task scheduling and stored procedures..

• Extensive experience with Hadoop Big Data Integration and ETL for data extraction, loading, and transformation for ERP data

• Worked on Spark SQL, created data frames by loading data from Hive tables and created prepared data and stored in AWS S3 and interact with the SQL interface using the command line.

• Worked on fetching live data from Oracle database using Spark Streaming and Kafka using the feed from API Gateway REST service.

• Utilized Spark’s in memory capabilities to handle large datasets on S3 and Loaded data into S3 buckets, then filtered and loaded into Hive external tables.

• Strong Hands-on experience in creating and modifying SQL stored procedures, functions, views, indexes, and triggers.

• As part of the Data Project based, I wrote many SQL scripts to resolve data inconsistencies and worked on migrating historical information from Teradata SQL to Snowflake.

• Experienced working with CI/CD pipeline using Jenkins, Airflow for Containers from Docker, and Kubernetes

• Worked on projects to ingest data files in JSON formats and store data on GCP Storage buckets.

• Used Git for version control and Jira for project management, tracking issues and bugs.

• Experience in using various Python libraries such as Pandas, SciPy, TensorFlow, Keras, Scikit-learn.

• Worked on both waterfall and agile methodologies in a fast-paced environment. Environments: Hadoop 2.x, Hive v2.3.1, Spark v2.1.3, AWS, EC2, S3, Lambda, Glue, Elasticsearch, RDS, DynamoDB, Redshift, ECS, Python, Java, Scala, SQL, Sqoop v1.4.6, Kafka, Airflow v1.9.0, Oozie, HBase, Teradata, Cassandra, MLlib, Tableau, Git.

Key Responsibilities:

• Created ADF Pipelines to load data from on-premises storage and databases into AZURE cloud storage and databases.

• Overseen assets and planning over the cluster utilizing using Azure Kubernetes Service

• Worked on Azure Data Factory to integrate data of both on-prem (MYSQL, Cassandra) and cloud

(Blob storage, Azure SQL DB) and applied transformations to load back to Azure Synapse.

• Azure Kubernetes services can be utilized to form, arrange, and oversee a cluster of Virtual machines.

• Explored with Spark to improve the performance and optimization of the existing algorithms in Hadoop. using Spark, Spark context, Spark-SQL, PostgreSQL, Data Frame, OpenShift, Talend, pair RDD's

• Worked with Hortonworks distribution and installed, configured, and maintained a Hadoop cluster based on the business requirements.

• Proficient in composing Python scripts to construct ETL pipeline and Coordinated Non-cyclic Chart Fullerton India Mumbai, India

Big Data Developers

Aug 2018-June 2019

(DAG) workflows utilizing Airflow, Apache NiFi.

• Worked on Kafka used to process ingested data in real-time from flat files and APIs.

• Designed custom-built input adapters using Spark, Hive, and Sqoop to ingest and analyze data

(Snowflake, MS SQL, MongoDB) into HDFS.

• Stacked information from Web servers and Teradata using Sqoop, Flume and Spark Streaming API.

• Generated report on predictive analytics using Python and Tableau including visualizing model performance and prediction results

• Supported in building of data pipelines to extract, transform, load, and integrate data from various sources.

• Worked on improving Data Ingestion System by making more robust and secure information pipelines.

• Experience in working with Spark applications like batch interval time, level of parallelism, memory tuning to improve the processing time and efficiency.

• Executed UNIX scripts to characterize the utilize case workflow and to handle the information records and robotize the work.

• Used Pandas API to put the data as time series and tabular format for east timestamp data manipulation and retrieval.

• Used Python for XML And Html processing, data exchange and business logic implementation.

• Used Scala for amazing concurrency support, and Scala plays the key role in parallelizing processing of the large data sets and Developed map reduce jobs using Scala for compiling program code into bytecode for the JVM for data processing.

• Planned and deployed information pipelines using Data Lake, Data Bricks, and Apache Airflow.

• Used Tableau as a front-end BI tool and MS SQL Server as a back-end database to design and develop dashboards, workbooks, and complex aggregate calculations.

• Developed spark-based ingestion framework for ingesting data into HDFS, creating tables in Hive and executing complex computations and parallel data processing.

• Proposed a robotized framework using Shell script to Sqoop the work.

• Worked with NoSQL databases like HBase in creating HBase tables to load large sets of semi-structured data coming from various sources.

• Enhanced scripts of existing Python modules. Worked on writing APIs to load the processed data to HBase tables.

• Participated in daily scrum and some other design-related meetings while working in an Agile development environment.

Environments: Hortonworks 2.0, Hadoop 2.x (HDFS, MapReduce, Yarn), Hive v1.0.0, HBase, HDInsight, Data Bricks (ADBX), Data Lake (ADLS), MySQL, Snowflake, MongoDB, PowerBI, Azure AD, Git, Blob Storage, Data Factory, Data Storage Explorer, Scala Spark v2.0.2, Airflow, Sqoop v1.4.4, Kafka v0.8.1, Python, SQL, Java, Teradata, Oracle, MySQL, Tableau v9.x, SVN, Jira. Key Responsibilities:

• Used Hadoop/Big Data concepts, loaded and transformed large sets of structured, semi - structured and unstructured data.

• Worked in developing Pig Scripts for data capture change and delta record processing between newly arrived data and already existing data in HDFS.

• Developed Pig Scripts to generate Map Reduce jobs and performed ETL procedures on the data in HDFS.

• Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.

• Developed SQL queries to search the Repository DB for deviations from the Company's ETL Standards for items such as Sources, Targets, Transformations, Log Files, Mappings, Sessions, and Workflows that were Dunzo Bangalore, India

Data Analyst

May 2017-July 2018

produced by users.

• Optimized Hive queries to extract the customer information from HDFS.

• Worked on a Scala log producer that monitors application logs, transforms incremental logs, and transmits them to a Kafka-based log collecting platform.

• Imported data into Spark RDD from sources such as HDFS/HBase.

• Extracted data from weblogs and store it in HDFS, I developed a data pipeline using flume, Sqoop, and pig.

• Responsible for performing SQL development and SSIS /SSAS/SSRS development.

• Hands-on experience on developing SQL Scripts for automation purpose.

• Used Talend for Big data Integration using Spark and Hadoop.

• Performed reading, processing, and generating the Spark-SQL tables using the Scala API.

• Used advanced approaches including as bucketing, partitioning, and optimizing self joins, we used structured data in Hive to run fast.

• Using the Scala API, we used Spark-SQL to read, analyze, and generate tables from parquet data.

• Exposure to Spark Architecture and how RDDs function by involving and processing data from local files, HDFS, and RDBMS sources by constructing RDDs and optimizing for performance.

• Used Apache Flume to collect and aggregate huge volumes of log data, then staging the data in HDFS for later analysis.

• Working on BI reporting with OLAP for Big Data

• Responsible for the Extraction, Transformation and Loading of data from Multiple Sources to Data Warehouse using SSIS.

• Responsible for performing SQL development and SSIS /SSAS/SSRS development.

• Worked in Agile Methodology and used JIRA for maintain the stories about project. Environments: Python, PySpark, Spark, Spark ML Lib, SparkSQL, PowerBI, YARN, HIVE, Pig, Scala, NiFi, Hadoop, NOSQL, Sqoop, SSIS, MYSQL.

Key Responsibilities:

• Developed custom reporting dashboards, reports and scorecards mainly focusing on Tableau and its components from multiple sources, including Oracle databases, SQL Server, Teradata, MS Access, DB2, Excel, flat files, and CSV Performed the process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making.

• Performed data collection, cleaning, wrangling, and analysis on the data sets in Python.

• Performed statistical data analysis and data visualization using Python and R.

• Implemented Power BI as a front-end BI tool and MS SQL Server as a back-end database to design and develop dashboards, workbooks, and complicated aggregate computations.

• Used Python (NumPy, SciPy, pandas, scikit-learn, seaborn) to develop variety of models and algorithms for analytic purposes.

• Responsible for building data analysis infrastructure to collect, analyses, and visualize data.

• Developed advanced SQL queries for data analysis using multi-table joins, group functions, subqueries, and set operations, as well as stored procedures and user-defined functions (UDFs).

• Quality analytics for data, Communication, and execution of data validation. Analyzing data quality results, Measuring, and auditing large volumes of data for quality issues and improvements, Performing root cause analysis of data anomalies.

• Worked with the team to analyze and evaluate the data.

• Optimized data collection procedures and generated reports on a weekly, monthly, and quarterly basis.

• Developed a variety of models and methods for analytic purposes, use Python and R libraries such as NumPy, Pandas, Matplotlib, SciPy, TensorFlow, Keras, Scikit-learn, OpenCV, and ggplot2.

• Collected, analyzed, and extracted data from a variety of sources to create reports, dashboards, and analytical solutions and Assisting with the debugging of Tableau dashboards. Amigos Solutions Hyderabad,India

Internship

Dec 2016-May 2017

• Worked in Agile Methodology and used JIRA for maintain the stories about project. Environments: Python, GitLab, Tableau, Snowflakes SQL, Oracle, SQL Server, PowerBI, Agile Methodology Key Responsibilities:

• Worked in the analysis, design, and development phases of the Software Development Lifecycle (SDLC).

• Created SQL queries, Sequences, Views for the backend database in the Oracle database.

• Used Log4J for logging and notification tracing mechanisms, Splunk for analyzing application Performance

• Developed a web application using JavaEE, Oracle database, using Postman microservices architecture and REST services.

• Extensive Development of database/query including stored procedures, triggers, data quality checks, views, query optimization, DDL, DML, CTEs

• Created application using Eclipse IDE.

• Installed Web Logic Server for handling HTTP Request/Response.

• Used Web services (SOAP) have been to transmit huge chunks of XML data via HTTP.

• Work Followed an agile methodology, working directly with clients on features, applying the best solutions, and customizing the application to the client's needs. Environments: Java/J2EE, Spring, Oracle, Linux, JDBC, Maven, Git, Jira, HTML, CSS, Angular, Splunk, Log4j, Servlets, Struts, JSP, WebLogic/SQL, agile methodology. EDUCATION

• Masters in Computer Science

• Bachelor of Technology in Computer Science & Engineering

Contact this candidate