Data Engineer Analysis

Location:

DeKalb, IL

Posted:

February 06, 2025

Contact this candidate

Resume:

Data Engineer

Name: Navya Sri Ankireddy

Contact: +1-779-***-****

Email: ******************@*****.***

PROFESSIONAL SUMMARY:

·Over 4 years of experience in data engineering, data analysis, and data mining, working with large structured and unstructured datasets, data acquisition, data validation, predictive modeling, statistical modeling, data modeling, and data visualization.

·Proficient in statistical programming languages like Python, R, and SAS, along with Big Data technologies such as Hadoop, Hive, and Pig.

·High exposure to Big Data technologies and the Hadoop ecosystem, with an in-depth understanding of MapReduce and Hadoop infrastructure.

·Expertise in writing end-to-end data processing jobs using MapReduce, Spark, and Hive for data analysis.

·Experience with the Apache Spark ecosystem, including Spark Core, SQL, DataFrames, RDDs, and knowledge of Spark MLlib.

·Extensive experience in developing Spark Streaming jobs using RDDs (Resilient Distributed Datasets) with Scala, PySpark, and Spark-Shell.

·Hands-on experience with ETL processes, including Informatica, Azure Data Factory, and data pipeline development.

·Strong expertise in SQL and NoSQL databases such as Snowflake, HBase, Cassandra, and MongoDB.

·Proficient in writing complex SQL queries, stored procedures, and performance tuning for relational databases.

·Experienced in data manipulation using Python for extraction and loading, leveraging libraries like NumPy, SciPy, and Pandas for data analysis and numerical computations.

·Experience in using Pig scripts for transformations, event joins, filters, and pre-aggregations before storing data into HDFS.

·Strong knowledge of Hive analytical functions and experience in extending Hive functionality by writing custom UDFs.

·Expertise in writing MapReduce jobs in Python for processing large structured, semi-structured, and unstructured datasets stored in HDFS.

·Hands-on experience with Amazon Web Services (AWS), including EC2, Elastic MapReduce (EMR), and Redshift for data processing.

·Proficient in setting up workflows using Apache Airflow and Oozie for managing and scheduling Hadoop jobs.

·Experience in developing interactive dashboards and reports using Power BI for data visualization and business insights.

·Strong understanding of data modeling concepts, including dimensional modeling (Star and Snowflake schemas), fact and dimension tables.

·Experienced in working within SDLC, Agile, and Scrum methodologies.

·Strong analytical, communication, problem-solving, and presentation skills, with the ability to work independently and collaboratively while following best practices.

TECHNICAL SKILLS:

Databases

AWS RDS, Teradata, Oracle, MySQL, Microsoft SQL, Postgres SQL.

NoSQL Databases

MongoDB, Hadoop HBase, and Apache Cassandra.

Programming Languages

Python, SQL, Scala, Pig C, C++, MATLAB, Java, JavaScript.

Cloud Technologies

AWS, Azure, Ec2, S3, Docker, GCP

Querying Languages

SQL, No SQL, PostgreSQL, MySQL, Microsoft SQL.

Deployment Tools

Anaconda Enterprise v5, R-Studio,

Machine Learning Studio, AWS Lambda, Informatica (ETL)

Scalable Data Tools

Hadoop, Hive, Apache Spark, Pig, Map Reduce, Sqoop.

Operating Systems

Red Hat Linux, Unix, Ubuntu, Windows, MacOS.

Reporting & Visualization

PowerBI, Looker, Matplotlib, Seaborn, Bokeh, ggplot, pilots.

Education:

NORTHERN ILLINOIS UNIVERSITY DeKalb, IL Master of Science in Computer Science. Aug 2022- May 2024

KL UNIVERSITY Vijayawada, India

Bachelor of Technology in Electronics and Communications Engineering. Jun 2017- May 2021

Certifications:

GOOGLE PROFESSIONAL MACHINE LEARNING ENGINEER

https://www.credly.com/earner/earned/badge/d6a7b330-53ad-4675-95db-cb4a5701e2b9

Professional Experience:

One Community Global INC San Gabriel, CA

Role: Data Engineer SEP 2024- DEC 2024

Responsibilities:

·Gathered, analyzed, and translated business requirements to technical requirements, communicated with other departments to collect client business requirements and access available data

·Acquiring, cleaning and structuring data from multiple sources and maintain databases/data systems. Identifying, analyzing, and interpreting trends or patterns in complex data sets

·Develop, prototype and test predictive algorithms. Filtering and cleaning data, review reports and performance indicators

·Developing and implementing data collection systems and other strategies that optimize statistical efficiency and data quality

·Create and statistically analyze large data sets of internal and external data

·Working closely with marketing team to deliver actionable insights from huge volume of data, coming from different marketing campaigns and customer interaction matrices such as web portal usage, email campaign responses, public site interaction, and other customer specific parameters

·Perform incremental loads as well as full loads to transfer data from OLTP to Data Warehouse of snowflake schema using different data flow and control flow tasks and provide maintenance for existing jobs.

·Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.

·Design and implement secure data pipelines into a Snowflake data warehouse from on-premise and cloud data sources

·Kafka was used as message broker to collect large data and to analyze the collected data in the distributed system.

·Designed ETL process using Talend Tool to load from Sources to Snowflake through data Transformations

·Have knowledge on partition of Kafka messages and setting up the replication factors in Kafka Cluster.

·Developed Snow pipes for continuous injection of data using event handler from AWS (S3 bucket)

·Design and developed end-to-end ETL process from various source systems to Staging area, from staging to Data Marts and data load

·Loading data into Snowflake tables from internal stage using SnowSQL

·Prepared data warehouse using Star/Snowflake schema concepts in Snowflake using SnowSQL

·Responsible for Data Cleaning, features scaling, features engineering by using NumPy and Pandas in Python.

·Conducted Exploratory Data Analysis using Python Matplotlib and Seaborn to identify underlying patterns and correlation between features

·Experimented with multiple classification algorithms, such as Random Forest and Gradient boosting using Python Scikit-Learn and evaluated performance on customer discount optimization on millions of customers

·Built models using Python and Pyspark to predict probability of attendance for various campaigns and events

·Performed data visualization and Designed dashboards with Tableau, and generated complex reports, including charts, summaries, and graphs to interpret findings to team and stakeholders

Environment: Snowflake, SnowSQL, AWS S3, Hadoop, Hive, HBase, Spark, R/R studio, Python- Pandas, Informatica, Numpy, Scikit-Learn, SciPy, Seaborn, Matplotlib, SQL, Power shell, Machine Learning, Kafka.

COGNIZANT TECHNOLOGY SOLUTIONS Chennai, India

Role: Data Engineer Feb 2021 – Jun2022

Client: Zoetis

Responsibilities:

·Responsible for the execution of big data analytics, predictive analytics and machine learning initiatives.

·Implemented a proof of concept deploying this product in AWS S3 bucket and Snowflake.

·Utilize AWS services with focus on big data architect /analytics / enterprise Data warehouse and business intelligence solutions to ensure optimal architecture, scalability, flexibility, availability, performance, and to provide meaningful and valuable information for better decision-making.

·Developed Scala scripts, UDF are using both data frames/SQL and RDD in Spark for data aggregation, queries and writing back into S3 bucket.

·Experience in data cleansing and data mining.

·Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Data bricks.

·Wrote, compiled, and executed programs as necessary using Apache Spark in Scala to perform ETL jobs with ingested data.

·Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.

·Wrote Spark applications for data validation, cleansing, transformation, and custom aggregation and used Spark engine, Spark SQL for data analysis and provided to the data scientists for further analysis.

·Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in production.

·Created scripts to read CSV, json and parquet files from S3 buckets in Python and load into AWS S3, DynamoDB and Snowflake.

·Implemented AWS Lambda functions to run scripts in response to events in Amazon DynamoDB table or S3 bucket or to HTTP requests using Amazon API gateway

·Migrated data from AWS S3 bucket to Snowflake by writing custom read/write snowflake utility function using Scala.

·Worked on Snowflake Schemas and Data Warehousing and processed batch and streaming data load pipeline using Snow Pipe and Matillion from data lake Confidential AWS S3 bucket.

·Profile structured, unstructured, and semi-structured data across various sources to identify patterns in data and Implement data quality metrics using necessary query’s or python scripts based on source.

·Install and configure Apache Airflow for S3 bucket and Snowflake data warehouse and created dags to run the Airflow.

·Created DAG to use the Email Operator, Bash Operator and spark Livy operator to execute and in EC2 instance.

·Deploy the code to EMR via CI/CD using Jenkins

·Extensively used Code cloud for code check-in and checkouts for version control.

Environment: Informatica, Power BI, AgileScrum,MapReduce,Snowflake,Pig, Spark,Scala, Hive, Kafka,Python, Airflow,Parquet, CSV,Codecloud, Acure cloud

CREATORS TOUCH VIJAYAWADA, INDIA

Role: Data Analyst Jul 2019– Dec 2020

Responsibilities:

•Extract, clean, and process structured and unstructured data from MongoDB, Hadoop HBase, and Apache Cassandra.

•Design and maintain ETL pipelines for data transformation and integration using industry-standard ETL tools.

•Utilize AWS (EC2, S3), Azure, and GCP to store, manage, and analyze large datasets.

•Perform data modeling and optimization for NoSQL databases like HBase, Cassandra, and MongoDB.

•Develop and implement scalable data processing solutions using Hadoop and cloud-based services.

•Conduct exploratory data analysis (EDA) and generate insights to support business decisions.

•Create interactive dashboards and reports using visualization tools like Tableau, Power BI, or Looker.

•Collaborate with data engineers and business stakeholders to ensure data availability and accuracy.

•Deploy and manage containerized data applications using Docker for efficient workload management.

•Implement security and compliance measures for data storage and processing in cloud environments.

•Participated in features engineering such as feature generating, PCA, Feature normalization and label encoding with Scikit-learn preprocessing. Data Imputation using variant methods in Scikit-learn package in Python.

·Used Sqoop to move data from oracle database into hive by creating a delimiter separated files and using these files in an external location to be used as an external table in hive and further moving the data into refined tables as parquet format using hive queries.

·Used Teradata utilities such as Fast Export, MLOAD for handling various tasks data migration/ETL from OLTP Source Systems to OLAP Target Systems.

·Developed spark programs using Scala APIs to compare the performance of spark with HIVE and SQL.

·Designed and created Hive external tables using shared meta-store instead of derby with partitioning, dynamic partitioning, and buckets.

Environment: MongoDB, Hadoop HBase, Apache Cassandra, AWS, Azure, GCP, AWS EC2, S3, Azure Data Lake, GCP BigQuery, Apache NiFi, Talend, Informatica, AWS Glue, Hadoop, Spark, Docker, Kubernetes, Tableau, Power BI, Looker, Python, SQL, R, Scala, Git, Jenkins.

SUBRAIN SOLUTIONS VIJAYAWADA, INDIA Jan 2019– May2019

Role: Python Intern

Responsibilities:

·Developed pipeline using Hive (HQL) to retrieve the data from Hadoop cluster, SQL to retrieve data from Oracle database and used ETL for data transformation.

·Analyzed and gathered business requirements from clients, conceptualized solutions with technical architects, and verified approach with appropriate stakeholders, developed E2E scenarios for building the application.

·Derived data from relational databases to perform complex data manipulations and conducted extensive data checks to ensure data quality. Performed Data wrangling to clean, transform and reshape the data utilizing NumPy and Pandas library.

·We have worked with datasets of varying degrees of size and complexity including both structured and unstructured data and Participated in all phases of Data mining, Data cleaning, Data collection, variable selection, feature engineering, developing models, Validation, Visualization and Performed Gap analysis.

·Optimized lot of SQL statements and PL/SQL blocks by analyzing the execute plans of SQL statement and created and modified triggers, SQL queries, stored procedures for performance improvement.

·Implemented Predictive analytics and machine learning algorithms in Data bricks to forecast key metrics in the form of designed dashboards on to AWS (S3/EC2) and Django platform for the company’s core business.

·Lead engagement planning: developed and managed Tableau implementation plans for the stakeholders, ensuring timely completion and successful delivery according to stakeholder expectations.

·Managed workload and utilization of the team. Coordinated resources and processes to achieve Tableau implementation plans.

·Integrate APIs to collect real-time data for analysis and reporting.

Environment: Python, Pandas, NumPy,SQL, MySQL, PostgreSQL, MongoDB, Hadoop, Spark, Kafka, AWS, Azure, GCP, AWS Lambda, Google Cloud Functions, Azure Functions, Docker, Kubernetes, Matplotlib, Seaborn, Power BI, Tableau, Git, Jupyter Notebook, VS Code, API Integration, Web Scraping.

Contact this candidate