Divya Yennam
Email: ***************@*****.***
Phone number: 469-***-****
Data Engineer
About me
Experienced Professional in development, design and maintenance of Data Models, Data Warehouses and performing advanced analytics. Flourishes in a dynamic challenging environment, fast learner and capable of using new technologies and processing of projects, tasks and operations as Required.
Professional Summary
5 years of progressive experience focused on Big Data ecosystems, Hadoop Architecture, and Data warehousing which include Design Integration, Maintenance, Implementation and Testing of various web applications
Expertise with architecture including data ingestion, pipeline design, Hadoop architecture, data modeling, data mining and data optimizing ETL workflows.
Expertise in Hadoop (HDFS, MapReduce), Hive, Sqoop, Spark, Databricks, PySpark, Airflow,Scala and AWS .
Solid experience developing Spark Applications for performing high scalable data transformations using RDD, Data Frame, and Spark-SQL API.
Proficiency with Cloudera, Hortonworks, Amazon EMR, Redshift and EC2 for project creation, implementation, deployment, and maintenance using Java/J2EE, Hadoop, and Spark.
On a regular basis, worked closely with business products, production support, and engineering teams to dive deep into data, make effective decisions, and support analytics platforms.
Strong experience in working with UNIX/LINUX environments, writing shell scripts.
Extensive experience debugging Spark application errors for fine-tuning spark apps and hive queries to maximize performance of the system.
Good Knowledge in Partitions, bucketing concepts in Hive and processed large sets of structured, semi-structured and unstructured data.
Experience transferring data from HDFS to Relational Database System and vice versa using SQOOP according to client requirements.
Used Git, Jenkins, Jira and Bitbucket version control systems efficiently.
Education
Masters in Big Data Analytics and Information Technology, University of Central Missouri, Warrensburg, MO, May 2022.
Bachelors in Electrical and Electronics Engineering, JNTUH, Hyderabad,May 2017.
TECHNICAL SKILLS:
Big data Technologies
HDFS, Map Reduce, Pig, Hive, Sqoop, Scala, Spark
Hadoop Frameworks
Cloudera CDHs, Hortonworks HDPs, MAPR
Database
Oracle 10g/11g, PL/SQL, MySQL, MS SQL Server 2012, DB2
Programming Languages
Java, Scala, Python, PySpark
Cloud
AWS, AZURE
Methodologies
Agile, Waterfall
Build Tools
Kubernetes, Jenkins.
IDE Tools
Eclipse, Net Beans, Intellij
BI Tools
Tableau, PowerBI
Operating System
Windows 7/8/10, Vista, UNIX, Linux, Ubuntu, Mac OS X
Professional Experience
Dataquad Inc, Houston, Texas Oct’ 2022 to Present
Data Engineer
Responsibilities:
Worked with application teams to install operating system and perform Hadoop updates, patches and version upgrades as required.
Developed data pipeline using Flume, Sqoop to ingest customer behavioral data and purchase histories into HDFS for analysis.
Experienced in writing Spark Applications in Scala/Python.
Experienced in Importing and exporting data into HDFS and Hive using Sqoop.
Developed MapReduce programs to parse the raw data, populate staging tables and store the refined detain partitioned tables in the EDW.
Deploying Spark jobs in Amazon EMR and running the job on AWS clusters.
Creating Databricks notebooks using SQL, Python and automated notebooks using jobs.
Completed data extraction, aggregation and analysis in HDFS by using PySpark and store the data needed to Hive.
Configure Spark Streaming in Python to receive real time data from the Kafka and store it onto HDFS.
Develop spark application by using python (Pyspark) to transform data according to business rules.
Create the executable rules based on business requirements as a reference data.
Perform complex HQL coding, which include data flows for evaluating business rules and creating necessary workflow to segregate data and load into final database objects for data visualization using analytical aspects of HIVE.
Perform ETL and SQOOP on the received data files to extract the data and load the data into DB2, DataLake etc.
Document low-level design specifications, including creation of data flow, workflow, data integration rules, data normalization, and data standardization methods.
Communicate day to day progress of both Onsite and offshore teams to client manager and making sure work is tracked and completed as per project schedules.
Technologies: Apache Spark 2.1, Spark SQL, Scala, Hadoop, Python, Amazon S3, Athena,EC2,EMR Hadoop, HDFS, Hive
Cloud Solutions Inc January ’22 –September’22
Data Engineer
Responsibilities:
Knowledge on Pyspark and used Hive to analyze sensor data and cluster users based on their behavior in the events.
Worked on Amazon AWS concepts like EMR and EC2 web services for fast and efficient processing of Big Data.
Developed Spark programs and created the data frames and worked on transformations.
Creating Databricks notebooks using SQL, Python and automated notebooks using jobs.
Creating Spark clusters and configuring high concurrency clusters using Databricks to speed up the preparation of high-quality data.
Expertise in designing PySpark applications for interactive analysis, batch processing, and stream processing as well as knowledge of the architecture and components of Spark.
Worked with Spark SQL and python to convert Hive/SQL queries to Spark transformations.
Proficient in Python scripting and worked with NumPy for statistics, Matplotlib for visualization, and Pandas for data organization to parse Json, CSV documents.
Involved in loading structured and semi-structured data into Spark clusters using SparkSQL and Data Frames API (API).
Created Databricks Delta Lake process for real-time data load from various sources data-lake using Python/PySpark code.
Used Spark-SQL to Load data into Hive tables and Written queries to fetch data from these tables.
Developed the ingestion and egression frameworks using Python and Spark to load the reference data into the tables for the different value streams in the project
Experienced in Hive queries to analyze massive data sets of structured, unstructured, and semi-structured data.
Technologies: Spark, Hive, AWS EMR, AWS Glue, AWS S3, AWS Athena, Amazon EC2, Apache Spark 2.1, Spark SQL, Scala, Hadoop, Python, Shell Script, SQL.
Veritas Soft Solutions Pvt ltd May’17-December’20
Data Analyst
Designed, developed, and Deployed batch and streaming jobs to load data from sources to our DataLake HDFS.
Developed complex PySpark, Spark SQL, and Python.
Experience developing Spark Applications for performing high scalable data transformations using RDD, Data Frame, and Spark-SQL API.
Developed and automated ETL pipelines and submit PySpark jobs to process data and load data into Hive tables.
Designed and developed data modeling techniques in HIVE
Validate Scoop jobs, and Shell scripts & perform data validation to check if data is loaded correctly without any discrepancy.
Performs complex HQL coding, which include data flows for evaluating business rules and creating necessary workflow to segregate data and load into final database objects for data visualization using analytical aspects of HIVE.
Implanted the analyzed data into Tableau and show the regression, Trend, and forecast in the dashboard for the datasets which were considered.
Using PySpark prepared the data as per the business needs and published the processed data to the HDFS.
Document low-level design specifications, including creation of data flow, workflow, data integration rules, data normalization, and data standardization methods.
Data Profiling, Data Analysis and Data Visualization for the PHI and sensitive data.
Responsible for daily communications to management and internal organizations regarding the status of all assigned projects and tasks.
Provided support to various pipelines running in production.
Environment: DataLake, HDFS, PySpark, Hive, Apache Spark 2.1, Scala, Hadoop 2.6.x, Python, Shell Script, SQL, Oracle, Git.