Data Analyst Engineer

Location:

Irving, TX

Posted:

May 05, 2023

Contact this candidate

Resume:

Divya Yennam

Email: ***************@*****.***

Phone number: 469-***-****

Data Engineer

About me

Experienced Professional in development, design and maintenance of Data Models, Data Warehouses and performing advanced analytics. Flourishes in a dynamic challenging environment, fast learner and capable of using new technologies and processing of projects, tasks and operations as Required.

Professional Summary

5 years of progressive experience focused on Big Data ecosystems, Hadoop Architecture, and Data warehousing which include Design Integration, Maintenance, Implementation and Testing of various web applications

Expertise with architecture including data ingestion, pipeline design, Hadoop architecture, data modeling, data mining and data optimizing ETL workflows.

Expertise in Hadoop (HDFS, MapReduce), Hive, Sqoop, Spark, Databricks, PySpark, Airflow,Scala and AWS .

Solid experience developing Spark Applications for performing high scalable data transformations using RDD, Data Frame, and Spark-SQL API.

Proficiency with Cloudera, Hortonworks, Amazon EMR, Redshift and EC2 for project creation, implementation, deployment, and maintenance using Java/J2EE, Hadoop, and Spark.

On a regular basis, worked closely with business products, production support, and engineering teams to dive deep into data, make effective decisions, and support analytics platforms.

Strong experience in working with UNIX/LINUX environments, writing shell scripts.

Extensive experience debugging Spark application errors for fine-tuning spark apps and hive queries to maximize performance of the system.

Good Knowledge in Partitions, bucketing concepts in Hive and processed large sets of structured, semi-structured and unstructured data.

Experience transferring data from HDFS to Relational Database System and vice versa using SQOOP according to client requirements.

Used Git, Jenkins, Jira and Bitbucket version control systems efficiently.

Education

Masters in Big Data Analytics and Information Technology, University of Central Missouri, Warrensburg, MO, May 2022.

Bachelors in Electrical and Electronics Engineering, JNTUH, Hyderabad,May 2017.

TECHNICAL SKILLS:

Big data Technologies

HDFS, Map Reduce, Pig, Hive, Sqoop, Scala, Spark

Hadoop Frameworks

Cloudera CDHs, Hortonworks HDPs, MAPR

Database

Oracle 10g/11g, PL/SQL, MySQL, MS SQL Server 2012, DB2

Programming Languages

Java, Scala, Python, PySpark

Cloud

AWS, AZURE

Methodologies

Agile, Waterfall

Build Tools

Kubernetes, Jenkins.

IDE Tools

Eclipse, Net Beans, Intellij

BI Tools

Tableau, PowerBI

Operating System

Windows 7/8/10, Vista, UNIX, Linux, Ubuntu, Mac OS X

Professional Experience

Dataquad Inc, Houston, Texas Oct’ 2022 to Present

Data Engineer

Responsibilities:

Worked with application teams to install operating system and perform Hadoop updates, patches and version upgrades as required.

Developed data pipeline using Flume, Sqoop to ingest customer behavioral data and purchase histories into HDFS for analysis.

Experienced in writing Spark Applications in Scala/Python.

Experienced in Importing and exporting data into HDFS and Hive using Sqoop.

Developed MapReduce programs to parse the raw data, populate staging tables and store the refined detain partitioned tables in the EDW.

Deploying Spark jobs in Amazon EMR and running the job on AWS clusters.

Creating Databricks notebooks using SQL, Python and automated notebooks using jobs.

Completed data extraction, aggregation and analysis in HDFS by using PySpark and store the data needed to Hive.

Configure Spark Streaming in Python to receive real time data from the Kafka and store it onto HDFS.

Develop spark application by using python (Pyspark) to transform data according to business rules.

Create the executable rules based on business requirements as a reference data.

Perform complex HQL coding, which include data flows for evaluating business rules and creating necessary workflow to segregate data and load into final database objects for data visualization using analytical aspects of HIVE.

Perform ETL and SQOOP on the received data files to extract the data and load the data into DB2, DataLake etc.

Document low-level design specifications, including creation of data flow, workflow, data integration rules, data normalization, and data standardization methods.

Communicate day to day progress of both Onsite and offshore teams to client manager and making sure work is tracked and completed as per project schedules.

Technologies: Apache Spark 2.1, Spark SQL, Scala, Hadoop, Python, Amazon S3, Athena,EC2,EMR Hadoop, HDFS, Hive

Cloud Solutions Inc January ’22 –September’22

Data Engineer

Responsibilities:

Knowledge on Pyspark and used Hive to analyze sensor data and cluster users based on their behavior in the events.

Worked on Amazon AWS concepts like EMR and EC2 web services for fast and efficient processing of Big Data.

Developed Spark programs and created the data frames and worked on transformations.

Creating Databricks notebooks using SQL, Python and automated notebooks using jobs.

Creating Spark clusters and configuring high concurrency clusters using Databricks to speed up the preparation of high-quality data.

Expertise in designing PySpark applications for interactive analysis, batch processing, and stream processing as well as knowledge of the architecture and components of Spark.

Worked with Spark SQL and python to convert Hive/SQL queries to Spark transformations.

Proficient in Python scripting and worked with NumPy for statistics, Matplotlib for visualization, and Pandas for data organization to parse Json, CSV documents.

Involved in loading structured and semi-structured data into Spark clusters using SparkSQL and Data Frames API (API).

Created Databricks Delta Lake process for real-time data load from various sources data-lake using Python/PySpark code.

Used Spark-SQL to Load data into Hive tables and Written queries to fetch data from these tables.

Developed the ingestion and egression frameworks using Python and Spark to load the reference data into the tables for the different value streams in the project

Experienced in Hive queries to analyze massive data sets of structured, unstructured, and semi-structured data.

Technologies: Spark, Hive, AWS EMR, AWS Glue, AWS S3, AWS Athena, Amazon EC2, Apache Spark 2.1, Spark SQL, Scala, Hadoop, Python, Shell Script, SQL.

Veritas Soft Solutions Pvt ltd May’17-December’20

Data Analyst

Designed, developed, and Deployed batch and streaming jobs to load data from sources to our DataLake HDFS.

Developed complex PySpark, Spark SQL, and Python.

Experience developing Spark Applications for performing high scalable data transformations using RDD, Data Frame, and Spark-SQL API.

Developed and automated ETL pipelines and submit PySpark jobs to process data and load data into Hive tables.

Designed and developed data modeling techniques in HIVE

Validate Scoop jobs, and Shell scripts & perform data validation to check if data is loaded correctly without any discrepancy.

Performs complex HQL coding, which include data flows for evaluating business rules and creating necessary workflow to segregate data and load into final database objects for data visualization using analytical aspects of HIVE.

Implanted the analyzed data into Tableau and show the regression, Trend, and forecast in the dashboard for the datasets which were considered.

Using PySpark prepared the data as per the business needs and published the processed data to the HDFS.

Document low-level design specifications, including creation of data flow, workflow, data integration rules, data normalization, and data standardization methods.

Data Profiling, Data Analysis and Data Visualization for the PHI and sensitive data.

Responsible for daily communications to management and internal organizations regarding the status of all assigned projects and tasks.

Provided support to various pipelines running in production.

Environment: DataLake, HDFS, PySpark, Hive, Apache Spark 2.1, Scala, Hadoop 2.6.x, Python, Shell Script, SQL, Oracle, Git.

Contact this candidate