Data Analyst Engineer

Location:

Erie, PA

Posted:

April 21, 2023

Contact this candidate

Resume:

Saipavan Konireddy

814-***-**** • **************@*****.***

Professional Summary

Around 3 years IT experience in Analysis, Design, Development and Big Data in Scala, Pyspark, Hadoop, and HDFS environment and experience in Python. Implemented Big Data solutions using Hadoop technology stack, including Pyspark, Hive, Sqoop, Avro and Thrift.

Proficiency in developing SQL with various relational databases like Oracle, SQL Server for Support of Data Warehousing and Data Integration Solutions. Strong experience in migrating other databases to Snowflake. In-depth knowledge of Snowflake Database, Schema and Table structures.

Experienced in requirement analysis, application development, application migration and maintenance using Software Development Lifecycle (SDLC) and Python technologies. Defined user stories and driving the agile board in JIRA during project execution, participate in sprint demo and retrospective.

Strong working experience with SQL and NoSQL databases, data modeling and data pipelines. Involved in end-to-end development and automation of ETL pipelines using SQL and Python. Managed multiple tasks and worked under tight deadlines and in fast pace environment. Excellent analytical, communication skills which helps to understand the business logics and develop a good relation between stakeholders and team members. Good communication skills, work ethics and the ability to work in a team efficiently with good leadership skills.Implementation and Support using Agile and Waterfall Methodologies. Skills

TECHNICAL SKILLS:

Big Data

Ecosystem:

HDFS, MapReduce, HBase, Pig, Hive, Sqoop,

Kafka Flume, Cassandra, Impala, Oozie,

Zookeeper, MapR,

Amazon Web Services (AWS), EMR

Machine

Learning:

Databases: Oracle 11g/10g/9i, MySQL, DB2, MS

SQL Server, HBASE

Programming: Query Languages Java, SQL,

Python Programming (Pandas, NumPy, SciPy,

Scikit-Learn, Seaborn,

Matplotlib, NLTK), NoSQL, PySpark, PySpark

SQL, SAS, R Programming (Caret, Glmnet,

XGBoost, rpart,

Ggplot2, sqldf), RStudio, PL/SQL, Linux shell

Classification Algorithms Logistic Regression,

Decision Tree, Random Forest, K-Nearest

Neighbor (KNN),

Gradient Boosting Classifier, Extreme Gradient

Boosting Classifier, Support Vector Machine

(SVM), Artificial

Neural Networks (ANN), Naïve Bayes

Classifier, Extra Trees Classifier, Stochastic

Gradient Descent, etc.

Cloud

Technologies:

AWS, Azure, Google cloud platform (GCP)

IDE'sIntelliJ: Eclipse, Spyder, Jupyter

Ensemble and

Stacking

Averaged Ensembles Weighted Averaging,

Base Learning, Meta Learning, Majority Voting,

Stacked

Ensemble, Auto ML - Scikit-Learn, ML jar, etc.

scripts, Scala.

Data Engineer: Big Data Tools / Cloud /

Visualization / Other Tools Databricks, Hadoop

Distributed File System (HDFS),

Hive, Pig, Sqoop, MapReduce, Spring Boot,

Flume, YARN, Hortonworks, Cloudera,

Mahout, MLlib, Oozie,

Zookeeper, etc. AWS, Azure Databricks, Azure

Data Explorer, Azure HDInsight, Salesforce,

GCP, Google

Shell, Linux, PuTTY, Bash Shell, Unix, etc.,

Tableau, Power BI, SAS, We Intelligence,

Crystal Reports,

Dashboard Design.

Work History

AWS Data Engineer, 06/2019 to 07/2022

Infotel Groups India

Implemented Installation and configuration of multi-node cluster on Cloud using Amazon Web Services (AWS) on EC2

Handled AWS Management Tools as Cloud watch and Cloud Trail Stored teh log files in AWS S3

Used versioning in S3 buckets where teh highly sensitive information is stored Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns Working experience with data streaming process with Kafka, Apache Spark, Hive Analysed the SQL scripts and designed the solution to implement using Scala Designed both 3NF data models for OLTP systems and dimensional data models using star and snowflake Schemas

Optimized the PySpark jobs to run on Kubernetes Cluster for faster data processing Developed parallel reports using SQL and Python to validate the daily, monthly, and quarterly reports

Designed and Developed Real time Stream processing Application using Spark, Kafka, Scala, and Hive to perform Streaming

ETL and apply Machine Learning

Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD's Primarily Responsible for converting Manual Report system to fully automated CI/CD Data Pipeline that ingest data from different Marketing platforms to AWS S3 data lake Designed and developed AWS architecture, Cloud migration, AWS EMR, DynamoDB, Redshift and event processing using lambda function

Conducted ETL Data Integration, Cleansing, and Transformations using AWS glue Spark script Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline which can be written to Glue Catalog and can be queried from Athena

Developed an aws lambda functions to trigger the unit step functions to process the scheduled EMR job

Created Partitioned and Bucketed Hive tables in Parquet File Formats with Snappy compression and then loaded data into

Parquet hive tables from Avro hive tables

Involved in designing and developing tables in HBase and storing aggregated data from Hive Table Analyzed the sql scripts and designed it by using PySpark SQL for faster performance Developed spark applications in python (PySpark) on a distributed environment to load huge number of CSV files with different schema in to Hive ORC tables Used Apache Spark Data frames, Spark-SQL, Spark MLlib extensively and developing and designing POC's using Scala, Spark

SQL and MLlib libraries

Created PySpark code that uses Spark SQL to generate data frames from avro formatted raw layer and writes them to data service layer internal tables as orc format Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target systems from multiple sources

Developed Airflow DAGs in python by importing the Airflow libraries Environment: AWS, JMeter, Kafka, Ansible, Jenkins, Docker, Maven, Linux, Red Hat, GIT, Cloud Watch, Python, Shell Scripting, Golang,

Web Sphere, Splunk, Tomcat, Soap UI, Kubernetes, Terraform, PowerShell. Education

Master of Science: Data Science, May 2023

GANNON UNIVERSITY - ERIE, PA

GPA: 3.85/4

Bachelor of Technology: Electronics and Communication Engineering, May 2019 S.R.M INSTITUTE OF SCIENCE AND TECHNOLOGY - CHENNAI Applied Machine Learning, Introduction to statistics, Applied algorithms, Linear Regression, Data Mining, Text Mining, AWS, Data Visualization and analysis, Manage and access big data, Advanced database management.

Accomplishments

Jan 2016-April 2016

Title: -Signal jammer

Description: It is a prototype that blocks transmission of signals between a cell phone and a base station

By using the same frequency as a mobile handset, the cell phone jammer creates strong interference for communication between caller and receiver

Depending on the frequencies needed to block the values of the indicator and capacitor can be altered

Title: - Movement of car by hand gestures

Description: To control a car movement in the desired direction, wheels will rotate with specific velocity

The robotic car can move forward, reverse, left, or right by performing some hand gesture and it is designed using Arduino

Title: -Speed of moving objects using sonic sensor Description: Prototype using Ultrasonic HC-SR04 and Arduino UNO to calculate distance between UltraSonic device and an object

A processing application is used to display the distance between an UltraSonic device and an object on the laptop's (Monitor) screen

Title: -Pollution and Weather Monitoring using LORA Description: To determine how much pollutant happening around our atmosphere by men of human activities and a weather report from daily access can be handled The data collected from sensors are transferred to Arduino UNO via analog to digital converters with the mean time sources that can be sent to LORA transmitters LORA module is used for long range coverage of wireless transmission mechanisms The data collected from the sensors can be stored and monitored by transferring to cloud storage devices.

Additional Information

Hands on Experience with dimensional modeling using star schema and snowflake models. Firm understanding of Hadoop architecture and various components including HDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduce programming. Involved in setting up Jenkins Master and multiple slaves for the entire team as a CI tool as part of Continuous development and deployment process Installed and configured apache Airflow for workflow management and created workflows in python, created the DAG's using Airflow to run jobs sequentially and parallelly. Experienced in Optimizing the Pyspark jobs to run on Kubernetes Cluster for faster data processing Involved in converting Hive Queries into various Spark Actions and Transformations by Creating RDD and Data frame from the required files in HDFS. Experience in providing support to data analyst in running Hive queries and building an ETL. Performed Importing and exporting data into HDFS and Hive using Sqoop. Worked on AWS Cloud – used S3 buckets for storing the data, EMR – for spinning up the cluster and running the spark jobs, Athena – for creating external tables, IAM – for authentication, AWS GLUE – to get data from sources. Experienced in Designing, Architecting, and implementing scalable cloud-based web applications using AWS and Azure. Experienced on Migrating SQL database to Azure data Lake storage, Azure Data Factory (ADF), Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and controlling and granting database access and migrating on premise databases to Azure Data Lake store using Azure Data factory. Involved in Software development, Data warehousing and Analytics and Data engineering projects using Hadoop, MapReduce, Hive, and other open-source tools/technologies. Worked on reading and writing multiple data formats like JSON, ORC, Parquet on HDFS using PySpark.

Contact this candidate