Data Engineer Big

Location:

Camarillo, CA

Posted:

November 30, 2023

Contact this candidate

Resume:

*****.********@*****.***

660-***-****

https://www.linkedin.com/in/di

lip-somaraju-9884a8183

Dilip Somaraju

Big Data Engineer

As a passionate and highly recommended Big Data Engineer, I am eager to leverage my skills and expertise to make a positive impact in a new organization that values innovation and data-driven decision-making. I am conﬁdent that I can contribute to the success and growth of any organization that requires high-quality and scalable data solutions. WORK EXPERIENCE

Kaiser Permanente

Sr.Big Data Engineer

Pleasanton, CA May 2019 MAY 2023

Led end-to-end data pipeline implementations, optimizing data processing with PySpark, DataFrames, and custom shell scripts. Transferred data efficiently between sources and Azure Data Lake Storage (ADLS) using Apache NiFi and designed data migration strategies from Hive to Snowflake.

Demonstrated expertise in Apache Airflow, creating custom operators, and maintaining up-to-date documentation for efficient workflow management.

Contributed to the successful loading of large data volumes into Databricks Delta Live Tables, ensuring data quality and accuracy. Managed Azure Cloud server configurations, including SQL database migration and Geo-replication setup for application and reporting data tiers.

Utilized Spark Submit on Databricks for ETL optimization and loaded data from Hive to Databricks Delta Live Tables using the Auto Loader feature Designed and implemented a real-time data pipeline to process semi- structured data by integrating 150 million raw records from 30+ data sources using Kafka and PySpark.

Led the design and architecture of a real-time data warehouse system that seamlessly integrated data from multiple sources, enabling instant access to critical business insights.

Optimized ETL processes for real-time data ingestion, transformation, and loading, resulting in a 30% reduction in data latency and improved decision-making speed

Implemented the Audit Balance and Control Framework with Airflow, built custom Python operators and sensors, and authored DAGs for use cases with various technologies.

Implemented dynamic data modeling techniques to accommodate evolving business requirements, allowing for rapid schema changes and ensuring the data warehouse's agility.

Used Spark in Python to distribute data processing on large streaming datasets, improving ingestion and speed by 67%

Supported implementation and active monitoring of controls and programs for precision and efficacy

Built, maintained, scaled, and supported 10+ existing data pipelines Ensured the proper storing of both raw and processed data Caesars Entertainment Corp

Big Data Engineer

Las Vegas, NV Nov 2017 - Apr 2019

Analyzed and reviewed business requirements and technical specs, ensuring alignment with data engineering processes SKILLS

Cloud Computing

Microsoft Azure (Azure Data

Factory, Azure HDInsight,

Azure Databricks, Azure

Synapse Analytics, Azure SQL

DB); AWS (Amazon S3, AWS

Glue, Amazon Redshift, AWS

Lambda, Amazon Kinesis,

Amazon SageMaker); Cloud

Data Storage & Management

(Azure Data Lake Storage,

Snowflake); Cloud Migration &

Integration (Data migration

from on-premises to cloud

platforms); Cloud Computing

Optimization & Performance

Tuning

Data Analytics

Data Analysis &

Transformation (Using SQL,

Hive, PySpark); Data

Visualization (Arcadia Data,

Power BI, Tableau); Big Data

Analytics (Using Hadoop, Hive,

PySpark, Spark); Machine

Learning (Experience with

Amazon SageMaker); Complex

Query Optimization & Data

Modeling; Data Ingestion &

Transformation Framework

(Design and Implementation)

; Python Libraries for Data

Analytics (NumPy, Pandas,

Matplotlib)

Big Data Technologies

Hadoop Ecosystem (HDFS,

MapReduce, Hive, Oozie,

Sqoop); Big Data

Management (Using tools like

Informatica Power Center,

Informatica Big Data

Management); Data Pipeline

Construction & Management

(Using tools like Apache NiFi,

Apache Airflow); Big Data

Analytics & Processing (Using

tools like Apache Spark,

Designed and developed data ingestion framework workflows and utilities using Big Data technologies, enabling efficient data collection, storage, and processing.

Leveraged AWS services, including Amazon S3, AWS Glue, and Amazon Redshift, to create scalable and efficient data pipelines, optimizing data ingestion, processing, and storage.

Utilized Amazon SageMaker for proof-of-concept (POC) projects, building, training, and deploying machine learning models for various applications, such as customer segmentation and fraud detection. Implemented serverless computing solutions with AWS Lambda, reducing infrastructure costs and improving scalability while minimizing operational overhead

Developed real-time data streaming solutions using Amazon Kinesis, enabling real-time capture and analysis of customer behavior and gaming transactions.

Collaborated with cross-functional teams, including data scientists and business analysts, to design and develop data solutions aligned with organizational goals.

Performed data refinement using SQL, Spark flows, and Big Data scripts to transform raw data based on business requirements, ensuring data quality and usability.

Maintained up-to-date documentation of real-time data warehouse processes and actively conducted training sessions for team members to ensure proficiency in managing real-time data workflows Implementation of the entire data pipeline, from capturing and storing data to processing that data using Apache Spark and maintaining scalable data models and pipelines

Collaborated with business stakeholders to define reporting requirements and developed business intelligence solutions that leverage the data warehouse for meaningful insights and data-driven decision-making

Wrote numerous BTEQ scripts to run complex queries on the Teradata database

Used DataStage as ETL tool to pull data from source systems/ files, cleanse, transform and load data into the Teradata using Teradata Utilities Systrac Solutions

Hadoop Associate Consultant

Hyderabad,India June 2012 - Jan 2014

Involved in design discussions on ingestion and logic to perform analytics on top of ingested data

Design, Data model, and Production related documentation for the complete process

Took end-to-end leadership for developing data ingestion and queries to perform data analytics

Developed automation scripts for execution and log generation Implemented ingestion code for all possible scenarios and performed integration activities

Implemented MapReduce programs to handle complex logics that are not achieved in Hive

Involved in Unit Testing and integration testing across all environments. Involved in Production Support activities

I have done Performance optimization of the existing codebase I have been involved in production implementation and its corresponding monitoring and validations

EDUCATION

Master of Science, Computer Science

University of Central Missouri

Jul 2016 - Feb 2018 Warrensburg, MO, USA

PySpark); Big Data Storage &

Retrieval (Using HBase);

Databases

Relational Databases (Oracle,

Microsoft SQL Server,

Teradata); NoSQL Databases

(HBase); Database

Management & Query

Optimization (Using SQL, Hive,

HBase); Database Tools

(Cloudera Hue, Razor SQL, DB

Visualizer, Oracle SQL

Developer, Teradata SQL

Assistant, Microsoft SQL

Server Management Studio);

Data Migration & Integration

(Between different database

systems)

Programming & Scripting

Python (Advanced skills

including building and

maintaining Python-based

applications); UNIX Shell

Scripting (Automation, job

scheduling, and scripting);

SQL & Hive QL (Data query,

transformation, and analysis);

JavaScript (Including libraries

like jQuery); HTML & CSS (For

web development);

Other

Integrated Development

Environments (IDEs) -

NetBeans, Eclipse, PyCharm;

Utilities - Putty, WinSCP; Job

Schedulers - Tivoli Workload

Scheduler; Project Leadership

& Collaboration; Strategy

Design & Implementation

(Including data pipelines and

ingestion frameworks);

Process Optimization &

Performance Tuning; Cross-

functional Team Collaboration

; Documentation &

Knowledge Sharing

Relevant courses

Data Mining; Intro to Machine Learning; Adv Algorithm & Data Structures; Advanced Database systems; Big Data: Storage, Analytics & Visualization; Software Testing

Master of Science, Software Engineering

Blekinge Institute of Technology

Jan 2014 - Apr 2016 Karlskrona, Sweden

Relevant courses

Research Methodologies in Software Engineering and Computer Science; Global software engineering; Applied software project management; Software Metrics; Software Requirement and estimation; Software Design Engineering; Advanced topic in computing; Software Quality Management Master of Technology,

Computer Science & Engineering

Jawaharlal Nehru Technological University Hyderabad. May 2013 - Jan 2014 Hyderabad, India

Bachelor of Technology,

Computer Science & Engineering

Jawaharlal Nehru Technological University Hyderabad. May 2010 - Apr 2013 Hyderabad, India

Relevant courses

Computer Programming & Data Structures; Probability & statistics; Advanced data structures; Objected Oriented programming language; Mathematical foundation of Computer science; Database management systems; Principles of programming language; Compiler Design; Software Engineering; Operations systems; Data communication and computer networks; Object Oriented Analysis and Design; Mobile computing; Network Security; Computer graphics

PUBLICATIONS

https://www.semanticscholar.org/paper/Prediction-of-Time%2C-Cost-and- Effort-needed-for-to-%3A-

Somaraju/6c14aa49b4ad604e44bb0d153ce890ce7b1410d0

Contact this candidate