*****.********@*****.***
https://www.linkedin.com/in/di
lip-somaraju-9884a8183
Dilip Somaraju
Big Data Engineer
As a passionate and highly recommended Big Data Engineer, I am eager to leverage my skills and expertise to make a positive impact in a new organization that values innovation and data-driven decision-making. I am confident that I can contribute to the success and growth of any organization that requires high-quality and scalable data solutions. WORK EXPERIENCE
Kaiser Permanente
Sr.Big Data Engineer
Pleasanton, CA May 2019 MAY 2023
Led end-to-end data pipeline implementations, optimizing data processing with PySpark, DataFrames, and custom shell scripts. Transferred data efficiently between sources and Azure Data Lake Storage (ADLS) using Apache NiFi and designed data migration strategies from Hive to Snowflake.
Demonstrated expertise in Apache Airflow, creating custom operators, and maintaining up-to-date documentation for efficient workflow management.
Contributed to the successful loading of large data volumes into Databricks Delta Live Tables, ensuring data quality and accuracy. Managed Azure Cloud server configurations, including SQL database migration and Geo-replication setup for application and reporting data tiers.
Utilized Spark Submit on Databricks for ETL optimization and loaded data from Hive to Databricks Delta Live Tables using the Auto Loader feature Designed and implemented a real-time data pipeline to process semi- structured data by integrating 150 million raw records from 30+ data sources using Kafka and PySpark.
Led the design and architecture of a real-time data warehouse system that seamlessly integrated data from multiple sources, enabling instant access to critical business insights.
Optimized ETL processes for real-time data ingestion, transformation, and loading, resulting in a 30% reduction in data latency and improved decision-making speed
Implemented the Audit Balance and Control Framework with Airflow, built custom Python operators and sensors, and authored DAGs for use cases with various technologies.
Implemented dynamic data modeling techniques to accommodate evolving business requirements, allowing for rapid schema changes and ensuring the data warehouse's agility.
Used Spark in Python to distribute data processing on large streaming datasets, improving ingestion and speed by 67%
Supported implementation and active monitoring of controls and programs for precision and efficacy
Built, maintained, scaled, and supported 10+ existing data pipelines Ensured the proper storing of both raw and processed data Caesars Entertainment Corp
Big Data Engineer
Las Vegas, NV Nov 2017 - Apr 2019
Analyzed and reviewed business requirements and technical specs, ensuring alignment with data engineering processes SKILLS
Cloud Computing
Microsoft Azure (Azure Data
Factory, Azure HDInsight,
Azure Databricks, Azure
Synapse Analytics, Azure SQL
DB); AWS (Amazon S3, AWS
Glue, Amazon Redshift, AWS
Lambda, Amazon Kinesis,
Amazon SageMaker); Cloud
Data Storage & Management
(Azure Data Lake Storage,
Snowflake); Cloud Migration &
Integration (Data migration
from on-premises to cloud
platforms); Cloud Computing
Optimization & Performance
Tuning
Data Analytics
Data Analysis &
Transformation (Using SQL,
Hive, PySpark); Data
Visualization (Arcadia Data,
Power BI, Tableau); Big Data
Analytics (Using Hadoop, Hive,
PySpark, Spark); Machine
Learning (Experience with
Amazon SageMaker); Complex
Query Optimization & Data
Modeling; Data Ingestion &
Transformation Framework
(Design and Implementation)
; Python Libraries for Data
Analytics (NumPy, Pandas,
Matplotlib)
Big Data Technologies
Hadoop Ecosystem (HDFS,
MapReduce, Hive, Oozie,
Sqoop); Big Data
Management (Using tools like
Informatica Power Center,
Informatica Big Data
Management); Data Pipeline
Construction & Management
(Using tools like Apache NiFi,
Apache Airflow); Big Data
Analytics & Processing (Using
tools like Apache Spark,
Designed and developed data ingestion framework workflows and utilities using Big Data technologies, enabling efficient data collection, storage, and processing.
Leveraged AWS services, including Amazon S3, AWS Glue, and Amazon Redshift, to create scalable and efficient data pipelines, optimizing data ingestion, processing, and storage.
Utilized Amazon SageMaker for proof-of-concept (POC) projects, building, training, and deploying machine learning models for various applications, such as customer segmentation and fraud detection. Implemented serverless computing solutions with AWS Lambda, reducing infrastructure costs and improving scalability while minimizing operational overhead
Developed real-time data streaming solutions using Amazon Kinesis, enabling real-time capture and analysis of customer behavior and gaming transactions.
Collaborated with cross-functional teams, including data scientists and business analysts, to design and develop data solutions aligned with organizational goals.
Performed data refinement using SQL, Spark flows, and Big Data scripts to transform raw data based on business requirements, ensuring data quality and usability.
Maintained up-to-date documentation of real-time data warehouse processes and actively conducted training sessions for team members to ensure proficiency in managing real-time data workflows Implementation of the entire data pipeline, from capturing and storing data to processing that data using Apache Spark and maintaining scalable data models and pipelines
Collaborated with business stakeholders to define reporting requirements and developed business intelligence solutions that leverage the data warehouse for meaningful insights and data-driven decision-making
Wrote numerous BTEQ scripts to run complex queries on the Teradata database
Used DataStage as ETL tool to pull data from source systems/ files, cleanse, transform and load data into the Teradata using Teradata Utilities Systrac Solutions
Hadoop Associate Consultant
Hyderabad,India June 2012 - Jan 2014
Involved in design discussions on ingestion and logic to perform analytics on top of ingested data
Design, Data model, and Production related documentation for the complete process
Took end-to-end leadership for developing data ingestion and queries to perform data analytics
Developed automation scripts for execution and log generation Implemented ingestion code for all possible scenarios and performed integration activities
Implemented MapReduce programs to handle complex logics that are not achieved in Hive
Involved in Unit Testing and integration testing across all environments. Involved in Production Support activities
I have done Performance optimization of the existing codebase I have been involved in production implementation and its corresponding monitoring and validations
EDUCATION
Master of Science, Computer Science
University of Central Missouri
Jul 2016 - Feb 2018 Warrensburg, MO, USA
PySpark); Big Data Storage &
Retrieval (Using HBase);
Databases
Relational Databases (Oracle,
Microsoft SQL Server,
Teradata); NoSQL Databases
(HBase); Database
Management & Query
Optimization (Using SQL, Hive,
HBase); Database Tools
(Cloudera Hue, Razor SQL, DB
Visualizer, Oracle SQL
Developer, Teradata SQL
Assistant, Microsoft SQL
Server Management Studio);
Data Migration & Integration
(Between different database
systems)
Programming & Scripting
Python (Advanced skills
including building and
maintaining Python-based
applications); UNIX Shell
Scripting (Automation, job
scheduling, and scripting);
SQL & Hive QL (Data query,
transformation, and analysis);
JavaScript (Including libraries
like jQuery); HTML & CSS (For
web development);
Other
Integrated Development
Environments (IDEs) -
NetBeans, Eclipse, PyCharm;
Utilities - Putty, WinSCP; Job
Schedulers - Tivoli Workload
Scheduler; Project Leadership
& Collaboration; Strategy
Design & Implementation
(Including data pipelines and
ingestion frameworks);
Process Optimization &
Performance Tuning; Cross-
functional Team Collaboration
; Documentation &
Knowledge Sharing
Relevant courses
Data Mining; Intro to Machine Learning; Adv Algorithm & Data Structures; Advanced Database systems; Big Data: Storage, Analytics & Visualization; Software Testing
Master of Science, Software Engineering
Blekinge Institute of Technology
Jan 2014 - Apr 2016 Karlskrona, Sweden
Relevant courses
Research Methodologies in Software Engineering and Computer Science; Global software engineering; Applied software project management; Software Metrics; Software Requirement and estimation; Software Design Engineering; Advanced topic in computing; Software Quality Management Master of Technology,
Computer Science & Engineering
Jawaharlal Nehru Technological University Hyderabad. May 2013 - Jan 2014 Hyderabad, India
Bachelor of Technology,
Computer Science & Engineering
Jawaharlal Nehru Technological University Hyderabad. May 2010 - Apr 2013 Hyderabad, India
Relevant courses
Computer Programming & Data Structures; Probability & statistics; Advanced data structures; Objected Oriented programming language; Mathematical foundation of Computer science; Database management systems; Principles of programming language; Compiler Design; Software Engineering; Operations systems; Data communication and computer networks; Object Oriented Analysis and Design; Mobile computing; Network Security; Computer graphics
PUBLICATIONS
https://www.semanticscholar.org/paper/Prediction-of-Time%2C-Cost-and- Effort-needed-for-to-%3A-
Somaraju/6c14aa49b4ad604e44bb0d153ce890ce7b1410d0