Data Engineer Big

Location:

Seattle, WA

Posted:

November 01, 2024

Contact this candidate

Resume:

Shalini Chenna Reddy

Seattle 206-***-**** *******.*****@*****.***

Profile

Accomplished Data Engineer with over 5 years of experience in analyzing, designing, developing, and maintaining Big Data solutions. Proficient in leveraging Hadoop ecosystem components and cloud platforms like AWS and Azure to efficiently solve data processing challenges. Experienced in data transformation and profiling using Spark, Hive, Pig, and Python, with a strong grasp of both Waterfall and Agile-SCRUM methodologies. Adept at implementing real-time data streaming and ETL processes, ensuring data consistency and optimal performance across distributed systems. Skilled in developing end-to-end data pipelines and workflow scheduling using tools like Oozie and Zookeeper.

Experience

Big Data Engineer Toyota

Seattle, WA

Jan 2023 – Current

Developed ETL jobs using Spark-Scala to migrate data from Oracle to Hive tables, transforming data using Hive and MapReduce, and loading it into HDFS.

Designed and implemented various stages of data flow (ingestion, processing, consumption) within the Hadoop ecosystem using tools such as HDFS, Hive, Pig, HBase, Sqoop, Kafka.

Created self-service reporting solutions in Azure Data Lake Store Gen2 using an ELT approach and wrote Py Spark and Spark SQL transformations in Azure Data Bricks for complex business rule implementation.

Managed wide-ranging data ingestion using Sqoop, Flume, and Kafka, storing data in partitioned formats like text, JSON, and Parquet, and developed Spark applications to consume Kafka data and ingest it into Cassandra.

Tuned HBase, Hive queries, and Spark jobs for performance optimization, installed and configured Hadoop ecosystem components, and monitored cluster health using Nagios and Ganglia.

Architected and implemented modern data stack solutions utilizing cloud-native technologies and scalable data processing frameworks.

Developed Java-Spring based middleware services for data retrieval using Phoenix SQL, wrote custom UDFs, UDTFs to extend Hive and Pig functionality, and scripted HBase table management tasks for MapReduce job analytics.

Implemented data security measures in Amazon Redshift to protect sensitive data, leveraging encryption, IAM roles, and VPC for secure data storage and access:

I.Configured Amazon Redshift to use AWS Key Management Service (KMS) for data encryption at rest, ensuring all sensitive data stored in the database is protected.

II.Implemented fine-grained access control using IAM roles and policies, ensuring that only authorized users and applications can access specific datasets and operations.

III.Set up Amazon Redshift to operate within a Virtual Private Cloud (VPC), providing network isolation and enhanced security controls, including subnet-level access control and security groups.

IV.Conducted regular security audits and vulnerability assessments on the Redshift clusters to identify and mitigate potential security risks.

V.Applied best practices for Redshift data security, including logging and monitoring with AWS CloudTrail and Amazon CloudWatch for real-time visibility into database activities and potential security incidents.

Environment: Sqoop, Hive, AWS, Kafka, AWS S3, Python, oracle, Terraform, MapReduce, Pig, Spark, Scala, Azure Data Bricks, Kubernetes, Zookeeper, Redshift DAX, Azure Data Lake, Storm, HBase, Hadoop, AWS Lambda.

Data Engineer Google

Hyderabad, Telangana

April 2019 - Dec 2022

Developed batch scripts to fetch data from AWS S3 storage and performed necessary transformations using Scala within the Spark framework.

Created Sqoop scripts to import and export customer profile data between RDBMS and S3 buckets.

Utilized Spark Streaming APIs to build a common learner data model, processing data from Kafka in near real-time and persisting it to Redshift clusters.

Employed Apache Kafka and Spark Streaming to ingest data from Adobe live stream REST API connections.

Optimized Hive queries using best practices and appropriate parameters, leveraging technologies such as Hadoop, YARN, Python, and Py Spark.

Analyzed and redesigned SQL scripts using Py Spark SQL for enhanced performance.

Automated the creation and termination of AWS EMR clusters, enhancing operational efficiency.

Troubleshooted Spark applications to improve error tolerance and reliability.

Used Spark Data Frame and Spark API for batch job processing, implementing concepts like broadcast variables, caching, and dynamic allocation for scalable applications.

Implemented Python libraries (NumPy, Matplotlib, Pandas, Sklenar) to create dashboards and visualizations using Spyder/Jupiter notebook.

Implemented CI/CD processes using tools such as Jenkins, GIT, and Maven, and configured Jenkins plugins to support project-specific tasks.

Environment: AWSEMR, Spark, Hive, HDFS, Sqoop, Kafka, Flink, HBase, Scala, Oozie, Map Reduce, AWS Redshift.

Education

BACHELOR OF SCIENCE IN ELECTRONICS AND COMMUNICATION ENGINEERING

MALLA REDDY ENGINEERING COLLEGE FOR WOMEN. HYDERABAD, INDIA.

Certifications

Python Power BI Tableau Big Data Py Spark AWS Data Analytics

Certifications Program from 360DIGITMG

Skills & Abilities

Languages

Python, Scala, R, SQL, PL/SQL, Java.

Data Visualization

Quick Sight, Power BI, Tableau, Informatica, Microsoft Excel.

Hadoop Eco System

Hadoop, MapReduce, Spark, HDFS, Sqoop, Hive, Impala, HBase.

Data Analysis

Web Scraping, Data Visualization, Modern data stack,Statistical Analysis, Data Mining, Data Warehousing, Data Migration, Database Management.

Database

MySQL, SQL Server, Oracle, AWS Redshift.

Cloud Platform

AWS, Azure, Cloud Stack/Open Stack.

Open Source

Kubernetes, Terraform.

Contact this candidate