Big Data Engineer

Location:

Maryland Heights, MO, 63146

Posted:

December 27, 2024

Contact this candidate

Resume:

Shrustii V

Dallas, TX

937-***-****

***********@*****.***

Summary

•9+ years of experience in Analysis, Design, Development, Maintenance, and user training of enterprise applications, working in distributed technologies like Spark Scala, Hadoop, Hive, and orchestration tools.

•Deployed DBT models to automate data transformation processes, enhancing data consistency and reducing processing durations.

•Expert in implementing ETL processes with Talend, Python, Apache Airflow, and AWS Glue, ensuring efficient data transformation and workflow automation.

•Leveraged Amazon Redshift and AWS Glue for data warehousing and storage solutions, improving data query performance and scalability. Utilized AWS Step Functions to orchestrate complex workflows, integrating various AWS services seamlessly.

•Developed scalable data processing pipelines using Scala and Apache Spark on AWS EMR for big data analytics.

•Implemented data transformation and aggregation workflows in Scala for processing large datasets stored in AWS S3.

•Developed Python-based data processing pipelines using AWS Lambda and AWS Step Functions for serverless workflows.

•Created and maintained ETL pipelines with Python and AWS Glue for transforming and loading data into AWS Redshift.

•Automated data analysis and reporting tasks in Python, leveraging AWS S3 for storage and AWS Athena for querying.

•Created custom Scala applications to interact with AWS DynamoDB for high-performance NoSQL data access.

•Optimized Scala-based Spark jobs for cost-effective processing on AWS EMR by tuning configurations and resource allocation.

•Hands-on experience with Terraform for infrastructure as code (IaC) deployment, automating the provisioning of AWS resources, and managing configurations.

•Developed real-time data streaming and messaging solutions using AWS SNS, SQS, and Apache Kafka, optimizing data ingestion and processing capabilities.

•Experienced with PySpark for developing scalable data processing pipelines, performing transformations, and ensuring efficient data management in distributed environments.

•Deployment and orchestration of Docker containers on cloud platforms like Kubernetes, AWS EKS, ECS, and GKE, ensuring reliable and scalable containerized applications.

•Designed and implemented data marts to support analytical and reporting needs, enhancing data accessibility and decision-making processes.

•Good knowledge of AWS services like EC2, ELB, VPC, Route53, Auto Scaling, AMIs, AWS Identity and Access Management (IAM), AWS CloudWatch, Amazon EBS, EMR, Lambda, and Athena for cloud-based solutions.

•Utilized AWS DynamoDB and MongoDB for deploying NoSQL database solutions, optimizing data structure and query performance.

•Worked extensively with AWS S3 for scalable object storage, ensuring efficient data access and cost management.

•Experience working on various Hadoop file formats like JSON, Parquet, ORC & AVRO files, and performing data processing using Spark Framework by creating Spark DataFrames and RDDs.

•In-depth knowledge of the Hadoop Ecosystem including HDFS, Yarn, MapReduce, Hive, HBase, Sqoop, Kafka, Spark, Oozie, NiFi, and Cassandra.

•Extensive experience in importing and exporting data using stream processing platforms like Flume and Apache or Confluent Kafka.

•Advanced proficiency in Python, SQL, and Scala for scripting, data manipulation, and backend development in various data-intensive applications.

•Skilled in using AWS QuickSight and Tableau for developing dashboards and reports that provide deep insights into operational data.

•Integrated Terraform with CI/CD pipelines (e.g., Jenkins, GitLab CI) to automate infrastructure deployments and updates, enabling rapid iteration and reducing manual intervention.

Technical Skills:

Operating System

Red Hat Linux, Ubuntu and Windows

Big Data

Spark, Hadoop, YARN, Hive, HBase, Flume, Sqoop, Zookeeper, Databricks, Airflow and Oozie

Streaming

Kafka and Spark DStreams

Cloud

AWS Glue, S3, EMR, EC2, Athena, Redshift Spectrum and Kinesis

Languages

Java8, Python, SQL, Scala Spark,Pyspark

IDE/Tools

Eclipse, IntelliJ, SQL Developer,Terraform

Database/D.B Languages

Oracle 10g/11g, SQL and MySQL,HQL

Version Controls

Git, Bitbucket

Software Methodologies

Agile Methodology Scrum and Waterfall

Professional Experience

Goldman Sachs, Dallas, TX Duration: Aug 2022 – Till Date

Sr. Data Engineer – AWS, Pyspark, Python, Hive.

•Created external tables with partitions using Hive, AWS Athena, and Redshift, Designed External and Managed tables in Hive, and processed data to the HDFS using Sqoop.

•Built and maintained ETL processes to integrate healthcare data from various sources into a centralized data warehouse.

•Built real-time data streaming applications using Scala and Spark Streaming, processing data from AWS Kinesis.

•Automated data workflows and job scheduling with Scala and AWS Step Functions, ensuring reliable execution of data pipelines.

•Implemented data quality checks and validation in Scala applications to ensure accuracy and integrity of data in AWS environments.

•Integrated Python applications with AWS API Gateway to create RESTful APIs and manage service interactions.

•Implemented data quality checks and validation in Python scripts, ensuring accurate and reliable data processing on AWS.

•Created Python-based AWS Lambda functions for event-driven data processing and automation tasks.

•Developed data visualization tools in Python and integrated them with AWS QuickSight for interactive reporting.

•Used AWS services such as S3, EMR, and Lambda to build scalable data processing workflows.

•Developed Python scripts to automate data extraction, transformation, and loading tasks, improving overall process efficiency.

•Used AWS Redshift, S3, Spectrum, and Athena services to query large amounts of data stored on S3 to create a Virtual Data Lake without having to go through the ETL process.

•Worked on AWS Data pipeline to configure data loads from S3 into Redshift.

•Created scripts to append data from temporary HBase tables to target HBase tables in Spark.

•Worked on NOSQL Databases such as HBase and used SPARK for real-time streaming of data into the cluster.

•Used HQL to load data into Hive tables from various formats like CSV, JSON, and Parquet, and to export processed data for use in other systems.

•Support current and new services that leverage AWS cloud computing architecture including EMR S3, and other managed service offerings.

•Developed and managed ETL processes to load data into DataMarts, utilizing tools like SQL, Python, and ETL frameworks to ensure data integrity and consistency.

•Implemented Amazon S3 and Amazon Sage Maker to deploy and host the model.

•Used the AWS Sage Maker to quickly build, train, and deploy machine learning models.

•Configured Spark streaming to receive real-time data from the Apache KAFKA and store the stream data to the HDFS using Scala.

•Worked towards creating real-time data streaming solutions using Apache Spark / Spark Streaming, Kafka.

•Implemented Apache-spark code to read multiple tables from the real-time records and filter the data based on the requirement.

•Worked on NoSQL databases like HBase and imported data from Oracle and processed data using Hadoop Tools and exported to Cassandra NoSQL database.

ADP, Atlanta, GA Duration: Sep 2021 – July 2022

Sr. Data Engineer

Responsibilities:

•Developed Streaming Applications using PySpark and Kafka to read messages from Amazon AWS Kafka queues, transforming and writing JSON data to AWS S3 buckets, ensuring efficient data storage and retrieval for downstream processing.

•Optimized ETL Pipelines using Spark and PySpark to process large datasets, transforming and loading data into Amazon Redshift, utilizing AWS Glue for seamless integration and automation of data flows.

•Designed and Implemented a Notification System using AWS SNS to automatically send alerts to subscribers via email and SMS when specific events occur in the data pipeline, improving system monitoring and response times.

•Optimized Scala-based Spark jobs for cost-effective processing on AWS EMR by tuning configurations and resource allocation.

•Implemented real-time data processing applications using Python and AWS Kinesis for streaming analytics.

•Built machine learning models with Python and deployed them on AWS SageMaker for scalable model training and inference.

•Developed custom Python scripts for data extraction and transformation, interfacing with AWS DynamoDB for NoSQL operations.

•Utilized Python with AWS CloudFormation for infrastructure as code, automating the provisioning of cloud resources.

•Integrated Scala with AWS Glue to automate ETL processes and manage data cataloging.

•Developed Scala-based machine learning models using Spark MLlib and deployed them on AWS SageMaker.

•Used Scala to implement distributed data processing and analytics on AWS Redshift Spectrum, enabling queries on data stored in S3.

•Integrated AWS SNS with Lambda Functions to trigger automated workflows in response to incoming notifications, ensuring real-time processing and handling of critical events.

•Automated Infrastructure Provisioning on AWS using Terraform, reducing deployment time and ensuring consistent and repeatable infrastructure setups across multiple environments.

•Orchestrated Complex Workflows using AWS Step Functions to automate and manage the execution of serverless applications, reducing manual intervention and improving operational efficiency.

•Developed Asynchronous Processing Pipelines using AWS SQS to decouple and manage message queues between microservices, improving application scalability and fault tolerance.

•Built Real-Time ETL Processes using AWS Glue, Kinesis, and PySpark to continuously stream and process data, ensuring near real-time availability of data in Amazon Redshift, enhancing decision-making processes.

•Orchestrated Complex Workflows using AWS Step Functions, integrating various AWS services like Lambda, SQS, and SNS, to automate and streamline ETL processes and data pipeline operations.

•Automated Repetitive Tasks by developing tools using Python, Shell scripting, and XML, including the management and orchestration of AWS services, to increase operational efficiency.

•Accessed and Processed Hive Tables into Spark using PySpark scripts and Spark SQL, achieving faster data processing and enhanced performance in data analytics tasks.

•Developed and Automated ETL Processes to load data into Amazon Redshift from various sources such as S3, RDS, and DynamoDB using optimized SQL commands like COPY, ensuring timely and efficient data ingestion.

•Ingested Data from Various Sources into HDFS Data Lake using PySpark for both streaming and batch processing, ensuring scalable and efficient data storage and retrieval.

•Performed Performance Tuning of HQL Queries by analyzing query plans, optimizing indexing, and applying partition pruning, resulting in improved execution times and resource utilization.

•Implemented Spark Applications in Scala for real-time analysis and fast querying, utilizing Spark APIs and advanced functions like MapReduce by Key to process large-scale datasets efficiently.

•Developed Back-End Web Services using Python and the Django REST framework, integrating with AWS services like Lambda and API Gateway for serverless application deployment.

•Enhanced Existing Hadoop Models by utilizing Spark Context, Spark-SQL, DataFrames, and Pair RDDs, optimizing them for better performance and scalability.

•Imported Data Using Sqoop from various relational databases like Oracle, Teradata, and MySQL into Hadoop, enabling large-scale data processing and analytics.

AIG, New Jersey, NJ Duration: Feb 2021 – Aug 202

Sr. Data Engineer

•Designed and Developed ETL Pipelines using AWS Glue and PySpark to process and integrate data from multiple sources into a centralized database on Amazon Redshift, ensuring efficient and scalable data transformation and loading.

•Authored Complex SQL Queries for data extraction, transformation, and analysis, leveraging AWS Redshift and RDS to support business decision-making and reporting.

•Implemented data quality checks and validation in Scala applications to ensure accuracy and integrity of data in AWS environments.

•Developed custom Scala connectors for integrating with various AWS services, including S3, DynamoDB, and RDS.

•Created and managed Python-based data pipelines on AWS Data Pipeline for orchestrating data movement and transformation.

•Developed Python applications for integrating with AWS S3 for data storage and retrieval operations.

•Automated data archiving and backup tasks using Python and AWS Glacier for long-term data retention.

•Managed and monitored Scala-based Spark jobs on AWS EMR, leveraging AWS CloudWatch for performance tracking.

•Created and maintained data processing frameworks in Scala for large-scale analytics on AWS infrastructure.

•Implemented Spark SQL on DataFrames to access Hive tables within Spark for faster data processing, optimizing data pipelines that leverage AWS Glue Catalog and S3 for efficient data management.

•Created and Managed Hive External Tables on AWS EMR, implementing static partitioning, dynamic partitioning, and bucketing to optimize data retrieval and storage.

•Enhanced Hadoop Algorithms using Spark Context, Spark-SQL, DataFrame, Pair RDDs, and Spark YARN, deploying the optimized applications on AWS EMR for improved performance and cost-efficiency.

•Deployed Applications on AWS EC2 Instances and configured storage using S3 buckets, ensuring secure and scalable infrastructure for big data processing.

•Utilized AWS S3 and Local Hard Disk as HDFS for Hadoop storage, enabling seamless integration with AWS cloud services and enhancing data processing capabilities.

•Fine-tuned Spark Applications and Jobs using PySpark and Scala to improve efficiency and overall processing time, leveraging AWS Glue and Step Functions for orchestration and automation.

•Authored Complex HQL Queries for data extraction, transformation, and analysis, integrating results with AWS Redshift for business intelligence and reporting purposes.

•Designed and Implemented DataMarts using AWS Glue and Redshift to aggregate and organize data from various sources, supporting efficient data storage and retrieval for analytical purposes.

•Developed Spark-Streaming Applications to consume data from Kafka topics and process streams, integrating with AWS SQS for message queuing and inserting the processed data into AWS Redshift.

•Utilized Broadcast Variables in Spark and applied effective joins, transformations, and other capabilities for efficient data processing, ensuring scalability and reliability within AWS environments.

•Converted Hive or SQL Queries into Spark Transformations using Python and Scala, enhancing data processing workflows on AWS Glue and EMR for better performance and scalability.

•Built Batch, Real-Time, and Streaming Analytics Pipelines using AWS services like Kinesis, Glue, and Lambda, processing data from event streams, NoSQL databases, and APIs.

•Exported Analyzed Data Using Sqoop into relational databases like AWS RDS, generating reports for the BI team and ensuring seamless data integration and accessibility.

CISIN, India Duration: Jun 2016 – Sep 2020

Hadoop Developer

•Hands-on Experience in Loading Data from UNIX File System to HDFS, utilizing AWS Glue for ETL operations and implementing Terraform scripts for infrastructure automation. Also performed parallel data transfer on AWS S3 using AWS DataSync and DistCp.

•Built and deployed Python-based web applications on AWS Elastic Beanstalk for scalable and managed application hosting.

•Created Python-based monitoring solutions using AWS CloudWatch to track application performance and log metrics.

•Developed serverless Python functions with AWS Lambda to process data and integrate with other AWS services.

•Developed Scala-based algorithms for data encryption and security in AWS environments, ensuring compliance with best practices.

•Created data archiving solutions in Scala to manage and optimize storage on AWS S3 and Glacier.

•Automated the deployment and scaling of Scala-based applications using AWS Elastic Beanstalk for seamless scaling and management.

•Utilized Python for managing AWS IAM roles and permissions programmatically, ensuring secure access to cloud resources.

•Designed and Implemented Data Marts on Amazon Redshift and AWS Glue to support analytical and reporting needs, ensuring efficient data storage and retrieval with optimized query performance.

•Built ETL Pipelines using AWS Glue and PySpark to process large volumes of data, ensuring efficient data integration, transformation, and loading into AWS Redshift and S3.

•Created and Optimized SQL Queries for data extraction and transformation on AWS RDS and Redshift, ensuring high performance, accuracy, and integration with other AWS services.

•Used AWS SQOOP for Data Transfer, importing and exporting data between relational databases like AWS RDS and Hadoop Distributed File System (HDFS) for further processing and analysis.

•Implemented Flume Script to Load Streamed Data into HDFS, leveraging AWS Kinesis and SQS for real-time data ingestion and processing within a cloud environment.

•Implemented Partitioning, Dynamic Partitions, and Buckets in Hive on AWS EMR, increasing performance benefits and logically organizing data for faster retrieval and processing.

•Enabled Dynamic Partitioning in Hive Tables using HQL on AWS EMR, allowing for efficient data loading and retrieval based on partition keys stored in S3.

•Designed and Created Hive Table Schemas using HQL, including external tables for data stored in HDFS and AWS S3, ensuring proper data organization and access patterns within the AWS ecosystem.

•Involved in Creating Mappings and Loading Data into Target Tables on AWS Redshift, employing logic and transformation to validate and process source data for business intelligence and reporting needs.

•Automated Processes in the Cloudera Environment and built OOZIE workflows, integrating them with AWS Step Functions and Lambda for orchestration and automation of ETL tasks.

•Developed Hive Internal and External Tables on AWS EMR and functioned on them using HIVEQL, optimizing data processing for large datasets stored in HDFS and S3.

•Created Hive Tables Based on Business Requirements, using AWS Glue and S3 to store and analyze large datasets efficiently, enabling data-driven decision-making.

•Extensive Working Knowledge of Partitioned Tables, UDFs, Performance Tuning, and Compression-related Properties in Hive on AWS EMR, ensuring optimal performance for big data processing.

•Reviewed and Managed Hadoop Logs using AWS CloudWatch for monitoring and troubleshooting, ensuring the stability and performance of big data applications on AWS.

•Participated in Daily SCRUM Meetings and provided daily status reports, collaborating with cross-functional teams to ensure timely delivery of AWS-based data solutions.

Environment: HDFS, Hive, Teradata, MapReduce, XML, JSON, Oracle, MySQL, Java, PL/SQL Developer, Stored procedures, Triggers, QlikView

Educational Background

Bachelors – Computer Science – JNTU, India.

Contact this candidate