Data Engineer Analysis

Location:

Pittsburgh, PA

Posted:

September 30, 2024

Contact this candidate

Resume:

SHANTHI T

***********@*****.***,

+1-816-***-****

https://www.linkedin.com/in/tshanthi-91242420a/

DATA ENGINEER CREATING EFFICIENT AND SCALABLE SOLUTIONS THROUGH OPTIMIZATION AND COLLABORATION

I am an experienced Data Engineer with 8 years of experience in data analysis, ETL development, and project management. I have strong programming skills in Python, Scala, Java, and C++, and extensive experience with databases such as Oracle, Snowflake, MySQL, SQL,NOSQL Hive, Impala, Postgres, DynamoDB, MongoDB. Additionally, I have experience in developing Java/Scala and Python-based Spark applications, deploying applications using CI/CD tools, and working with AWS services such as EMR, Glue, SNS, API Gateway, S3, CloudWatch, EC2, Lambda, Step functions, Athena, DynamoDB, Service Catalogue, Glue, and Redshift. I have expertise in data modeling, data management, database performance, and data analysis using SQLAlchemy, Pandas, NumPy, and Snowflake. I am a team player with excellent communication and presentation skills, great interpersonal skills, and a strong work ethic.

Technical Summary

Technology Tools/Languages/Frameworks

Big

Data Technologies

Spark, PySpark, Hive, Impala, Sqoop, Flume, Oozie, HDFS, MongoDB, Databricks AWS EMR, Glue, S3, EC2, Lambda, Athena, Step function, API gateway, SNS, Glue Catalogue, Redshift, DynamoDB, CloudWatch

GCP Dataproc, BQ, Storage, VM, image

Languages Python, Scala, Java, C++, SQL, MYSQL, NOSQLShell scripting Workflow Airflow, Step functions, Dataflow, Control-M Web Technologies HTML, CSS, Java script, JSON, XML/SOAP, REST, WSDL Operating Systems Linux (Ubuntu, Fedora & CentOS), Unix and Windows IDEs Databricks, Jupyter, IntelliJ, PyCharm, Eclipse, Visual Studio Code Version Control Git, Subversion, CVS, MKS

Databases Snowflake, SQL, NOSQL

Tracking tools Rally, JIRA

CI/CD tools Jenkins, Chef, Confluence, Bitbucket

EXPERIENCE

Sr Data Engineer

Garmin,Kansas,KS

Oct 2022 – Present

● Engineered a scalable Data Lake on AWS using S3, EMR, Redshift, and Athena to centralize data storage and processing.

● Designed and developed ETL processes using Apache NIFI for seamless extraction, transformation, and loading of data from diverse sources.

● Automated data ingestion pipelines, migrating terabytes of data from legacy data warehouses to the cloud, enhancing data accessibility and scalability.

● Created and maintained real-time streaming pipelines with Kafka, Spark Streaming, and Redshift, including Kafka producers with the Kafka Java Producer API.

● Optimized Spark and Scala applications for both daily batch jobs and real-time data processing, improving performance and efficiency.

● Upgraded Spark and Scala applications on AWS EMR, transitioning from version 5.31 to 6.11 for better functionality and stability.

● Developed and executed complex SQL queries for data extraction and processing, ensuring robust data operations and insights.

● Implemented workflow automation solutions to accelerate data delivery and streamline data operations, enhancing productivity.

● Performed data quality profiling to measure and ensure data accuracy, integrity, and completeness.

● Collaborated with business and data science teams to translate requirements into efficient data pipelines, aligning technical solutions with business needs.

● Developed Hive scripts and Spark applications to generate analytical datasets for digital marketing teams, supporting data-driven decision-making.

● Managed and fine-tuned EMR clusters for optimized Spark application performance and automated infrastructure setups.

● Implemented and managed Splunk reporting systems to provide enhanced data visibility and real-time monitoring for improved operational insights.

Environment: Snowflake, Python, Databricks, AWS Lambda, EMR, CloudWatch, Airflow, MySQL, Shell-scripting, Linux, Jenkins, Bitbucket.

Sr Data Engineer

Fidelity Investments, Charlotte

Nov 2021 – Oct 2022

● Developed data pipelines in Databricks Delta Lake using PySpark for efficient data transformation and loading.

● Created PySpark applications to transform data and load it into a feature store using Python object-oriented programming and Shell scripting.

● Designed and automated Databricks notebooks with PySpark, Python, and SQL, and scheduled them using automated jobs.

● Automated backup jobs using AWS CloudWatch and Lambda services, ensuring reliable and timely data backup and recovery.

● Integrated feature stores with data APIs by category, enhancing data management and accessibility.

● Developed Python APIs to capture array structures for debugging, aiding in effective troubleshooting and issue resolution.

● Implemented Snowpipe for continuous data loading into Snowflake, streamlining and improving data ingestion processes.

● Developed ETL processes using AWS Glue and Athena, and automated data validation with Great Expectations to ensure data integrity.

● Deployed applications across Dev, Stage, and Prod environments, managing transitions and ensuring consistent performance.

● Estimated, monitored, and troubleshot Spark Databricks clusters, optimizing performance and resource usage for efficient data processing.

● Created scripts for Delta table DDL creation and table analysis from PySpark jobs, facilitating efficient data management.

● Designed Tableau dashboards with category-wise views for weekly data analysis, delivering actionable insights and visualizations.

● Utilized various Spark modules such as Spark Core, Spark SQL, Spark Streaming, and DataFrames for comprehensive data processing and analytics.

Environment: Pyspark, Python, NOSQL, Glue, Databricks, AWS Lambda, EMR, CloudWatch, Airflow, Shell-scripting, Linux, Jenkins, Bitbucket.

Hadoop Developer

Bank of America, Charlotte

Apr 2019 – Nov 2021

● Created and optimized Spark applications using Scala to enrich clickstream data and integrate it with enterprise user data.

● Designed and developed Spark applications for comprehensive data extraction, transformation, and aggregation from multiple sources.

● Performed performance tuning of Spark applications, including setting optimal batch intervals, parallelism levels, and memory configurations.

● Developed applications for integration with REST APIs, Merchant UI, and custom Python libraries, enhancing data interoperability.

● Built a robust framework for data migration and developed Spark jobs to streamline data movement and processing.

● Translated business and data requirements into logical data models supporting Enterprise Data Models, OLAP, OLTP, and analytical systems.

● Visualized store layouts with adjacency, left/right, opposite, and perpendicular mapping for improved spatial analysis.

● Designed, deployed, and managed infrastructure on GCP cloud, ensuring effective cloud resource management and configuration.

● Automated ETL processes to manage and post recommendation data store-by-store, improving data processing efficiency.

● Developed and maintained a Data Lake with regulatory data using HDFS, Apache Impala, Apache Hive, and Cloudera distribution for comprehensive data storage and management.

● Built horizontally and vertically scalable distributed data solutions with Python multiprocessing, REST APIs, and GCP, ensuring scalable data processing capabilities.

● Developed multi-threaded Java-based input adaptors for daily ingestion of clickstream data from FTP servers and Google Storage.

● Utilized Spark Streaming for real-time data processing and applied Spark’s in-memory computing for text analysis, enhancing performance and processing speed. Environment: Scala, PySpark, REST, Pandas, JQ, GCP, Dataproc, NOSQL VM, Elastic search, TrainMe, Shell-scripting, Linux.

Hadoop Developer

JP Morgan,Delaware

Jan 2018 – Apr 2019

● Developed PySpark applications to extract data from multiple third-party applications (TPAs) using Python and Shell scripting, enhancing data integration capabilities.

● Automated daily backup jobs using AWS CloudWatch and Lambda services, ensuring reliable data backup and recovery.

● Created and optimized Spark jobs to classify and categorize data, storing it efficiently in Hadoop HDFS.

● Designed and implemented ETL pipelines to extract data from MySQL, Oracle, and flat files, and load it into Hadoop HDFS using Sqoop.

● Developed Hive queries and Spark SQL to transform and load data into Hive tables, supporting complex data analysis.

● Wrote Pig scripts to transform data and load it into Hadoop HDFS, facilitating diverse data processing needs.

● Conducted comprehensive data profiling and validation to ensure data quality and integrity across the data pipeline.

● Designed and developed data models for Hive and Impala tables, optimizing data organization and access.

● Processed large volumes of data using Hadoop MapReduce and Spark, enhancing data handling capabilities.

● Monitored and managed Hadoop clusters using Cloudera Manager, and worked with the team to troubleshoot and optimize cluster performance.

● Implemented real-time data streaming with Spark and Kafka, managing data flow from web server logs for timely insights.

● Developed and maintained data pipelines for incremental data import from DB2 and Teradata into Hive tables using Sqoop, facilitating efficient data integration and reporting. Environment: Hadoop, Hive, Sqoop, Pig, Spark, Cloudera, DYNAMODB, Oracle, MySQL, Shell-scripting, Linux. Java Developer

LTIMindtree, Hyderabad, India

Jan 2014 – Jun 2016

● Developed robust applications using Java and Spring Boot, incorporating JDBC for reliable database interactions and operations.

● Created a RESTful API with Spring Boot for an e-commerce platform, streamlining product catalog management and customer order processing.

● Configured and maintained the Spring Application Framework, ensuring high application stability, performance, and scalability.

● Implemented design patterns such as Singleton, Business Delegate, Value Object, and Spring DAO, enhancing application architecture and design.

● Developed DAO classes using Spring JDBC, facilitating efficient database operations and management of account information.

● Mapped business objects to the database using Hibernate, optimizing data handling, persistence, and object-relational mapping.

● Wrote and maintained Spring Configuration XML files for defining application context and bean configurations, ensuring proper application setup.

● Utilized Tomcat web server for the development and deployment of web applications, ensuring effective server management and application operation.

● Created and executed unit tests to validate application functionality, ensuring code reliability and reducing bugs.

● Managed build and deployment processes using Maven, streamlining development workflows and ensuring consistent builds.

● Employed Oracle for database management, writing complex SQL scripts and PL/SQL code for advanced queries and stored procedures.

● Used Log4J for logging and debugging, effectively monitoring application behavior and resolving issues during both development and production.

Environment: Java,Spring Boot,XML,Singleton,Hibernate, Spring,Log4J,Oracle,SQL, Java Script, Eclipse, Maven, Tomcat.

Contact this candidate