Resume

Big Data Engineer

Location:

Chicago, IL

Posted:

October 25, 2023

Contact this candidate

Resume:

Name: Praharshini. Title : data engineer Phno:+1-952-***-****

ad0mc8@r.postjobfree.com

SUMMARY

Diligent Data Engineer with 5+ years of Experience working with large datasets of structured and unstructured data.

Performed risk analysis, data visualization, reporting on various projects. Technical expertise in Data Pipeline, Integration, Data profiling, Data Cleaning. Worked on Tableau, Power BI, and Shiny in R to create dashboards and visualizations.

Experience in Data warehousing, Data engineering, Feature engineering, big data, ETL/ELT, and Business Intelligence.

As a big data architect and engineer, specialize in AWS frameworks, Cloudera, Hadoop Ecosystem, Spark/pySpark/Scala, Data bricks, Hive, Redshift, Snowflake, relational databases, tools like Tableau, Airflow, DBT, Presto/Athena, and Data DevOps

Frameworks/Pipelines with Scripting skills in Python in building data intensive applications, tackling challenging architectural and scalability problems.

Experience and training in Confidential Web Services S3, EC2, IAM, Route53, Databases (RDS, DynamoDB, Redshift), VPC, Lambda, EBS, EFS, Glue, Athena, SQS, SNS, API Gateway, Kinesis.

Business Intelligence as an application developer with expertise in implementing and developing Data Warehousing/Business Intelligence solutions applying advance techniques using IBM OBIEE, in - depth and comprehensive experience in design, development, testing, Security and Support for Data warehousing and Client/Server projects.

Experience in Developing Spark applications using Spark - SQL, Pyspark and Delta Lake in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.

Good understanding of Spark Architecture, MPP Architecture, including Spark Core, Spark SQL, Data Frames, Spark Streaming, Driver Node, Worker Node, Stages, Executors and Tasks.

Productionize models in Cloud environment, which would include, automated processes, CI/CD pipelines, monitoring/alerting, and troubleshooting issues. Present the model and results to technical and non-technical audience.

Highly skilled in various automation tools, continuous integration workflows, managing binary repositories and containerizing application deployments and test environments.

Experience in Data Analysis, Data Profiling, Data Integration, Migration, Data governance and Metadata Management, Master Data Management and Configuration Management.

Experience in creating Docker containers leveraging existing Linux Containers and AMI's in addition to creating Docker containers from scratch.

Experience in Extraction, Transformation and Loading (ETL) data from various sources into Data Warehouses, as well as data processing like collecting, aggregating, and moving data from various sources using Apache Flume, Kafka, Power BI and Microsoft SSIS.

Used Kafka for activity tracking and log aggregation.

Experience in efficiently doing ETL’s using spark - in memory processing, Spark SQL and Spark Streaming using Kafka Distributed Messaging System.

Hands-on experience with Amazon EC2, Confidential S3, Confidential RDS, VPC, IAM, Confidential Elastic Load Balancing, Auto Scaling, CloudWatch, SNS, SES, SQS, Lambda, EMR, Dynamo DB and other services of the AWS family.

Experience in handling python and spark context when writing Pyspark programs for ETL.

TECHNICAL SKILLS

Programming Language: Java 1.7/1.8, SQL, Python, Scala, UNIX Shell Script, Power Shell, SQL, YAML

Cloud Platform: AWS

Application/Web Servers: WebLogic, Apache Tomcat 5.x/6.x/7.x/8.x

Hadoop Distributions: Horton Works, Cloudera Hadoop

Hadoop/Big Data Technologies: HDFS, Hive, Sqoop, Yarn, Spark, Spark SQL

Hadoop/Spark Ecosystem: Hadoop, MapReduce, Pig, Hive/impala, YARN, Kafka, Flume, Oozie, Zookeeper, Spark, Airflow

Cloud Platforms: AWS: Confidential EC2, S3, RDS, IAM, Auto Scaling, CloudWatch, SNS, Athena, Glue, Kinesis, Lambda, EMR, Redshift,

ETL Tools: Informatica, Data Studio

Reporting Tools: Power BI, Tableau, SSRS

Virtualization: Citrix, VDI, VMware

PROFESSIONAL EXPERIENCE

Truist (charlotte,NC) (JAN2022-JULY2023)

Sr. Data Engineer

Responsibilities:

Extracted data and load into HDFS using Sqoop commands and scheduling Map/Reduce jobs on Hadoop. [KA1]

Worked extensively with importing metadata into Hive using Python and migrated existing tables and applications to work on AWS cloud (S3).

Developed Python scripts to manage AWS resources from API calls using BOTO3 SDK and worked with AWS CLI.

Set up the CI/CD pipelines using Maven, GitHub, and AWS.

Experienced in writing real-time processing and core jobs using Spark Streaming with Kafka as a data pipeline system.

Proficient in crafting real-time processing tasks and fundamental jobs utilizing Spark Streaming alongside Kafka as a framework for data pipeline orchestration.

Designed and implemented data models and ETL workflows using Airflow, Snowflake, and DBT, resulting in improved data accessibility and accuracy for business analysts and data scientists.

Performed data cleaning and feature selection using MLlib package in PySpark. Deep learning using CNN, RNN, ANN, reinforcement learning.

Processed structured & semi-structured datasets using PySpark by loading into RDDs.

Implementation of new tools such as Kubernetes with Docker to assist with auto-scaling and continuous integration (CI) and upload a Docker image to the registry so the service is deployable through Kubernetes. Use the Kubernetes dashboard to monitor and manage the services

Used AWS EMR to transform and move large amounts of data into and out of other AWS Data stores and databases, such a S3 and Confidential Dynamo DB.

Implemented reporting data marts for Sales and Marketing teams on AWS redshift. Handled data schema design and development, ETL pipelines in Python/ MySQL Stored Procedures and Automation using Jenkins.

Extensive experience with query performance tuning in Relational and distributed systems.

Designed ETL architecture for data transfer from the OLTP to OLAP.

Imported attribution data from proprietary database into Hadoop ecosystem.

Extensively worked with moving data across cloud architectures including Redshift, hive, S3 buckets

Scripted in Unix and Python to simplify repetitive support tasks.

Provided data engineering support on Python, Spark, Hive, Airflow for modelling projects.

Strong Experience and knowledge of real time data analytics using Spark Streaming, Kafka and Flume.

Demonstrated extensive expertise in conducting real-time data analytics through the utilization of Spark Streaming, Kafka, and Flume.

Applied substantial knowledge to effectively implement real-time data analytics using Spark Streaming, Kafka, and Flume.

Utilized strong experience to successfully carry out real-time data analytics utilizing Spark Streaming, Kafka, and Flume

Configured Spark Streaming to get ongoing information from the Kafka and store information to HDFS.

I have integrated product data feeds from Kafka to Spark processing system and store the order details in PostgreSQL Database

Designed social listening platform to scrape data from Twitter, Instagram. Application calls Google sentiment analysis API, Image detection API to produce brand index, competitor analysis and product analysis.

Collaborated with Marketing team to implement search bidding algorithm.

Automated build and deployment using Jenkins to reduce human error and speed up production processes.

Managed GitHub repositories and permissions, including branching and tagging.

Drove strategy for migrating from perforce to GitHub, including branching, merging, and tagging.

Created tables, triggers, stored procedures, SQL loader in relational database SQL.

Designed services to store and retrieve user data using MongoDB database and communicated with remote servers using REST enabled Web Services on Jersey framework.

Performance tuning in Hive using multiple methods but not limited to dynamic partitioning, bucketing, indexing, file compressions and cost-based optimization.

Environment: HDFS, Hive, AWS, MapReduce, Kafka, Confidential DynamoDB, PostgreSQL Spark SQL, Python, Scala, PySpark, SSIS, ELK.

HCLTechnologies (Bangalore, IND) (JAN2018-Nov2021)

Data Engineer / Data Analyst

Responsibilities:

Implemented Apache Airflow for authoring, scheduling, and monitoring Data Pipelines

Designed several DAGs (Directed Acyclic Graph) for automating ETL pipelines.

Performed data extraction, transformation, loading, and integration in data warehouse, operational data stores and master data management.

Developing ETL pipelines in and out of data warehouse using combination of Python and Snowflakes SnowSQL Writing SQL queries against Snowflake.

Strong understanding of AWS components such as EC2 and S3

Implement AWS Lambdas to drive real-time monitoring dashboards from system logs.

Conducted statistical analysis on healthcare data using python and various tools.

Developed simple and complex Map Reduce programs in Java for Data Analysis on different data formats.

Involved in using AWS for the Tableau server scaling and secured Tableau server on AWS to protect the Tableau environment using Confidential VPC, security group, AWS IAM and AWS Direct Connect.

Responsible for data services and data movement infrastructures

Experienced in ETL concepts, building ETL solutions and Data modeling.

Worked on architecting the ETL transformation layers and writing spark jobs to do the processing.

Aggregated daily sales team updates to send report to executives and to organize jobs running on Spark clusters.

Loaded application analytics data into data warehouse in regular intervals of time

Experienced in fact dimensional modeling (Star schema, Snowflake SQL), transactional modeling and SCD (Slowly changing dimension)

Designed and developed Security Framework to provide fine grained access to objects in AWS S3 using AWS Lambda, DynamoDB.

Used AWS Athena to Query directly from AWS S3.

Worked on Confluence and Jira

Designed and implemented configurable data delivery pipeline for scheduled updates to customer facing data stores built with Python.

Proficient in Machine Learning techniques (Decision Trees, Linear/Logistic Regressors) and Statistical Modeling

Configured AWS Lambda with multiple functions.

Compiled data from various sources to perform complex analysis for actionable results.

Measured Efficiency of Hadoop/Hive environment ensuring SLA is met

Optimized the TensorFlow Model for efficiency.

Implementations of generalized solution model using AWS Sage Maker

Analyzed the system for new enhancements/functionalities and perform Impact analysis of the application for implementing ETL changes.

Implemented a Continuous Delivery pipeline with Docker, GitHub, and AWS

Environment: AWS, Python, HDFS, MapReduce, Flume, Kafka, Zookeeper, Pig, Hive, HQL, HBase, Spark, Kafka, ETL, Web Services, Linux RedHat, Unix.

Education: Bachelors in computer science Jntuk,

Masters in ITMS Concordia St. Paul USA.

Contact this candidate