Data Engineer Big

Location:

Lake Saint Louis, MO

Posted:

November 20, 2023

Contact this candidate

Resume:

Hima.S

Big Data Engineer

**********@*****.***

636-***-****

Professional Summary:

Over 9 years of Software development experience with expertise in Data Engineering using Big data tools and Application development using Java, Scala and Python.

Strong experience working with various tools in Hadoop ecosystem like MapReduce, Hive, Yarn, HDFS, Kafka, Sqoop, Flume, Oozie, HBase, Impala etc.,

Profound knowledge in Data Extraction, Data Cleaning, Data Loading, Statistical data analysis, Exploratory Data Analysis, Data Wrangling, Predictive Modelling using Power BI, R, Python, and Visualization using Tableau.

Strong development skills with Azure Data Lake, Azure Data Factory, Azure Data bricks, SQL Data Warehouse Azure Blob, Azure Storage Explorer.

Strong experience troubleshooting long running spark applications, designing highly fault tolerant spark applications and fine-tuning spark applications.

Experience in Dimensional Modelling using Snowflake schema methodologies of Data Warehouse and Integration projects.

Worked with processes to transfer/migrate data from AWS/S3 Relational database and flat files common staging tables in various formats to meaningful data into Snowflake.

Strong experience in Data migration from RDBMS to Snowflake cloud data warehouse.

Worked on building various data ingestion pipelines to pull data from various sources like S3 buckets, FTP servers and Rest Applications.

Strong knowledge building data lakes in AWS Cloud utilizing services like S3, EMR, Athena, Redshift, Redshift Spectrum, Glue, Meta store etc.,

Worked extensively on Sqoop for performing both batch loads as well as incremental loads from relational databases.

Experience working with NoSQL databases like Cassandra, MongoDB and HBase.

Proficient SQL experience in querying, data extraction/transformations and developing queries for a wide range of applications.

Strong working experience in SQL Server and T - SQL, Dynamic SQL in constructing triggers, tables, Joins, stored procedures, sub-queries, user functions, views, indexes, user profiles, relational database models, data dictionaries, data integrity, and table partitioning.

Experience in using SQL Server, MSBI, SSAS, SSRS, Oracle Enterprise One, and Tableau.

Has very good experience in creating SQL Jobs, and SSIS packages to manage SQL server databases and SQL server objects between instances of SQL server.

Strong in core Java concepts including Object-Oriented Design (OOD) and Java components like Collections Framework, Exception handling, I/O system.

Solid understanding of data science concepts and experience working with data scientists and analysts.

Technical Skills Summary:

Big Data

HDFS, Spark, YARN, Hive, HBase, Flume, Kafka, Sqoop, NiFi, Oozie, Zookeeper

Hadoop Distributions

Cloudera, Hortonworks, and AWS EMR, Dataproc

Languages

Python, Java, Scala, Shell Scripting, SQL, R, SAS

Database

Oracle 10g, MySQL, MSSQL, SQL Server, Postgre SQL

No SQL Database

HBase, Cassandra, MongoDB

Version Control

Git, Bit Bucket

AWS Services

AWS EMR, EC2, S3, Redshift, Athena, Dataproc, GCS, Big query

Reporting Tools

Tableau, QlikView, Power BI, SSRS, SSAS

Experience Profile

Client: CVS Health, RI Feb 2022 - Present

Role: Big Data Engineer

Responsibilities:

Ingesting data from Teradata and SQL Server, land data in s3, write Py-spark ETL scripts to read data, and using Spark and EMR for computational purposes. Write data back into Hive and Athena.

Writing Spark streaming applications and ingesting data from AWS kinases and writing data into HBase for API consumption.

Airflow DAGs to schedule all the workflows we develop in Python.

Export latest snapshot data into snowflake for analytical purposes, where Tableau dashboards are built on top of that by visualization developers.

Gathering data events from various databases such as neo4j and legacy mainframes for developing, supporting and maintaining the ETL processes using Informatica PowerCenter.

Build data pipeline via source connectors to push data from ETL process to kafka topics.

Ingest and map real-time and batch data into different kafka topics and process them in a way that is usable for consumers.

Design, model, and implement data structures and topic event schemas in Kafka.

Build streaming application on java to connect different kafka topics.

Involved in creating DDL scripts and pulling data into Oracle and MongoDB databases from kafka topics and perform SQL queries to process the data into databases.

Creating avro and swagger files for API Consumptions.

Experience in developing Reports, Charts, and Dashboards using Power BI Desktop and managing security based on requirements.

Performed different types of contexts like Row context, filter context, and query context using Power BI Desktop.

Environment: GCS, Dataproc, BigQuery, Spark, Scala, Python, Airflow, Jupyter notebooks, Github, Kafka, Java, Tableau, Informatica, SQL, Oracle, AWS, Power BI.

Client: Amazon, USA 2021 March - February 2022

Role: Big Data Engineer

Responsibilities:

Worked on building centralized Data Lake on AWS Cloud utilizing primary services like S3, EMR, Redshift and Athena.

Worked on migrating datasets and ETL workloads from On-prem to AWS Cloud services.

Built series of Spark Applications and Hive scripts to produce various analytical datasets needed for digital marketing teams.

Working with Informatica 9.5.1 and Informatica 9.6.1 Big Data edition for ETL development.

After the transformation of data is done, this transformed data is then moved to Spark cluster where the data is set to go live on to the application using Spark streaming and Kafka.

Extracting data from data warehouse (Teradata)on to the Spark RDD’s

Working on Stateful Transformations in Spark Streaming.

Good hands-on experience on Loading data onto Hive from Spark RDD’s.

Worked on Spark SQL UDF’s and Hive UDF’s.

Worked extensively on building and automating data ingestion pipelines and moving terabytes of data from existing data warehouses to cloud.

Worked extensively on fine tuning spark applications and providing production support to various pipelines running in production.

Worked closely with business teams and data science teams and ensured all the requirements are translated accurately into our data pipelines.

Worked on full spectrum of data engineering pipelines: data ingestion, data transformations and data analysis/consumption.

Worked on automating the infrastructure setup, launching and termination EMR clusters etc.

Created Hive external tables on top of datasets loaded in S3 buckets and created various hive scripts to produce series of aggregated datasets for downstream analysis.

Build real time streaming pipeline utilizing Kafka, Spark Streaming and Redshift.

Worked on creating Kafka producers using Kafka S Producer Api for connecting to external Rest live stream application and producing messages to Kafka topic.

Environment: AWS S3, EMR, Redshift, Athena, Glue, Spark, Scala, Python, Java, Hive, Kafka

Client: Burlington Coat Factory, India 2018 October - 2020 December

Role: Big Data Engineer

Responsibilities:

Responsible for ingesting large volumes of user behavioral data and customer profile data to Analytics Data store.

Developed custom multi-threaded Java based ingestion jobs as well as Sqoop jobs for ingesting from FTP servers and data warehouses.

Developed Scala based Spark applications for performing data cleansing, event enrichment, data aggregation, de-normalization and data preparation needed for machine learning and reporting teams to consume.

Worked on troubleshooting spark application to make them more error tolerant.

Worked on fine-tuning spark applications to improve the overall processing time for the pipelines.

Wrote Kafka producers to stream the data from external rest APIs to Kafka topics.

Wrote Spark-Streaming applications to consume the data from kafka topics and write the processed streams to HBase.

Experienced in handling large datasets using Spark in Memory capabilities, using broadcasts variables in Spark, effective & efficient Joins, transformations, and other capabilities.

Worked extensively with Sqoop for importing data from Oracle.

Designed & implemented medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL MongoDB).

Involved in creating Hive tables, loading, and analyzing data using hive scripts.

Implemented Partitioning, Dynamic Partitions, Buckets in Hive.

Worked on Azure DevOps, Azure Repos, and Azure Pipelines for CICD releases, builds, and deployments.

Used Reporting tool like Tableau to connect with Impala for generating daily reports of data.

Collaborated with the infrastructure, network, database, application, and BA teams to ensure data quality and availability.

Environment: Hadoop, Spark, Scala, Python, Hive, Sqoop, Oozie, Kafka, YARN, JIRA, Shell Scripting, SBT, GITHUB, Maven, NoSQL, MongoDB, Azure

Client: AIG life insurance, Hyderabad 2017 March - 2018 September

Role: Data Engineer

Responsibilities:

Migrating an entire oracle database to Big Query and using of power bi for reporting.

Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators

Experience in GCP Dataproc, GCS, Cloud functions, Big Query.

Experience in moving data between GCP and Azure using Azure Data Factory.

Experience in building power bi reports on Azure Analysis services for better performance.

Used cloud shell SDK in GCP to configure the services Data Proc, Storage, Big Query.

Coordinated with team and Developed framework to generate Daily ad hoc reports and Extracts from enterprise data from Big Query.

Designed and Co-ordinated with Data Science team in implementing Advanced Analytical Models in Hadoop Cluster over large Datasets.

Wrote scripts in Hive SQL for creating complex tables with high performance metrics like partitioning, clustering and skewing.

Work related to downloading Big Query data into pandas or Spark data frames for advanced ETL capabilities.

Worked with google data catalog and other google cloud APIs for monitoring, query and billing related

analysis for Big Query usage.

Worked on creating POC for utilizing the ML models and cloud ML for table Quality Analysis for the batch process.

Good knowledge in using cloud shell for various tasks and deploying services.

Expertise in designing and deployment of Hadoop cluster and different Big Data analytic tools including Pig, Hive, SQOOP, Apache Spark, with Cloudera Distributions.

Environment: Hadoop, Apache Spark, Sqoop, Scala, Python, Hadoop, Hive, Sqoop, SQL, Shell Scripting, SQL, NoSQL, GCP Big Query, Azure, Cloudera distributions.

Client: Genpact, Hyderabad 2014 May – 2017 February

Role: Data Engineer

Worked with the analysis teams and management teams and supported them based on their requirements.

Involved in extraction, transformation and loading of data directly from different source systems (flat files/Excel/Oracle/SQL/Teradata) using SAS/SQL, SAS/macros.

Generated PL/SQL scripts for data manipulation, validation and materialized views for remote instances.

Used Agile (SCRUM) methodologies for Software Development.

Created and modified several database objects such as Tables, Views, Indexes, Constraints, Stored procedures, Packages, Functions and Triggers using SQL and PL/SQL.

Created large datasets by combining individual datasets using various inner and outer joins in SAS/SQL and dataset sorting and merging techniques using SAS/Base.

Developed live reports in a drill down mode to facilitate usability and enhance user interaction.

Extensively worked on Shell scripts for running SAS programs in batch mode on UNIX.

Wrote Python scripts to parse XML documents and load the data in database.

Used Python to extract weekly information from XML files.

Developed Python scripts to clean the raw data.

Used Hive, Impala and Sqoop utilities and Oozie workflows for data extraction and data loading.

Environment: SAS, SQL, Teradata, Oracle, PL/SQL, UNIX, XML, Python, AWS, SSRS, TSQL, Hive, Sqoop.

Education:

Bachelor of Engineering, Electronics and Instrumentation Engineering. JNTUH (2013)

Contact this candidate