Post Job Free

Resume

Sign in

Data Engineer

Location:
Charlotte, NC
Posted:
December 13, 2023

Contact this candidate

Resume:

Vinay H

Sr. Data Engineer

Contact: +1-980-***-****

Email: ad1xq2@r.postjobfree.com

LinkedIn: www.linkedin.com/in/vinayhde

Professional Summary

• Over 8+ years of experience in Data Engineering, Data Pipeline Design, Development and Implementation as a Data Engineer/Data Developer and Data Modeler.

• Experience in all stages of SDLC (Agile, Waterfall), writing Technical Design document, Development, Testing and Implementation of Enterprise level Data mart and Data warehouses.

• Experience in developing Spark streaming jobs by developing RDD’s (Resilient Distributed Datasets) using Scala, PySpark and Spark-Shell.

• Hands-on experience in Spark and Scala APIs to compare the performance of Spark with Hive and SQL and used Spark SQL to manipulate Data Frames in Scala.

• Strong experience in writing scripts using Python API, PySpsark API and Spark API for analyzing data.

• Expertise in Python and Scala, user-defined functions (UDF) for Hive and Pig using Python.

• Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.

• Worked with various data formats like .CSV, .Avro, .Parquet, JSON, Text files, SequenceFile, .TSV, .XML, .XLSX, .PDF

• Experience in using Python and SQL for Data Engineering and Data Modeling.

• Extensive experience with Informatica Power Center (9.0/8.x/7.x) for Data Extraction, Transformation and Loading into Data Warehouses/Data Marts.

• Experience creating Visual report, Graphical analysis and Dashboard reports using Tableau. Also Performed data analysis using Splunk enterprise edition on historical data saved in HDFS.

• Experience in writing Map-Reduce Jobs in Python for processing large sets of structured, semi-structured and unstructured data sets and stores them in HDFS.

• Involved in writing Python scripts to automate the process of extracting weblogs using Airflow DAGs.

• Proficient in creating Scala and PySpark apps for interactive analysis, batch processing, and stream processing.

• Proven expertise on stream processing using AWS Kinesis Firehose, Apache Kafka, AWS Event Bridge, SQS and SNS.

• Hands on experience designing and building data models and data pipelines on Data Warehouse and Data Lakes.

• Hands on experience on Star Schema Modeling, Snowflake Modeling, FACT and Dimensions Tables, Physical and Logical Data Modeling using Erwin.

• Hands on experience on Amazon Web Services stack like AWS S3, Elastic Map Reduce (EMR), Redshift, EC2, Athena, Kinesis, and Step Functions for data processing, transforming, and storing.

• Expertise knowledge on Ad-hoc analysis using Tableau, Power BI, SSAS and AWS Quicksight.

• Experience in Importing and exporting data into HDFS and Hive using Sqoop.

• Experienced with Integration Services (SSIS), Reporting Service (SSRS) and Analysis Services (SSAS).

• Good Hands-on Experience on NoSQL databases like MongoDB, Cassandra, Dynamo DB and HBase and relational databases such as Oracle, SQL Server, My SQL, and AWS RDS.

• In-depth knowledge on Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Map Reduce, Spark.

• Extensive knowledge in using Spark APIs for real-time data streaming, data staging, data cleaning, data transformation, and data preparation.

• Worked on importing continuous data using Sqoop in Last modified and Last updated mode.

• Actively participated in sprints planning and handled scrum workloads in Agile environment using JIRA.

• Strong skills in analytical, presentation, communication, problem solving with the ability to work independently as well as in a team and had the ability to follow the best practices and principles defined for the team. Certifications

AWS Certified Data Analytics – Specialty Aug 2022. Technical Skills

Relational Databases AWS RDS, Oracle, MySQL, Microsoft SQL. NoSQL Databases MongoDB, Hadoop HBase, Dynamo DB and Apache Cassandra. Programming Languages Python, SQL, Scala, Java, Linux, Pig, HiveQL. AWS Cloud Services S3, S3 Glacier, Redshift, RDS, Dynamo DB, EMR, EC2, Lambda, Glue, SNS, Step Functions, Athena, CloudWatch, Kinesis.

AZURE Cloud Services Azure Synapse Analytics, Data Lake, Stream Analytics, Data Factory, Databricks, HDInsight and Blob storage.

Data Warehouses & Lakes Snowflake, Teradata, Hive, AWS Redshift, AWS S3. Big Data Technologies Hadoop, Cloudera CDH, Databricks, Sqoop, HDFS, MapReduce, YARN, Zookeeper Stream Processing Frameworks Apache Spark, Apache Kafka, AWS Kinesis. Orchestration Tools Apache Airflow, AWS Step Functions. ETL Tools Informatica Power Center, SSIS, Glue.

CI/CD and version control Tools Jenkins, Git, GitHub, CloudFormation. Operating Systems Red Hat Linux, Unix, Windows, macOS. Reporting & Visualization Tableau, Power BI, SSAS. Python Libraries PySpark, TensorFlow, Pandas, NumPy, Matplotlib. Professional Experience

Client: Change Healthcare, Nashville, TN (Apr 2022 - present) Role: Sr. Data Engineer

Responsibilities:

• Involved in Analysis, Design, System architectural design, Process interfaces design, design documentation.

• Designed, built, and launched efficient and reliable ETL/data pipelines to move data across several platforms including on-prem and cloud Data Warehouse, online caches, and real-time systems.

• Leveraged AWS services and developed a self-serve, domain-driven data mesh platform, reducing operational & engineering costs, empowering data consumers & producers, and enabled greater observability.

• Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing the streaming data from Kafka and Amazon Kinesis.

• Extensively used Kafka and Amazon Kinesis Firehose, Kinesis Data Streams to process the real time data to spark.

• Developed Spark jobs to clean data obtained from various feeds to make it suitable for ingestion into Hive tables for analysis.

• Designed flexible and scalable data models for rich datasets, enabling Data Scientists and Analytical engineers to make data driven decisions, aligning product design with user experience.

• Developed batch scripts to fetch the batch data from AWS S3 data lake and performed transformations in Scala using Spark framework.

• Developed Scala scripts, UDF’s using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into RDBMS.

• Supported existing data ETL processes running in production, resolving issues to ensure data accuracy.

• Built Spark jobs using PySpark to perform ETL for data in S3 Data Lake.

• Involved in developing data pipeline using Kafka, Spark, and Hive to ingest, transform and analyze data.

• Developed Pig Scripts, Pig UDFs and Hive Scripts, Hive UDFs to analyses HDFS data.

• Leveraged combination of AWS Lambda, Glue on AWS EMR to build an automated ETL pipeline for data transformation, data sourcing, mapping, conversion, and loading raw data from S3 into AWS Redshift Data warehouse.

• Used Glue Catalog and Glue Crawler for data discovery and metadata management to data across S3 and Redshift.

• Actively contributed to designing the schemas like Star, Snowflake and Hybrid on Data Warehouses and Databases.

• Developed connections for Tableau Application to core and peripheral data sources like Flat files, Microsoft Excel, Tableau Server, Amazon Redshift, Microsoft SQL Server to Analyze complicated data.

• Used Apache Kafka to aggregate web log data from multiple servers and make them available in downstream systems for analysis.

• Performed analysis on the unused user navigation data by loading into HDFS and writing MapReduce jobs.

• Creating Hive tables, loading and analyzing data using hive scripts. Implemented Partitioning, Dynamic Partitions, Buckets in Hive.

• Ingested the data from/to Teradata and HDFS and used Kafka to extract and ingest the unstructured data into HDFS.

• Maintained data pipelines by designing workflows as Directed Acyclic Graphs (DAGs) of tasks using Apache Airflow.

• Monitored end-to-end infrastructural events with the help of CloudWatch and used SNS for notifications.

• Worked on different file formats like Text, Sequence files, Avro, Parquet, JSON, XML files and Flat files.

• Configured AWS Redshift clusters and spectrums for querying, AWS Redshift Datashare for transferring the data among clusters.

• Used Apache Airflow to automate the ETL processes and store the data in batches in Amazon S3.

• Analyzed raw files from S3 data lake using AWS Athena and Glue without loading the data in the database.

• Performed Ad-hoc analysis using AWS Quicksight and Tableau to address immediate business requirements.

• Involved in creating, modifying SQL queries, prepared statements and stored procedures used by the application.

• Implemented the project under Agile Project Management Environment and followed SCRUM iterative incremental model & configured various sprints to execute.

• Worked on CI/CD Pipelines and CloudFormation templates.

• Actively participated and provided feedback in a constructive and insightful manner during weekly Iterative review meetings to track the progress for each iterative cycle and figure out the issues. Stack: Spark, Scala, Python, PySpark, Apache Kafka, Tableau, Hive, AWS S3, AWS Redshift, Apache Airflow, Teradata, Kinesis Firehose, Kinesis Data Streams, Kinesis Data Analytics, MS SQL, CloudWatch, AWS Glue, Glue Catalog, Glue Crawler.

Client: Texas Mutual Insurance, Austin, TX (Jan 2021 – Mar 2022) Role: Data Engineer

Responsibilities:

• Gathered, analyzed, and translated business requirements to technical requirements, communicated with other departments to collect client business requirements and access available data.

• Proficient in building data pipelines with Azure Data Factory to load data from SQL server to Azure Data Lake using Data Factories, API Gateway Services, SSIS Packages, Talend Jobs, custom .Net, Database jobs and Python scripts.

• Developed various spark applications using Scala to perform various enrichments of user behavioral data (click stream data) merged with user profile data.

• Involved in developing production ready spark application using Spark RDD APIs, Data frames, Spark-SQL and Spark-Streaming API's.

• Involved in implementing advanced procedures like text analytics and processing using Apache Spark written in Scala.

• Designed data pipelines using Databricks, Data Lake and Apache Airflow and deployed them.

• In charge of monitoring and debugging the Spark Databricks cluster, as well as predicting the cluster size.

• Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Spark SQL using Scala.

• Leveraged Apache Kafka to implement the real-time data streaming solutions, facilitating seamless integration of disparate data sources.

• Design and implement secure data pipelines into a Snowflake data warehouse from on-premises and cloud data sources.

• Integrated data from numerous source systems, including nested JSON-formatted data, using the Snowflake cloud data warehouse.

• Contributed to the data transfer from an on-premises MySQL server to Azure Synapse Analytics Data Warehouse & Azure SQL Database.

• Used Azure HDInsight and Azure Stream Analytics performed Ad-hoc analysis and provided the insights to business teams.

• Developed Simple to complex MapReduce Jobs using Hive and Pig. Developed Shell and Python scripts to automate and provide Control flow to Pig scripts.

• Involved in Extraction, Transformation and Loading (ETL) of data from multiple sources like Flat files, XML files, Avro, Parquet, JSON, and Databases.

• Developed interactive reports and dashboards in Tableau using Cross tabs, Heat maps, Box and Whisker charts, Scatter Plots, Geographic Map, Pie Charts and Bar Charts and Density Chart.

• Built models using Python and Pyspark to predict probability of attendance for various campaigns and events.

• Worked on Kafka messaging platform for real-time transactions and streaming of data from APIs and databases to Reporting tools for analysis.

• Maintained data pipelines by designing workflows as Directed Acyclic Graphs (DAGs) of tasks using Apache Airflow.

• Implemented the migration of data from multiple on-premises servers into cloud using Azure Data Factory service and Data Migration Assistant application.

• Built an Azure Web Job for Product Management teams to connect to different APIs and sources to harvest data and put it into Azure Data Lake.

• Designed data pipelines using Databricks, Azure Data Lake and Apache Airflow and deployed them.

• Worked on custom Pig Loaders and Storage classes to work with a variety of data formats such as JSON, Compressed CSV, etc.

• Developed NoSQL database by using CRUD, Indexing, Replication and Sharing in MongoDB and Cassandra.

• Performed cluster setup, configuration, and monitoring of clusters using nodetool in Cassandra.

• Designing and creating SQL Server tables, views, stored procedures, and functions in Azure SQL and MS SQL Dbs.

• Actively participating in code reviews, meetings and solving any technical issues.

• Troubleshooted issues related to Zookeeper such as performance bottlenecks, resource contention, and optimized Zookeeper configurations for better performance in clusters.

• Managed cluster of nodes in Hadoop using the Apache Zookeeper.

• Involved in Agile methodologies, daily scrum meetings, spring planning.

• Actively participating in the code reviews, meetings and solving any technical issues. Stack: Spark, Scala, Python, PySpark, Tableau, Airflow, Kafka, Hive, MongoDB, Cassandra, Databricks, MS SQL, Azure Data Factory, Azure SQL, MySQL, Azure Blob Storage, Azure Data Lake, Synapse Analytics, Stream Analytics and Zookeeper.

Client: Fidelity Investments, St. Louis, MO (Aug 2019 – Dec 2020) Role: Data Engineer

Responsibilities:

• Worked with the business users to gather, define business requirements, and analyze the possible technical solutions.

• Developed Spark applications by using Scala and Python and implemented Apache Spark for data processing from various streaming sources.

• Developed Spark-Streaming applications to consume the data from Kafka topics and to insert the processed streams to HBase.

• Involved in writing Spark applications using Scala to perform various data cleansing, validation, transformation, and summarization activities according to the requirement.

• Built reusable Hive UDF libraries for business requirements which enabled users to use these UDF's in Hive Querying.

• Developed Spark jobs on Data bricks to perform tasks like data cleansing, data validation, standardization, and then applied transformations as per the use cases.

• Developed Map Reduce programs for applying business rules on the data.

• Design and develop Power BI reports and visualizations which include preparing Dashboards using calculations, Power Query, Power Pivot, and DAX functions.

• Design and developed end-to-end ETL process from various source systems to Staging area, from staging to Data Marts and data load.

• Involved in writing Pyspark User Defined Functions (UDF’s) for various use cases and applied business logic wherever necessary in the ETL process.

• Data gathering, data cleaning and data wrangling are performed using Python.

• Performed incremental loads as well as full loads to transfer data from OLTP to Data Warehouse of snowflake schema using different data flow and control flow tasks and provide maintenance for existing jobs.

• Performed Sqoop jobs to land data on HDFS and running validations.

• Created Hive tables, loading and analyzing data using hive scripts. Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.

• Maintained data pipelines by designing workflows as Directed Acyclic Graphs (DAGs) of tasks using Apache Airflow.

• Loaded data from Teradata to Hadoop Cluster by using TDCH scripts.

• Wrote complex SQL scripts and PL/SQL packages, to extract data from various source tables of data warehouse.

• Involved in Agile methodologies, daily scrum meetings, spring planning.

• Actively participating in the code reviews, meetings and solving any technical issues. Stack: Spark, Scala, Python, PySpark, Power BI, HBase, Hive, Snowflake, Sqoop, Databricks, Airflow, Kafka, Snowflake, Teradata.

Client: Wipro, Bengaluru, India (Sep 2015 – Jul 2018) Role: Data Engineer

Responsibilities:

• Worked with the business users to gather, define business requirements, and analyze the possible technical solutions.

• Developed Spark scripts by using Python shell commands as per the requirement.

• Developed spark code and spark-SQL/streaming for faster testing and processing of data.

• Developed Spark/Scala, Python for regular expression (RegEx) project in Hadoop/Hive environment for big data resources.

• Developed PIG scripts for the analysis of semi structured data.

• Used Pig as ETL tool to do transformations, event joins, filters, and some pre-aggregations before storing the data onto HDFS.

• Built key business metrics, Visualizations, dashboards, reports with Tableau.

• Involved in building the ETL architecture and Source to Target mapping to load data into Data warehouse.

• Developed Map Reduce jobs for data cleaning and manipulation.

• Involved in writing Pyspark User Defined Functions (UDF’s) for various use cases and applied business logic wherever necessary in the ETL process.

• Data gathering, data cleaning and data wrangling are performed using Python.

• Written Programs in Spark using Scala for Data quality check.

• Created Hive tables as per requirement as internal or external tables, intended for efficiency.

• Worked on Snowflake environment to remove redundancy and load real time data from various data sources into HDFS using Kafka.

• Used AWS S3 to store large amounts of data in identical/similar repository.

• Wrote complex SQL scripts and PL/SQL packages, to extract data from various source tables of data warehouse.

• Actively participating in the code reviews, meetings and solving any technical issues. Stack: Spark, Scala, Hadoop, Python, Pyspark, AWS S3, MapReduce, Pig, ETL, HDFS, Hive, HBase, SQL, Agile and Windows.

Client: BetaBulls, Hyderabad, India (Jul 2014 – Aug 2015) Role: Hadoop Developer

Responsibilities:

• Extensive hands-on experience in designing Data warehouse/Database, Modelling, building SQL objects such as tables, views, user defined/ table valued functions, stored procedures, triggers, and indexes.

• Created HBase tables from Hive and wrote HiveQL statements to access HBase table's data.

• Created dynamic partitions and bucketing in Hive to enhance query efficiency and developed intricate Hive Scripts to analyze the data.

• Developed MapReduce applications utilizing the Hadoop MapReduce programming framework, and optimized MapReduce Jobs using compression techniques.

• Developed Pig UDFs, User Defined Functions, to know the customer behavior and Pig scripts for processing the data in Hadoop.

• Utilized Oozie to set up automated schedules for the execution of tasks, including loading data into HDFS using Sqoop, and performing data preprocessing with Pig and Hive.

• Developed the Oozie actions like hive, shell, and java to submit and schedule applications to run in Hadoop cluster.

• Worked with production support team to provide necessary support for issues with CDH Hadoop cluster and the data ingestion.

• Designed, Developed and Deployed data pipelines for moving data across various systems.

• Developed solutions for import/export of data from Teradata, Oracle databases to HDFS.

• Resolved Spark and YARN resource management issues in Spark including Shuffle issues, Out of Memory issues, heap space errors and schema compatibility.

• Used Sqoop to import and export the data from/to HDFS and Oracle database.

• Involved in converting HiveQL queries into Spark Transformations and Spark Actions using Spark Data Frames and Spark RDDs.

• Extensively involved in installation and configuration of Cloudera Hadoop Distribution (CDH).

• Built ETL pipeline for scaling up data processing flow to meet the rapid data growth by exploring Spark and improving the performance of the existing algorithm in Hadoop using Spark-Context, Spark-SQL, Data Frame, Pair RDD’s and Spark YARN.

Stack: Spark, Cloudera Hadoop CDH, Teradata, Oracle, SQL, HiveQL, HBase, MapReduce, Sqoop, YARN, Unix/Linux. Education

Master of Science in Computer Science, University of Dayton, Dayton, Ohio. (Aug 2018 – Dec 2019) Bachelor of Engineering in Electronics and Communications Engineering., Chennai, INDIA. (Aug 2010 – May 2014) References: Will be provided upon request.



Contact this candidate