Data Engineer Big

Location:

Orlando, FL

Posted:

November 13, 2023

Contact this candidate

Resume:

SANKAR YADAV DATA ENGINEER

Dallas, Texas 857-***-**** ***********@*****.***

5.7 years of technical software development experience with 4+ years of expertise in Big Data, Hadoop Ecosystem, Analytics, Cloud Engineering, Data Warehousing.

Experience in large scale application development using Big Data ecosystem - Hadoop (HDFS, MapReduce, Yarn), Spark, Kafka, Hive, Impala, HBase, Sqoop, Pig, Airflow, Oozie, Zookeeper, Ambari, Flume, Nifi, AWS.

Sound Involvement with AWS services like Amazon EC2, S3, EMR, Amazon RDS, VPC, Amazon Elastic Load Balancing, IAM, Auto Scaling, Cloud Front, CloudWatch, SNS, SES, SQS, and Lambda to trigger resources.

Experience with building data pipelines using Azure Data Factory, Azure Databricks, and stacking data to Azure Data Lake, Azure SQL Database, Azure SQL Data Warehouse to control and concede database access.

Good experience with Azure services like HDInsight, Stream Analytics, Active Directory, Blob Storage, Cosmos DB, Storage Explorer.

In-profundity understanding/knowledge of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Map Reduce, Spark.

Experience in job workflow scheduling and monitoring tools like Oozie and Zookeeper.

Expertise recorded as a hard copy Hadoop occupation utilizing MapReduce, Apache Crunch, Hive, Pig, and Splunk.

Profound information in developing production-ready Spark applications using Spark Components like Spark SQL, Matplotlib, Graph X, Data Frames, Datasets, Spark-ML and Spark Streaming.

Strong working experience with SQL and NoSQL databases (Cosmos DB, MongoDB, HBase, Cassandra), data modelling, tuning, disaster recovery, backup and creating data pipelines.

Experienced in scripting with Python (PySpark), Java, Scala and Spark-SQL for development, aggregation from various file formats such as XML, JSON, CSV, Avro, Parquet, ORC.

Great experience in data analysis using HiveQL, Hive-ACID tables, Pig Latin queries, custom MapReduce programs and achieved improved performance.

Developing End to End ETL Data pipeline that take the data from surge and loading it into the RDBMS using the Spark.

Experience in configuring Spark Streaming to receive real time data from the Apache Kafka and store the stream data to HDFS and expertise in using spark-SQL with various data sources like JSON, Parquet and Hive.

Experience in ELK stack to develop search engines on unstructured data within NoSQL databases in HDFS.

Created Kibana visualizations and dashboards to view the number of messages processing through the streaming pipeline for the platform.

Implemented CRUD operations using Cassandra Query Language (CQL), analyze the data from Cassandra tables for quick searching, sorting, and grouping on top of the Cassandra File System.

Hands-on experience on Ad-hoc queries, Indexing, Replication, Load balancing, Aggregation in MongoDB.

Good information in understanding the security requirements like Azure Active Directory, Sentry, Ranger, and Kerberos authentication and authorization infrastructure.

Experience developing Pig Latin and HiveQL scripts for Data Analysis and ETL purposes and extended the default functionality by writing User Defined Functions (UDFs), (UDAFs) for custom data specific processing.

Sound information in growing highly scalable and resilient Restful APIs, ETL solutions, and third-party integrations as part of Enterprise Site platform using Informatica.

Experience in utilizing bug following and tagging frameworks such as Jira, and Remedy, used Git and SVN for version control.

Highly associated with all aspects of SDLC using Waterfall and Agile Scrum methodologies.

Sound information and Hands-on-experience with – NLP, Image Detection, Map R, IBM infosphere suite, Storm, Flink, Talend, ER Studio and Ansible.

Extensive involvement with the execution of Continuous Integration (CI), Continuous Delivery and Continuous Deployment (CD) on Various Java based applications using Jenkins, TeamCity, Azure DevOps, Maven, Git, Nexus, Docker and Kubernetes.

Big Data Ecosystem:

HDFS, Yarn, MapReduce, Spark, Kafka, Kafka Connect, Hive, Airflow, Stream Sets, Sqoop, HBase, Flume, Pig, Ambari, Oozie, Zookeeper, Nifi, Sentry, Ranger

Hadoop

Distributions:

Apache Hadoop 2.x/1.x, Cloudera CDP, Hortonworks HDP, Amazon AWS - EMR, EC2, EBS, RDS, S3, Athena, Glue, Elasticsearch, Lambda, SQS, DynamoDB, Redshift, ECS, Quick Sight, Kinesis, Microsoft Azure - Databricks, Data Lake, Blob Storage, Azure Data Factory, SQL Database, SQL Data Warehouse, Cosmos DB, Azure Active Directory)

Scripting Languages:

Python, Java, Scala, R, PowerShell Scripting, Pig Latin, HiveQL.

Cloud Environment:

Amazon Web Services (AWS), Microsoft Azure.

NoSQL Database:

Cassandra, Redis, MongoDB, Neo4j

Database:

MySQL, Oracle, Teradata, MS SQL SERVER, PostgreSQL, DB2

ETL/BI:

Snowflake, Informatica, Talend, SSIS, SSRS, SSAS, ER Studio, Tableau, Power BI

Operating systems:

Linux (Ubuntu, Centos, RedHat), Windows (XP/7/8/10)

Web Development:

JavaScript, NodeJS, HTML, CSS, Spring, J2EE, JDBC, Okta, Postman, Angular, JFrog, Mockito, Flask, Hibernate, Maven, Tomcat, WebSphere

Version Control:

Git, SVN, Bitbucket.

Others:

Machine learning, NLP, Spring Boot, Jupyter Notebook, Jenkins, Splunk, Jira.

Harrisburg University GPA:3.50

Master of Science, Computer Systems

Coursework: Facility Systems Design, Big Data Analytics, Project Management.

Sir CRR College of Engineering, AP, India GPA:3.92

Bachelor’s in information technology.

Coursework: basic programming, systems and architecture, software development and databases, algorithms.

AWS Certified: AWS Data Engineer Associate

Exam DP-200: Implementing an Azure Data Solution

Exam DP-201: Designing an Azure Data Solution

Analyzing and Visualizing Data with Power BI

May 2017

GE Healthcare Jan 2022 - Present

Role: Data Analytics Engineer

Building and chipping on data streaming pipelines to handle the creation and dominating of new and existing patients records to consolidate patient information across healthcare providers, sharing this information securely across multiple organizations (such as insurance, settlement companies, etc.) and maintaining classification and security. This data will be utilized in creating the CRM platform to offer value addition to overall people health care insight.

Developed different data stacking methodology and performed different changes for investigating the datasets by using Hortonworks Distribution for Hadoop ecosystem.

Worked in a Databricks Delta Lake environment on AWS using Spark.

Developed spark-based ingestion framework for ingesting data into HDFS, making tables in Hive and executing complex computations and parallel data processing.

Ingested data real-time data application of flat files and API’s using Kafka.

Developed data ingestion pipeline from HDFS into AWS S3 buckets using Nifi.

Created outer and permanent tables in Snowflake on the AWS data.

Worked on creating Hive tables and written Hive queries for data analysis to meet business requirements and experienced in Sqoop to import and export the data from Oracle & MySQL.

Implemented Spark to relocate MapReduce jobs into Spark RDD changes and Spark streaming.

Developed application to clean semi-structured data like JSON into structured files before ingesting them into HDFS.

Automated the way toward changing and ingesting terabytes of monthly data using Kafka, S3, Lambda and Oozie.

Provided Production support to Tableau users and Wrote Custom SQL to support business requirements.

Modified the pipeline for permitting messages to receive the new approaching records for combining, handling missing (NULL) values and triggering a corresponding merge in records on receipt of a merge event.

Integrated Apache Storm with Kafka to perform web analytics and to perform clickstream data from Kafka to HDFS.

Worked on automating CI and CD pipeline with AWS Code Pipeline, Jenkins and AWS Code Deploy.

Created inner instrument for contrasting the RDBMS and Hadoop with the end goal that all the data in source and target matches using shell script and can decrease the complexity in moving data.

Written python scripts for internal testing which pushes the data reading form a file into Kafka queue which in turn is consumed by the Storm application.

Configured Snow pipe to pull data from S3 buckets into Snowflake table.

Created and managed Kafka topics and producers for the streaming data.

Worked in Agile development environment and took interest in daily scrum and other design related meetings.

Imported and exported the analyzed data to the relational databases using Sqoop for visualization and to produce reports for the BI team using Power BI with mechanized trigger API.

Tata Consultancy Services April 2019 – Dec 2021

Big Data Engineer

Worked with Hortonworks distribution. Installed, configured, and maintained a Hadoop cluster based on the business and the team requirements.

Experience with bigdata parts like HDFS, MapReduce, YARN, Hive, HBase, Druid, Sqoop, Pig and Ambari.

Enhanced scripts of existing modules written in Python. Relocate ETL jobs to Pig scripts to apply transformations, joins, aggregations and to load data into HDFS.

Developed an ETL pipeline to extract archived logs from different sources for additional processing using PySpark. Used Cron schedulers for weekly automation.

Implemented a proof of concept deploying this product in Amazon Web Services (AWS). Operated in the AWS environment with Elastic MapReduce (EMR) to setup AWS EC2 instances, develop and deploy custom Hadoop applications.

Designing and creating Apache NiFi jobs to get the records from exchange frameworks into data lake raw zone.

Writing Spark Applications which runs on Amazon EMR cluster that fetches data from the Amazon S3/one lake location and queue it in the Amazon SQS (simple Queue Services) queue.

Responsible to stack, oversee and audit terabytes of log files using Ambari and Hadoop streaming jobs.

Migrated from JMS solace to Kafka, used Zookeeper to oversee synchronization, serialization, and coordination.

Used Sqoop to migrate data between traditional RDBMS and HDFS. Ingested data into HDFS from Teradata, Oracle, and MySQL. Identified required tables and views and exported them into Hive. Performed ad-hoc queries using Hive joins, partitioning, bucketing techniques for faster data access.

Developed Tableau data visualization using Cross tabs, Heat maps, Box and Whisker charts, Scatter Plots, Geographic Map, Pie Charts and Bar Charts and Density Chart.

Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for data sets processing and storage. Experienced in Maintaining the Hadoop cluster on AWS EMR.

Responsible for gathering, cleaning, and removing data from var generate reports, dashboards, and analytical solutions. Helped in debugging the Tableau dashboards.

Troubleshooted deserts by distinguishing root cause and fixed them during QA phase. Used SVN for version control.

Suneratech, Hyderabad April 2018 - April 2019

Data Analyst

Extensively used Informatica Client tools Power Center Designer, Workflow Manager, Workflow Monitor and Repository Manager. Extracted data from various heterogeneous sources like Oracle, Flat Files.

Performed data operations like Text Analytics and Data Processing, using the in-memory computing capabilities of Spark using Scala.

Implement Data Interface to get information on customers using Rest API.

Maintained ELK (Elastic Search, Logstash, and Kibana) and Wrote Spark scripts using Scala.

Designed SSIS (ETL) Packages to extract data from various heterogeneous data sources such as Access database, Excel spreadsheet and flat files into SQL Server and maintain the data.

Creating Informatica workflows to load the source data into CSDR. Involved in creating various UNIX script used during ETL load process.

Had experience in Hadoop framework, HDFS, MapReduce processing implementation.

Worked on the development of data warehouse, Data Lake and ETL systems using relational and non-relational tools like SQL, No SQL. In in converting the business requirement into technical design document.

Creating new repositories from scratch, backup and restore

Used Spark-SQL to read, process the parquet data, and create the tables using the Scala API.

Monitored Spark Application to capture the logs generated by Spark jobs.

Implement MapReduce using Sqoop to import the data from various relational databases to filter out unstructured data, perform data cleaning and pre-process on HDFS.

Periodically cleaning up Informatica repositories. Monitoring the daily load & handing over the stats with the QA Team.

PROFESSIONAL SUMMARY

SKILLS

EDUCATION

CERTIFICATIONS

PROFESSIONAL EXPERIENCE

Contact this candidate