Aws Data Engineer

Location:

Dallas, TX

Salary:

Posted:

May 23, 2023

Contact this candidate

Resume:

Saheshchandra

469-***-****

************@*****.***

PROFESSIONAL SUMMARY:

·7+ years of experience as Data Engineer in BIGDATA using HADOOP, Spark framework and Analysis, Design, Development, Documentation, Deployment and Integration using SQL and Big Data technologies.

·2+ years of experience as Snowflake Engineer.

·Well versed in configuring and administering the Hadoop Cluster using Cloudera and Hortonworks.

· Experience in creating separate virtual data warehouses with difference size classes in AWS Snowflake.

· Experience with data transformations utilizing SnowSQL and Python in Snowflake.

· Hands-on experience in bulk loading and unloading data into Snowflake tables using COPY command.

·Experience in working with AWS S3 and Snowflake cloud Data warehouse.

·Expertise in building Azure native enterprise applications and migrating applications from on-premises to Azure environment.

·Implementation experience with Data lakes and Business intelligence tools in Azure.

·Experience in creating real time data streaming solutions using Apache Spark/ Spark Streaming/ Apache Storm, Kafka and Flume

·Currently working on Spark applications extensively using Scala as the main programming platform

·Processing this data using Spark Streaming API with Scala.

·Used Spark Data Frames, Spark-SQL and RDD API of Spark for performing various data transformations and dataset building

·Developed RESTful web Services to retrieve, transform and aggregate the data from different end points to Hadoop (Hbase, Solr).

·Created Jenkins Pipeline using Groovy scripts for CI/CD.

·Exposure to Data Lake Implementation and developed Data pipelines and applied business logic utilizing Apache Spark

·Involved converting Cassandra/Hive/SQL queries into Spark transformations using RDD's and Scala.

·Implemented Spark Scripts using Scala, Spark SQL to access hive tables into spark for faster processing of data.

·Hands on experience doing real time on NO-SQL databases like MongoDB, HBase and Cassandra

·Experience in creating MongoDB clusters and hands on experience with complex MongoDB aggregate functions and mapping

·Experience in using Flume to load log files into HDFS and Oozie for data scrubbing and process

·Experience on performance tuning of HIVE queries and Map Reduce programs for scalability and faster execution

·Experienced in handling real time analytics using HBase on top of HDFS data

·Experience in transforming, Grouping, Aggregations, Joins using Kafka Streams API

·Hands on experience deploying KAFKA connect in standalone and distributed mode creating docker containers using DOCKER

·Created TOPICS and written KAFKA producer and consumer in Python as required, developed KAFKA source/sink connectors to store the streaming new data into topics, from topics to required different database by performing ETL tasks also used Akka toolkit with Scala to perform some builds

·Experienced in collecting metrics for Hadoop clusters using Ambari & Cloudera Manager.

·Has knowledge on Storm architecture, Experience in using data modeling tools like Erwin

·Excellent experience in using scheduling tools to automate batch jobs

·Hands on experience in using Apache SOLR/Lucene

·Expertise using SQL Server, SQL, queries, procedures, functions

·Hands on experience in App Development using Hadoop, RDBMS and Linux shell scripting

·Strong experience in Extending Hive and Pig core functionality by writing custom UDFs

·Experience in transforming business requirements into analytical models, designing algorithms, building models, developing data mining and reporting solutions that scale across big volume of structured and unstructured data.

· Extensive experience in Text Analytics, developing different statistical Machine Learning, Data mining solutions to various business problems and gathering data visualization using Python and R.

·Ability to work as team and individually on many cutting-edge technologies with excellent management skills, business understanding and strong communication skills

TECHNIAL SKILLS:

Hadoop/Big Data

HDFS, MapReduce, Yarn, HBase, Pig, Hive, Sqoop, Flume, Oozie, Zookeeper, Splunk, Hortonworks, Cloudera

Programming languages

SQL, Python, R, Scala, Spark, Linux shell scripts

Databases

RDBMS (MySQL, DB2, MS-SQL Server, Terradata, PostgreSQL), NoSQL (MongoDB, HBase, Cassandra), Snowflake virtual warehouse

OLAP & ETL Tools

Tableau, Spyder, Spark, SSIS, Informatica Power Center, Pentaho, Talend

Data Modelling Tools

Microsoft Visio, ER Studio, Erwin

Python and R libraries

R-tidyr, tidyverse, dplyr reshape, lubridate, Python – beautiful Soup, numpy, scipy, matplotlib, python-twitter, pandas, scikit-learn, keras.

Machine Learning

Regression, Clustering, MLlib, Linear Regression, Logistic Regression, Decision Tree, SVM, Naive Bayes, KNN, K-Means, Random Forest, and Gradient Boost & Adaboost, Neural Networks and Time Series Analysis.

Data analysis Tools

Machine Learning, Deep Learning, Data Warehouse, Data Mining, Data Analysis, Big data, Visualizing, Data Munging, Data Modelling

Cloud Computing Tools

Snowflake, SnowSQL, Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP)

Amazon Web Services

EMR, EC2, S3, RDS, Cloud Search, Redshift, Data Pipeline, Lambda.

Reporting Tools

JIRA, MS Excel, Tableau, Power BI, QlikView, Qlik Sense, D3, SSRS, SSIS

IDE’s

Pycharm

Development Methodologies

Agile, Scrum, Waterfall

EDUCATION: Master of science in Informatics Analytics, Northeastern university, Boston, MA

PROFESSIONAL EXPERIENCE:

The Vanguard Group, King of Prussia, Pennsylvania Oct 2021 – Till Now

Role: Data Engineer

Responsibilities:

·Developing ETL pipelines to move on-prem data (data sources that include Flat Files, Mainframe Files, and Databases) to AWS S3 using PySpark. Created and embedded python modules in the ETL pipeline to automatically migrate data from S3 to Redshift using AWS Glue.

·Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.

·Used AWS cloud product suites (S3, EMR, SQS, Redshift), SparkSQL

·Operated on AWS cloud RDS, Athena, Cloud Watch, EC2, IAM policies

·Designed and architected solutions to load multipart files which can't rely on a scheduled run and must be event driven, leveraging AWS SNS,SQS and Glue.

·Created Lambda function to run the AWS Glue job based on the deﬁned Amazon S3 event.

·Optimize the Glue jobs to run on EMR Cluster for faster data processing.

·Performed data preprocessing and feature engineering for further predictive analytics using Python Pandas.

·Explored the usage of Spark for improving the performance and optimization of the existing algorithms in Big Data using Spark Context, Spark SQL and Spark Yarn.

·Involved in converting Hive/SQL queries into Spark Transformations using Spark RDDs

·Used Oozie workflow engine to manage independent Hadoop jobs and to automate several types of Hadoop such as Hive and Sqoop as well as system specific jobs

·Working with Amazon Web Services (AWS), AWS Cloud Formation, AWS Cloud Front and using containers like Docker and familiar with Jenkins.

·Managed security groups on AWS, focusing on high-availability, fault-tolerance, and auto scaling using Terraform templates. Along with Continuous Integration and Continuous Deployment with AWS Lambda and AWS code pipeline.

·Implemented a 'server less' architecture using API Gateway, Lambda, and Dynamo DB and deployed AWS Lambda code from Amazon S3 buckets.

·Worked on Cognos modernization which involves migrating the congos data source to AWS data lake

·Built Datablocks in data lake using the raw tables as data source

·Git used as a version control tool and Jenkins as Continuous integration(CI) tool.

·Published the dashboard reports to Tableau Server for navigating the developed dashboards in web.

·Followed agile methodology and actively participated in daily scrum meetings.

Environment: AWS, Spark, Hive, Spark SQL, Spark Streaming, AWS EC2, S3, Glue, EMR

Wayfair, Boston, MA Feb 2020 – Sep 2021

Role: Data Engineer

Responsibilities:

·Developed Talend Bigdata jobs to load heavy volume of data into S3 data lake and then into Snowflake.

·Developed snowpipes for continuous injection of data using event handler from AWS (S3 bucket).

·Developed SnowSql scripts to deploy new objects and update changes into Snowflake.

·Developed a Python script to integrate DDL changes between on-prem Talend warehouse and snowflake.

·Working with AWS stack S3, EC2, Snowball, EMR, Athena, Glue, Redshift, DynamoDB, RDS, Aurora, IAM, Firehose, and Lambda.

·Designing and implementing new HIVE tables, views, schema and storing data optimally.

·Performing Sqoop jobs to land data on HDFS and running validations.

·Configuring Oozie Scheduler Jobs to run the Extract jobs and queries in a automated way.

·Querying data by optimizing the query and increasing the query performance.

·Designing and creating SQL Server tables, views, stored procedures, and functions.

·Performing ETL operations using Apache Spark, also using Ad-Hoc queries and implementing Machine Learning techniques.

·Worked on configuring CICD for CaaS deployments (k8's).

·Involved in migrating master-data form Hadoop to AWS.

·Worked with Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames, Pair RDD's.

·Developed preprocessing job using Spark Data frames to transform JSON documents to flat file

·Loaded D-Stream data into Spark RDD and did in-memory data computation to generate output response

·Processing with Amazon EMR big data across a Hadoop cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3).

·Imported data from AWS S3 into Spark RDD, performed transformations and actions on RDD's.

·Worked on Big Data infrastructure for batch processing and real-time processing using Apache Spark

·Developed Apache Spark applications by using Scala for data processing from various streaming sources

·Processed the Web server logs by developing Multi-Hop Flume agents by using Avro Sink and loaded into Cassandra for further analysis, Extracted files from Cassandra through Flume

·Responsible for design and development of Spark SQL Scripts based on Functional Specifications

·Worked on the large-scale Hadoop YARN cluster for distributed data processing and analysis using Spark, Hive, and Cassandra

·Involved in converting Cassandra/Hive/SQL queries into Spark transformations using RDD's and Scala

·Implemented Spark Scripts using Scala, Spark SQL to access hive tables to spark for faster processing of data.

·Developed Some Helper class for abstracting Cassandra cluster connection act as core toolkit

·Involved in creating Data Lake by extracting customer's data from various data sources to HDFS which include data from Excel, databases, and log data from servers

·Moved data from HDFS to Cassandra using Map Reduce and Bulk Output Format class.

·Extracted files from Cassandra through Sqoop and placed in HDFS and processed it using Hive

·Writing MapReduce (Hadoop) programs to convert text files into AVRO and loading into Hive table

·Experienced in writing real-time processing and core jobs using Spark Streaming with Kafka as a data pipeline system

·Extending HIVE/PIG core functionality by using custom User Defined Function's (UDF), User Defined Table-Generating Functions (UDTF) and User Defined Aggregating Functions (UDAF) for Hive and Pig

·Involved in loading data from rest endpoints to Kafka producers and transferring the data to Kafka brokers

·Used Apache Kafka functionalities like distribution, partition, replicated commit log service for messaging

·Partitioning Data streams using Kafka. Designed and configured Kafka cluster to accommodate heavy throughput.

·Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team

·Used Apache Oozie for scheduling and managing multiple Hive Jobs. Knowledge of HCatalog for Hadoop based storage management

·Migrated an existing on-premises application to Amazon Web Services (AWS) and used its services like EC2 and S3 for small data sets processing and storage, experienced in maintaining the Hadoop cluster on AWS EMR

·Developed solutions to pre-process large sets of structured, semi-structured data, with different file formats like Text, Avro, Sequence, XML, JSON, and Parquet

·Generated various kinds of reports using Pentaho and Tableau based on Client specification

·Have come across new tools like Jenkins, Chef and Rabbit MQ.

·Worked with SCRUM team in delivering agreed user stories on time for every Sprint

Environment: Snowflake, SnowSQL, Hadoop, MapReduce, HDFS, Yarn, Hive, Sqoop, Oozie, Spark, Scala, AWS, EC2, S3, EMR, Cassandra, Flume, Kafka, Pig, Linux, Shell Scripting

Travelers Insurance, Westport, CT Jun 2019 – Dec 2019

Role: Data Engineer

Responsibilities:

·Worked on Snowflake Shared Technology Environment for providing stable infrastructure, secured environment, reusable generic frameworks, robust design architecture, technology expertise, best practices and automated SCBD (Secured Database Connections, Code Review, Build Process, Deployment Process) utilities.

·Designed ETL process using Pentaho Tool to load from Sources to Targets with Transformations.

·Worked on Snowflake Schemas and Data Warehousing.

·Developed Pentaho Bigdata jobs to load heavy volume of data into S3 data lake and then into Redshift data warehouse.

·Migrated the data from Redshift data warehouse to Snowflake database.

·Build dimensional modelling, data vault architecture on Snowflake.

·Built scalable distributed Hadoop cluster running Hortonworks Data Platform (HDP 2.6)

·Involved in developing Spark code using Scala and Spark-SQL for faster testing and processing of data and exploring of optimizing it using SparkContext, Spark-SQL, PairRDD's

·Serializing JSON data and storing the data into tables using Spark SQL

·Spark Streaming collects data from Kafka in near-real-time and performs necessary transformations and aggregation to build the common learner data model and stores the data in NoSQL store (HBase).

·Worked on Spark framework on both batch and real-time data processing

·Hands on experience in MLlib from Spark are used for predictive intelligence, customer segmentation and for smooth maintenance in Spark streaming

·Developing programs for Spark streaming which takes the data from Kafka and pushes into different sources

·Loading the data from the different Data sources like (Teradata, DB2, Oracle and flat files) into HDFS using Sqoop and load into Hive tables, which are partitioned.

·Created different pig scripts & converted them as shell command to provide aliases for common operation for project business flow.

·Implemented Partitioning, Bucketing in Hive for better organization of the data.

·Created few Hive UDF's to as well to hide or abstract complex repetitive rules.

·Developed Oozie Workflows for daily incremental loads, which gets data from Teradata and then imported into hive tables.

·Developed bash scripts to bring log files from FTP server and then processing it to load into Hive tables.

·All the bash scripts are scheduled using Resource Manager Scheduler.

·Developed Map Reduce programs for applying business rules on the data.

·Developed a NiFi Workflow to pick up the data from Data Lake as well as from server and send that to Kafka broker

·Involved in loading and transforming large sets of structured data from router location to EDW using an Apache NiFi data pipeline flow

·Implemented Kafka event log producer to produce the logs into Kafka topic which are utilized by ELK (Elastic Search, Log Stash, Kibana) stack to analyze the logs produced by the Hadoop cluster

·Did Implementation using Apache Kafka replacement for a more traditional message broker (JMS Solace) to reduce licensing and decouple processing from data producers, to buffer unprocessed messages.

·Implemented receiver-based approach, here I worked on Spark streaming for linking with Streaming Context using Python and handle proper closing & waiting stages as well.

·Experience in Implementing Rack Topology scripts to the Hadoop Cluster.

·Implemented the part to resolve issues related with old Hazel cast API Entry Processor.

·Used Akka Toolkit to perform few builds and used Akka with Scala

·Excellent knowledge with Talend Administration console, Talend installation, using Context and global map variables in Talend

·Used dashboard tools like Tableau

·Used Talend Admin Console Job conductor to schedule ETL Jobs on daily, weekly basis

Environment: Hadoop HDP, Linux, MapReduce, HBase, HDFS, Hive, Pig, Tableau, NoSQL, Shell Scripting, Sqoop, Open source technologies Apache Kafka, Apache Spark, Git, Talend.

Micro Intellects Solutions, Hyderabad, India Jan 2015 – June 2018

Role: Hadoop Engineer

Responsibilities:

·Interacting with the Business Requirements and the design team and preparing the Low-Level Design and high-level design documents.

·Provide in-depth technical and business knowledge to ensure efficient design, programming, implementation and on-going support for the application.

·Involved in identifying possible ways to improve the efficiency of the system.

·Logical implementation and interaction with HBase.

·Efficiently put and fetched data to/from HBase by writing MapReduce job.

·Developed Map Reduce jobs to automate transfer of data from/to HBase.

·Assisted with the addition of Hadoop processing to the IT infrastructure.

·Implemented Map/Reduce job and execute the Map/Reduce job to process the log data from the ad-servers.

·Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.

·Worked on MongoDB, and Cassandra.

·Prepared multi-cluster test harness to exercise the system for better performance.

Environment: Hadoop, HDFS, MapReduce, HBase, Hive, Cassandra, Hadoop distribution of Hortonworks, Cloudera, SQL* PLUS and Oracle 10g.

Contact this candidate