Data Engineer

Location:

Alpharetta, GA

Salary:

$70/hr on C2C

Posted:

May 04, 2022

Contact this candidate

Resume:

VARUN

Sr Data Engineer

Mobile: +1-469-***-****

E-mail: ************@*****.***

Professional Summary:

Having over 7 years of experience as a Big Data Engineer with expertise in designing data intensive applications using Hadoop Ecosystem, Big Data Analytical, Cloud Data engineering, Data Warehouse/ Data Mart, Data Visualization, Reporting, and Data Quality solutions.

In - depth knowledge of Hadoop architecture and its components like YARN, HDFS, Name Node, Data Node, Job Tracker, Application Master, Resource Manager, Task Tracker and Map Reduce programming paradigm.

Extensive experience in Hadoop led development of enterprise level solutions utilizing Hadoop components such as Apache Spark, MapReduce, HDFS, Sqoop, PIG, Hive, HBase, Oozie, Flume, NiFi, Kafka, Zookeeper, and YARN.

Experience in developing web applications by using Python, Django, C++, XML, CSS, HTML, JavaScript and jQuery.

Profound experience in performing Data Ingestion, Data Processing (Transformations, enrichment, and aggregations). Strong Knowledge on Architecture of Distributed Systems and Parallel Processing, In-depth understanding of MapReduce programming paradigm and Spark execution framework.

Experience with Requests, Report Lab, NumPy, SciPy, Pytables, cv2, imageio, Python-Twitter, Matplotlib, HTTPLib2, Urllib2, Beautiful Soup, Data Frame and Pandas python libraries during development lifecycle.

Experienced with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data frame API, Spark Streaming, MLlib, Pair RDD's and

Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory and worked explicitly on PySpark and Scala.

Handled ingestion of data from different data sources into HDFS using Sqoop, Flume and perform transformations using Hive, Map Reduce and then loading data into HDFS. Managed Sqoop jobs with incremental load to populate HIVE external tables.

Good working experience in using Apache Hadoop eco system components like MapReduce, HDFS, Hive, Sqoop, Pig, Oozie, Flume, HBase and Zookeeper.

Extensive experience working on spark in performing ETL using Spark Core, Spark-SQL and Real-time data processing using Spark Streaming.

Extensively worked with Kafka as middleware for real-time data pipelines.

Writing UDFs and integrating with Hive and Pig using Java.

Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow and Oozie. Experienced with using most common Operators in Airflow - Python Operator, Bash Operator, Google Cloud Storage Download Operator, Google Cloud Storage Object Sensor, GoogleCloudStorageToS3Operator.

Hands-on experience in handling database issues and connections with SQL and NoSQL databases such as MongoDB, HBase, Cassandra, SQL server, and PostgreSQL. Created Java apps to handle data in MongoDB and HBase. Used Phoenix to create SQL layer on HBase.

Experience in designing and creating RDBMS Tables, Views, User Created Data Types, Indexes, Stored Procedures, Cursors, Triggers and Transactions.

Expert in designing ETL data flows using creating mappings/workflows to extract data from SQL Server and Data Migration and Transformation from Oracle/Access/Excel Sheets using SQL Server SSIS.

Expert in designing Parallel jobs using various stages like Join, Merge, Lookup, remove duplicates, Filter, Dataset, Lookup file set, Complex flat file, Modify, Aggregator, XML.

Hands-on experience with Amazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, CloudWatch, SNS, SES, SQS, Lambda, EMR and other services of the AWS family.

Experienced in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension).

Azure Data Factory (ADF), Integration Run Time (IR), File System Data Ingestion, Relational Data Ingestion .

Docker container orchestration using ECS, ALB and lambda.

Experienced with JSON based RESTful web services, and XML/QML based SOAP web services and also worked on various applications using python integrated IDEs like Sublime Text and PyCharm.

Building and productionizing predictive models on large datasets by utilizing advanced statistical modeling, machine learning, or other data mining techniques.

Developed intricate algorithms based on deep-dive statistical analysis and predictive data modeling that were used to deepen relationships, strengthen longevity and personalize interactions with customers.

Technical Skills:

Big Data Technologies: Hadoop, MapReduce, HDFS, Sqoop, PIG, Hive, HBase, Oozie, Flume, NiFi, Kafka, Zookeeper, Yarn, Apache Spark, Mahout, Sparklib

Databases: Oracle, MySQL, SQL Server, MongoDB, Cassandra, DynamoDB, PostgreSQL, Teradata, Cosmos.

Programming: Python, PySpark, Pandas, Scala, Java, C, C++, Shell script, Perl script, SQL

Cloud Technologies: AWS (lambda, EC2, EMR, AmazonS3, Kinesis, Sagemaker), Microsoft Azure, GCP

Frameworks: Django REST framework, MVC, Hortonworks

Tools: PyCharm, Eclipse, Visual Studio, SQL*Plus, SQL Developer, TOAD, SQL Navigator, Query Analyzer, SQL Server Management Studio, SQL Assistance, Eclipse, Postman

Versioning tools: SVN, Git, GitHub (Version Control)

Network Security: Kerberos

Database Modelling: Dimension Modeling, ER Modeling, Star Schema Modeling, Snowflake Modeling

Monitoring Tool: Apache Airflow, Agile, Jira, Rally

Visualization/ Reporting: Tableau, ggplot2, matplotlib, SSRS and Power BI

Machine Learning Techniques: Linear & Logistic Regression, Classification and Regression Trees, Random Forest, Associative rules, NLP and Clustering.

Machine Learning Tools: Scikit-learn, Pandas, TensorFlow, SparkML, SAS, R, Keras

Professional Experience:

Walgreens, Chicago, IL Aug 2020 – Till Date

Role: Sr Data Engineer

Responsibilities:

Extensive experience in working with AWS cloud Platform (EC2, S3, EMR, Redshift, Lambda and Glue).

Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.

Developed rest API's using python with flask and Django framework and done the integration of various data sources including Java, JDBC, RDBMS, Shell Scripting, Spreadsheets, and Text files.

Working knowledge of Spark RDD, Data Frame API, Data set API, Data Source API, Spark SQL and Spark Streaming.

Developed Spark Applications by using Python and Implemented Apache Spark data processing Project to handle data from various RDBMS and Streaming sources.

Perform Data Cleaning, features scaling, features engineering using pandas and numpy packages in python.

Worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop.

Using Spark Context, Spark-SQL, Spark MLlib, Data Frame, Pair RDD and Spark YARN.

Used Spark Streaming APIs to perform transformations and actions on the fly for building common.

Learner data model which gets the data from Kafka in real time and persist it to Cassandra.

Developed Kafka consumer API in python for consuming data from Kafka topics.

Consumed Extensible Markup Language (XML) messages using Kafka and processed the XML file using Spark Streaming to capture User Interface (UI) updates.

Developed Pre-processing job using Spark Data frames to flatten JSON documents to flat file.

Load D-Stream data into Spark RDD and do in memory data Computation to generate output response.

Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a Data pipeline system.

Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for data sets processing and storage.

Experienced in Maintaining the Hadoop cluster on AWS EMR.

Loaded data into S3 buckets using AWS Glue and PySpark. Involved in filtering data stored in S3 buckets using Elasticsearch and loaded data into Hive external tables.

Configured Snow pipe to pull the data from S3 buckets into Snowflakes table.

Stored incoming data in the Snowflakes staging area.

Created numerous ODI interfaces and load into Snowflake DB.

Worked on Amazon Redshift for shifting all Data warehouses into one Data warehouse.

Good understanding of Cassandra architecture, replication strategy, gossip, snitches etc.

Designed columnar families in Cassandra and Ingested data from RDBMS, performed data transformations, and then exported the transformed data to Cassandra as per the business requirement.

Used the Spark Data Cassandra Connector to load data to and from Cassandra.

Worked from Scratch in Configurations of Kafka such as Mangers and Brokers.

Tables for quick searching, sorting and grouping using the Cassandra Query Language.

Used Hive QL to analyze the partitioned and bucketed data, Executed Hive queries on Parquet tables.

Stored in Hive to perform data analysis to meet the business specification logic.

Used Apache Kafka to aggregate web log data from multiple servers and make them available in Downstream systems for Data analysis and engineering type of roles.

Worked in Implementing Kafka Security and Boosting its performance.

Developed Custom UDF in Python and used UDFs for sorting and preparing the data.

Worked on Custom Loaders and Storage Classes in PIG to work on several data formats like JSON, XML, CSV and generated Bags for processing using pig etc.

Developed Sqoop and Kafka Jobs to load data from RDBMS, External Systems into HDFS and HIVE.

Developed Oozie coordinators to schedule Hive scripts to create Data pipelines.

Written several Map Reduce Jobs using Pyspark, NumPy and used Jenkins for Continuous integration.

Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Sql Activity.

Setting up and worked on Kerberos authentication principals to establish secure network communication.

On cluster and testing of HDFS, Hive, Pig and MapReduce to access cluster for new users.

Continuous monitoring and managing the Hadoop cluster through Cloudera Manager.

Environment: Spark, Spark-Streaming, Spark SQL, Django, AWS EMR, map R, HDFS, Hive, Pig, Apache Kafka, Sqoop, Python, Pyspark, Shell scripting, Linux, MySQL Oracle Enterprise DB, SOLR, Jenkins, Eclipse, Oracle, Git, Oozie, Tableau, MySQL, Soap, Cassandra & Agile Methodologies.

State farm., Bloomington IL Apr 2019 – Jul 2020

Role: Big Data / Hadoop Engineer

Responsibilities:

Worked as a Sr. Big Data/Hadoop Engineer with Hadoop Ecosystems components like HBase, Sqoop, Zookeeper, Oozie, Hive and Pig with Cloudera Hadoop distribution.

Involved in Agile development methodology active member in scrum meetings.

Worked in Azure environment for development and deployment of Custom Hadoop Applications.

Designed and implemented scalable Cloud Data and Analytical architecture solutions for various public and private cloud platforms using Azure.

Involved in start to end process of Hadoop jobs that used various technologies such as Sqoop, PIG, Hive, MapReduce, Spark, and Shells scripts.

Implemented various Azure platforms such as Azure SQL Database, Azure SQL Data Warehouse, Azure Analysis Services, HDInsight, Azure Data Lake and Data Factory.

Extracted and loaded data into Data Lake environment (MS Azure) by using Sqoop which was accessed by business users.

Developed interfaces using middleware tools: IIB, DataStage, ITX and Sterling MFT.

Manage and support of enterprise Data Warehouse operation, big data advanced predictive application development using Cloudera & Hortonworks HDP.

Developed PIG scripts to transform the raw data into intelligent data as specified by business users.

Utilized Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, executed machine learning use cases under Spark ML and MLLib.

Installed Hadoop, Map Reduce, HDFS, Azure to develop multiple MapReduce jobs in PIG and Hive for data cleansing and pre-processing.

Used Spark API over Hortonworks Hadoop YARN to perform analytics on data in Hive.

Improved the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.

Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.

Developed a Spark job in Java which indexes data into Elastic Search from external Hive tables which are in HDFS.

Performed transformations, cleaning and filtering on imported data using Hive, MapReduce, and loaded final data into HDFS.

Explored with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.

Import the data from different sources like HDFS/HBase into Spark RDD and developed a data pipeline using Kafka and Storm to store data into HDFS.

Used Spark streaming to receive real time data from the Kafka and store the stream data to HDFS using Scala and NoSQL databases such as HBase and Cassandra.

Documented the requirements including the available code which should be implemented using Spark, Hive, HDFS, HBase and Elastic Search.

Performed transformations like event joins, filter boot traffic and some pre-aggregations using Pig.

Explored MLLib algorithms in Spark to understand the possible Machine Learning functionalities that can be used for our use case

Used windows Azure SQL reporting services to create reports with tables, charts, and maps.

Executed Hive queries on Parquet tables stored in Hive to perform data analysis to meet the business requirements.

Configured Oozie workflow to run multiple Hive and Pig jobs which run independently with time and data availability.

Imported and exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.

Environment: Hadoop 3.0, Azure, Sqoop 1.4.6, PIG 0.17, Hive 2.3, MapReduce, Spark 2.2.1, Shells scripts, SQL, Hortonworks, Python, MLLib, HDFS, YARN, Java, Kafka 1.0, Cassandra 3.11, Oozie, Agile

Merck, Branchburg, NJ Dec 2017 – Mar 2019

Role: Big Data/Hadoop Engineer

Responsibilities:

Imported the data from various formats like JSON, Sequential, Text, CSV, AVRO and Parquet to HDFS cluster with compressed for optimization.

Worked on ingesting data from RDBMS sources like - Oracle, SQL Server and Teradata into HDFS using Sqoop.

Loaded all datasets into Hive from Source CSV files using Spark and Cassandra from Source CSV files using Spark

Created environment to access Loaded Data via Spark SQL, through JDBC & ODBC (via Spark Thrift Server).

Developed real time data ingestion/ analysis using Kafka / Spark-streaming.

Created Hive External tables and loaded the data into tables and query data using HQL.

Written Hive jobs to parse the logs and structure them in tabular format to facilitate effective querying on the log data.

Using Hive join queries to join multiple tables of a source system and load them to Elastic search tables.

Real time streaming, performing transformations on the data using Kafka and Kafka Streams.

Implemented Spark using Scala and Spark SQL for faster testing and processing of data.

Experience in managing and reviewing huge Hadoop log files.

Involved in Cluster maintenance, Cluster Monitoring and Troubleshooting.

Migrated the computational code in HQL to PySpark.

Completed data extraction, aggregation and analysis in HDFS by using PySpark and store the data needed to Hive.

Developed Python code to gather the data from HBase (Cornerstone) and designs the solution to implement using PySpark.

Maintaining technical documentation for each and every step of development environment and launching Hadoop clusters.

Worked on different file formats like Parquette, Orc, Avro, Sequence files using MapReduce/Hive/Impala.

Worked with Avro Data Serialization system to work with JSON data formats.

Used Amazon Web Services (AWS) S3 to store large amount of data in identical/similar repository.

Worked with the Data Science team to gather requirements for various data mining projects.

Automated the process of rolling day-to-day reporting by writing shell scripts.

Involved in build applications using Maven and integrated with Continuous Integration servers like Jenkins to build jobs.

Worked on BI tools as Tableau to create dashboards like weekly, monthly, daily reports using tableau desktop and publish them to HDFS cluster.

Environment: Spark, Spark SQL, Spark Streaming, Scala, Kafka, Hadoop, HDFS, Hive, Oozie, Pig, Nifi, Sqoop, AWS (EC2, S3, EMR), Shell Scripting, HBase, Jenkins, Tableau, Oracle, MySQL, Teradata and AWS.

HighMark – Pittsburgh PA Aug 2016 – Nov 2017

Role: Hadoop/Bigdata Engineer

Responsibilities:

Worked as Hadoop Engineer and responsible for taking care of everything related to the clusters.

Developed Spark scripts by using Java, and Python shell commands as per the requirement.

Involved with ingesting data received from various relational database providers, on HDFS for analysis and other big data operations.

Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.

Worked on Spark SQL and Data frames for faster execution of Hive queries using Spark SQL Context.

Performed analysis on implementing Spark using Scala.

Used Data frames/ Datasets to write SQL type queries using Spark SQL to work with datasets on HDFS.

Extracted files from MongoDB through Sqoop and placed in HDFS and processed.

Created and imported various collections, documents into MongoDB and performed various actions like query, project, aggregation, sort and limit.

Extensively experienced in deploying, managing and developing MongoDB clusters.

Created Hive tables to import large data sets from various relational databases using Sqoop and export the analyzed data back for visualization and report generation by the BI team.

Involved in creating Shell scripts to simplify the execution of all other scripts (Pig, Hive, Sqoop, Impala and MapReduce) and move the data inside and outside of HDFS.

Implemented some of the big data operations on AWS cloud.

Used Hibernate reverse engineering tools to generate domain model classes, perform association mapping and inheritance mapping using annotations and XML.

Developed Pig Scripts, Pig UDFs and Hive Scripts, Hive UDFs to analyze HDFS data.

Maintained the cluster securely using Kerberos and making the cluster up and running all the times.

Have an experience to load and transform large sets of structured, semi structured, and unstructured data, using Sqoop from Hadoop Distributed File Systems to Relational Database Systems.

Created Hive tables to store the processed results in a tabular format.

Used Hive QL to analyze the partitioned and bucketed data and compute various metrics for reporting.

Performed data transformations by writing MapReduce as per business requirements.

Implemented schema extraction for Parquet and Avro file Formats in Hive.

Involved in various NoSQL databases like HBase, Cassandra in implementing and integration.

Queried and analyzed data from Cassandra for quick searching, sorting, and grouping through CQL.

Responsible for developing data pipeline using flume, Sqoop and pig to extract the data from weblogs and store in HDFS.

Environment: Java, Spark, Python, HDFS, YARN, Hive, Scala, SQL, MongoDB, Sqoop, AWS, Pig, MapReduce, Cassandra, NoSQL.

Sutherlands, India Apr 2013 – Jul 2015

Role: Data Analyst/Engineer

Responsibilities:

Responsible for gathering requirements from Business Analyst and Operational Analyst and identifying the data sources required for the request.

Created value from data and drive data-driven decisions by performing advanced analytics and statistical techniques to determine to deepen insights, optimal solution architecture, efficiency, maintainability, and scalability which make predictions and generate recommendations.

Enhancing data collection procedures to include information that is relevant for building analytic systems.

Worked closely with a data architect to review all the conceptual, logical and physical database design models with respect to functions, definition, maintenance review and support data analysis, Data quality and ETL design that feeds the logical data models.

Maintained and developed complex SQL queries, stored procedures, views, functions, and reports that qualify customer requirements using SQL Server 2012.

Creating automated anomaly detection systems and constant tracking of its performance.

Support Sales and Engagement's management planning and decision making on sales incentives.

Used statistical analysis, simulations, predictive modelling to analyze information and develop practical solutions to business problems.

Extending the company's data with third-party sources of information when needed.

Précised development of several types of sub-reports, drill down reports, summary reports, parameterized reports, and ad-hoc reports using SSRS through mailing server subscriptions &SharePoint server.

Generated ad-hoc reports using Crystal Reports 9 and SQL Server Reporting Services (SSRS).

Developed the reports and visualizations based on the insights mainly using Tableau and dashboards for the company insight teams.

Environment:SQL Server 2012, SSRS, SSIS, SQL Profiler, Tableau, Qlik View, Agile, ETL, Anomaly detection.

Contact this candidate