Data Engineer

Location:

United States

Posted:

May 09, 2023

Contact this candidate

Resume:

Abhilash

***********@*****.***

469-***-****

Sr. Data Engineer

Professional Summary:

Data Engineer professional with having 7.9 years of combined experience in the fields of Data Engineer, Big Data implementations and Spark technologies.

Experience in Big Data ecosystems using Hadoop, Pig, Hive, HDFS, MapReduce, Sqoop, Storm, Spark, Airflow, Snowflake, Teradata, Flume, Kafka, Yarn, Oozie, and Zookeeper.

High Exposure on Big Data technologies and Hadoop ecosystem, In - depth understanding of Map Reduce and Hadoop Infrastructure.

Expertise in writing end to end Data processing Jobs to analyze data using MapReduce, Spark and Hive.

Experience with Apache Spark ecosystem using Spark-Core, SQL, Data Frames, RDD's and knowledge on Spark MLLib.

Experienced in data manipulation using python for loading and extraction as well as with python libraries such as NumPy, SciPy and Pandas for data analysis and numerical computations.

Experienced with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.

A solid experience and understanding of designing and operationalization of large-scale

data and analytics solutions on Snowflake Data Warehouse.

Developing ETL pipelines in and out of data warehouse using a combination of Python

and Snow SQL.

Experience in extracting files from MongoDB through Sqoop and placed in HDFS

and processed.

Worked with NoSQL databases like HBase in creating HBase tables to load large sets of

semi-structured data coming from various sources.

Implemented Cluster for NoSQL tool HBase as a part of POC to address HBase limitations.

Strong Knowledge on architecture and components of Spark, and efficient in working

with Spark Core.

Strong knowledge of Hive analytical functions, extending Hive functionality by writing custom UDFs.

Expertise in writing Map-Reduce Jobs in Python for processing large sets of structured, semi-structured and unstructured data sets and stores them in HDFS.

Good understanding of data modeling (Dimensional & Relational) concepts like Star-Schema Modeling, Snowflake Schema Modeling, Fact and Dimension tables.

Used Amazon Web Services Elastic Compute Cloud (AWS EC2) to launch cloud instances.

Hands on experience working Amazon Web Services (AWS) using Elastic Map Reduce (EMR), Redshift, and EC2 for data processing.

Hands on experience in SQL and NOSQL database such as Snowflake, HBase, Cassandra and MongoDB.

Hands on experience in setting up workflow using Apache Airflow and Oozie workflow engine for managing and scheduling Hadoop jobs.

Experience in data warehousing and business intelligence area in various domain.

Created tableau dashboards designing with large data volumes from data source SQL server.

Extract, Transform and Load (ETL) source data into respective target tables to build the required data marts.

Active involvement in all scrum ceremonies - Sprint Planning, Daily Scrum, Sprint Review and Retrospective meetings and assisted Product owner in creating and prioritizing user stories.

Strong experience in working with UNIX/LINUX environments, writing shell scripts.

Worked with various formats of files like delimited text files, clickstream log files, Apache log files, Avro files, JSON files, XML Files.

Strong skills in analytical, presentation, communication, problem solving with the ability to work independently as well as in a team and had the ability to follow the best practices and principles defined for the team.

Technical Skills:

Hadoop/Spark Ecosystem: Hadoop, MapReduce, Pig, Hive/impala, YARN, Kafka, Flume, Oozie, Zookeeper, Spark, Airflow, MongoDB, Cassandra, HBase, and Storm.

Hadoop Distribution: Cloudera distribution and Horton works.

Programming Languages: Scala, Hibernate, JDBC, JSON, HTML, CSS, SQL, R, Shell Scripting

Script Languages: JavaScript, jQuery, Python.

Databases: Oracle, SQL Server, MySQL, Cassandra, Teradata, PostgreSQL, MS Access, Snowflake, NoSQL, Database (HBase, MongoDB).

Operating Systems: Linux, Windows, Ubuntu, Unix

Web/Application server: Apache Tomcat, WebLogic, WebSphere Tools Eclipse, NetBeans

Data Visualization Tools: Tableau, Power BI, SAS, Excel, ETL

OLAP/Reporting: SQL Server Analysis Services and Reporting Services.

Cloud Technologies: MS Azure, Amazon Web Services (AWS).

Professional Experience:

Peraton, Reston, Virginia Jan 2022 to Present

Senior Azure Data Engineer

Responsibilities:

Used Azure Data Factory extensively for ingesting data from disparate source systems.

Used Azure Data Factory as an orchestration tool for integrating data from upstream to downstream systems.

Automated jobs using different triggers (Event, Scheduled and Tumbling) in ADF.

Used Cosmos DB for storing catalog data and for event sourcing in order processing pipelines.

Designed and developed user defined functions, stored procedures, triggers for Cosmos DB

Analyzed the data flow from different sources to target to provide the corresponding design Architecture in Azure environment.

Take initiative and ownership to provide business solutions on time.

Created High level technical design documents and Application design documents as per the requirements and delivered clear, well-communicated and complete design documents.

Created DA specs and Mapping Data flow and provided the details to the developer along with HLDs.

Created Build definition and Release definition for Continuous Integration and Continuous Deployment.

Created Application Interface Document for the downstream to create a new interface to transfer and receive the files through Azure Data Share.

Creating pipelines, data flows and complex data transformations and manipulations using ADF and PySpark with Databricks

Ingested data in mini-batches and performs RDD transformations on those mini-batches of data by using Spark Streaming to perform streaming analytics in Data bricks. Created, provisioned different Databricks clusters needed for batch and continuous streaming data processing and installed the required libraries for the clusters. Integrated Azure Active Directory authentication to every Cosmos DB request sent and demoed feature to Stakeholders

Improved performance by optimizing computing time to process the streaming data and saved cost to company by optimizing the cluster run time. Perform ongoing monitoring, automation and refinement of data engineering solutions prepare complex SQL views, stored procs in azure SQL DW and Hyperscale Designed and developed a new solution to process the NRT data by using Azure stream analytics, Azure Event Hub and Service Bus Queue. Created Linked service to land the data from SFTP location to Azure Data Lake.

Created several Databricks Spark jobs with PySpark to perform several tables to table operations.

Extensively used SQL Server Import and Export Data tool.

Created database users, logins and permissions to setup.

Working with complex SQL, Stored Procedures, Triggers, and packages in large databases from various servers.

Helping team members to resolve any technical issue, Troubleshooting, Project Risk & Issue identification and management Addressing resource issue, Monthly one on one, Weekly meeting.

Environment: Azure Cloud, Azure Data Factory (ADF v2), Azure functions Apps, Azure Data Lake, BLOB Storage, SQL server, Teradata Utilities, Windows remote desktop, UNIX Shell Scripting, AZURE PowerShell, Data bricks, Python, Erwin Data Modelling Tool, Azure Cosmos DB, Azure Stream Analytics, Azure Event Hub, Azure Machine Learning.

Anthem, Indianapolis, Indiana Feb 2020 to Dec 2021

Senior Data Engineer

Responsibilities:

Experienced in development using Cloudera Distribution System.

Design and develop ETL integration patterns using Python on Spark.

Optimize the PySpark jobs to run on Secured Clusters for faster data processing.

Developed spark applications in Python (PySpark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables.

Designing and Developing Apache NiFi jobs to get the files from transaction systems into data lake raw zone.

Analyzed the user requirements and implemented the use cases using Apache NiFi.

Worked on reading and writing multiple data formats like JSON, ORC, Parquet on HDFS using PySpark.

Experienced with Azure Micro Services, Azure Functions and Azure Solutions.

Hands on experience on Azure VPN-Point to Site, Virtual networks, Azure Custom security, end point security and firewall.

Caching, SQL Azure, NoSQL, Storage, Network services, Azure Active Directory, API Management, Scheduling, Auto Scaling, and PowerShell Automation.

Involved in developing the Azure Solution and Services like IaaS and PaaS.

As a Hadoop Developer, my role is to manage the Data Pipelines and Data Lake.

Have experience of working on Snow -flake data warehouse and have hands-on experience working with Snow SQL.

Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.

Designed custom Spark REPL application to handle similar datasets.

Used Hadoop scripts for HDFS (Hadoop File System) data loading and manipulation.

Performed Hive test queries on local sample files and HDFS files.

Developed the application on Eclipse IDE.

Worked on Configuration of Internal load balancer, load balanced sets and Azure Traffic manager.

Developed Hive queries to analyze data and generate results.

Used Spark Streaming to divide streaming data into batches as an input to Spark Engine for batch processing.

Worked on analyzing Hadoop cluster and different Big Data analytic tools including Pig, hive, HBase, Spark and Sqoop.

Exported data from HDFS to RDBMS via Sqoop for Business Intelligence, visualization, and user report generation.

Used Scala to write code for all Spark use cases.

Analyzed user request patterns and implemented various performance optimization measures including implementing partitions and buckets in HiveQL.

Caching, SQL Azure, NoSQL, Storage, Network services, Azure Active Directory, API Management, Scheduling, Auto Scaling, and PowerShell Automation.

Assigned name to each of the columns using case class option in Scala.

Worked on migrating Map Reduce programs into Spark transformations using Spark and Scala, initially done using Python (PySpark).

Involved in converting the HQL's in to spark transformations using spark RDD with support of Python and Scala.

Developed multiple Spark SQL jobs for data cleaning.

Created Hive tables and worked on them using Hive QL.

Experienced on Big data with Azure. Connecting HDInsight to Azure and working on Big Data technology.

Assisted in loading large sets of data (Structure, Semi Structured, and Unstructured) to HDFS.

Developed Spark SQL to load tables into HDFS to run select queries on top.

Developed analytical component using Scala, Spark, and Spark Stream.

Used Visualization tools such as Power view for excel, Tableau for visualizing and generating reports.

Worked on the NoSQL databases HBase and mongo DB.

Worked on Configuration of Internal load balancer, load balanced sets and Azure Traffic manager.

Environment: Azure, Hadoop, Hive, Oozie, Java, Linux, Maven, Apache NiFi, Oracle 11g/10g, Zookeeper, MySQL, Spark.

UPS, Boston ridge, New jersey Oct 2018 to Jan 2020

Data Engineer

Responsibilities:

Integrated services like Bitbucket AWS Code Pipeline and AWS Elastic Beanstalk to create a deployment pipeline.

Created S3 buckets in the AWS environment to store files, sometimes which are required to serve static content for a web application.

Configured S3 buckets with various life cycle policies to archive the infrequently accessed data to storage classes based on requirement.

Possess good knowledge in creating and launching EC2 instances using AMI's of Linux, Ubuntu, RHEL, and Windows and wrote shell scripts to bootstrap instance.

Used IAM for creating roles, users, groups and implemented MFA to provide additional security to AWS account and its resources. AWS ECS and EKS for docker image storage and deployment.

Used Bamboo pipelines to drive all micro services builds out to the Docker registry and then deployed to Kubernetes, Created Pods and managed using Kubernetes.

Design an ELK system to monitor and search enterprise alerts. Installed, configured and managed the ELK Stack for Log management within EC2 / Elastic Load balancer for Elastic Search.

Create develop and test environments of different applications by provisioning Kubernetes clusters on AWS using Docker, Ansible, and Terraform.

Worked on deployment automation of all the micro services to pull image from the private Docker registry and deploy to Docker Swarm Cluster using Ansible.

Installed Ansible Registry for local upload and download of Docker images and even from Docker Hub.

Create and maintain highly scalable and fault tolerant multi-tier AWS and Azure environments spanning across multiple availability zones using Terraform and CloudFormation.

Implemented domain name service (DNS) through route 53 to have highly available and scalable applications. Maintained the monitoring and alerting of production and corporate servers using Cloud Watch service.

Worked on scalable distributed data system using Hadoop ecosystem in AWS EMR.

Migrated on premise database structure to Confidential Redshift data warehouse.

Wrote various data normalization jobs for new data ingested into Redshift.

Wrote scripts and indexing strategy for a migration to Confidential Redshift from SQL Server and MySQL databases.

The data is ingested into this application by using Hadoop technologies like PIG and HIVE.

Worked on AWS Data Pipeline to configure data loads from 53 to into Redshift

Used JSON schema to define table and column mapping from S3 data to Redshift.

Worked on AWS Data Pipeline to configure data loads from S3 to into Redshift

Used JSON schema to define table and column mapping from S3 data to Redshift

On demand, secure EMR launcher with custom spark submit steps using S3 Event, SNS, KMS and Lambda function.

Created EBS volumes for storing application files for use with EC2 instances whenever they are mounted to them.

Experienced in creating RDS instances to serve data through servers for responding to requests. Automated regular tasks using Python code and leveraged Lambda function wherever required.

Knowledge on Containerization Management and setup tools Kubernetes and ECS.

Environment: AWS (EC2, S3, EBS, ELB, RDS, SNS, SQS, VPC, Cloud formation, CloudWatch, ELK Stack), Bitbucket, Ansible, Python, Shell Scripting, PowerShell, ETL, AWS Glue, Jira, JBOSS, Bamboo, Docker, Web Logic, Maven, Web sphere, Unix/Linux, AWS X-ray, Dynamodb, Kinesis, Code Deploy, Splunk.

Genpact, India Feb 2016 to Aug 2017

Hadoop Engineer

Responsibilities:

Involved in importing the real time data to Hadoop using Kafka and implemented the Oozie job for daily data.

Loaded the data from Teradata to HDFS using Teradata Hadoop connectors.

Import the data from different sources like HDFS/HBase into Spark RDD

Developed Spark scripts by using Python shell commands as per the requirement

Issued SQL queries via Impala to process the data stored in HDFS and HBase.

Used the Spark - Cassandra Connector to load data to and from Cassandra.

Used Restful Web Services API to connect with the MapRtable. The connection to Database was developed through restful web services API.

Involved in developing Hive DDLs to create, alter and drop Hive tables and storm, & Kafka.

Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.

Experience in data migration from RDBMS to Cassandra. Created data-models for customer data using the Cassandra Query Language.

Responsible for building scalable distributed data solutions using Hadoop cluster environment with Horton works distribution

Involved in developing Spark scripts for data analysis in both Python and Scala. Designed and developed various modules of the application with J2EE design architecture.

Implemented modules using Core Java APIs, Java collection and integrating the modules.

Experienced in transferring data from different data sources into HDFS systems using Kafka producers, consumers and Kafka brokers

Installed Kibana using salt scripts and built custom dashboards that can visualize aspects of important data stored by Elastic search.

Used File System Check (FSCK) to check the health of files in HDFS and used Sqoop to import data from SQL server to Cassandra

Streaming the transactional data to Cassandra using Spark Streaming/Kafka

Implemented a distributed messaging queue to integrate with Cassandra using Apache Kafka and Zookeeper.

ConfigMap and Daemon set files to install File beats on Kubernetes PODS to send the log files to Log stash or Elastic search to monitor the different types of logs in Kibana.

Created Database in Influx DB also worked on Interface, created for Kafka also checked the measurements on Databases.

Installed Kafka manager for consumer lags and for monitoring Kafka Metrics also this has been used for adding topics, Partitions etc.

Successfully Generated consumer group lags from Kafka using their API.

Ran Log aggregations, website Activity tracking and commit log for distributed system using Apache Kafka.

Involved in creating Hive tables, and loading and analyzing data using hive queries.

Developed multiple MapReduce jobs in java for data cleaning and pre-processing. Loading data from different source (database & files) into Hive using the Talend tool.

Used Oozie and Zookeeper operational services for coordinating cluster and Scheduling workflows.

Implemented Flume, Spark, and Spark Streaming framework for real time data processing.

Environment: Hadoop, Python, HDFS, Hive, Scala, MapReduce, Agile, Cassandra, Kafka, Storm, AWS, YARN, Spark, ETL, Teradata, NoSQL, Oozie, Java, Cassandra, Talend, LINUX, Kibana, HBase

A3 IT Solutions, India Jun 2014 – Jan 2016

Data Analyst

Responsibilities:

Understand the data visualization requirements from the Business Users.

Writing SQL queries to extract data from the Sales data marts as per the requirements.

Developed Tableau data visualization using Scatter Plots, Geographic Map, Pie Charts and Bar Charts and Density Chart.

Designed and deployed rich Graphic visualizations with Drill Down and Drop-down menu option and Parameterized using Tableau.

Created action filters, parameters, and calculated sets for preparing dashboards and worksheets in Tableau.

Explored traffic data from databases connecting them with transaction data, and presenting as well as writing report for every campaign, providing suggestions for future promotions.

Extracted data using SQL queries and transferred it to Microsoft Excel and Python for further analysis.

Data Cleaning, merging, and exporting the dataset was done in Tableau Prep.

Data processing and cleaning techniques carried out to reduce text noise, reduce dimensionality in order to improve the analysis.

Environment: Python, Informatica v9.x, MS SQL SERVER, T-SQL, SSIS, SSRS, SQL Server Management Studio, Oracle, Excel.

Education:

Bachelors in Computer Science from Sree Nidhi Institute of Science and Technology, India.

Contact this candidate