Data Engineer

Location:

Texas City, TX

Posted:

May 09, 2023

Contact this candidate

Resume:

Achyut P

Email: ********@*****.***

Contact: 512-***-****

Professional Experience:

8+ years of strong Software Developer experience in IT industry involving design, development, testing, maintenance, and documentation of different applications in various domains.

Actively involved in each phase of Software Development life cycle (SDLC) and experience in Agile Software Methodology.

Good understanding of Apache Spark High level architecture and performance tuning pattern

Parsing the data from S3 through the Python API calls through the Amazon API Gateway generating Batch Source for processing

Extract, transform and load the data from different formats like JSON, a Database, and expose it for ad-hoc/interactive queries using Spark SQL

Excellent Experience on various Python, Java integrated IDE’s such as Subline Text, PyCharm, Eclipse and NetBeans.

Involved in MySQL data base (Cassandra), Strom, Kafka Activity to rea the values in application.

Hands on experience on tools like Pig & Hive for data analysis, Sqoop for data ingestion, Oozie for scheduling and Zookeeper for coordinating cluster resources

Worked on Scala code base related to Apache Spark performing the Actions, Transformations on RDDs, Data Frames & Datasets using SparkSQL and Spark Streaming Contexts

Proficiency in analyzing large unstructured data sets using PIG and developing and designing POCs using Map-reduce and Scala and deploying on the Yarn cluster

Experienced in developing MapReduce programs using Apache Hadoop for working with Big Data

Good experience in performing operations such as CRUD operations and writing complex queries with Oracle and Postgres.

Experience in developing web services (WSDL, SOAP and REST) and consuming web services with python programming language.

Experience in working on different Mobile operating systems Android.

Expertise in full life cycle application development and good experience in Unit testing and Test-Driven Development (TDD) and Behavior driven Development.

Hands on experience with version control tools such as SVN, GitHub, and GitLab, CI/CD.

Experience in Database modeling with SQL, MySQL, Postgres and Oracle.

Experience in testing frameworks like Behave.

Involved to access the Restful web services using Python API like Pandas, NumPy.

Experience in working on different operating systems Windows, Linux, Unix and Ubuntu.

Excellent communication, Teamwork, inter-personnel and presentation skills, fast learner and organized self-starter.

Good knowledge in establishing database connections for Python by configuring packages MySQL-Python.

Hands of experience in GCP, BigQuery, GCS bucket, G - cloud function, cloud dataflow, Pub/suB cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver

Wrote Python modules to extract/load asset data from the Oracle source database

Skills:

Big Data frameworks

HDFS, Spark, MapReduce, Pig, Hive, Sqoop, Oozie, Kafka, Cassandra, Spark Streaming, Spark SQL

Programming languages

Python, Core java, SQL, Scala, MapReduce.

Cloud Technologies

GCP, AWS

Databases

SQL server, MySQL, Oracle

Operating Systems

Windows, Linux (Ubuntu, Cent OS)

File Formats

CSV, ORC, JSON, Sequence, Delimited/Fixed Width

Development methodologies

Agile, Waterfall

Web Technologies

JavaScript, JSON, HTML, XML

Version Tools

Git and CVS

Others

Putty, WinSCP

Education:

Bachelor of Technology in Computer Science from JNTUK, India

Experience:

Driven Brands, Charlotte, NC August 2022 – Present

SR Data Engineer

Responsibilities:

Worked on entity relationship diagram for transaction build ddm.

Worked on Non-FRED Macro data ingestion from GCS to Big Query which includes 6 different sources.

Worked on data inventory and automation part for non-FRED data.

Also worked on gathering list of customer data sources for driven data proto.

Prepared necessary documentation and validated the process fits in architecture standards for Non-FRED Macro data.

Loaded Data in to Bigquery for on arrival csv files in GCS bucket and worked with airflow server and google cloud platform.

Performed Data Migration from GCS to GBQ.

Worked on scheduling all jobs using Airflow scripts using python. Adding different tasks to DAG’s and dependencies between the tasks.

Installed and configured apache airflow for workflow management and created workflows in python.

Created Airflow Scheduling scripts in Python.

Implemented Apache Airflow for authoring, scheduling and monitoring Data Pipelines.

Processed bound and unbound Data from Google Cloud platform to Google Bigquery using Airflow with Python.

Worked with Google Cloud (GCP) Services like Compute Engine, Cloud Functions, Cloud DNS, Cloud Storage and Cloud Deployment Manager.

Involved in Data Analysis, Data Mapping and Data Modelling and Performed Data integrity, Data Portability testing by executing SQL statements for the customer Data

Added indexes to improve performance on tables.

Involved in gathering business requirements, analysing the project, and creating Use Cases and Design Document for new requirements.

Designing architecture and building new features into the existing product.

Environment: GCP, Cloud Dataflow, Apache Airflow, Python, Pandas ETL workflows, BigQuery, ER-diagrams.

Lending Club, SFO, CA July 2021 – Aug 2022

SR GCP Data Engineer

Responsibilities:

Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala

Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow for ETL batch processing to load into Snowflake for analytical processes

Loaded data into Snowflake tables from internal stage using SnowSQL

Performed Data Migration to GCP

Evaluated suitability of Hadoop and its ecosystem to the project and implementing, validating with various proof of concept (POC) applications to eventually adopt them to benefit from the Big Data Hadoop initiative

Worked on scheduling all jobs using Airflow scripts using python. Adding different tasks to DAG’s and dependencies between the tasks.

Used IAM to create new accounts, roles and groups and polices and developed critical modules like generating amazon resource numbers and integration points with S3, Dynamo DB, RDS, Lambda and SQS Queue

Populated HDFS and Cassandra with huge amounts of data using Apache Kafka.

Integrated Apache Storm with Kafka to perform web analytics. Uploaded click stream data from Kafka to HDFS, Hbase and Hive by integrating with Storm.

Configured Spark Streaming to receive real time data from the Apache Kafka and store the stream data to HDFS using Scala.

Wrote various data normalization jobs for new data ingested into Redshift

Developed spark applications in python (PySpark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables.

Worked with google data catalog and other google cloud API’s for monitoring, query and billing related analysis for BigQuery usage.

Developed PySpark and SparkSQL code to process the data in Apache Spark on Amazon EMR to perform the necessary transformations based on the STMs developed

Used AWS glue catalog with crawler to get the data from S3 and perform sql query operations

Reviewing the explain plan for the SQLs in snowflake

Created an e-mail notification service upon completion of job for the team which requested for the data.

Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP

Estimated the Software & Hardware requirements for the Name node and Data nodes in the cluster.

Experience in migrating existing databases from on premise to AWS Redshift using various AWS services

Developed the Pysprk code for AWS Glue jobs and for EMR.

Designed and Developed ETL jobs to extract data from Salesforce replica and load it in data mart in Redshift.

Developed ETL jobs using Spark -Scala to migrate data from Oracle to new Cassandra tables.

Used Spark -Scala (RDD’s, Data frames, Spark Sql) and Spark - Cassandra -Connector API's for few tasks (Data migration, Business report generation etc.)

Data Extraction, aggregations and consolidation of Adobe data within AWS Glue using PySpark.

Installed and configured apache airflow for workflow management and created workflows in python

Implemented security to meet PCI requirements, using VPC Public/Private subnets, Security Groups, NACLs, IAM roles, policies, VPN, WAF, Trust Advisor, Cloud Trail etc. to pass penetration testing against infrastructure

Defined job work flows as per their dependencies in Oozie.

Played a key role in productionizing the application after testing by BI analysts.

Developed ETL parsing and analytics using Python/Spark to build a structured data model in Elastic search for consumption by the API and UI.

Responsible for Designing Logical and Physical data modelling for various data sources on Confidential Redshift

Developed data pipeline using sqoop, HQL, Spark and Kafka to ingest Enterprise message delivery data into HDFS.

Developed Java Map Reduce programs for the analysis of sample log file stored in cluster

Implemented Spark using Python and Spark SQL for faster testing and processing of data.

Designing and building multi-terabyte, full end-to-end Data Warehouse infrastructure from the ground up on Confidential Redshift for large scale data handling Millions of records every day

Used apache airflow in GCP composer environment to build data pipelines and used various airflow operators like bash operator, Hadoop operators and python callable and branching operators.

Imported data using Sqoop to load data from MySQL to HDFS on regular basis.

Created Airflow Scheduling scripts in Python

Installed and configured Hadoop Map Reduce, HDFS, developed multiple Map Reduce jobs in java and Scala for data cleaning and preprocessing

Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery

Coordinated with team and Developed framework to generate Daily adhoc reports and Extracts from enterprise data from BigQuery.

Developed and deployed data pipeline in cloud such as AWS and GCP

Created BigQuery authorized views for row level security or exposing the data to other teams.

Prepared data warehouse using Star/Snowflake schema concepts in Snowflake using SnowSQL

Created Partitions, Buckets based on State to further process using Bucket based Hive joins.

Environment: Gcp, Cloud Dataflow, Cloud Shell, Gsutil, Docker, Kubernetes, AWS, Apache Airflow, Python, Pandas ETL workflows, Pyspark, Scala, Spark

Door dash, SFO, CA Dec 2020 – June 2021

SR Data Engineer

Responsibilities:

Process and load bound and unbound Data from Google pub/sub topic to Bigquery using cloud Dataflow with Python.

Wrote scripts and indexing strategy for a migration to Confidential Redshift from SQL Server and MySQL databases

Strong understanding of AWS components such as EC2 and S3

Implemented a Continuous Delivery pipeline with Docker and Git Hub

Worked with g-cloud function with Python to load Data in to Bigquery for on arrival csv files in GCS bucket

Developed POC for Apache Kafka and Implementing real-time streaming ETL pipeline using Kafka Streams API.

Performed data engineering functions: data extract, transformation, loading, and integration in support of enterprise data infrastructures - data warehouse, operational data stores and master data management

Develop framework for converting existing PowerCenter mappings and to PySpark(Python and Spark) Jobs.

Experienced in writing live Real-time Processing using Spark Streaming with Kafka

Performed Data Analysis, Data Migration, Data Cleansing, Transformation, Integration, Data Import, and Data Export through Python.

Create Pyspark frame to bring data from DB2 to Amazon S3.

Devised PL/SQL Stored Procedures, Functions, Triggers, Views and packages. Made use of Indexing, Aggregation and Materialized views to optimize query performance.

Developed logistic regression models (Python) to predict subscription response rate based on customers variables like past transactions, response to prior mailings, promotions, demographics, interests, and hobbies, etc.

Developed PySpark scripts that runs on MSSQL table pushes to Big Data where data is stored in Hive tables.

Develop near real time data pipeline using spark

Implemented Apache Airflow for authoring, scheduling and monitoring Data Pipelines

Proficient in Machine Learning techniques (Decision Trees, Linear/Logistic Regressors) and Statistical Modeling

Designed tables and columns in Redshift for data distribution across data nodes in the cluster keeping columnar database design considerations.

Used AWS Glue for the data transformation, validate and data cleansing.

Used python Boto 3 to configure the services AWS glue, EC2, S3

Hands on experience with big data tools like Hadoop, Spark, Hive

Experience implementing machine learning back-end pipeline with Pandas, Numpy

Responsible for data services and data movement infrastructures good experience with ETL concepts, building ETL solutions and Data modeling

Architected several DAGs (Directed Acyclic Graph) for automating ETL pipelines

Hands on experience on architecting the ETL transformation layers and writing spark jobs to do the processing.

Architect, Design and develop Hadoop ETL by using Kafka.

Gather and process raw data at scale (including writing scripts, web scraping, calling APIs, write SQL queries, writing applications)

Experience in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension)

Worked on AWS Data Pipeline to configure data loads from S3 to into Redshift

Used JSON schema to define table and column mapping from S3 data to Redshift

Encoded and decoded json objects using PySpark to create and modify the dataframes in Apache Spark

Devised simple and complex SQL scripts to check and validate Dataflow in various applications.

Environment: MapReduce, Hive, Sqoop, Oozie, Python, Scala, Spark, Kafka, PySpark, Cassandra, Linux, kafka, AWS EMR, S3, Storm

Michelin Tires, Charlotte, NC Jan 2019 – Nov 2020

Role: SR Big Data Engineer

Responsibilities:

Wrote Redshift UDFs and Lambda functions using Python for custom data transformation and ETL.

Used AWS Redshift, S3, Spectrum and Athena services to query large amount data stored on S3 to create a Virtual Data Lake without having to go through ETL process.

Designed and setup Enterprise Data Lake to provide support for various uses cases including Analytics, processing, storing and Reporting of voluminous, rapidly changing data.

Responsible for maintaining quality reference data in source by performing operations such as cleaning, transformation and ensuring Integrity in a relational environment by working closely with the stakeholders & solution architect.

Creating Lambda functions with Boto3 to deregister unused AMIs in all application regions to reduce the cost for EC2 resources.

Importing & exporting database using SQL Server Integrations Services (SSIS) and Data Transformation Services (DTS Packages).

Performed end- to-end Architecture & implementation assessment of various AWS services like Amazon EMR, Redshift, S3.

Coded Teradata BTEQ scripts to load, transform data, fix defects like SCD 2 date chaining, cleaning up duplicates.

Developed Spark scripts using Python on Azure HDInsight for Data Aggregation, Validation and verified its performance over MR jobs.

Developed Kibana Dashboards based on the Log stash data and Integrated different source and target

Designed and developed Security Framework to provide fine grained access to objects in AWS S3 using AWS Lambda, DynamoDB.

Set up and worked on Kerberos authentication principals to establish secure network communication on cluster and testing of HDFS, Hive, Pig and MapReduce to access cluster for new users.

Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.

Developed reusable framework to be leveraged for future migrations that automates ETL from RDBMS systems to the Data Lake utilizing Spark Data Sources and Hive data objects.

Conducted Data blending, Data preparation using Alteryx and SQL for Tableau consumption and publishing data sources to Tableau server.

Used Spark SQL for Scala & amp, Python interface that automatically converts RDD case classes to schema RDD.

Developed PySpark script to merge static and dynamic files and cleanse the data.

Import the data from different sources like HDFS/HBase into Spark RDD and perform computations using PySpark to generate the output response.

Implemented the machine learning algorithms using python to predict the quantity a user might want to order for a specific item so we can automatically suggest using kinesis firehose and S3 data lake.

Environment: Apache Spark, HBase, AWS EMR, S3, RDS, Redshift, Lambda, Boto3, DynamoDB, Amazon SageMaker, Apache Kafka, HIVE, SQOOP, Map Reduce, Snowflake, Apache Pig, Python, SSRS, Tableau

Objectsoft, India Aug 2016 – Feb 2018

Role: Big Data Engineer

Understand Business requirement and involved in preparing Design document preparation according to client requirement.

Installed and configured Hadoop MapReduce. HDFS, developed multiple MapReduce jobs in java and Scala for data cleaning and preprocessing.

Installed and configured Hive and written Hive UDFS and Used MapReduce and Junit for unit testing

Responsible for automating build processes towards CI/CD automation goals

Installed and configured Hadoop Ecosystem like Hive, Oozie, Sqoop by which implemented using Cloudera Hadoop cluster for helping with performance tuning and monitoring

Involved in Agile methodologies, daily scrum meetings, spring planning

Analyzing the existing data flow to the warehouses and taking the similar approach to migrate the data into HDFS

Created Partitioning bucketing and Map Side Join Parallel execution for optimizing the Hive queries decreased the time of execution from hours to minutes

Defined the application architecture and design for Big Data Hadoop initiative to maintain structured and unstructured data: create architecture for the enterprise

Identify data sources create source-to-target mapping storage estimation provide support for Hadoop cluster setup, data partitioning

Involved in gathering requirements from client and estimating time line for developing complex queries using Hive for logistics application.

Worked with cloud provisioning team on a capacity planning and sizing of the nodes (Master and Slave) for an AWS EMR Cluster.

Responsible for creating an instance on Amazon EC2 (AWS) and deployed the application on it

Worked with Amazon EMR to process data directly in s3 when we want to copy data from s3 to the Hadoop Distributed File System (HDFS) on your Amazon EMR cluster by setting up the Spark Core for analysis work

Exposure on Spark Architecture and how ROD's work internally by involving and processing the data from local ties HDFS and RDBMS sources by creating RDD and optimizing for performance

Involved in data pipeline using Pig Sqoop to ingest cargo data and customer histories into HDFS for analysis

Created HBase tables to load large sets of semi structured data coming from various sources

Responsible for leading the customers data and event logs from Kafka into HBase using REST API

Created custom UDFs for Spark and Kafka procedure for some of non-working functionalities in custom UDF into Scala in production environment

Developed workflows in Oozie and scheduling jobs in Mainframes by preparing data refresh strategy document & Capacity planning documents required for project development and support

Worked with different actions in Dare to design work flow like Sqoop action pig action hive action, shell action.

Environment: AWS, Kafka, Map Reduce, Snowflake, Apache Pig, Python, SSRS, Tableau

Data Analyst April 2014 – July 2016

Sutherland, India

Responsibilities:

Created SQL queries to simplify migration progress reports and analyses.

Wrote SQL queries using joins, grouping, nested sub-queries, and aggregation depending on data needed from various relational customer databases.

Developed reporting and various dashboards across all areas of the client's business to help analyze the data.

Cleansed and manipulated data by sub-setting, sorting, and pivoting on need basis

Used SQL Server and MS Excel on daily basis to manipulate the data for business intelligence reporting needs.

Developed the stored procedures as required, and user defined functions and triggers as needed using T-SQL.

Created reports from OLAP, sub reports, bar charts and matrix reports using SSIS.

Used Excel and PowerPoint on various projects as needed for presentations and summarization of data to provide insight on key business decisions

Designed Ad-hoc reports using SQL and Tableau dashboards, facilitating data driven decisions for business users.

Created reports for the Data Analysis using SQL Server Reporting Services.

Designed data reports in Excel, for easy sharing, and used SSRS for report deliverables to aid in statistical data analysis and decision

involved in extensive data validation by writing several complex SQL queries and Involved in back-end testing and worked with data quality issues

Collected, analyze and interpret complex data for reporting and/or performance trend analysis

Performed Data Manipulation using MS Excel Pivot Sheets and produced various charts for creating the mock reports

Extracted data from different sources performing Data integrity and quality checks

Performed Data Analysis and Data Profiling and worked on data transformations and data quality rules

Developed Stored Procedures in SQL Server to consolidate common DML transactions such as insert, update and delete from the database.

Worked with Data Analysts to understand Business logic and User Requirements.

Closely worked with cross functional Data warehouse members to import data into SQL Server and connected to SQL Server to prepare spreadsheets.

Created reports for the Data Analysis using SQL Server Reporting Services.

Environment: SOL Server. MS Excel, T-SQL, SSRS, SSIS, OLAP, Power Point

Contact this candidate