Resume

Data Engineer Big

Location:

Tampa, FL

Posted:

December 06, 2023

Contact this candidate

Resume:

VAMSI.V

Phn: 475-***-****

Mail: ad1q32@r.postjobfree.com

Professional Summary

Data Engineer with 8+ years of experience, Python Developer. Proficient in designing, documenting, development, and implementation of data models for enterprise-level applications. Background in Data Lake, Data Warehousing, ETL Data pipeline & Data Visualization. Proficient in Big data storage, processing, analysis, and reporting on all major Cloud vendors- AWS, Azure and GCP.

Experience in Big Data ecosystems using Hadoop, MapReduce, YARN, HDFS, HBase, HIVE, Sqoop, Storm, Spark, Scala, Airflow, Flume, Kafka, Oozie, Impala, HBase and Zookeeper.

Experience developing Spark applications using Spark Core, Streaming, SQL, Data Frames, Datasets & Spark-ML. Developed Spark Streaming jobs by developing RDDs using Scala, PySpark and Spark-Shell.

In-depth knowledge of Hadoop and Spark Scripting Architecture with data mining and stream processing technologies including Spark Core, Spark SQL, Data Frames and Spark Streaming for developing Spark Scripting based Programs for Batch and Real-Time Processing.

Experienced in HIVE Queries to process large sets of structured, semi-structured & unstructured data. Experience in loading data into HDFS using Sqoop as well as saving data in Hive tables.

Leveraged Databricks for processing and analyzing large-scale datasets, optimizing performance and efficiency.

Experience with Snowflake cloud data warehouse and AWS S3 bucket for integrating data from multiple source systems which includes loading nested JSON formatted data into snowflake table.

Experience in Python Development and Scientific Programming and using NumPy and Pandas in Python for Data Manipulation.

Good working knowledge of Data Warehousing and ETL processes using tools like Informatica.

Involved in end-to-end implementation of Enterprise Data Warehouse, Data Lakes & Data Mart with Batch and Real-time processing using Spark streaming, AWS Kinesis.

Experience in setting up workflow using Apache Airflow and Oozie to manage & schedule Hadoop jobs.

Experience in configuration, deployment & automation of major Cloud environments (AWS, & GCP).

Worked with AWS EC2 cloud instance. Used EMR, Redshift, and Glue for data processing.

Worked with AWS storage, OLTP, OLAP, NoSQL & their data warehouse AWS RedShift.

Worked on creating IAM policies for delegated administration within AWS. Configured Users/Roles/Policies.

Proficient in AWS Code Pipeline and worked with Code Commit, Code Build & Code Deploy.

Developed and designed a system to collect data from multiple portals using Kafka and process it using Spark. Designed and implemented Kafka by configuring Topics in new Kafka cluster in all environments.

Worked extensively on GCP Cloud Services such as GCS (google cloud storage), Compute Engine, Dataproc, Cloud SQL, Big query, and Bigtable for building data lakes on the google cloud platform.

Exposure and development experience in Microservices Architectures best practices, Java Spring Boot Framework, Docker, Kubernetes Jenkins and Python

Experience with Python, SQL on AWS cloud platform, better understanding of Data Warehouses like Snowflake and Data-bricks platform, etc.

Experience in data preprocessing (data cleaning, data integration, data reduction, and data transformation) using Python libraries including NumPy, SciPy and Pandas for data analysis and numerical computations.

Experience working on various file formats including delimited text files, clickstream log files, Apache log files, Parquet files, Avro files, JSON files, XML files and others.

Good understanding of data modeling (Dimensional & Relational) concepts like Star-Schema Modeling, Schema Modeling, Fact and Dimension tables.

Involved in various projects related to Data Modeling, System/Data Analysis, Design and Development for both OLTP and Data warehousing environments.

Experience with visualization tools such as Power BI, Tableau, Jupyter Notebook, TIBCO SpotFire, QlikView, MicroStrategy, Information Builders, and other reporting and analytical tools

Hands on experience with Shell Scripting and UNIX/LINUX.

A Data Science enthusiast with strong Problem solving, Debugging and Analytical capabilities, who actively engages in understanding and delivering business requirements.

Technical Skills:

Hadoop Distributions

Apache, Cloudera CDH4, Hortonworks and CDH 5

Big Data Ecosystem

Apache Hadoop (HDFS/Map Reduce), Hive, Pig, Sqoop, Zookeeper, Oozie, Hue, Spark Scripting, Spark SQL, PySpark, Apache Kafka

NoSQL Databases

HBase, Cassandra, Mongo DB

Languages

SQL, Python, Scala, Core Java, PL SQL, Shell Scripting, XML, AZURE PowerShell

Java / J2EE Technologies

Core Java, Servlets, Hibernate, Spring, Struts, JMS, EJB

Operating Systems

Linux, Unix, Windows 8, Windows 7, Windows Server 2008/2003,

Mac OS

Cloud

Google Cloud Platform (GCP), Amazon Web Service (AWS), EC2,

EMR, S3, VPC, ELB, RDS, Redshift, EMR, Lambda, Glue, Data

Pipeline, Athena Databricks, GCS, Dataproc, Bigquery,Bigtable

Visualization Tools

Power BI, Tableau Desktop and Server, TIBCO Spot-Fire, QlikView, MicroStrategy, Jupiter Notebook, Information Builders

Databases

Oracle 10g/9i/8i, DB2, MySQL, MS-SQL Server

Application Servers

WebLogic, Web Sphere, JBoss, Tomcat

IDE Dev. Tools

PyCharm, Vi / Vim, Sublime Text, Visual Studio Code, Jupyter Notebook.

Build Tools

Jenkins, Maven, ANT

Software Engineering

Agile/Scrum & Waterfall Methodology

Professional Experience:

Edelman, New York, NY March 2022 to till date

GCP Data Engineer

Overview: We used GCP tools and services to build a data pipeline for customer segmentation and market analysis. The pipeline will collect data from multiple sources, transform it using PySpark, and analyze it in Databricks. The results will be stored in GCS and monitored using Stack driver.

Responsibilities:

Involved in developing a roadmap for migration of enterprise data from multiple data sources like SQL Server, and provider databases into S3 which serves as a centralized data hub across the organization.

Hands of experience in GCP, Big Query, GCS bucket, G - cloud function, cloud dataflow, Pub/sub cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver

Deployed Pyspark applications and developed them in the Databricks cluster

Managed Teradata instances on GCP for efficient data warehousing, enabling the storage and retrieval of historical market data alongside real-time streams.

Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.

Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery

Deployed and managed Teradata instances on Google Cloud for high-performance data warehousing, improving query response times and reducing infrastructure costs.

Led a team responsible for migrating terabytes of data from on-premises Teradata to Google Cloud Storage and BigQuery, ensuring data accuracy and minimal downtime

Developed custom ETL processes using Google Dataflow to integrate Teradata data with real-time streaming data sources on GCP

Integrated Teradata data with GCP's machine learning services to build predictive models for market forecasting, resulting in more informed trading strategies.

Created pipelines, data flows and complex data transformations and manipulations using ADF and PySpark with Databricks

Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs with Scala

Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP

Day to-day responsibility includes developing ETL Pipelines in and out of data warehouse, develop major regulatory and financial reports using advanced SQL queries in snowflake

Used Apache Airflow in GCP Composer environment to build data pipelines and used various airflow operators like bash operator, Hadoop operators and python callable and branching operators.

Implemented real-time data streaming pipelines using EKS and containerized stream processing technologies.

Worked on building input adapters for data dumps from FTP Servers using Apache spark.

Wrote spark applications to perform operations like data inspection, cleaning, loading, and transforming large sets of structured and semi-structured data.

Developed Spark with Scala and Spark-SQL for testing and processing of data.

Reporting the spark job stats, Monitoring, and running Data Quality Checks are made available for each Data.

Leveraged and integrated Google Cloud Storage and Big Query applications, which connected to Tableau for end user web-based dashboards and reports. Leveraged Google Cloud SDKs and client libraries.

Stage the API or Kafka Data(in JSON file format) into Snowflake DB by FLATTENing the same for different functional services

Created GCP DataProc Server by passing Max-Idle time parameter and executed Spark jobs to be more cost efficient then BigQuery

Utilized Talend's powerful capabilities to orchestrate and automate data processing tasks in a GCP project, leading to improved data quality and reduced time-to-insights Involved in creating a data lake in Google Cloud Platform (GCP) for allowing business teams to perform data analysis in Big Query.

Worked with Snowflake features like clustering, time travel, cloning, logical data warehouse, caching.

Worked closely with GCP services, such as Cloud Storage, BigQuery, and Dataflow, to configure seamless integration of Apache NiFi pipelines with GCP infrastructure, enabling efficient data processing and storage.

Utilized Google Cloud Storage as a data lake and ensured all the processed data is written to GCS directly from spark and hive jobs.

Environment: GCP, Google cloud storage, Apache Spark, Spark-SQL, snowflake, UNIX, Kafka, Scala, Teradata.

Alexion Pharmaceuticals, Boston, Massachusetts Jan 2021 to Feb 2022

AWS Data Engineer

Overview: we used Aws cloud tools like Glue, EKS,Kinesis for the gathering of all the data from multiple data sources and efficiently maintained the proper data storages and handed over the data to the manufacturing and data science teams.

Responsibilities:

Designed robust, reusable, and scalable data-driven solutions and data pipelines to automate the ingestion.

Experience in Data Integration and Data Warehousing using ETL tools like Informatica Power Center, AWS Glue, SQL Server Integration Services (SSIS), and Talend.

processing, and delivery of both structured and semi-structured batch and real-time data streaming data.

Applied efficient and scalable data transformations on the ingested data using the Spark framework.

Built end-to-end Spark applications for performing data cleansing, transformations, and aggregations of varieties of data sources ingested from downstream applications.

Worked on real-time data streaming projects using AWS Kinesis and Databricks streaming capabilities.

Worked closely with machine learning teams to deliver feature datasets in an automated manner to help them with model training and mode scoring.

Expert noledge on MongoDB, NoSQL data modeling, tuning, disaster recovery backup used it for distributed storage and processing using CRUD.

Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD's

Managed version control and collaborated effectively with cross-functional teams by utilizing Git and GitHub for code repositories, tracking changes, and facilitating code reviews in an AWS Data Engineering environment.

Proactively monitored Teradata systems for performance issues and executed routine maintenance tasks, such as updates and backups.

Gained good knowledge in troubleshooting and performance tuning Spark applications and Hive scripts to achieve optimal performance.

Developed real-time and batch data load modules to ingest data into S3, Redshift, and Snowflake using AWS Kinesis, AWS Glue, AWS Lambda, and AWS Step Functions

Leading teh testing efforts in support of projects/programs across a large landscape of technologies ( Unix, Angular JS, AWS, sauseLABS, Cucumber JVM, Mongo DB, GITHub,BitBucket,SQL, NoSQL database, API scripting, Java, Jenkins

Developed various custom UDFs in spark to perform transformations on date fields, complex string columns, encrypt PI fields, etc.

Deployed and managed Amazon EKS clusters for containerized data processing applications and services

Written complex hive scripts for performing various data analyses and creating various reports requested by business stakeholders.

Developed and maintained data pipelines using AWS Glue and Databricks.

Utilized Glue megastore as a common megastore between EMR clusters and Athena query engine with S3 as the storage layer for both.

Worked extensively on migrating our existing on-Prem data pipelines to the AWS cloud for better scalability and infrastructure maintenance.

Worked extensively in automating the creation/termination of EMR clusters as part of starting the data pipelines.

Orchestrated the deployment of Spring Boot applications within Docker containers.

Designed AWS Glue pipelines to ingest, process, and store data interacting with different services in AWS

Worked extensively on migrating/rewriting existing oozie jobs to AWS simple workflow.

Loaded the processed data into Redshift tables for allowing downstream ETL and reporting teams to consume the processed data.

Configuring Apache Airflow for workflow management.

Environment: AWS Cloud, S3, EMR, Redshift, Athena,Databricks, Lambda,Teradata, Scala, Spark, Kafka, Hive, Yarn, HBase, Jenkins, Terraform, Docker.

Bank of Hope, Los Angeles, CA Nov 2018 to Dec 2020

Data Engineer

Responsibilities:

Work on requirements gathering, analysis and designing of the systems.

Developed Spark programs using Scala to compare the performance of Spark with Hive and SparkSQL.

Developed a robust data ingestion system using Spring Boot to efficiently process and integrate large volumes of streaming data from diverse sources into a centralized data repository.

Experience working in the spark ecosystem using Spark-Sql, and Scala queries on different formats like Text files and CSV files

Designed and implemented RESTful APIs with Spring Boot to facilitate seamless communication between data processing modules, enabling efficient data flow within the system.

Implemented Spark using Scala and SparkSql for faster testing and processing of data.

Involved in developing a MapReduce framework that filters bad and unnecessary records.

Ingested data from RDBMS and performed data transformations, and then export the transformed data to Cassandra as per the business requirement.

Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs with Scala.

Used Spark API over Hadoop YARN as execution engine for data analytics using Hive.

Exported the analyzed data to the relational databases using Sqoop to further visualize and generate reports for the BI team.

Engineered real-time data analytics solutions by combining Spring Boot with Apache Kafka and Apache Spark, enabling timely insights into streaming data for informed decision-making.

Migrated the computational code in hql to PySpark.

Worked with Spark Ecosystem using Scala and Hive Queries on different data formats like Text file and parquet.

Using AWS Redshift, I Extracted, transformed, and loaded data from various heterogeneous data sources and destinations.

Worked in migrating Hive QL into Impala to minimize query response time.

Responsible for migrating the code base to Amazon EMR and evaluated Amazon ecosystems components like Redshift.

Collected the logs data from web servers and integrated in to HDFS using Flume Developed Python scripts to clean the raw data.

Implemented Nifi flow topologies to perform cleansing operations before moving data into HDFS.

Worked on different file formats (ORCFILE, Parquet, Avro) and different Compression Codecs (GZIP, SNAPPY, LZO).

Experience writing pig and hive scripts, using Snowflake Clone and Time Travel

Created applications using Kafka, which monitors consumer lag within Apache Kafka clusters.

Created and maintained data models in Teradata, optimizing query performance and data consistency.

Developed Databricks

ETL pipelines using notebooks, Spark Dataframes, SPARK SQL and python scripting.

Created, provisioned numerous Databricks clusters needed for batch and continuous streaming data processing and installed the required libraries for the clusters.

Developed workflow in Oozie to automate the tasks of loading the data into HDFS.

Improved SQL query performance by X% through query optimization and index tuning in Teradata.

Worked in Agile environment using Scrum methodology.

Environment: Hadoop, Hive, MapReduce, Sqoop, Kafka, Spark, Yarn, Pig, PySpark, Cassandra, Oozie, Nifi,Solr, Shell Scripting, Hbase,Mongo DB Scala, AWS, Maven,Teradata, Java, JUnit, agile methodologies, Horton works, Soap, Python, Java Spring boot, MySQL.

Axis Bank – Hyderabad, India

Role: Hadoop Developer Sep 2016 to Oct 2018

Responsibilities:

Mining the location of users on social media sites in semi supervised environment on Hadoop cluster using Map Reduce.

Implementing single source shortest path on Hadoop cluster.

Involved in loading and transforming large sets of Structured, Semi-Structured and

Unstructured data and analyzed them by running Hive queries and Pig scripts.

Evaluated suitability of Hadoop and its ecosystem to the above project and implemented various proof of concept (POC) applications to eventually adopt them to benefit from the Big Data Hadoop initiative.

Estimated Software & Hardware requirements for the Name Node and Data Node & planning the cluster.

Participated in requirement gathering from the Experts and Business Partners and

Converting the requirements into technical specifications.

Extracted the needed data from the server into HDFS and Bulk Loaded the cleaned data into HBase.

Written the Map Reduce programs, Hive UDFs in Java where the functionality is too complex.

Involved in running Hadoop jobs for processing millions of records of text data.

Involved in loading data from LINUX file system to HDFS.

Prepared design documents and functional documents.

Based on the requirements, addition of extra nodes to the cluster to make it scalable.

Developed HIVE queries for the analysis, to categorize different items.

Assisted application teams in installing Hadoop updates, operating system, patches and version upgrades when required.

Designing and creating Hive external tables using shared meta-store instead of derby with partitioning, dynamic partitioning and buckets.

Given POC of FLUME to handle the real time log processing for attribution reports.

Maintained System integrity of all sub-components (primarily HDFS, MR, HBase, and Hive.

Environment: Hadoop, HDFS, MapReduce, Yarn, Hive, PIG, Oozie, Sqoop, HBase, Flume, Linux, Shell scripting, Talend, Java Spring boot, Eclipse, SQL.

Innohub Technologies India Pvt Ltd Hyderabad India July 2014 to Aug 2016

ETL/SQL Developer/Data Modeler

Responsibilities:

Design, develop, and test processes for extracting data from legacy systems or production databases.

Participate in performance, integration, and system testing.

Created Sample Data Sets using Informatica Test Data Management.

Worked on the Informatica Data Quality (IDQ) to standardize the Address, Address

Profiling, Merging and Parsing, used the components like RBA, Token Parser, Character Parser.

Counsels team members on the evaluation of data using the Informatica Data Quality

(IDQ) toolkit.

Applies data analysis, data cleansing, data matching, exception handling, and reporting and monitoring capabilities of IDQ.

Has an intensive experience working on Address doctor, Matching, De-duping and Standardizing.

Provide in-depth knowledge of data governance or data management.

Having in-depth knowledge on MDM.

Perform hands-on development on data quality tools IDQ and Data Archive using ILM and TDM.

Created and reused the existing Unix Shell Scripts to schedule and automate the Informatica.

Mappings and also automate the batch jobs. FTP the Files from different UNIX boxes. Match/merge utilities.

Worked extensively with complex mappings using expressions, aggregators, filters,

lookup, Joiner, update strategy, filter transformation etc.

Procedures to develop and populate Data Warehouse Relational databases, Transform and

cleanse data and Load it into data marts.

Coordinated in Testing, writing and executing test cases, procedures and scripts, creating test scenario.

Environment: Map Reduce, HBase, HDFS, Hive,IDQ SQL, Cloudera Manager, Sqoop, Flume.

Contact this candidate