Data Engineer

Location:

San Jose, CA

Posted:

December 13, 2024

Contact this candidate

Resume:

Name: Vamshitha

Email: ************@*****.***

Phone No: 475-***-****

PROFESSIONAL SUMMARY:

●Around 5+ years of professional experience in information technology as Data Engineer with an expert hand in the areas of Database Development, ETL Development, Data modelling, Report Development and Big Data Technologies.

●Experience in Data Integration and Data Warehousing using various ETL tools Informatica PowerCenter, AWS Glue, SQL Server Integration Services (SSIS), Talend.

●Experience in Designing Business Intelligence Solutions with Microsoft SQL Server and using MS SQL Server Integration Services (SSIS), MS SQL Server Reporting Services (SSRS) and SQL Server Analysis Services (SSAS).

●Extensively used Informatica PowerCenter, Informatica Data Quality (IDQ) as ETL tool for extracting, transforming, loading and cleansing data from various source data inputs to various targets, in batch and real time.

●Experience working with Amazon Web Services (AWS) cloud and its services like Snowflake, EC2, S3, RDS, EMR, VPC, IAM, Elastic Load Balancing, Lambda, RedShift, Elastic Cache, Auto Scaling, Cloud Front, Cloud Watch, Data Pipeline, DMS, Aurora, ETL and other AWS Services.

●Strong expertise in Relational Data Base systems like Oracle, MS SQL Server, Tera Data, MS Access, DB2 design and database development using SQL, PL/SQL, SQL PLUS, TOAD, SQL - LOADER. Highly proficient in writing, testing and implementation of triggers, stored procedures, functions, packages, Cursors using PL/SQL.

●Hands on Experience with AWS Snowflake cloud data warehouse and AWS S3 bucket for integrating data from multiple source system which include loading nested JSON formatted data into Snowflake table.

●Launched Personalization and Loyalty programs at Confidential . Initiated Implementation of Action IQ Customer Data Platform (CDP)

●Extensive experience in integration of Informatica Data Quality (IDQ) with Informatica PowerCenter.

●Extensive experience in Data Mining solutions to various business problems and generating data visualizations using Tableau, Power BI, Alteryx.

●Well knowledge and experience in Cloudera ecosystem such as HDFS, Hive, SQOOP, HBASE, Kafka, Data pipeline, Data analysis and processing with Hive SQL, IMPALA, SPARK, SPARK SQL.

●Worked with different scheduling tools like Talend Administrator Console (TAC), UC4/Atomic, Tidal, Control M, Autosys, CRON TAB and TWS (Tivoli Workload Scheduler).

●Experienced in design, development, Unit testing, integration, debugging and implementation and production support, client interaction and understanding business application, business data flow and data relations.

●Using Flume, Kafka and Spark streaming to ingest real time or near real time data to HDFS.

●Analysed data and provided insights with Python Pandas.

●Worked on AWS Data Pipeline to configure data loads from S3 into Redshift.

●Worked on Data Migration from Teradata to AWS Snowflake Environment using Python and BI tools like Alteryx.

●Experience in moving data between GCP and Azure using Azure Data Factory.

●Developed Python scripts to parse the Flat Files, CSV, XML, JSON files and extract the data from various sources and load the data into data warehouse.

●Developed Automated scripts to do the migration using Unix shell scripting, Python, Oracle/TD SQL, TD Macros and Procedures.

●Expert-level mastery in designing and developing complex mappings to extract data from diverse sources including flat files, RDBMS tables, legacy system files, XML files, Applications, COBOL Sources & Teradata.

●Worked on JIRA for defect/issues logging & tracking and documented all my work using CONFLUENCE.

●Experience with ETL workflow Management tools like Apache Airflow and have significant experience in writing the python scripts to implement the workflow.

●Experience in identifying Bottlenecks in ETL Processes and Performance tuning of the production applications using Database Tuning, Partitioning, Index Usage, Aggregate Tables, Session partitioning, Load strategies, commit intervals and transformation tuning.

●have some experience in converting Mainframe binary files from EBCDIC to ASCII

●I have certifications in primary AWS Services like EC2, EBS, S3, Lambda, Batch, Glue, Athena CloudWatch, CloudTrail, ECS, ECR, EMR, IAM, SNS etc.

●Worked on performance tuning of user queries by analyzing the explain plans, recreating the user driver tables by right Primary Index, scheduled collection of statistics, secondary or various join indexes.

●Experience with scripting languages like PowerShell, Perl, Shell, etc.

●Expert knowledge and experience in fact dimensional modelling (Star schema, Snowflake schema), transactional modelling and SCD (Slowly changing dimension).

●Design, implement, and support Cloud Data Management and Advanced Analytics platforms

●Excellent interpersonal and communication skills, experienced in working with senior level managers, businesspeople and developers across multiple disciplines.

●Strong problem solving, analytical and have the ability to work both independently and as a team. Highly enthusiastic, self-motivated and rapidly assimilate with new concepts and technologies.

TECHNICAL SKILLS:

ETL

Informatica Power Center 10.x/9.6/9.1, AWS Glue, Talend 5.6, SQL Server Integration Services (SSIS)

Databases & Tools

MS SQL Server 2014/2012/2008, Teradata 15/14, Oracle 11g10g, SQL Assistant, Erwin 8/9, ER Studio

Cloud Environment

AWS Snowflake, AWS RDS, AWS Aurora, Redshift, EC2, EMR, S3, Lambda, Glue, Data Pipeline, Athena, Data Migration Services, SQS, SNS, ELB, VPC, EBS, RDS, Route53, Cloud Watch, AWS Auto Scaling, Git, AWS CLI, Jenkins, Microsoft Azure, Google Cloud Platform (GCP)

Reporting Tools

Tableau, PowerBI

Big Data Ecosystem

HDFS, Map Reduce, Hive/Impala, Pig, Sqoop, Hbase, Spark, Scala, Kafka.

Programming languages

Unix Shell Scripting, SQL, PL/SQL, Perl, Python, T-SQL

Data Warehousing & BI

Star Schema, Snowflake schema, Facts and Dimensions tables, SAS, SSIS, and Splunk

AWS Certifications

AWS Cloud Quest: Solutions Architect, AWS Knowledge: Architecting, Cloud Practitioner

LinkedIn Certifications

Python for Data Science, Data Science Foundations: Data Engineering, Advanced NoSQL for

Data Science, DevOps for Data Scientist, SQL Essential Training

Udmey Certifications

Azure Databricks and Spark for Data Engineer, Snowflake-The Masterclass

Google Cloud Certification

BigQuery Fundamentals for Snowflake Professionals

Microsoft Certifications

Microsoft Azure Datbricks for Data Engineer

Hakkerank Certifications

SQL(Beginners,Intermediate, Advance)

PROFESSIONAL EXPERIENCE:

Client: Factory Mutual Insurance, Alpharetta, GA March 2023 – Till Date

Role: Data Engineer

Responsibilities:

●Involved in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation.

●Perform Informatica Cloud Services, Informatica Power Center Administration ETL strategies and ETL Informatica mapping. Setting up of Secure Agent and connect different applications and its Data Connectors for processing the different kinds of data including unstructured (logs, click streams, Shares, likes, topics etc..), semi structured (XML, JSON) and structured like RDBMS.

●Worked extensively with AWS services like EC2, S3, VPC, ELB, Auto Scaling Groups, Route 53, IAM, CloudTrail, CloudWatch, CloudFormation, CloudFront, SNS, and RDS.

●Developed Python scripts to parse XML, Json files and load the data in AWS Snowflake Data warehouse.

●Conducted data mapping workshops with business data owners and functional teams to perform data mapping of data migration objects.

●Developed a well-structured framework performing the entire pipeline functionality using Maven template.

●Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, Parquet/Text Files into AWS Redshift.

●Build a program with Python and Apache beam and execute it in cloud Dataflow to run Data validation between raw source file and big query tables.

●Created new solace queues and Implemented JMS listeners to receive the price messages and used xml parser to parse the message and format it using JAXB.

●Worked with the Data Architect to break complex data objects into smaller data objects in order to simplify the data migration process and increase data migration accuracy.

●Strong background in Data Warehousing, Business Intelligence and ETL process (Informatica, AWS Glue) and expertise on working on large data sets and analysis.

●Developing UDFs in java for hive and pig and worked on reading multiple data formats on HDFS using Scala.

●Building a Scala and spark based configurable framework to connect common Data sources like MYSQL, Oracle, Postgres, SQL Server, Salesforce, Big query and load it in big query.

●Extensive Knowledge and hands-on experience implementing PaaS, IaaS, SaaS style delivery models inside the Enterprise (Data centre) and in Public Clouds using like AWS, Google Cloud, and Kubernetes etc.

●Applied required transformation using AWS Glue and loaded data back to Redshift and S3.

●Created the functional and technical specifications for the data migration programs.

●Opened firewalls to have communication with Database and Solace queues from pivotal.

●Profiled data in IBM Info Analyzer using rule and column analysis against identified ECDEs, BVDE's

●Used Azure DevOps Services for source code repository and to build project artifacts.

●Automated OpenStack and AWS deployment using Cloud Formation, Ansible, Chef and Terraform.

●Designed end-to-end ETL pipeline for data coming from external sources into GBI division of the company and stored into several data warehouse.

●Provide technological guidance on Data Lake and Enterprise Data Warehouse design, development, implementation and monitoring.

●Understand, Design and Implement Data Security around cloud infrastructures.

●Provide support and guidance to Data Services and other application development teams on various AWS Products.

●Work with leadership on process improvement and strategic initiatives on Cloud Platform.

●Attended Agile, Scrum events (standups, planning, reviews and retrospectives) and provide constant input in support of a high-performing team.

●Developed programs in JAVA, Scala-Spark for data reformation after extraction from HDFS for analysis

●Involved in performing the Linear Regression using Scala API and Spark.

●Integrated Apache Storm with Kafka to perform web analytics. Uploaded click stream data from Kafka to Hdfs, Hbase and Hive by integrating with Storm

●Extensively worked on Informatica tools like source analyzer, mapping designer, workflow manager, workflow monitor, Mapplets, Worklets and repository manager.

●Worked on Data Extraction, aggregations and consolidation of Adobe data within AWS Glue using PySpark.

●Developed SSIS packages to Extract, Transform and Load ETL data into the SQL Server database from the legacy mainframe data sources.

●Opened firewalls to have communication with Database and Solace queues from pivotal.

●Worked on Building data pipelines in airflow in GCP for ETL related jobs using different airflow operators.

●Worked on Postman using HTTP requests to GET the data from RESTful API and validate the API calls.

●Used Provisioners in Terraform to execute scripts.

●Created custom data migration programs to load Material BOM and Sales BOM.

●Created Informatica workflows and IDQ mappings for - Batch and Real Time.

●Developed the PySpark code for AWS Glue jobs and for EMR.

●Developed ETL python scripts for ingestion pipelines which run on AWS infrastructure setup of EMR, S3, Redshift and Lambda.

●Applied write concern for level of acknowledgement while MongoDB write operations and to avoid rollback.

●Understood the existing internal pipeline framework that is written in Python and designed the external pipeline as conformed with that parallelly.

●Specializing in DR/BCP/CDP technologies. Worked with CDP Continuous Data Protection products to implement near real time DR and branch data consolidation solutions.

●Monitoring big query, Data proc and cloud Data flow jobs via Stack driver for all environments.

●Configured EC2 instances and configured IAM users and roles and created S3 data pipe using Boto API to load data from internal data sources.

●Took part in a team environment which implements agile, scrum software development approach.

●Provided Best Practice document for Docker, Jenkins, Puppet and GIT.

●Expertise in implementing DevOps culture through CI/CD tools like Repos, Code Deploy, Code Pipeline, GitHub.

●Install and configured Splunk Enterprise environment on Linux, Configured Universal and Heavy forwarder.

●Developed server-based web traffic using RESTful API's statistical analysis tool using Flask, Pandas.

●Analyse various type of raw file like Json, Csv, Xml with Python using Pandas, Numpy etc.

Environment: Informatica Power Center 10.x/9.x, IDQ, AWS Redshift, Snowflake, S3, Postgres, Google Cloud Platform (GCP), MS SQL Server, Big query, Salesforce SQL, Python, Postman, Tableau, Unix Shell Scripting, EMR, GitHub.

Client: Societe Generale, Bangalore, India July 2021 – Jan 2023

Role: Data Engineer

Responsibilities:

●Involved in Agile methodologies, daily Scrum meetings, Sprint planning.

●Implemented a proof of concept deploying this product in Amazon Web Services AWS.

●Implemented solutions for ingesting data from various sources and processing the Data-at-Rest utilizing Big Data technologies using Hadoop, MapReduce, HBase, Hive and Cloud Architecture.

●Responsible for creating on-demand tables on S3 files using Lambda functions and AWS glue using python and Pyspark

●Developed complex Hive ETL logic for data cleansing and transformation of data coming through relational systems.

●Incomplete Shocks Finder Tool: Worked on the tool that replaced the manual work of finding the data quality commands to ease the process of relaunching them. This is an oozie spark job that runs every day and generates a report containing data quality commands in the appropriate format.

●CCR Error Management OPAL library: Worked on the development of OPAL Java Library for Credit Risk Error Management, which is used by the downstream for reading the error manage data from datalake.

●CCR NRT tool: Developed a Non-Regression Testing tool for CCR Error Managed Data, which would be used to compare pre-production and production outputs while testing.

●CCR error notify: Azure Kubernetes Service (AKS) job which is used to send notifications to the downstream via event hub when the data are ready.

●CCR Monitoring UI Migration: Migrated the monitoring UI to the newer version for Credit Risk while adding support for the Time Range filter and fixed common bugs on both production and non-production data. Updated the module to support the new data transformation changes.

●Work with Apache Spark creating RDDs and Data Frames applying operations like Transformation and Actions and converting RDDs to Data Frames.

●Developed Spark batch jobs using Spark SQL and data migration strategies

●Pre-processed and cleaned data using MapReduce and performed feature engineering using Python. Implemented REST API data interfaces.

Environment: Hadoop, Scala, Spark, Spark-Streaming, Spark SQL, Apache Kafka, Hive, HBase, MS SQL, MySQL, Python, Azure (Data Storage Explorer, ADF, AKS, Blob Storage) Linux, Data Storage Explorer, Azure SQL Database, MongoDB, Azure ML Studio, Jenkins, Flask Framework, IntelliJ PyCharm, Git, Eclipse, Azure Data Factory, Tableau, MySQL, Postman, Agile Methodologies, AWS Lambda, Azure Cloud, Docker.

Client: Equiniti India, Pvt Ltd, India June 2020 – June 2021

Role: Data Engineer

Responsibilities:

●Contributing to the development of key data integration and advanced analytics solutions leveraging Apache Hadoop and other big data technologies for leading organizations using major Hadoop Distributions like Hortonworks.

●Involved in Agile methodologies, daily Scrum meetings, Sprint planning.

●Performed Data transformations in HIVE and used partitions, buckets for performance improvements.

●Created Hive external tables on the MapReduce output before partitioning; bucketing is applied on top of it.

●Developed business specific Custom UDF's in Hive, Pig.

●Published data-related articles and documentation in GitHub repositories, effectively communicating complex technical concepts and solutions to a broader audience.

●Demonstrated strong documentation skills by maintaining detailed technical documentation for SSIS packages, ensuring knowledge sharing and team collaboration.

●Developed end to end architecture design on big data solution based on variety of business use cases.

●Worked as a Spark Expert and performance Optimizer.

●Member of Spark COE (Center of Excellence) in Data Simplification project at Cisco

●Experienced with Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark

●Handled Data Skewness in Spark-SQL

●Implemented Spark using Scala, Java and utilizing Data frames and Spark SQL API for faster processing of data.

●Developed Spark code and Spark-SQL/Streaming for faster testing and processing of data.

●Developed a data pipeline using Kafka, HBase, Spark and Hive to ingest, transform and analyzing customer behavioral data also developed Spark jobs and Hive Jobs to summarize and transform data.

●Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive and Spark.

●Implemented Sqooping from Oracle and MongoDB to Hadoop and load back in parquet format.

●Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data; Worked under Map Distribution and familiar with HDFS.

●Handled importing data from different data sources into HDFS using Sqoop and performing transformations using Hive, MapReduce and then loading data into HDFS.

●Collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis.

●Worked on Oozie scheduler to automate the pipeline workflow and orchestrate the Sqoop, hive and pig jobs that extract the data in a timely manner.

●Exported data from HDFS to RDBMS via Sqoop for Business Intelligence, visualization, and user report generation.

●Preparation of docs like Functional Specification document and Deployment Instruction documents.

●Fixed defects during the QA phase, support QA testing, troubleshoot defects and identify the source of defects.

●Involved in installing Hadoop Ecosystem components (Hadoop, MapReduce, Spark, Pig, Hive, Sqoop, Flume, Zookeeper and HBase).

●Worked collaboratively with all levels of business stakeholders to architect, implement and test Big Data based analytical solution from disparate sources.

Environment: AWSS3, RDS, EC2, Redshift, Hadoop, Hive, Pig, Sqoop, Oozie, HBase, Flume, Hortonworks, MapReduce, Kafka, HDFS, Oracle, Microsoft, Java, GIS, Spark, Zookeeper

Client: Brigade Corporation India, Pvt Ltd, India May 2019 – May 2020

Role: Data Engineer

Responsibilities:

●Experience in building and architecting multiple data pipelines end to end ETL and ELT for data ingestion and transformation in GCP and coordinate task among them.

●Implemented and Managed ET solutions and automating operational processes

●Design and develop ET integration patterns using Python on Spark.

●Develop framework for converting existing PowerCenter mappings and to PySpark (Python and Spark) Jobs.

●Build data pipelines in airflow in GCP for ET related jobs using different airflow operators.

●Used Stitch ETL tools to integrate data into the central data warehouse.

●Experience in GCP Dataproc, GCS, Cloud functions, Data prep, Data Studio and Big Query.

●Experience in using G-cloud function with python to load data into BigQuery for on arrival CSV files to GCS Bucket.

●Experience in loading bound and unbound data from Google subtopic to Big Query using cloud data flow with python.

●Used Rest API with python to ingest data from and other sites to BigQuery.

●Implemented Spark RDD transformations to map business analysis and apply actions on top of Transformations.

●Design star schema in Big Query.

●Worked on creating various types of indexes on different collections to get good performance in Mongo database.

●Monitoring Big query, Dataproc and cloud Data flow jobs via Stack driver for all the environments.

●Used Agile for the continuous model deployment.

●Worked with Google data catalog and other google cloud APIs for monitoring, query and billing related analysis for big query usage.

●Knowledge about cloud data flow and Apache beam.

●Used Snowflake for the Data Storage, processing which is easier and faster to use.

●Write a Python program to maintain raw file archival in GCS bucket.

●Worked with google data catalog and other google cloud Apl’s for monitoring, query, and billing related analysis for Big Query usage.

●Used Airflow to manage task scheduling, progress, and success status using DAG graphs.

●Created Big Query authorized views for row level security or exposing the data to other teams.

●Integrated services like GitHub, Jenkins to create a deployment pipeline.

●Implemented new projects builds framework using Jenkins as build framework tools.

Environment: T-SQL, PL/SQL, Google Cloud, Python, Big query, Dataflow, Dataproc, Data prep, Data Studio, Bigtable Stitch ETL, PySpark, Snowflake, MySQL, Airflow, Shell Scripts, Mongo DB, GIT, Apache, Spark, Docker

Contact this candidate