Post Job Free
Sign in

Data Engineer Integration

Location:
Denton, TX
Posted:
October 28, 2024

Contact this candidate

Resume:

Name: Vamsi G

Mobile: +1-682-***-****

Email: **********@*****.***

PROFESSIONAL SUMMARY

9+ years of experience as Data Engineer working on Data Integration, data ingestion and Data Warehousing using Teradata and Oracle technologies.

Developed multi cloud strategies in better using GCP (for its PAAS) and Azure (for its SAAS).

Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP and coordinate task among the team.

Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP and coordinate task among the team.

Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems and vice versa.

Experienced in Python data manipulation for loading and extraction as well as with Python libraries such as NumPy, pyspark, Pandas, pytorch, matplotlib, seaborn, slearn and SciPy for data analysis and numerical computations.

Used PyTorch an open-source machine learning library for Python that is primarily used for developing deep learning models.

Strong experience in NoSQL databases like HBase, Cassandra, and MongoDB and SQL databases like Teradata, Oracle, PostgreSQL, and SQL Server.

Strong Experience with ETL process with tools like Informatica, Teradata, SQL.

Worked with Spark and Scala to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames in Scala.

Designed and developed various scalable systems using Hadoop technologies and analyzed data using MapReduce, Hive, and PIG.

Knowledge and experience with CI/CD using containerization technologies like Docker and Jenkins.

Led analytics projects with budgets exceeding $1 million, demonstrating proficiency in managing substantial financial resources to deliver impactful results.

Extensive experience in various business domains like Healthcare, Financial, Investment and Retail.

Extensively worked with designing, developing, and implementing Data models for enterprise-level applications and BI solutions. Experience with Agile and waterfall methodologies and SDLC.

Extensively worked with designing and building Data Management Lifecycle covering Data Ingestion, Data integration, Data consumption, Data delivery, and integration Reporting, Analytics, and System-System integration.

Strong expertise in Requirement Gathering, requirement Analysis, Design, Coding/development, Testing, Support and Documentation using Apache Spark & Scala, Python, HDFS, YARN, Sqoop, Hive, Map Reduce, KAFKA. Also, Analysis, Design, Development, Implementation, Modeling, Testing, and support for Data warehousing applications.

Experience in Big data environment and Hadoop components to utilize large scale data including structured and semi structured data.

Good Knowledge of AWS services like AWS EMR, Redshift, S3, EC2, Lambda, Glue, and concepts, configuring the servers for auto-scaling etc.

Extensive experience with Azure cloud technologies like Azure Data Lake Storage, Azure Data Factory, Azure SQL, Azure Data Warehouse, Azure Synapse Analytical, Azure Analytical Services, Azure HDInsight and Databricks.

Experience in building power bi reports on Azure Analysis services for better performance

Hands on experience in Google cloud Platform Big Query, G- cloud function, Cloud data flow, pub/sub cloud shell, GSUTIL, BQ Command line utilities, dataproc.

Used cloud shell SDK in GCP to configure the services Data Proc, Storage, Big Query.

Hands on experience on Google Cloud Platform (GCP) in all the bigdata products Big Query, Cloud Data Proc, Google Storage, Composer (Air Flow as a service).

Working Knowledge of Data warehousing concepts like Star Schema and Snowflake Schema, Data Marts, Kimball Methodology used in Relational, Dimensional and Multidimensional data modelling.

TECHNICAL SKILLS

Teradata Utilities

BTEQ, Fast Load, Multiload, TPT, TPump, SQL Assistant, Viewpoint, Query Monitor, SQL workbench

ETL Tools

Informatica Power Center 9.x/8.x (Source Analyzer, Repository Manager, Transformation Developer, Mapplet Designer, Mapping Designer, Workflow Manager, Workflow Monitor, Warehouse Designer and Informatica Server)

Databases

Teradata 14.10/14/13.10/13, Oracle 11g/10g,8i, DB2/UDB, SQL Server, MySQL,

MongoDB, DynamoDB, Cassandra DB

Languages

SQL, PL/SQL, XML, UNIX Shell Scripting

Operating Systems

Windows 95/98/NT/2000/XP, UNIX, Linux, NCR MP-RAS UNIX

Data Modeling

Erwin, ER Studio

Tools/Utilities

PL/SQL, TOAD, SQL Server, SQL Developer, Erwin, Microsoft Visio, Tableau, microstrategy

Scheduler

UC4, Control M, Autosys

Big Data

Hadoop, MapReduce, Spark, HDFS, Sqoop, YARN, Oozie, Hive, Impala, Zookeeper, Apache Flume, Apache Airflow, Cloudera,HBase

Programming Languages

Python, Pytorch, Java, PL/SQL, SQL, Scala, PowerShell, C, C++, T-SQL

Cloud Services

Azure Data Lake Storage Gen 2, Azure Data Factory, Blob storage, Azure SQL DB, Databricks, Azure Event Hubs, AWS RDS, Amazon SQS, Amazon S3, AWS EMR, AWS S3, Redshift, Glue,

Lambda, AWS SNS, BiqQuery, GCS Bucket,GCP (google cloud platform) G- cloud function, Data flow, pub/sub cloud shell

PROFESSIONAL EXPERIENCE

Client: Walgreens, Chicago, IL Aug 2022 - Present

Role: Sr. Data Engineer

Responsibilities:

Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.

Upgrade the existing HD insight code to use azure Databricks for better performance, because data bricks are an optimized cluster.

Azure Data Factory (ADF), Integration Run Time (IR), File System Data Ingestion, Relational Data Ingestion

Used Spark streaming using the Kafka streaming and Azure based Event hub services.

Used Pattern matching algorithms to recognize the customer across different sources and built risk profiles for each customer using Hive and stored the results in HBase.

Implemented a proof of concept deploying this product in Amazon Web Services AWS.

Ingested data from various sources and processed the Data-at-Rest utilizing Big Data technologies.

Developed advanced PL/SQL packages, procedures, triggers, functions, Indexes, and Collections to implement business logic using SQL Navigator.

Worked as a Data Engineer on several Hadoop Ecosystem components with Cloudera Hadoop distribution.

Worked on Apache Spark using Scala and Python.

Extensively used Spark-core and spark-SQL libraries to perform transformations on the data.

Designed and Developed the Azure data lake storage and placed all the different files from various sources into Data Lake.

Stored the data from and source systems into Data Lake and processed the data by using the spark.

Worked with AWS to implement the client-side encryption as Dynamo DB does not support at rest encryption at this time.

Provided thought leadership for architecture and the design of Big Data Analytics solutions for customers, actively drive Proof of Concept (POC) to implement a Big Data solution.

Developed and implemented logical and physical data models using enterprise modeling tools Erwin.

Created Hive queries and tables that helped line of business identify trends by applying strategies on historical data before promoting them to production.

Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Sql Activity.

Designed and developed cubes using SQL Server Analysis Services (SSAS) using Microsoft Visual Studio.

Performed performance tuning of OLTP and Data warehouse environments using SQL.

Created data structure to store the dimensions in an effective way to retrieve, delete and insert the data.

Used Hive to analyze data ingested into HBase by using Hive-HBase integration and compute various metrics for reporting on the dashboard

Developed shell scripts, stored procedures, and macros in order to automate to run based on the frequency.

Involved in taking care of day-to-day needs for the development team with in the specified SLA and also taking care of ad-hoc requests.

Developed Pig scripts to parse the raw data, populate staging tables and store the refined data in partitioned DB2 tables for Business analysis.

Environment: HBase, Oozie 4.3, Hive 2.3, Sqoop 1.4, SDLC, OLTP, SSAS, SQL, Oracle 12c, PL/SQL, ETL, AWS, Sqoop, Flume Spark Core, Spark RDDs, Python, Hive, Impala, HBase, Sqoop, Kerberos (security), LDAP, and crontab, Delta Lake, Azure Event Hubs, Stream Analytics, Azure Blob Storage, PowerShell, Apache Airflow, Hadoop, YARN, PySpark, Hive.

Client: Invesco, New York, NY March 2020 - July 2022

Role: Sr. Data Engineer

Responsibilities:

Worked on creating Azure Data Factory and managing policies for Data Factory and Utilized Blob storage for storage and backup on Azure.

Developed and ingested the data in Azure cloud from web service and loaded it to Azure SQL DB.

Used PyTorch an open-source machine learning library for Python that is primarily used for developing deep learning models and used features like dynamic computational graph for easier debugging.

Worked with devops and Terraform

Manage the Teradata servers across the globe, monitor, responding to alerts, tuning, creating appropriate indexes, and creating incidents with Teradata and assistance/follow up with CSR.

Performed system level and application level tuning and supported the application development team for database needs and guidance using tools and utilities like explain, visual explain, PMON,DBC views.

Developed shell scripts, stored procedures, and macros in order to automate to run based on the frequency.

Involved in taking care of day-to-day needs for the development team with in the specified SLA and also taking care of ad-hoc requests.

Developed multi cloud strategies in better using GCP (for its PAAS) and Azure (for its SAAS).

Used postgre’s geospatial data types and functions that deal with location data, such as mapping and geocoding, making it an excellent choice for applications.

Used Postgres as a data warehouse for storing large volumes of data for analytics purposes. Also supported advanced SQL features such as window functions, common table expressions, and subqueries, which are useful for complex analysis.

Used Postgres as the database backend for content management systems (CMS) such as Drupal and Joomla.

Used Data lineage to track data's path, transformations, and interactions with various systems and processes throughout its lifecycle.

As a Data Engineer, I am a data experts, as I work primarily with databases and different types of software.

Worked in complete Software Development Life Cycle (SDLC) which includes gathering and analyzing business requirements, understanding the functional workflow of information from source systems to destination systems.

Ingest, analyze, and interpret large data sets to develop technical and data-driven solutions to difficult business problems using tools such as SQL and Python.

Worked with databases, both relational and multidimensional.

Migrated applications from Teradata DB to Azure Data Lake Storage Gen 2 using Azure Data Factory And Created tables and loaded and analyzed data in the Azure cloud. Also worked on cloning data and tables.

Environment: Teradata 14,Oracle, Teradata SQL Assistant, SQL Developer, BTEQ, Fast Load, Multiload, Fast Export, UNIX, Shell scripting, View Point, Teradata Administrator,Tableau, Python, SQL, Cassandra DB, Azure Data Lake Storage Gen 2, Azure Data Factory, Azure SQL DB, Spark, Databricks, SQL Server, Kafka, Apache Spark.

Client: Extended Stay America, Charlotte, NC Oct 2018 - Feb 2020

Role: Sr. Data Engineer

Responsibilities:

Have also been using AWS cloud technologies such as S3 for data storage, EMR for data processing, EC2 for virtual Linux instances, CloudWatch for log analysis, and Lambda functions for triggering various jobs in AWS cloud services.

Worked on complex Maven data types GitHub of arrays, map, and Azure Data Factory struct in hive.

Utilized analytical skills to design, develop Azure fundamental knowledge and maintain reports, dashboards and visualizations that provide Apache Spark insight to internal partners and external clients in reporting tools like Tableau and Excel.

Responsible for validation of transactional Azure Databricks and profile data from RDBMS which are transformed Maven and SparkSQL loaded to Data Lake using Hadoop Bigdata technologies.

Perform tasks such as writing GitHub scripts, Azure Data Factory, calling APIs, write SQL queries, etc.

Importing and exporting data into HDFS from MySQL and vice versa using Sqoop and manage the data coming from different sources.

Worked in Azure environment for development and deployment of Custom Hadoop Applications.

Designed and implemented scalable Cloud Data and Analytical a solution for various public and private cloud platforms using Azure.

Developed numerous MapReduce jobs in Scala for Data Cleansing and Analyzing Data in Impala.

Used Hive to analyze data ingested into HBase by using Hive-HBase integration and compute various metrics for reporting on the dashboard

Worked on migration of existing Azure SQL Data warehouse to Azure Synapse Workspace.

Worked on migrating ADF into Synapse Workspace by creating PowerShell Environment.

Migrated data from on-premise to Azure Synapse using Stream sets and Az copy.

Performed Data transformation using Azure Data factory using data pipeline and dataflow Implemented Polybase, bulk load, and copy utilities for various transformations.

Extensively for producing Billing reports and reports for statutory in Pl/Sql Development

Implemented complex map-reduce Apache Spark programs Azure Databricks to perform joins such as map-side using SparkSQL distributed cache in Java.

Environment: Spark Core, Dataframe API, AWS S3, EMR, Oracle 19C Lambda, CloudFormation, CloudWatch, Python, Hive, Presto, Crontab, Elastic Search, and Kibana, Hadoop 3.0, MS Azure, Sqoop, Agile, Kafka, OLTP, Erwin 9.7, MapReduce, HBase 1.2, Pig 0.17, Python, OLAP

Client: HP, Inc, Spring, TX Dec 2016 - Sep 2018

Role: Sr. Data Engineer

Responsibilities:

Developed Spark applications in Python for distributed data processing, loading high-volume files into PySpark DataFrames, and incorporating snowflake schemas for structured data handling.

Utilized PyTorch for developing deep learning models, leveraging dynamic computational graphs for easier debugging, and integrating snowflake schemas for enhanced data organization.

Coordinated with Data Modelers to design dimensional models, implementing snowflake schemas for optimized data warehousing.

Generated high-level and low-level design documents for source-to-target transformations, incorporating snowflake schemas for improved data structure.

Created Test Scenarios, test plans, and executed them within SLA, focusing on snowflake schema designs for accurate testing.

Developed reports using Tableau for executive reporting, integrating snowflake schemas for comprehensive data representation.

Designed, developed, and tested ETL workflows, SQL queries, stored procedures, and Shell scripts to implement complex business rules within the existing framework.

Migrated historical data and transferred data from various sources like XML, flat files, and different databases to the main and semantic database, incorporating snowflake schemas for improved data organization.

Performed ETL tasks including data cleansing, conversion, and transformations, collaborating with the DBA team to optimize queries and ensure efficient data loading.

Utilized Cassandra for handling diverse data types and ensuring data availability, with a focus on snowflake schema designs for enhanced data management.

Employed PowerShell scripting for data maintenance and configuration, automated and validated data using Apache Airflow, integrating snowflake designs for improved data structure.

Built data pipelines in Airflow for ETL jobs, incorporating snowflake schemas for consistent data representation.

Utilized Delta Lakes for streamlined data processing, incorporating snowflake schemas for unified data representation and ACID transactions using Apache Spark.

Environment: SQL workbench, Teradata 14, Informatica Power Center 9.1/9.5,Tableau, GIThub, Autosys, Jira, Service now, Python, SQL, Cassandra DB, Apache Spark, Delta Lake, Stream Analytics, PowerShell, Apache Airflow, Hadoop, YARN, PySpark, Hive

Client: Trane Technologies, Minneapolis, MN June 2015 - Nov 2016

Role: Data Analyst

Responsibilities:

Worked closely with Business Analyst and report developers in writing the source to target specifications for Data warehouse tables based on the business requirement needs.

Exported data into excel for business meetings which made the discussions easier while looking at the data.

Performed analysis after requirements gathering and walked the team through major impacts.

Provided and debugged crucial reports for finance teams during month-end period.

Addressed issue reported by Business Users in standard reports by identifying the root cause.

Get the reporting issues resolved by identifying whether it is reporting report-related issue or source source-related issues.

Creating Ad hoc reports as per user’s needs.

Investigating and analyzing any discrepancy found in data and then resolving it. Implemented reporting Data Warehouse with online transaction system data.

Developed and maintained a data warehouse for the PSN project.

Provided reports and publications to Third Parties for Royalty payments.

Managed user accounts, groups, and workspace creation for different users in PowerCenter. Wrote complex UNIX/windows scripts for file transfers, and emailing tasks from FTP/SFTP.

Worked with PL/SQL procedures and used them in Stored Procedure Transformations.

Extensively worked on oracle and SQL server. Wrote complex SQL queries to query ERP system for data analysis purpose

Worked on most critical Finance projects and had been the go-to person for any data-related issues for team members.

Migrated ETL code from Talend to Informatica. Involved in development, testing and post-production for the entire migration project.

Environment: Informatica Power Center 9.1/9.0, Talend 4.x & Integration suite, Business Objects XI, Oracle 10g/11g, Oracle ERP, EDI, SQL Server 2005, UNIX, Windows Scripting, JIRA



Contact this candidate