Goutham
Email: *******************@*****.***
PH: (205) 615-105
Sr Data Engineer
PROFESSIONAL SUMMARY
10+ years of professional IT experience as a Data Engineer with expertise in designing data intensive applications using Hadoop Ecosystem, Big Data Analytics, Cloud Data engineering, Data Science, Data Warehouse, Data Visualization, Reporting, and Data Quality solutions.
Excellent Knowledge in setup, design and using Hadoop ecosystem components like Hadoop Map Reduce, HDFS, HBase, Oozie, Hive, Sqoop, Pig, spark, Kafka, storm, Zookeeper and Flume.
Worked with data architecture, including the design of data intake pipelines, Hadoop information architecture, data modelling, data mining, and advanced data processing.
Experience with CDH3 and CDH4, CDH5, CDH6 clusters, as well as the installation, configuration, maintenance, and management of the Cloudera Hadoop platform.
Understanding of controlling and providing database access, migrating on-premises databases to Azure Data Factory's, Azure Data Lake storage, and leveraging Azure Data Lake Analytics, Azure SQL Database, Data Bricks, and Azure SQL Datawarehouse.
Good expertise in Setup of Azure data solutions, provisioning storage account, Azure Data Factory, SQL server, SQL Databases, SQL Data warehouse, Azure Data Bricks and Azure Cosmos DB.
Expertise in setup of Dimensional modelling, Data cleansing, Data profiling, Data migration and ETL Processes features for data warehouses
knowledge in automating and securing AWS infrastructure using AWS CloudFormation, API Gateway, and AWS Lambda. proficiency in building CI/CD on an AWS environment using AWS Code Commit, Code Build, Code Deploy, and Code Pipeline.
Knowledge in the AWS family of services, including Amazon Serverless, Amazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, Cloud Front, CloudWatch, Redshift, SNS, SES, and SQS.
Involved with the Hadoop architecture and its ecosystems, including HDFS, Hive, Pig, Sqoop, Job Tracker, Task Tracker, Name Node.
Experience in setup of jobs using various stages like Join, Merge, Lookup, remove duplicates, Filter, Dataset, Lookup file set, Complex flat file, Modify, Aggregator, XML.
Experience in Database Creation and data models with Oracle, Teradata, Netezza, DB2, MongoDB, HBase and SQL Server databases.
Involved with the Design and Development of ETL pipelines to move data into the data warehouse from different sources.
Experience in Data Modelling for Data Warehouse and Data Mart using Star Schema, Snowflake Schema, Fact and Dimension Tables, ER Diagrams, OLAP & OLTP Systems.
Utilised ERWIN and Power Designer extensively to create logical and physical data models.
Good knowledge in using UNIX shell scripts to process massive volumes of data from a variety of sources and load them into Teradata databases.
Implemented the real time data analytics using Spark Streaming, Kafka and Flume.
Designed reports & dashboards using Tableau, Power Bi, Tibco and Spotfire for end users to make business decision based on the visualizations.
Excellent knowledge in Python Libraries like pandas, NumPy to efficiently manipulate and analyse large datasets & perform numerical operation on data.
Worked in Python Libraries like Seaborn, Plotly to create wide range of plots, including bar charts, line graphs, and scatter plots. Have Extensive Experience in IT data analytics projects, Hands on experience in migrating on premise ETs to Google Cloud Platform (GCP) using cloud native tools such as BIG query, Cloud Data Proc, Google Cloud Storage, Composer.
Experience in GitHub and Bitbucket version control tools.
Expertise in software development, project management, and Agile.
Excellent knowledge of Machine Learning, Deep Learning, Mathematical Modelling and Operations Research. Comfortable with R, Python, SQL, SAS, SPSS, Tableau, Power Bi, PowerApps, MATLAB, and Relational databases. Deep understanding &exposure of Big Data Eco - System.
Experience in data architecture including data ingestion pipeline design, Hadoop information architecture, data modeling, data mining, and advanced data processing.
Experience in preparing project documentation and tracking tasks and reporting regularly on the status of projects to all project stakeholders.
Developed as an efficient team player, client-facing experience with proficient communication and analytical skills.
TECHNICAL SKILL SET
Big Data Ecosystem Impala, HDFS, Kafka, HBase, Storm, Sqoop, Oozie, Hadoop Map Reduce, Flume, Pig, spark, Hive
Hadoop Distributions: Amazon EMR (EMR, EC2, EBS, RDS, S3, Athena, Glue, Elasticsearch, Lambda, DynamoDB, Redshift, ECS, Quick sight), Azure HDInsight (Data Bricks, Data Lake, Blob Storage, Data Factory ADF, SQL DB, SQL DWH, Cosmos DB, Azure AD), Cloudera CDP, Apache Hadoop 2.x/1.x, Hortonworks HDP
Programming Languages: HiveQL, R, UNIX shell Scripting, PL/SQL, SQL, Scala, Java, SAS, Python
Machine learning: Linear Regression, Logistic Regression, Decision Tree, Random Forest, SVM, XGBoost, GBM, Cat Boost, Naïve Bayes, PCA, LDA, K-Means, KNN, Neural Network
Deep Learning Recurrent Neural Networks (RNN),LSTM, Convolutional Neural Networks (CNN),GRUs, GANs
Databases: Teradata, Snowflake, DB2, MS SQL SERVER, Oracle, PostgreSQL, MySQL
NoSQL Databases: DynamoDB, Cosmos DB, HBase, Mongo DB, Cassandra
Version Control & SCM: Git, SVN, Bitbucket, GitHub, Azure Repo’s
ETL/BI Power BI, SSIS, Informatica, SSRS, QlikView, SSAS, Tableau, Erwin, Arcadia
Operating System Windows 7/8/10, Unix, Linux, Ubuntu
DevOps Tools Jenkins, Azure DevOps, Docker, Terraform, CloudFormation.
Methodologies Jira, Confluence, Waterfall Model, Agile, System Development Life Cycle (SDLC), RAD,JAD, UML
PROFESSIONAL WORK EXPERIENCES:
Safeway, Pleasanton, CA April 2022 - present
Role: Sr Data Engineer
Responsibilities:
Built and implemented relational servers and databases for the Azure Cloud using analysis of the present and upcoming needs of the organization.
Delivered and approached the data migration for advancement of current solutions from on-premises systems and apps with pipelines to Azure cloud are being developed.
Implementing Azure Data Factory, Spark SQL, and U-SQL Azure Data Lake Storage Gen2(ADLS) Analytics, extract, transform, and load data from sources systems to Azure Data Storage services.
Integrated the data from on-premises SQL servers to cloud databases (Azure Synapse Analytics & Azure SQL DB).
Created a Spark job using Azure Databricks to replace the current ETL/SSIS solution and provided Data warehouse solutions on Azure Synapse using Polybase/external tables.
Performed with data analysis using HiveQL, HBase, and custom Map Reduce programs as well as experience importing and exporting data from RDBMS to HDFS and Hive.
Established the Rest api while working on the Java API to link the web application to Cassandra for data consumption.
Performed various database transformations to cleanse the data and ensure its quality and provided production support and monitored ETL jobs to ensure that they were running smoothly.
Implemented the cloud-based Matillion ETL tool to create fact and dimensional models in MS SQL Server and Snowflake Database.
Ensured that assigned systems were patched, setup, and optimized by working on performance tuning.
Created applications for real-time processing using Spark Streaming and utilized the Spark components like Spark SQL, MLlib, and Spark Streaming.
Interpretation of principles related to data warehousing, including normalization of data, OLTP and OLAP systems, physical and logical data models, and extensive expertise with star and snowflake schemas.
Implemented several UNIX environments, including CRON, FTP, and UNIX Shell Scripting.
Implemented Spark and Kafka to construct streaming apps in Azure Notebooks and used Spark Streaming and Kafka for cluster management and data ingestion.
Planned the use of scheduler to automate the entire data pipeline by creating Oozie workflows and designed Hadoop ETL tasks have automation scripts in place.
Implementation of data scientist expertise in analytics and big data for feature extraction and classification further down the line.
Implemented the on-premises migration. ARM templates, Azure DevOps, Azure CLI, and App services are used to connect Net apps, the DevOps platform, and Azure CI/CD processes.
Environment: Azure Data Factory, Azure Data Lake, Azure Databricks, Azure Logic App, Azure ARM, Azure SQL, Shell Scripting, PySpark, Spark3.0, Spark SQL, Terraform.
AT&T, Dallas, TX October 2020 - March 2022
Sr. AWS Data Engineer
Responsibilities:
Created and implemented AWS data pipelines using a variety of AWS resources, such as the AWS API Gateway to receive responses from AWS Lambda, retrieve data from Snowflake using a lambda function and convert the response into JSON format using the Snowflake database, DynamoDB, AWS Lambda function, and AWS S3.
Developed and built ETL processes in AWS Glue to import campaign data from external sources such as S3, ORC, Parquet and Text Files into AWS Redshift.
Indulged in the evaluation and setup of bare-metal systems and Cassandra databases on AWS for non-production environments.
Processed the data with cloudera that goes beyond merely gathering and storing data. Processed and stored small datasets using AWS services like EC2 and S3.
Created and built Spark operations in Scala to pull data from an AWS S3 bucket and transform it using Snowflake.
Implemented One time Data Migration of Multistate level data from SQL server to Snowflake by using Python and SnowSQL.
Implemented AWS Lambda, AWS Glue, and Step Functions and worked on ingesting data and doing data transformations and cleaning.
Utilized the in-memory computing features of Spark and Scala to carry out complex operations such as text analytics and processing.
Ingested data from many sources using Apache Flume to sinks like Avro and HDFS. Imported data from a Relational Database Management System (RDBMS) into HDFS and Hive using scoop.
Converted the data from Text files to Avro format after installing and configuring Pig where Pig scripts were created and Built Docker Images to run airflow on local environment to test the Ingestion as well as ETL pipelines.
Built/Maintained Docker container clusters managed by Kubernetes and Utilized Kubernetes and Docker for the runtime environment of the CI/CD system to build, test and deploy.
Created Airflow DAGs to schedule the Ingestions, ETL jobs and various business reports. Worked on Supporting the Production Environment and debugging the issues using Splunk logs.
Developed ETL Pipelines in and out of data warehouse as a part of day to-day responsibility, develop major regulatory and financial reports using advanced SQL queries in snowflake.
Optimized snowflake with the help of Matillion which can connect to all of your data sources, making it much easier for you to create no-code data pipelines to Snowflake. Out of the five Matillion activities, we chose the extract job, which would take data from Kronos via the Kronos API or a custom script and store it in a format that could be easily loaded into Snowflake, such as CSV or JSON.
Staged the API or Kafka Data (in JSON file format) into Snowflake DB by FLATTENING the same for different functional services.
Imported real time weblogs using Kafka as a messaging system and ingested the data to Spark Streaming. Implemented data quality checks using Spark Streaming and arranged bad and passable flags on the data.
Developed business logic using Kafka & Spark Streaming and implemented business transformations. Supported Continuous storage in AWS using Elastic Block Storage, S3, Glacier.
Created Volumes and configured Snapshots for EC2 instances and also the PL/SQL scripts to extract the data from the operational database into simple flat text files using UTL FILE package and developed Spark code for using Scala and Spark.
Environment: AWS, MapReduce, Snowflake, Pig, Spark, Scala, Scala, Airflow, Kafka, Python, JSON, Parquet, and CSV.
State of WA, Seattle, WA December 2019 – September 2020
Big Data Engineer
Responsibilities:
Interpreted the usage, installation and configuration of the various parts of the Hadoop ecosystem
Implementation of the Google Cloud Platform's BigQuery and Cloud DataProc services to build and develop production data engineering solutions that deliver our pipeline patterns.
Designed the tools of the Hadoop ecosystem, including HDFS, MapReduce, Hive, Pig, YARN, Spark, HBase, Oozie, Sqoop, and Zookeeper.
Normalization of databases like Oracle, SQL, PostgreSQL, and Cassandra complies with design principles and best practices.
Deployed Talend Data Integration and Informatica Power Center to analyze the ETL technologies for developing data warehouses, BI, analytics, ETL procedures, data mining, data mapping, data conversion, and data migration.
Designed Power BI to build reports and dashboards, comprehend the data, and make business choices while working as a downstream data analyst.
Generated Hadoop MapReduce jobs, Spark, and Kafka producers and consumers.
Integrated every step of the process, from gathering data from MySQL to pushing it to the Hadoop Distributed FileSystem to performing PIG/Hive and MapReduce processes using Oozie.
Developed and deployed unique Hadoop apps while working in an AWS environment.
Applied structural changes with Map-Reduce and HIVE, and analyze data using tools for data visualization and reporting.
Involved in obtaining customer requirements and calculating development time for complicated queries using the Hived atabase for a logistics application.
Created a data pipeline utilizing Flume, Pig, and Sqoop to input customer histories and cargo data into HDFS for analysis.
Operated the configuration of the Hive meta store with MySQL, which contains the metadata for Hive tables, by importing data from MySQL DB to HDFS and vice versa using Sqoop.
Designed the data refresh strategy document and the capacity planning papers necessary for project development and support, processes in Oozie and task scheduling in mainframes were developed.
Collaborated with many Oozie actions to construct a process, including the Sqoop, Pig, Hive, and Shell actions.
Developed multiple Open-Source projects and prototypes for numerous applications using cutting-edge Big Data tools, including major Hadoop distributors like Horton Works and Cloudera.
Distribute the cluster's resources for the map-reduced jobs in order to build a fair scheduler on the job tracker.
Developed and built a complete end-to-end data warehouse infrastructure on Confidential Redshift from scratch for processing enormous volumes of data, including millions of records each day.
Environment: Big Data Tools, Hadoop, Hive, Sqoop, HBase, Pig, Oozie, My SQL, Cloudera Distribution with Hadoop, HDFS
JPMC, NYC, NY March 2018 - November 2019
Data Analyst
Responsibilities:
Examined the information at the cluster level from several databases when data transformation or data loading occurs.
Built many SQL scripts for data discrepancies as part of data migration, and worked on migrating history data from Teradata SQL to Snowflake.
Deployed SQL Server Integration Services (SSIS) to build automated data pipelines to transport the data more effectively, and used SQL Server Management Studio (SSMS) to construct database connections and write scripts to extract and convert the data.
Defined and controlled the policies for S3 buckets, which were used by AWS as backup and storage.
Integrated with Tibco and Spotfire and made computations along with predictive modeling using python scripts.
Evaluated promotional activities from a number of angles to maximize the return on investment for clients including click-through rates, conversion rates (CON), seasonal trends, search queries, quality scores (QS), competitors, distribution channels, etc.
Monitored the click events on a blended home screen, such as the click - through rate, exchange rate, bounce back rate, list view which provides useful data for finding relevant optimization.
Comprehended client requirements and developed a testing approach based on firm principles.
Built a defect management plan to track problems and assign tasks using JIRA, was built.
Created process flows to show and clarify project requirements.
Coordinated for many teams such as QA, development, and automation and frequently used Quality Center to design and run test cases.
Involved in performance testing, integration tests, unit testing, validation. Hadoop MapReduce written in Python, Pig, and Hive has been tested.
Designed Spark SQL to facilitate data testing and processing more quickly.
Derived a custom SQL query to check the parameters for the daily, weekly, and monthly jobs.
Involved heavily in the daily and monthly job scheduling with pre- and post-conditions according to the necessity.
Environment: Apache Spark, Sqoop, AWS S3, GitHub, Service Now, Hadoop, MapReduce, AWS, Snowflake, Jira, Teradata, SQL Server.
TCS, Hyderabad, India July 2014 – October 2017
PL/SQL DEVELOPER
Responsibilities:
Developed complicated stored procedures, functions, and views in dynamic SQL with ease.
Creation and knowledge of SQL Profiler, DMVs, and execution plans for performance tweaking.
Design of tables, primary and foreign keys, indexes, and stored procedures for databases.
Developed scripts such as SQL and PL/SQL for relational objects which helps to create, install and delete including tables, views, primary keys, indexes, constraints, packages, sequences, grants, and synonyms.
Created, maintained, and modified PL/SQL packages, mentored people in the production of complicated SQL statements, performed data modeling, and created, maintained, and modified sophisticated database triggers and data migration scripts. Tuned SQL statements using hints for maximum efficiency and performance.
Modified database in accordance with the new demands with the addition of additional fields and tables to the current database.
Established and managed linked servers, as well as the ability to move data between SQL Server and other heterogeneous relational databases like Oracle and DB2.
Designed and maintained both clustered and non-clustered tables to maintain SQL Server performance.
Developed and implemented various protocols that needed sophisticated join statements, such as Outer Joins and Self-Joins.
created Shell scripts for daily backups and task automation and written the business process documentation using Designer.
Stored Procedures, Triggers, Tables, Views, and User-Defined Functions were developed, debugged, and changed.
Built indexes to improve database performance and speed up client information retrieval.
Supported the front-end applications need for flexibility by developing dynamic SQL.
Environment: SQL Developer, Informatica, Oracle10g, PL/SQL Developer, and UNIX
Education and Certifications:
Bachelor’s in Computer Science