Prashanth Patel
Data Engineer
Email: *****************@*****.*** Phone: 512-***-****
PROFESSIONAL SUMMARY
•Over 7 years of professional experience in the field of Data Engineering, Data Analytics, Business Intelligence, and software development using open source, Microsoft (Azure) and Amazon webservices (AWS) technologies.
• Data Modelling, Ingestion, Processing, ETL, storage, Data-Driven quantitative analysis, Data Integration and Resource utilization in the Big Data ecosystem.
•Experience in project development, implementation, deployment, and maintenance using Java/J2EE, Hadoop and Spark related technologies using Cloudera, Hortonworks, Amazon EMR, Azure HDInsight.
•Experienced on Data architecture including data ingestion pipeline design, Hadoop information architecture, data modelling and data mining, machine learning and advanced data processing.
•Strong experience in Business and Data Analysis, Data Profiling, Data Migration, Data Integration, Data governance and Metadata Management, Master Data Management and Configuration Management.
•Working knowledge on Azure cloud components (HDInsight, Data Bricks, Data Lake, Blob Storage, Data Factory, Storage Explorer, SQL DB, SQL DWH, Cosmos DB).
•Experience with working on AWS platforms (EMR, EC2, RDS, EBS, S3, Lambda, Glue, Elasticsearch, Kinesis, SQS, DynamoDB, Redshift, API Gateway, Athena, Glue, ECS).
•Experienced in developing Data Pipelines in Azure Data Factory and Datasets/pipelines during ETL process from Azure SQL, Blob Storage, Azure SQL Datawarehouse.
•Familiar with streaming processing on Apache Kafka, Apache Flink and Spark Streaming.
•Migrated ETL code from Talend to Informatica. Involved in development, testing and Postproduction for the entire migration project.
•Expert in using the Talend Troubleshooting and DataStage to understand the errors in jobs and used the Map/expression editor to evaluate complex expressions and look at the transformed data to solve mapping issues.
•Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline which can be written to Glue CatLog and can be queried from Athena.
•Experience in using Talend features such as context variables, triggers, connectors for Database and flat files.
•Possesses hands on experience in Cloudera Hadoop, Horton Works, Hadoop, various ETL tools, Cassandra, and various Confidential IaaS/PaaS services.
•Excellence in handling Big Data Ecosystems like Apache Hadoop, MapReduce, Spark, HDFS Architecture, Cassandra, HBase, Sqoop, Hive, Pig, MLIB, ELT.
•Expertise in writing Spark RDD transformations, actions, Data Frames, case classes for the required input data and performed the data transformations using Spark-Core, Spark SQL.
•Used Flink Streamed for pipelined Flink engine to process data streams to deploy new API including definition of Flexible windows.
•Strong experience and knowledge of real time data analytics using Spark Streaming, Kafka, and Flume.
•Expertise in building PySpark and Spark-Scala applications for analysis, batch processing, and stream processing.
•Developed high-performance batch processing applications on Apache Hive, Spark, Impala, Scoop HDFS, Oozie.
•Experience in development and implementation of Big Data Management Platform using Hadoop 2.x, HDFS, MapReduce/Yarn/Spark, Hive, Pig, Oozie, Apache NIFI, Sqoop.
•Good understanding of Big Data Hadoop and Yarn architecture along with various Hadoop Demons such as Job Tracker, Task Tracker, Name Node, Data Node, Resource/Cluster Manager, and Kafka.
•Experience in Integrating Apache Storm with Kafka to perform web analytics and uploaded click stream data from Kafka to HDFS, HBase and Hive by integrating with Storm.
•Experience in managing Ansible Playbooks with Ansible roles, used file module in Ansible playbook to copy and remove files on remote systems, wrote scripts using YAML and automate software update and Verify functionality.
•Extensive experience with Docker and Kubernetes on multiple cloud providers and helped developers to build and containerize their application pipeline (CI/CD) to deploy them on cloud and KOPS for managing K8.
•Hands on Experience with Arm Templates & Azure DevOps CI/CD Pipeline.
•Access and ingest Hive tables to Apache Kudu for Data warehousing after performing the record joins using Kudu context in Apache Spark Oozie.
•Experience in migrating Hive & MapReduce jobs to EMR and QUBOIE with automating the workflows using Airflow.
•Implemented a CI/CD pipeline using Azure DevOps (VSTS, TFS) in both cloud and on-premises with GIT, MS Build, Docker, Maven along with Jenkins plugins.
•Experienced in writing Spark scripts in both Python and Scala for development and data analysis.
Sound knowledge and Hands-on-experience with –MapReduce, Ansible, Presto, Amazon Kinesis, Storm, Flink, Stream Sets, Star Schema, Snowflake Schema, ER Modelling and Talend.
Experience in analysing data from multiple sources and creating reports with Interactive Dashboards using power BI, Tableau, Arcadia and Matplotlib.
•Experience in job workflow scheduling and monitoring tools like Oozie and Zookeeper.
•Have implementation of Kerberos, Azure AD, Sentry, and Ranger authentication for client/server applications by using secret-key cryptography.
•Experience in Importing and exporting data from different databases like MS SQL Server, Oracle, Cassandra, Teradata, PostgreSQL Post into HDFS using Sqoop, Talend.
•Extensive experience in development of Bash scripting, T-SQL, PL/SQL, Bash Scripts and UNIX Shell command.
•Experience in Developing ETL workflows using Informatica PowerCenter 9.X/8.X and IDQ. Worked extensively with Informatica client tools- Designer, Repository Manager, Workflow Manager, and Workflow Monitor.
•Well versed with the design and development of presentation layer for web applications using technologies like HTML, CSS, jQuery, and JavaScript.
•Leverages DevOps techniques and Experience with DevOps tools like GitHub, Jira, DocuSign, Box with Agile Methodology
•Experience with ticketing & Project Management tools like JIRA, Azure DevOps, Bugzilla and ServiceNow.
•Experience in software methodologies like Agile, Waterfall model.
•Excellent Communication and interpersonal skills.
TECHNICAL SKILLS
•Big Data Ecosystem: HDFS, MapReduce, Yarn, Spark, Kafka, Airflow, Hive, Impala, StreamSets, Sqoop, HBase, Flume Pig, Ambari, Oozie, Zookeeper, NIFI, Sentry, Ranger.
•Hadoop Distributions: Apache Hadoop 2.x/1.x, Cloudera CDP, Hortonworks HDP, AWS (EMR, EC2, EBS, RDS, S3, Athena, Glue, Elasticsearch, Lambda, DynamoDB, Redshift, ECS, QuickSight), Azure (HDInsight, Data Bricks, Data Lake, Blob Storage, Data Factory ADF, SQL DB, SQL DWH, Flink Cosmos DB, Azure AD).
•Programming Languages: Python, Scala, Java, Shell Scripting, Pig Latin, HiveQL.
•NoSQL Database: MongoDB 3.x, Hadoop HBase 0.98, and Apache Cassandra, Redis.
•Database: Snowflake, AWS RDS, Teradata, Oracle 9i/10g, MySQL 5.5/5.6/8.0, Azure SQL Database, Postgres SQL.
•Version Control: Git, SVN, Bitbucket
•ETL/BI: Snowflake, Informatica, SSIS, SSRS, SSAS, Tableau, Arcadia.
•Reporting & Visualization: Tableau 9.x/10.x, Matplotlib, Power BI.
•Web Development: JavaScript, Node.js, HTML, CSS, Spring, J2EE, JDBC, Okta, Postman, Angular, JFrog, Mokito, Flask, Hibernate, Maven, Tomcat, WebSphere.
•Operating systems: Linux (Ubuntu, Centos, RedHat), Windows (XP/7/8/10). Mac
•Others: StreamSets, Terraform, Docker, Kubernetes, Jenkins, Ansible, Splunk, Jira.
PROFESSIONAL EXPERIENCE
Client: AT&T Oct 2021 – Present
Dallas, TX
Data Engineer
Responsibilities:
Responsible for the execution of big data analytics, predictive analytics, and machine learning initiatives.
Implemented a proof of concept deploying this product in AWS S3 bucket and Snowflake.
Utilize AWS services with focus on big data architect /analytics / enterprise Data warehouse and business intelligence solutions to ensure optimal architecture, scalability, flexibility, availability, performance, and to provide meaningful and valuable information for better decision-making.
Developed Scala scripts, UDF's using both data frames/SQL and RDD in Spark for data aggregation, queries and writing back into S3 bucket.
Wrote, compiled, and executed programs as necessary using Apache Spark in Scala to perform ETL jobs with ingested data.
Involved in the development of real time streaming applications using PySpark Apache Flink Kafka Hive on distributed Hadoop Cluster
Experience in data cleansing and data mining.
Loaded real time data from various data sources into HDFS using Kafka.
Developed common Flink module for serializing and deserializing AVRO data applying schema.
Created Talend jobs to copy the files from one server to another and utilized Talend FTP components.
Design and Implement ETL for data load from heterogeneous source to SQL server and Oracle as target databases and for Fact and slowly changing dimensions SCD-type1 and SCD-type2.
Used Talend for Big data Integration using Spark and Hadoop
Developed a queryable state for Flink by Scala to query streaming data and enriched the functionalities of the framework.
Involved in maintaining Informatica workflows and Domains.
Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.
Wrote Spark applications for data validation, cleansing, transformation, and custom aggregation and used Spark engine, Spark SQL for data analysis and provided to the data scientists for further analysis.
Expertise in Transact -SQL (DDL, DML) and in Design and Normalization
Prepared scripts to automate the ingestion process using Python and Scala as needed through various sources such as API, AWS S3, Teradata and snowflake.
Validate the data from SQL Server to Snowflake to make sure it has Apple to Apple match.
Building Solutions once for all with no band-aid approach
Designed and Developed Spark workflows using Scala for data pull from AWS S3 bucket and Snowflake applying transformations on it.
Created ETL Mapping with Talend Integration Suite to pull data from Source, apply transformations, and load data into target database.
Designed and developed Flink pipelines to consume streaming data from Kafka and applied business logic and transform and serialized raw data.
Worked on joblets (reusable code) & java routines in Talend.
Implemented Spark RDD transformations to Map business analysis and apply actions on top of transformations.
Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in production.
Created scripts to read CSV, JSON, and parquet files from S3 buckets in Python and load into AWS S3, DynamoDB and Snowflake.
Implemented AWS Lambda functions to run scripts in response to events in Amazon DynamoDB table or S3 bucket or to HTTP requests using Amazon API gateway.
Migrated data from AWS S3 bucket to Snowflake by writing custom read/write snowflake utility function using Scala.
Worked on Snowflake Schemas and Data Warehousing and processed batch and streaming data load pipeline using Snow Pipe and Matillion from data lake Confidential AWS S3 bucket.
Profile structured, unstructured, and semi-structured data across various sources to identify patterns in data and Implement data quality metrics using necessary queries or python scripts based on source.
Install and configure Apache Airflow for S3 bucket and Snowflake data warehouse and created DAGs to run the Airflow.
Created DAG to use the Email Operator, Bash Operator, and spark Livy operator to execute and in EC2 instance.
Deploy the code to EMR via CI/CD using Jenkins.
Extensively used Code cloud for code check-in and checkouts for version control.
Environment: AWS EC2, S3, Athena, Lambda, Glue, Elasticsearch, RDS, DynamoDB, RedShift, ECS, Hadoop 2.x, HBase, Cassandra, Oracle, Teradata, MS SQL, Agile, Unix, Quicksight.
Client: Bank of America Jan 2019 to July 2021
Plano, TX
Role: Data Analytics Engineer II
Responsibilities:
Analyze, design, and build Modern data solutions using Azure PAAS service to support visualization of data. Understand current Production state of application and determine the impact of new implementation on existing business processes.
Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data
Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.
Implemented Proof of concepts for SOAP & REST APIs.
REST APIs to retrieve analytics data from different data feeds.
Collaborate with data architect / engineers to establish data governance / cataloging for MDM/ security (key vault, network security schema level & row level), resource groups, integration runtime setting, integration patterns, aggregated functions for data bricks development.
Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different.
Developed complex Talend ETL jobs to migrate the data from flat files to database.ec2.
sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
Developed Spark applications using PySpark and Spark-SQL for data Extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow and Oozie.
Created Data Bricks notebooks using Python (PySpark), Scala and Spark SQL for transforming the data that is stored in Azure.
Data Lake stored Gen2 from Raw to Stage and Curated zones.
Built numerous technology demonstrators using Confidential Edison Arduino shield using Azure EventHub and Stream.
Analytics, integrated with Power Bl and Azure ML to demonstrate the capabilities of Azure Stream Analytics.
Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark data bricks cluster.
Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that processes the data using the SQL Activity.
Hands-on experience on developing SQL Scripts for automation purposes.
Created Build and Release for multiple projects (modules) in production environment using Visual Studio Team Services.
Created Custom RBAC roles for Management Group/Subscriptions/RG using the PowerShell/Arm DevOps-pipeline.
Loaded data from web servers using Flume and Spark Streaming API. Used Flume sink to write directly to indexes deployed on cluster, allowing indexing during ingestion.
Developed python code for different tasks, dependencies, SLA watcher and time sensor of each job for workflow management and automation using Airflow tool.
Monitor systems life cycle deliverables and activities to ensure that procedures and methodologies are followed, and that appropriate complete documentation is captured.
Technologies: Azure (HDInsight, Data Bricks, Data Lake, Blob Storage, Data Factory, SQL DB, SQL DWH, AD), Python, Hadoop 2.x (HDFS, MapReduce, Yarn), Hive, Sqoop. Agile Jira, Jenkins, Docker.
Client: Freedom Financial Network March 2016 to Dec 2018
Tempe, Arizona
Role: AWS Data Engineer
Responsibilities:
Experienced in using distributed computing architectures like AWS (EC2, Redshift, and EMR, Elastic search), Hadoop, Spark, Python, and effective use of MapReduce, SQL, and Cassandra to solve big data type problems.
Worked with Spark, improved the performance, and optimized the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, RDD's, Spark YARN.
Developed and implemented various spark jobs using AWS EMR to perform bigdata operations in AWS.
Migrated AWS EMR based Data Lake and data ingestions system onto cloud Snowflake database.
Installed the application on AWS EC2 instances and configured the storage on S3 buckets.
Utilized Spark’s in memory capabilities to handle large datasets stored on S3 DataLake.
Worked on data ingestion by going through cleansing and transformations and leveraging AWS Lambda, AWS Glue and Step Functions and loaded into S3 buckets.
Developed and executed a migration strategy to move Data Warehouse from an Oracle platform to AWS Redshift.
Worked with engineering team to plan and deploy new Hadoop environments and expand existing Hadoop clusters.
Loaded data from web servers using Flume and Spark Streaming API.
Have experience with processing Framework such as Spark and Spark SQL
Worked on Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's.
Created tables, dropping, and altered at run time without blocking updates and queries using Spark and Hive.
Converted Hive/SQL queries into Spark transformations using Spark RDD, Scala and Python.
Created Hive tables and involved in data loading and writing Hive UDFs. Developed Hive UDFs for rating aggregation.
Created Data Frames and performed analysis using Spark SQL
Automated the Informatica jobs Using UNIX shell Scripting.
Developed Impala queries for faster querying and perform data transformations on Hive tables.
Created Hive External and Managed Tables.
Deployment mode of the cluster was achieved through YARN scheduler and the size is Auto scalable.
Worked on querying data using Spark SQL on the top of PYSpark engine jobs to perform data cleansing, validation, and applied transformations and executed the program using python API.
Used various spark Transformations and Actions for cleansing the input data.
Used Apache Kafka for importing real time network log data into HDFS. Experience integration of Kafka with Spark for real time data processing and then then deployed on the Yarn cluster.
Co-ordinated data pipeline using Kafka and Spark Streaming using the feed from API Gateway REST service.
Involved in CI/CD process using Jenkins and GIT.
Developed various Python scripts to find vulnerabilities with SQL Queries by doing SQL injection, checks & analysis.
Created and Maintained Teradata Tables, Views, Macros, Triggers and Stored Procedures
Performed interactive Analytics like cleansing, validation, checks on data stored in S3 buckets using AWS Athena.
Involved in migrating table from RDBMS into Hive using SQOOP & later generated data visualizations using Tableau.
Experience with analytical reporting and facilitating data for Quicksight and Tableau dashboards.
Used Git for version control and Jira for project management and to track issues and bugs.
Technologies: AWS EC2, S3, Athena, Lambda, Glue, Elasticsearch, RDS, DynamoDB, Redshift, ECS, Hadoop 2.x (HDFS, MapReduce, Yarn), Hadoop 2.x (HDFS, MapReduce, Yarn), Hive v2.3.1, Spark v2.1.3, Python, SQL, Sqoop v1.4.6, Kafka v2.1.0, Airflow v1.9.0, HBase, Cassandra, Oracle, Teradata, MS SQL Server, Agile, Unix, Informatica, Talend, Tableau