Post Job Free

Resume

Sign in

Big Data Processing

Location:
Southfield, MI
Posted:
April 18, 2024

Contact this candidate

Resume:

Praveen Rondi

ad435k@r.postjobfree.com

+1-260-***-****

Summary:

Having 8+ years of Professional experience in IT Industry in Developing, Implementing, configuring, Hadoop & Big Data Technologies.

Well versed with Big Data on GCP cloud services i.e. VM, Storage, Cloud Storage, PubSub, DataFlow.

Exposure to Spark, Spark Streaming, Spark MLlib, Scala and Creating the Data Frames handled in Spark with Scala.

Designed and implemented data processing pipelines using GCP services such as Cloud Dataflow and Apache Beam to ingest, transform, and analyze large volumes of data

Developed and optimized ETL processes to extract data from various sources, including databases, APIs, and streaming platforms, and load it into BigQuery for analysis.

Having hands on experience as Cloud Data Engineering in Big data Hadoop ecosystems such as HDFS, Hive, Spark, Bigquery, Data Bricks, Kafka, Yarn on AWS cloud services and Cloud rational databases.

Experience in Extraction, Transformation and Loading data from various sources into Data Warehouses as well as data processing like collecting, aggregating and moving data from various sources using Apache Flume, Kafka, Power BI and Microsoft SSIS

Collaborated with business analysts and stakeholders to gather requirements and translate them into technical specifications.

Designed and implemented a scalable data warehouse solution using Teradata, reducing query response times by 30% and improving overall data accessibility.

Provided technical leadership and guidance to junior developers on best practices for Informatica development.

Participated in code reviews and testing activities to ensure the quality and reliability of ETL solutions.

3+ years of experience in writing Python as ETL framework and Pyspark to process huge amount of data daily.

Developed Python code to gather the data from HBase (Cornerstone) and designs the solution to implement using PySpark.

Having hands on experience on streaming data using Kafka & spark Scala streaming API.

Well exposure on Spark SQL, Spark Streaming and using Core Spark API to explore Spark features to build data pipelines.

Hands on experience in setting up workflow using Apache Airflow and Oozie workflow engine for managing and scheduling Hadoop jobs.

Proficient with Apache Spark ecosystem such as Spark, Spark Streaming using Scala and Python.

Developed the spark code for AWS Glue jobs and for EMR.

Work with Business Intelligence tools like Business Objects and Data Visualizations tools like Tableau.

Hands on experience in working on Spark SQL queries, Data frames, and import data from Data sources, perform transformations; perform read/write operations, save the results to output directory into HDFS.

Involved in converting Hive/SQL queries into Spark transformations using Spark Data Frames and Python.

Hands on experience in developing SPARK applications using Spark tools like RDD transformations, Spark core, Spark MLlib, Spark Streaming and Spark SQL.

Strong experience and knowledge of real time data analytics using Spark, Kafka and Flume.

Hands on experience in Capturing data from existing relational databases (Oracle, MySQL, SQL and Teradata) that provide SQL interfaces using Sqoop.

Good understanding on Cloud Based technologies such as GCP, AWS and Azure

Hands on experience in writing Map Reduce programs using Java to handle different data sets using Map and Reduce tasks.

Written PySpark job in AWS Glue to merge data from multiple tables and in Utilizing Crawler to populate AWS Glue data Catalog with metadata table definations.

Experience in Analyzing the SQL scripts and designed the solution to implement using spark.

Worked on Implementing and optimizing Hadoop/MapReduce algorithms for Big Data analytics.

Adept in Agile/Scrum methodology and familiar with SDLC life cycle from requirement analysis to system study, designing, testing, de-bugging, documentation, and implementation.

Education:

Master’s in Business Analytics – 2023

Bachelor’s in Mathematics - 2014

Work Experience:

State Farm Insurance, Bloomington Sep 2022- Present

Role: Senior Data Engineer

Responsibilities:

Written ETL jobs in using spark data pipelines to process data from different source to transform data to multiple targets.

Creating Test Automation Framework in Python.

Experienced in writing Spark Applications in Scala and Python.

Designed and implemented data processing pipelines using GCP services such as Cloud Dataflow and Apache Beam to ingest, transform, and analyze large volumes of data

Developed and optimized ETL processes to extract data from various sources, including databases, APIs, and streaming platforms, and load it into BigQuery for analysis.

Implemented real-time data streaming solutions using GCP Pub/Sub and Dataflow for continuous data ingestion and processing.

Collaborated with data scientists to deploy machine learning models on GCP using TensorFlow and integrated them into data pipelines for predictive analytics.

Designed and maintained data warehouses on GCP, optimizing performance and scalability for analytical queries.

Developed and optimized ETL processes to load and transform data from various sources into Teradata, ensuring data quality and consistency.

Collaborated with business stakeholders to identify data analytics requirements and deliver actionable insights using Teradata's advanced analytics capabilities.

Created interactive dashboards and reports using Google Data Studio to visualize insights and facilitate data-driven decision-making.

Collaborated with business analysts and stakeholders to gather requirements and translate them into technical specifications

Created framework for Data Profiling

Created framework for data encryption

Designed “Data Services” to intermediate data exchange between the Data Clearinghouse and the Data Hubs.

Conducted version control and deployment of SSIS packages using Visual Studio Team Services (VSTS) or other version control systems

Involved in the ETL phase of the project & Designed and analyzed the data in oracle and migrated to Redshift and Hive.

Use Spark to process live Streaming data using Apache Kafka.

Develop Scala Source Code to process heavy RAW JSON data.

Proficient in database development using PostgreSQL, SQL Server, Oracle, and MySQL, with a strong emphasis on Postgres.

Designed and implemented database schemas, tables, indexes, and views in Postgres to support application requirements and ensure efficient data storage and retrieval.

Experienced in data visualization tools such as Tableau for creating interactive dashboards and reports

Involving in client meetings and explaining the views to supporting and gathering requirements.

Working in an agile methodology, understand the requirements of the user stories

Prepared High-level design documentation for approval.

Providing support (24*7), on call.

Environment: Spark, Spark SQL, Python, Cloud, GCP, AWS, Glue, HDFS, Hive, Apache Kafka, Sqoop, Scala, Shell scripting, Linux, Terraform, MySQL Oracle Enterprise DB, Jenkins, Git, MySQL, Cassandra and Agile Methodologies

Wipro Ltd (Truist Financials, India) Mar 2020- Jan 2022

Role: Senior Data Engineer

Responsibilities:

Experienced in writing Spark Applications in Scala and Python.

Created ETL Framework using spark on AWS EMR in Scala/Python.

Wrote scripts and indexing strategy for a migration to Confidential Redshift from SQL server and MySQL databases.

Analyze and cleanse raw data using Spark, SQL and PySpark.

Developed a POC for project migration from on premise Hadoop MapR system to GCP.

Building ETL data pipeline on Hadoop/Teradata using Hadoop/Pig/Hive/UDFs

Worked on AWS Data Pipeline to configure data loads from S3 to into Redshift.

Used JSON schema to define table and column mapping from S3 data into Redshift.

Involved in migrating tables from RDBMS into Hive tables using SQOOP and later generated data visualizations using Tableau.

Design and Develop ETL Process in AWS Glue to migrate Campaign data from external sources.

Construct the AWS data pipelines using VPC, EC2, S3, Auto Scaling Groups, EBS, Snowflake, IAM, CloudFormation, Route 53 and CloudWatch.

Worked on migrating MapReduce programs into Spark transformations using Spark and Scala, initially done using python (PySpark).

Generated a script in AWS Glue to transfer the data and utilized AWS Glue to run ETL jobs and run aggregation on PySpark code.

Created ingestion framework using Kafka, EMR, Aurora, Cassandra in Python/Scala.

Created Airflow scheduling scripts in Python.

Analyze the user needs, interact with various SOR's to understand their incoming data structure and ran POC's with best possible processing framework in big data platform.

Documented the results with various tools and technologies which can be implemented accordingly based on the business use case.

Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.

Very capable at using AWS utilities such as EMR, S3 and Cloud watch to run and monitor Hadoop/Spark jobs.

Develop Cloud Functions in Python to process JSON files from source and load the files to BigQuery.

Worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.

Developed UNIX shell scripts to load large number of files into HDFS from Linux File System.

Designed AWS Cloud Formation templates to create VPC, subnets, NAT to ensure successful deployment of Web applications and database templates.

Experience in creating tables, dropping and altered at run time without blocking updates and queries using HBase and Hive.

Assisted users in creating/modifying worksheets and data visualization dashboards in Tableau.

Encoded and decoded json objects using Spark to create and modify the data frames in Apache Spark

Involved in converting Hive/SQL queries into Spark transformations using Spark RDD's.

Creating S3 buckets also managing policies for S3 buckets and Utilized S3 bucket and Glacier for storage and backup on AWS.

Creating Spark clusters and configuring high concurrency clusters using Databricks to speed up the preparation of high-quality data.

Optimize the spark jobs to run on Kubernetes Cluster for faster data processing

Analyzed the SQL scripts and designed the solution to implement using spark.

Install and configure Apache Airflow for S3 bucket and Snowflake data warehouse.

Developed a capability to implement audit logging at required stages while applying business logic.

Used AWS glue catalog with crawler to get the data from S3 and perform SQL query operations.

Optimize the PySpark jobs to run on Kubernetes Cluster for faster data processing

Expert in building the AWS Notebooks functions by using Python, Scala and Spark.

Environment: Spark, Spark SQL, GCP, Python, Cloud, AWS, Glue, HDFS, Hive, Apache Kafka, Sqoop, Scala, Shell scripting, Linux, MySQL Oracle Enterprise DB, Jenkins, Git, Oozie, MySQL, Soap, AWS, Cassandra and Agile Methodologies.

Wipro Ltd (Sears Holding Corporation, India) Dec 2018-Feb 2020

Role: Data Engineer

Responsibilities:

Work in a fast-paced agile development environment to quickly analyze, develop, and test potential use cases for the business.

Developed real time data processing applications by using Scala and Python and implemented Apache Spark Streaming from various streaming sources like Kafka and JMS.

Convert into different Data formats for user requirements by streaming data pipeline from various sources Snowflake and unconstructed data.

Developed Python code to gather the data from HBase (Cornerstone) and designs the solution to implement using PySpark.

Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP.

Develop framework for converting existing PowerCenter mappings and to Pyspark Jobs.

Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data model which gets the data from Kafka in near real time and persist it to Cassandra.

Migrated on premise database structure to Confidential Redshift data warehouse.

Creating Databricks notebooks using SQL, Python and automated notebooks using jobs.

Data extraction, aggregations and consolidation of adobe data within AWS Glue using Pyspark

Worked on Amazon S3 for persisting the transformed Spark Data Frames in S3 buckets and using Amazon S3 as a Data-lake to the data pipeline running on spark and Map-Reduce.

Loaded all datasets into Hive from Source CSV files using spark and Cassandra from Source CSV files using Spark/PySpark.

Developed Kafka consumer's API in Scala for consuming data from Kafka topics.

Consumed XML messages using Kafka& processed xml using Spark Streaming to capture UI updates.

Developed Preprocessing job using Spark Data frames to flatten Jason documents to flat file.

Load D-Stream data into Spark RDD and do in memory data Computation to generate Output response.

Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.

Advanced knowledge on Confidential Redshift and MPP database concepts

AWS CI/CD Data pipeline and AWS Data Lake using EC2, AWS Glue, AWS Lambda.

Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.

Leveraged cloud and GPU computing technologies for automated machine learning and analytics pipelines, such as AWS.

Completed data extraction, aggregation, and analysis in HDFS by using PySpark and store the data needed to Hive.

Worked on importing and exporting data from Snowflake, Oracle and DB2 into HDFS and HIVE using Sqoop for analysis, visualization and to generate reports.

Implemented Elastic Search on Hive data warehouse platform.

Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.

Used the Spark DataStax Cassandra Connector to load data to and from Cassandra.

Experienced in creating data-models for client data sets, analyzed the data from Casandra tables for quick searching, sorting and grouping using the Cassandra Query Language (CQL).

Tested the cluster Performance using Cassandra-stress tool to measure and improve the Read/Writes.

Used Hive QL to analyze the partitioned and bucketed data, Executed Hive queries on Parquet tables stored in Hive to perform data analysis to meet the business requirements.

Used Kafka functionalities like distribution, partition, replicated commit log service for messaging systems by maintaining feeds.

Used Apache Kafka to aggregate web log data from multiple servers and make them available in downstream systems for analysis.

Develop Autosys job for scheduling.

Participated in development/implementation of Cloudera Hadoop environment.

Developed Sqoop and Kafka Jobs to load data from RDBMS, External Systems into HDFS and HIVE.

Used Jira for bug tracking and Bit Bucket to check-in and checkout code changes.

Creating Test Automation Framework using Scala.

Creating utility method to flatten json events to granular level using Scala & Spark.

Loading of XML, JSON, CSV, Parquet files using Pyspark jobs & Spark-Scala.

Review code developed by offshore team and validates the test results.

Worked on reading and writing multiple data formats like JSON, ORC, Parquet on HDFS using Spark.

Responsible for generating actionable insights from complex data to drive real business results for various application teams and worked in Agile Methodology projects extensively.

Provide guidance to development team working on Pyspark as ETL platform.

Worked with SCRUM team in delivering agreed user stories on time for every Sprint.

Environment: Spark, Scala, Python, Cloud, GCP, AWS, Glue, HDFS, Hive, Apache Kafka, Sqoop, Scala, Shell scripting, Linux, MySQL Oracle Enterprise DB, Jenkins, Git, Oozie, MySQL, Soap, AWS, NIFI

Wipro Ltd (HCL Infosystems, India) Apr 2016-Oct 2018

Role: Data Engineer

Responsibilities:

Experience in Big Data Analytics and design in Hadoop ecosystem using MapReduce Programming, Spark, Hive, Pig, Sqoop, HBase.

Responsible for Designing Logical and Physical data modelling for various data sources on Confidential Redshift.

Optimization of Hive queries using best practives and right parameters and using technologies like Hadoop, YARN, PySpark.

Involved in converting Hive/SQL queries into Spark transformations using Spark RDD, Scala, and Python.

Developed Spark jobs using Scala and Python on top of Yarn/MRv2 for interactive and Batch Analysis.

Very capable at using AWS utilities such as EMR, S3 and Cloud watch to run and monitor Hadoop/Spark jobs.

Performing hive tuning techniques like partitioning and bucketing and memory optimization.

Worked on different file formats like parquet, orc, json and text files.

Experience in developing scalable & secure data pipelines for large datasets.

Partnered with ETL developers to ensure that data is well cleaned, and the data warehouse is up to date for reporting purpose by Pig.

Supported data quality management by implementing proper data quality checks in data pipelines.

Involved in the data support team as role of bug fixes, schedule change, memory tuning, schema changes loading the historic data.

Enhancing Data Ingestion Framework by creating more robust and secure data pipelines.

Analysed the SQL scripts and designed it by using PySpark SQL for faster performance

Implemented data streaming capability using Kafka and Talend for multiple data sources.

Worked with multiple storage formats (Avro, Parquet) and databases (Hive, Impala, Kudu).

Used spark sql to load data and created schema RDD on top of that which loads into hive tables and handled structured using spark Sql.

Involved in the development of agile, iterative, and proven data modeling patterns that provide flexibility.

Performed data validation which does the record wise counts between the source and destination.

Involved in the data support team as role of bug fixes, schedule change, memory tuning, schema changes loading the historic data.

Worked with SCRUM team in delivering agreed user stories on time for every Sprint.

Worked on analyzing and resolving the production job failures in several scenarios.

Data extraction, aggregations and consolidation of adobe data within AWS Glue using Pyspark

Worked on migrating MapReduce programs into Spark transformations using Spark and Scala, initially done using Python (Spark).

Environment: Hadoop, Map Reduce, Cloud, HDFS, Hive, Cassandra, Sqoop, Oozie, SQL, Kafka, Spark, PySpark, Scala, Python, GitHub, Talend, Big Data Integration, Impala.



Contact this candidate