Venkata S Parupudi
Data Engineer
Professional Summary:
•Over 3+ years of IT experience in a variety of industries working on Big Data technology using technologies such as Cloudera and Hortonworks distributions. Hadoop working environment includes Hadoop, Spark, MapReduce, Kafka, Hive, Ambari, Sqoop, HBase, and Impala.
•Good working experience on Spark (spark streaming, spark SQL) with Scala and Kafka. Worked on reading multiple data formats on HDFS using Scala.
•Extensive knowledge in writing Hadoop jobs for data analysis as per the business requirements using Hive and worked on HiveQL queries for required data extraction, join operations, writing custom UDF's as required and having good experience in optimizing Hive Queries.
•Experience in importing and exporting the data using Sqoop from HDFS to Relational Database systems and vice-versa and load into Hive tables, which are partitioned.
•Experience in cloud data migration using AWS and Snowflake.
•Developed custom Kafka producer and consumer for different publishing and subscribing to Kafka topics.
•Good understanding of distributed systems, HDFS architecture, Internal working details of MapReduce and Spark processing frameworks.
•Excellent programming skills with experience in Java, C, SQL and Python Programming.
•Experience in data modeling, data warehouse, and ETL design development using Ralph Kimball model with Star/Snowflake Model Designs with analysis - definition, database design, testing, and implementation process.
•Involved in converting Hive/SQL queries into Spark transformations using Spark Data frames and Scala.
•Experience in using Kafka and Kafka brokers to initiate spark context and processing livestreaming.
•Hands on experience in writing Map Reduce programs using Java to handle different data sets using Map and Reduce tasks.
•Experience in Microsoft Azure/Cloud Services like SQL Data Warehouse, Azure SQL Server, Azure Databricks, Azure Data Lake, Azure Blob Storage, Azure Data Factory.
•Working Experience with Amazon Web Services (AWS) Cloud Platform which includes services like EC2, S3, VPC, ELB, IAM, DynamoDB, Cloud Front, Cloud Watch, Route 53, Elastic Beanstalk (EBS), Auto Scaling, Security Groups.
•Good understanding and knowledge of NoSQL databases like MongoDB, PostgreSQL, HBase and Cassandra.
•Strong experience in core Java, Scala, SQL, PL/SQL and Restful web services.
•Experience in developing custom UDFs for Pig and Hive to incorporate methods and functionality of Python/Java into Pig Latin and HQL (HiveQL) and Used UDFs from Piggybank UDF Repository.
•Worked on various programming languages using IDEs like Eclipse, NetBeans, and Intellij, Putty, GIT.
•Experience on ETL concepts using Informatica Power Center, AB Initio.
•Experience in extracting files from MongoDB through Sqoop and placed in HDFS and processed.
•Excellent understanding and knowledge of job workflow scheduling and locking tools/services like Oozie and Zookeeper.
•Good experience in database design, creating Tables, Views, Stored Procedures, Functions, Triggers and Indexes.
•Hands on experience with data ingestion tools Kafka, Flume and workflow management tools Oozie.
Technical Skills:
•Programming Languages: Python, SQL, PL/SQL, Scala, and UNIX.
•Big Data Tools: Hadoop, HDFS, Map Reduce, Spark, Airflow, Nifi, HBase, Hive, Pig, Sqoop, Kafka, Oozie, Zookeeper
•Methodologies: RAD, JAD, System Development Life Cycle (SDLC), Agile
•Cloud Platform: AWS (Amazon Web Services), Microsoft Azure
•Cloud Management: Amazon Web Services (AWS)- EC2, EMR, S3, Redshift, EMR, Lambda, Athena
•Data Modeling Tools: Erwin Data Modeler, ER Studio v17
•OLAP Tools: Tableau, SSAS, Business Objects, and Crystal Reports 9
•Databases: Oracle 12c/11g, Teradata R15/R14.
•ETL/Data warehouse Tools: Informatica 9.6/9.1, and Tableau.
•Operating System: Windows, Unix, Sun Solaris
Project Experience:
Client: TeknXpert,Charlotte, NC Jan 2024 – till date
Role: Data Engineer
Responsibilities:
•The individual will be responsible for design and development of High-performance data architectures which support data warehousing, real-time ETL and batch big-data processing.
•Analyzing SQL scripts and designed the solution to implement using PySpark
•Export tables from Teradata to HDFS using Sqoop and build tables in Hive.
•Loaded and transformed large sets of structured, semi structured and unstructured data using Hadoop/Big Data concepts.
•Use SparkSQL to load JSON data and create Schema RDD and loaded it into Hive Tables and handled structured data using SparkSQL.
•Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.
•Develop RDD's/Data Frames in Spark using and apply several transformation logics to load data from Hadoop Data Lakes.
•Developing Spark programs with Python, and applied principles of functional programming to process the complex structured data sets.
•Worked with Hadoop infrastructure to storage data in HDFS storage and use Spark / HIVE SQL to migrate underlying SQL codebase in AWS.
•Using Python in spark to extract the data from Snowflake and upload it to Salesforce on Daily basis.
•Worked with Hadoop ecosystem and Implemented Spark using Scala and utilized Data frames and Spark SQL API for faster processing of data.
•Utilized Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, executed machine Learning use cases under Spark ML and Mllib.
•Developed Spark Streaming job to consume the data from the Kafka topic of different source systems and push the data into HDFS locations.
•Converting Hive/SQL queries into Spark transformations using Spark RDDs and Pyspark
•Filtering and cleaning data using Scala code and SQL Queries
•Troubleshooting errors in Hbase Shell/API, Pig, Hive and MapReduce.
•Implemented Installation and configuration of multi-node cluster on Cloud using Amazon Web Services (AWS) on EC2.
•Designed and developed architecture for data services ecosystem spanning Relational, NoSQL, and Big data technologies. Extracted Mega Data from Amazon Redshift, AWS, and Elastic Search engine using SQL Queries to create reports.
•Designed and Developed Real Time Stream Processing Application using Spark, Kafka, Scala and Hive to perform Streaming ETL and apply Machine Learning.
•Use python to write a service which is event based using AWS Lambda to achieve real time data to One-Lake (A Data Lake solution in Cap-One Enterprise).
•Used Talend for Big Data Integration using Spark and Hadoop.
•Responsible for analyzing large data sets and derive customer usage patterns by developing new MapReduce programs using Java.
•Perform structural modifications using MapReduce, Hive and analyze data using visualization/reporting tools (Tableau).
•Designed Kafka producer client using Confluent Kafka and produced events into Kafka topic.
Environment: Hadoop, Spark, Scala, Hbase, Hive, Python, PL/SQL AWS, EC2, S3, Lambda, Auto Scaling, Cloud Watch, Cloud Formation, IBM Info sphere, DataStage, MapReduce, Oracle12c, Flat files, TOAD, MS SQL Server database, XML files, Cassandra, MongoDB, Kafka, MS Access database, Autosys, UNIX, Erwin.
Client: Met Life, Hyderabad, India Jan 2021 – July 2022
Role: Data Engineer
Responsibilities:
•Experience working with Apache SOLR for indexing and querying.
•Created custom SOLR Query segments to optimize ideal search matching.
•Stored the time-series transformed data from the Spark engine built on top of a Hive platform to Amazon S3 and Redshift.
•Facilitated deployment of multi-clustered environment using AWS EC2 and EMR apart from deploying Dockers for cross-functional deployment.
•Collected and aggregated large amounts of web log data from different sources such as webservers, mobile and network devices using Apache Flume and stored the data into HDFS for analysis.
•Involved in developing Spark SQL queries, Data frames, import data from Data sources, perform transformations, perform read/write operations, save the results to output directory into HDFS.
•Implemented the workflows using Apache Oozie framework to automate tasks.
•Designed and implemented Incremental Imports into Hive tables.
•Ingest real-time and near-real time (NRT) streaming data into HDFS using Flume.
•Involved in collecting, aggregating and moving data from servers to HDFS using Flume.
•Imported and Exported Data from Different Relational Data Sources like DB2, SQL Server, Teradata to HDFS using Sqoop.
Environment: Hadoop, Cloudera, Flume, HBase, HDFS, MapReduce, AWS, YARN, Hive, Pig, Sqoop, Oozie, Tableau, Java, Solr.