Data Engineer Software Development

Location:

Hoffman, NJ, 08831

Posted:

December 13, 2023

Contact this candidate

Resume:

PAVAN KUMAR KANDUKURI

***************@*****.***

+1-630-****-***

Senior Big Data Engineer

Overall Summary:

•Over 8+ Years of strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in Devops environment,Scrum,Waterfall,and Agile methodologies.

•Have very strong inter-personal skills and the ability to work independently and with the group, can learn quickly and easily adaptable to the working environment.

•Excellent programming skills with experience in Java, SQL and Python Programming.

• Evaluate innovative technologies and best practices for the team.

• Applied knowledge of modern software delivery methods like TDD, BDD, CI/CD and applied knowledge of Infrastructure as Code(IAC)

•Involved in converting Hive/SQL queries into Spark transformations using Spark Data frames and Scala.

•Experience in AWS Serverless services such as Fargate, SNS, SQS, Lambda

•Experience in Data modeling tools such as Erwin and Toad Data Modeler.

•Experience in using Kafka and Kafka brokers to initiate spark context and processing livestreaming.

•Utilized Python, including Pandas, Numpy, PySpark, and Impala, for data manipulation and analysis, resulting in streamlined processes and data-driven recommendations.

•- Developed interactive dashboards and reports using Power BI, Tableau, and Qlik Sense to visualize complex datasets and communicate insights to stakeholders effectively.

•- Conducted data analysis and extraction on Hadoop, contributing to improved data accessibility and accuracy.

•- Collaborated with cross-functional teams to implement data integration solutions, leveraging tools like Apache NiFi, Azure Data Factory, and Pentaho.

•- Worked with Enterprise Business Intelligence Platform and Data Platform to support data-driven decision-making at an organizational level.

•- Conducted hands-on data modeling, programming, and querying, handling large volumes of granular data to deliver custom reporting solutions.

•- Applied data mining and machine learning algorithms to identify patterns and trends, providing valuable insights to drive business strategies.

•- Assisted in collecting, standardizing, and summarizing data while identifying inconsistencies and suggesting data quality improvements.

•- Utilized strong SQL skills for data retrieval and analysis.

•Hands-on experience in developing web applications and RESTful web services and APIs using Python, Flask and Django. Experienced in Working on Big Data Integration and Analytics based on Hadoop, PySpark and No-SQL databases like HBase and MongoDB.

•Working Experience with Amazon Web Services (AWS) Cloud Platform which includes services like EC2, S3, VPC, ELB, IAM, DynamoDB, Cloud Front, Cloud Watch, Route 53, Elastic Beanstalk (EBS), Auto Scaling, Security Groups, EC2 Container Service (ECS), Code Commit, Code Pipeline, Code Build, Code Deploy, Dynamo DB, Auto Scaling, Security Groups, Red shift, CloudWatch, CloudFormation, CloudTrail, Ops Works, Kinesis, IAM, SQS, SNS, SES,Fargate,ECS,Glue,Athena.

•Extensive knowledge in writing Hadoop jobs for data analysis as per the business requirements using Hive and worked on HiveQL queries for required data extraction, join operations, writing custom UDF's as required and having good experience in optimizing Hive Queries.

•Experience in importing and exporting the data using Sqoop from HDFS to Relational Database systems and vice-versa and load into Hive tables, which are partitioned.

•Experience in Microsoft Azure/Cloud Services like SQL Data Warehouse, Azure SQL Server, Azure Databricks, Azure Data Lake, Azure Blob Storage, Azure Data Factory,Azure Cosmos DB, SQL Server, SSIS, SSRS.

•Strong experience in core Java, Scala, SQL, PL/SQL and Restful web services.

•Having extensive knowledge on RDBMS such as Oracle, DevOps, Microsoft SQL Server, MYSQL

•Extensive experience working on various databases and database script development using SQL and PL/SQL.

•Hands on experience in writing Map Reduce programs using Java to handle different data sets using Map and Reduce tasks.

•Developed custom Kafka producer and consumer for different publishing and subscribing to Kafka topics.

•Good understanding of distributed systems, HDFS architecture, Internal working details of MapReduce and Spark processing frameworks.

•Good experience of software development in Python (libraries used: libraries- Beautiful Soup, NumPy, SciPy, matplotlib, python-twitter, Pandas data frame, network, urllib2, MySQL dB for database connectivity).

•Knowledge of ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions and Data Warehouse tools for reporting and data analysis.

•Excellent understanding and knowledge of job workflow scheduling and locking tools/services like Oozie and Zookeeper.

•Hands on experience with data ingestion tools Kafka, Flume and workflow management tools Oozie.

•Used Spark Data Frames API over Cloudera platform to perform analytics on Hive data and Used Spark Data Frame Operations to perform required Validations in the data.

•Good understanding and knowledge of NoSQL databases like MongoDB, PostgreSQL, HBase and Cassandra.

•Worked with various formats of files like delimited text files, click stream log files, Apache log files, Avro files, JSON files, XML Files. Mastered in using different columnar file formats like RC, ORC and Parquet formats. Has good understanding of various compression techniques used in Hadoop processing like G-zip, Snappy, LZO etc.

•Experience on ETL concepts using Informatica Power Center, AB Initio.

• Build and deploy modular data pipeline components such as Apache Airflow DAGs, AWS Glue jobs, AWS Glue crawlers through a CI/CD process.

•Experience in developing custom UDFs for Pig and Hive to incorporate methods and functionality of Python/Java into Pig Latin and HQL (HiveQL) and Used UDFs from Piggybank UDF Repository.

•Hands on experience with Apache Airflow as a pipeline tool.

•Worked with NoSQL databases like HBase in creating HBase tables to load large sets of semi-structured data coming from various sources.

•Experience with Data Analytics, Data Reporting, Ad-hoc Reporting, Graphs, Scales, PivotTables and OLTP reporting.

•Worked on various programming languages using IDEs like Eclipse, NetBeans, and Intellij, Putty, GIT.

• Proficient in data modeling techniques and concepts to support data consumers

• AWS ECS, ECR and Fargate to scale and organize these workloads.

•Worked on GitLab for implementing Continuous Integration and Continuous Deployment (CI/CD) pipelines, streamlining the process of building, testing, and deploying code changes.

Technical Skills:

Languages

Shell scripting, SQL, PL/SQL, Python, R, PySpark, Pig, Hive QL, Scala, Regular Expressions

Hadoop Distribution

Cloudera CDH, Horton Works HDP, Apache, AWS

Big Data Ecosystem

HDFS, MapReduce, Hive, Pig, Sqoop, Flume, Oozie, Zookeeper, Kafka, Cassandra, Apache Spark, Spark Streaming, HBase, Flume, Impala

Python Libraries/Packages

NumPy, SciPy,Boto, Pickle, PySide, PyTables, Data Frames, Pandas, Matplotlib, SQLAlchemy, HTTPLib2, Urllib2, Beautiful Soup, Py Query

Databases

Oracle 10g/11g/12c, SQL Server, MySQL, Cassandra, Teradata, PostgreSQL, MS Access, Snowflake, NoSQL Database (HBase, MongoDB),Dynamo DB,SnowFlake,RDS,T-SQL

Cloud Technologies

Amazon Web Services (AWS), Microsoft Azure

Version Control

GIT, GIT HUB,GIT LAB

IDE & Tools, Design

Eclipse, Visual Studio, Net Beans, Junit, CI/CD, SQL Developer, MySQL, SQL Developer, Workbench, Tableau

Operating Systems

Windows 98, 2000, XP, Windows 7,10, Mac OS, Unix, Linux

Data Engineer/Big Data Tools / Cloud / ETL/ Visualization / Other Tools

Databricks, Hadoop Distributed File System (HDFS), Hive, Pig, Sqoop, MapReduce, Spring Boot, Flume, YARN, Hortonworks, Cloudera, MLlib, Oozie, Zookeeper, etc. AWS, Azure Databricks, Azure Data Explorer,Azure Cosmos DB, Azure HDInsight, Linux, Bash Shell, Unix, etc., Tableau, Power BI, SAS, Crystal Reports, Dashboard Design,Glue,Athena,Lakeformation,Airflow,NF core,Nextflow,Aws Batch.

Project Experience:

Client: Merck Pharma, Boston, MA Feb 2022 – till now

Role: Senior Big Data Engineer

Responsibilities:

•Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.

•Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, MongoDB,Dynamo DB,SnowFlake,Postgre,RDS T-SQL, and SQL Server using Python.

•Experience in Python, SQL, yaml, spark programming (pyspark)

•Applied knowledge of modern software delivery methods like TDD, BDD, CI/CD and applied knowledge of Infrastructure as Code(IAC)

•Involved as primary on-site ETL Developer during the analysis, planning, design, development, and implementation stages of projects using IBM Web Sphere software (Quality Stage v9.1, Web Service, Information Analyzer, Profile Stage)

•Utilized Python, including Pandas, Numpy, PySpark, and Impala, for data manipulation and analysis, resulting in streamlined processes and data-driven recommendations.

•- Developed interactive dashboards and reports using Power BI, Tableau, and Qlik Sense to visualize complex datasets and communicate insights to stakeholders effectively.

•- Conducted data analysis and extraction on Hadoop, contributing to improved data accessibility and accuracy.

•- Collaborated with cross-functional teams to implement data integration solutions, leveraging tools like Apache NiFi, Azure Data Factory, and Pentaho.

•Applied domain knowledge in genomics and life sciences to enhance data processing capabilities.

•Created and optimized Spark and EMR jobs for efficient data handling and analysis.

•Worked extensively on NF-Core, Airflow, and Nextflow to streamline workflow processes.

•Designed and implemented data management pipelines in Airflow for effective data organization.

•Demonstrated proficiency in AWS services, including Glue, Batch, Athena, and EMR Serverless.

•Conducted comprehensive analysis and interpretation of genomic data to extract meaningful insights.

•Collaborated with cross-functional teams to integrate bioinformatics solutions into broader projects.

•Utilized Python, Scala, and SQL for data manipulation and processing tasks.

•Implemented and maintained data workflows using NF-Core and Nextflow.

•Experience in Data modeling tools such as Erwin and Toad Data Modeler.

•Worked in AWS environment for development and deployment of custom Hadoop applications.Experience in AWS Data processing, Analytics, and storage Services such as Simple Storage Service (s3), Glue, Athena and Lake Formation.

•Worked on GitLab for implementing Continuous Integration and Continuous Deployment (CI/CD) pipelines, streamlining the process of building, testing, and deploying code changes

•Developing Spark programs with Python, and applied principles of functional programming to process the complex structured data sets.

•Experience with Cognos Framework Manager.

•Developed internal API's using Node.js and used MongoDB for fetching the schema. Worked on Node.js for developing server-side web applications. Implemented Python views and templates with Python and Django's view controller and Jinja templating language to create a user-friendly website interface.

•Reduced access time by refactoring information models, query streamlining and actualized Redis store to help Snowflake.

•Strong experience in working with ELASTIC MAPREDUCE(EMR) and setting up environments on Amazon AWS EC2 instances.

•Prepared Data Mapping Documents and Design the ETL jobs based on the DMD with required Tables in the Dev Environment

•Generate metadata, create Talend ETL jobs, mappings to load data warehouse, data lake.

•Designed and Developed Real Time Stream Processing Application using Spark, Kafka, Scala and Hive to perform Streaming ETL and apply Machine Learning.

•Filtering and cleaning data using Scala code and SQL Queries

•Experience in data processing like collecting, aggregating, moving the data using Apache Kafka.

•Used Kafka to load data into HDFS and move data back to S3 after data processing

•Worked with Hadoop infrastructure to storage data in HDFS storage and use Spark / HIVE SQL to migrate underlying SQL codebase in AWS.

•Used Test driven approach for developing the application and Implemented the unit tests using Python Unit test framework and Development of Isomorphic ReactJS and Redux driven API client applications.

•Converting Hive/SQL queries into Spark transformations using Spark RDDs and Pyspark

•Analyzing SQL scripts and designed the solution to implement using PySpark

•Export tables from Teradata to HDFS using Sqoop and build tables in Hive.

•Implemented Installation and configuration of multi-node cluster on Cloud using Amazon Web Services (AWS) on EC2.

•Use SparkSQL to load JSON data and create Schema RDD and loaded it into Hive Tables and handled structured data using SparkSQL.

•Designed and developed architecture for data services ecosystem spanning Relational, NoSQL, and Big data technologies. Extracted Mega Data from Amazon Redshift, AWS, and Elastic Search engine using SQL Queries to create reports.

•Used Talend for Big Data Integration using Spark and Hadoop.

•Used Kafka and Kafka brokers, initiated the spark context and processed live streaming information with RDD and Used Kafka to load data into HDFS and NoSQL databases.

•Used Kafka functionalities like distribution, partition, replicated commit log service for messaging systems by maintaining feeds and Created applications using Kafka, which monitors consumer lag within Apache Kafka clusters.

•Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.

•Worked on SQL Server concepts SSIS (SQL Server Integration Services), SSAS (Analysis Services) and SSRS (Reporting Services). Using Informatica & SSIS, SPSS, SAS to extract transform & load source data from transaction systems.

•Developed reusable objects like PL/SQL program units and libraries, database procedures and functions, database triggers to be used by the team and satisfying the business rules.

•Developed API's in python with SQLAlchemy for ORM along with MongoDB, documenting API's in Swagger docs and deploying application over Jenkins. Developed Restful API's using Python Flask and SQLAlchemy data models as well as ensured code quality by writing unit tests using Pytest.

•Experience with Data Analytics, Data Reporting, Ad-hoc Reporting, Graphs, Scales, PivotTables and OLTP reporting.

Environment: Python, Django, Flask, Hadoop, Spark, Scala, Hbase, Hive, UNIX, Erwin, TOAD, MS SQL Server database, XML files, AWS, Cassandra, MongoDB, Kafka, IBM Info Sphere Data Stage, PL/SQL, Oracle 12c, Flat files, Autosys, MS Access database.

Client: Chewy, Dania Beach, FL Mar 2020 – Jan 2022

Role: Big Data Engineer

Responsibilities:

•Worked in Agile environment, and used rally tool to maintain the user stories and tasks.

•Implement ad-hoc analysis solutions using Azure Data Lake Analytics/Store, HDInsight

•Implemented Apache Sentry to restrict the access on the hive tables on a group level.

•Good Exposure on Map Reduce programming using Java, PIG Latin Scripting and Distributed Application and HDFS.

•Experienced in using Tidal enterprise scheduler and Oozie Operational Services for coordinating the cluster and scheduling workflows.

•Developed Microservices by creating REST APIs and used them to access data from different suppliers and to gather network traffic data from servers. Wrote and executed various MYSQL database queries from python using Python-MySQL connector and MySQL dB package.

•Designed and implemented by configuring Topics in new Kafka cluster in all environment.

•Created multiple dashboards in tableau for multiple business needs.

•Experience in Data modeling tools such as Erwin and Toad Data Modeler.

•Implemented Partitioning, Dynamic Partitions and Buckets in HIVE for efficient data access.

•Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB).

•Employed AVRO format for the entire data ingestion for faster operation and less space utilization.

•Designed SSIS Packages to extract, transfer, load (ETL) existing data into SQL Server from different environments for the SSAS cubes (OLAP)

•Involved in development of Web Services using REST for sending and getting data from the external interface in the JSON format. Wrote and executed various MYSQL database queries from python using Python-MySQL connector and MySQL dB package.

•Developed visualizations and dashboards using PowerBI

•Implemented Composite server for the data virtualization needs and created multiples views for restricted data access using a REST API.

•Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team Using Tableau.

•Installed Kerberos secured Kafka cluster with no encryption on Dev and Prod. Also set up Kafka ACL's into it

•Developed Apache Spark applications by using spark for data processing from various streaming sources.

•Strong Knowledge on architecture and components of TeaLeaf, and efficient in working with Spark Core, SparkSQL. Designed and developed RDD Seeds using Scala and Cascading. Streaming data to Spark streaming using Kafka

•Exposure to Spark, Spark Streaming, Spark MLlib, snow flake, Scala and Creating the Data Frames handled in Spark with Scala.

•Developed data pipeline using Spark, Hive, Pig, python, Impala, and HBase to ingest customer

•Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.

•Queried and analyzed data from Cassandra for quick searching, sorting and grouping through CQL.

•Joined various tables in Cassandra using spark and Scala and ran analytics on top of them.

•Applied Spark advanced procedures like text analytics and processing using the in-memory processing.

•Implemented Apache Drill on Hadoop to join data from SQL and No SQL databases and store it in Hadoop.

•Brought data from various sources in to Hadoop and Cassandra using Kafka.

•Migration of on premise data (Oracle/ SQL Server/ DB2/ MongoDB) to Azure Data Lake and Stored (ADLS) using Azure Data Factory (ADF V1/V2).

•Worked on GitLab for implementing Continuous Integration and Continuous Deployment (CI/CD) pipelines, streamlining the process of building, testing, and deploying code changes

•SQL Server reporting services (SSRS). Created & formatted Cross-Tab, Conditional, Drill-down, Top N, Summary, Form, OLAP, Sub reports, ad-hoc reports, parameterized reports, interactive reports & custom reports

•Created action filters, parameters and calculated sets for preparing dashboards and worksheets using PowerBI

•Designing and Developing Oracle PL/SQL and Shell Scripts, Data Import/Export, Data Conversions and Data Cleansing

•Developed Restful API's using Python Flask and SQLAlchemy data models as well as ensured code quality by writing unit tests using Pytest. Contributed in migrating data from Oracle database to Apache Cassandra (NoSQL database) using SS Table loader.

•Developed Spark applications using Spark -SQL in Databricks for data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.

Environment: Python, Pytest, Flask, SQLAlchemy, Map Reduce, HDFS, Hive, pig, Impala, Kafka, Cassandra, spark, Scala, Solr, Azure (SQL, Databricks, Data lake, Data Storage, HDInsight), Java, SQL, Tableau, PIG, Zookeeper, Sqoop, Kafka, Teradata, Power BI, CentOS, Pentaho.

Client: Qualcomm, San Diego, CA Nov 2017 – Feb 2020

Role: Data Engineer

Responsibilities:

•Developed Python utility to validate HDFS tables with source tables

•Conduct systems design, feasibility and cost studies and recommend cost-effective cloud solutions such as Amazon Web Services (AWS).

•Implement code in Python to retrieve and manipulate data.

•Designed ETL Process using Informatica to load data from Flat Files, and Excel Files to target Oracle Data Warehouse database.

•Worked on GitLab for implementing Continuous Integration and Continuous Deployment (CI/CD) pipelines, streamlining the process of building, testing, and deploying code changes

•Configured AWS Identity and Access Management (IAM) Groups and Users for improved login authentication.

•Loaded data into S3 buckets using AWS Glue and PySpark.

•Involved in Web application penetration testing process, web crawling process to detect and exploit SQL Injections Vulnerabilities. Wrote automate Python Script for testing program to store machine detection alarm when Pump experience overloading to Amazon cloud.

•Automated all the jobs for pulling data from FTP server to load data into Hive tables using Oozie workflows

•Responsible for developing Python wrapper scripts which will extract specific date range using Sqoop by passing custom properties required for the workflow

•Involved in filtering data stored in S3 buckets using Elasticsearch and loaded data into Hive external tables.

•Designed and developed UDF'S to extend the functionality in both PIG and HIVE

•Import and Export of data using Sqoop between MySQL to HDFS on regular basis

•Developed a shell script to create staging, landing tables with the same schema as the source and generate the properties which are used by Oozie Jobs

•Used Python programming and Django for the backend development, Bootstrap and Angular for frontend connectivity and MongoDB for database. Developed a Django ORM module queries that can pre-load data to reduce the number of databases queries needed to retrieve the same amount of data.

•Worked with NoSQL databases like HBase in creating HBase tables to load large sets of semi structured data coming from various sources

•Developed Oozie workflows for executing Sqoop and Hive actions

•Built various graphs for business decision making using Python matplotlib library.

Environment: Python, Django, HDFS, Spark, Hive, Sqoop, AWS, Oozie, ETL, Pig, Oracle 10g, My SQL, No SQL, Hbase, Windows.

Client: Accel Frontline Technologies, India Oct 2016 – Aug 2017

Role: Hadoop Developer

Responsibilities:

•Installed and configured Hadoop Map Reduce, HDFS, developed multiple MapReduce jobs in java and Scala for data cleaning and preprocessing.

•Installed and configured Hive and written Hive UDFs and Used Map Reduce and Junit for unit testing.

•Queried and analyzed data from DataStax Cassandra for quick searching, sorting and grouping.

•Experienced in working with various kinds of data sources such as Teradata and Oracle. Successfully loaded files to HDFS from Teradata, and load loaded from HDFS to hive and impala.

•Used Yarn Architecture and MapReduce in the development cluster for POC.

•Supported MapReduce Programs those are running on the cluster. Involved in loading data from UNIX file system to HDFS.

•Experienced in installing, configuring and using Hadoop Ecosystem components.

•Designed and implemented a product search service using Apache Solr/Lucene.

•Involved in various NOSQL databases like Hbase, Cassandra in implementing and integration.

•Experienced in Importing and exporting data into HDFS and Hive using Sqoop.

•Participated in development/implementation of Cloudera Hadoop environment.

•Monitoring Python scripts run as daemons in the UNIX/Linux system background to collect trigger and feed arrival information. Created Python/MySQL backend for data entry from Flash.

•Implemented Kafka consumers to move data from Kafka partitions into Cassandra for near real-time analysis

•Worked in installing cluster, commissioning & decommissioning of Data nodes, Name node recovery, capacity planning, and slots configuration.

Environment: Python, CDH, MapReduce, Scala, Kafka, spark, Solr, HDFS, Hive, pig, Impala, Cassandra, Java, SQL, Tableau, PIG, Zookeeper, Pentaho, Sqoop, Teradata, CentOS.

Client: Pennant Technologies, India Dec 2014 – Oct 2016

Role: Data Engineer

Responsibilities:

•Responsible for gathering requirements from Business Analysts and identifying the data sources required for the requests.

•Wrote SAS Programs to convert Excel data into Teradata tables.

•Worked on importing/exporting large amounts of data from files to Teradata and vice versa.

•Created multi-set tables and volatile tables using existing tables and collected statistics on table to improve the performance.

•Wrote Python scripts to parse files and load the data in database, used Python to extract weekly information from the files, Developed Python scripts to clean the raw data.

•Worked on data pre-processing and cleaning the data to perform feature engineering and performed data imputation techniques for the missing values in the dataset using Python.

•Developed Teradata SQL scripts using OLAP functions like rank and rank Over to improve the query performance while pulling the data from large tables.

•Designed and developed weekly, monthly reports related to the Logistics and manufacturing departments using Teradata SQL.

Environment: Teradata, BTEQ, VBA, Python, SAS, FLOAD MLOAD, UNIX, SQL, Windows XP, Business Objects.

Qualification:

Bachelor of Technology in Civil Engineering from Kakatiya Institute of Technology and Science,Warangal in 2014.

Contact this candidate