Big Data Software Developer

Location:

Elk Grove Village, IL

Posted:

February 17, 2025

Contact this candidate

Resume:

Name: Chaitanya Kurumella

Contact: 484-***-****

Email: ******************.***@*****.***

LinkedIn: linkedin.com/in/chaitanya-kurumella-24041b10a/

Summary:

●Around 12 years of professional experience as a Software developer in design, development, deploying and supporting large scale distributed systems.

●Around 8 years of extensive experience as a Data Engineer and Big data Developer specialized in Big Data Ecosystem-Data Ingestion, Modelling, Analysis, Integration, and Data Processing.

●Extensive experience in providing solutions for Big Data using Hadoop, Spark, HDFS, Map Reduce, YARN, Kafka, Pig, Hive, Sqoop, HBase, Oozie, Zookeeper, Cloudera Manager, Horton works.

●Strong experience working with Amazon cloud services like EMR, Redshift, DynamoDB, Lambda, Athena, Glue, S3, API Gateway, RDS, CloudWatch for efficient processing of Big Data.

●Hands-on experience building PySpark, Spark Java and Scala applications for batch and stream processing involving Transformations, Actions, Spark SQL queries on RDD’s, Data frames.

●Strong experience writing, troubleshooting, and optimizing Spark scripts using Python, Scala.

●Experienced in using Kafka as a distributed publisher-subscriber messaging system.

●Strong knowledge on performance tuning of Hive queries and troubleshooting various issues related to Joins, memory exceptions in Hive.

●Exceptionally good understanding of partitioning, bucketing concepts in Hive and designed both Managed and External tables in Hive.

●Experience in importing and exporting data between HDFS and Relational Databases using Sqoop.

●Experience in real time analytics with Spark Streaming, Kafka and implementation of batch processing using Hadoop, Map Reduce, Pig and Hive.

●Experienced in building highly scalable Big-data solutions using NoSQL column-oriented databases like Cassandra, MongoDB and HBase by integrating them with Hadoop Cluster.

●Manager and SaaS, PaaS and IaaS concepts of Cloud Computing and Implementation Worked with

●Extensive work on ETL consisting of data transformation, data sourcing, mapping, conversion and loading data from heterogeneous systems like flat files, Excel, Oracle, Teradata, MSSQL Server.

●Experience of building ETL production pipelines using Informatica Power Center, SSIS, SSAS, SSRS.

●Proficient at writing MapReduce jobs and UDF’s to gather, analyse, transform, and deliver the data as per business requirements and optimize the existing algorithms for best results.

●Experience in working with Data warehousing concepts like Star Schema, Snowflake Schema, DataMarts, Kimball Methodology used in Relational and Multidimensional data modelling.

●Strong experience leveraging different file formats like Avro, ORC, Parquet, JSON and Flat files.

●Sound knowledge on Normalization and Denormalization techniques on OLAP and OLTP systems.

●Good experience with Version Control tools Bitbucket, GitHub, GIT.

●Experience with Jira, Confluence, and Rally for project management and Oozie, AirFlow scheduling tools.

●Monitor portfolio performance, attribution, and benchmarks, making adjustments as needed to achieve targeted returns.

●Conduct thorough research and analysis of financial markets, macroeconomic trends, and industry sectors to provide actionable insights for trading decisions.

●Monitor news, economic indicators, and geopolitical events to identify potential market opportunities and risks.

●Good experience with Quantexa platform updates, AWS services, and best practices to enhance data engineering capabilities and contribute to continuous improvement initiatives.

●Integrate data and workflows into the Quantexa platform, ensuring seamless connectivity and data synchronization for advanced analytics and entity resolution.

●Configure and manage connectors, APIs, and data loaders within the Quantexa environment

●Experienced in Strong scripting skills in Python, Scala and UNIX shell.

●Involved in writing Python, Java API’s for Amazon Lambda functions to manage the AWS services.

●Experience in design, development and testing of Distributed Client/Server and Database applications using Java, Spring, Hibernate, JSP, JDBC, REST services on Apache Tomcat Servers.

●Hands on working experience with RESTful API’s, API life cycle management and consuming RESTful services.

●Have good working experience in Agile/Scrum methodologies, communication with scrum calls for project analysis and development aspects.

●Worked with Google Cloud (GCP) Services like Compute Engine, Cloud Functions, Cloud DNS, Cloud Storage and Cloud Deployment Manager and SaaS, PaaS and IaaS concepts of Cloud Computing and Implementation using GCP.

Technical Skills:

Programming Languages: Python, Scala, SQL, Java, C/C++, Shell Scripting

Web Technologies: HTML, CSS, XML, AJAX, JSP, Servlets, JavaScript

Big Data Stack: Hadoop, Spark, MapReduce, Hive, Pig, Yarn, Sqoop, Flume, Oozie, Kafka, Impala, Storm

Cloud Platform: Amazon Web Services (AWS), Google Cloud Platform (GCP),

Relational databases: Oracle, MySQL, SQL Server, DB2, PostgreSQL, Teradata, Snowflake

NoSQL databases: MongoDB, Cassandra, HBase, Pig

Version Control Systems: Bitbucket, GIT, SVN, GitHub

IDEs: PyCharm, IntelliJ IDEA, Jupyter Notebooks, Google Colab, Eclipse

Operating Systems: Unix, Linux, Windows

Professional experience:

Client: Arbella Insurance, Massachusetts Sep 2023 – Till Date

Senior Data Engineer

Responsibilities:

●Worked on building the data pipelines (ELT/ETL Scripts), extracting the data from different sources

(DB2, AWS S3 files), transforming and loading the data to the Data Warehouse (AWS Redshift).

●Worked on adding the Rest API layer to the ML models built using Python, Flask & deploying the models in AWS BeanStalk Environment using Docker containers.

●Ability to debug and troubleshoot Terraform deployments.

●Worked on developing & adding few Analytical dashboards using Looker product.

●Designed and implemented scalable and reliable data pipelines using Kubernetes and Apache Spark for real-time data processing

●Utilized Kubernetes API and Helm charts to automate deployment and monitoring of data applications

●Implemented data security measures in Kubernetes clusters to ensure data privacy and compliance with regulations

●Developed an AWS TDM (Test Data Management) strategy for the organization, reducing the time and effort required for data provisioning and ensuring data privacy and security.

●Implemented AWS data quality checks using AWS DQ (Data Quality) tools, detecting and resolving data issues in real-time, resulting in improved data accuracy and consistency.

●Worked on building the aggregate tables & de-normalized tables, populating the data using ETL to improve the looker analytical dashboard performance and to help data scientist and analysts to speed up the ML model training & analysis.

●Created New Dashboards, reports, scheduled searches and alerts using spunk.

●Integrate Pager duty with Splunk to generate the Incidents from Splunk.

●Developed custom Jenkins’s jobs/pipelines that contained Bash shell scripts utilizing the AWS CLI to automate infrastructure provisioning.

●Developed a user-eligibility library using Python to accommodate the partner filters and exclude these users from receiving the credit products.

●Conducted performance testing and troubleshooting of Kubernetes clusters to identify and resolve bottlenecks

●Used Maven to manage project dependencies and build automated deployment processes to deploy data pipelines and applications into production environments.

●Developed and maintained data warehouse using Snowflake to enable ad-hoc querying and reporting for business users

●Good Understanding of other AWS services like S3, EC2 IAM, RDS Experience with Orchestration and Data Pipeline like AWS Step functions/Data Pipeline/Glue.

●Hands-on experience on working with AWS services like Lambda function, Athena, DynamoDB, Step functions, SNS, SQS, S3, IAM etc.

●Integrated lambda with SQS and DynamoDB with step functions to iterate through list of messages and updated the status into DynamoDB table.

●Utilized JBoss Data Virtualization (JDV) to create a virtualized data layer, providing a unified view of disparate data sources to support

●Automated data ingestion and processing using JBoss Drools and Apache Airflow, reducing manual effort and increasing efficiency

●Utilized WebLogic JDBC data sources to connect and integrate external databases, enabling seamless data integration and business continuity.

●Implemented security best practices for WebLogic environments, including SSL configuration, dynamic user authentication, and authorization, to ensure data confidentiality and integrity

●Collaborated with development teams to troubleshoot and resolve production issues related to WebLogic, ensuring minimal downtime and maximum availability

●Designed and implemented a highly scalable and automated data pipeline using Ansible for data ingestion, transformation, and loading.

●Developed and maintained Ansible playbooks to deploy and manage big data clusters on cloud platforms like AWS, Google Cloud, and Azure.

●Implemented complex data integration processes using Ansible to connect and transfer data between different systems and databases.

●Utilized Ansible's dynamic inventory feature to dynamically manage and provision resources in the data pipeline based on workload demands.

●Automated deployments of data warehouse applications and BI tools using Ansible, reducing deployment time by 50%.

●Designed disaster recovery processes for data applications running on Kubernetes to ensure minimal downtime

●Implemented logging and monitoring solutions, such as Prometheus and ELK stack, for Kubernetes clusters to track performance and detect issues

●Built the data pipelines to aggregate the user click stream session data using spark streaming module which reads the clickstream data from Kinesis streams and store the aggregate results in S3 and data and eventually loaded to AWS Redshift warehouse.

●Worked on supporting & building the infrastructure for the core module of the Credit Sesame i.e Approval Odds, started with Batch ETL, moved to micro-batches and then converted to a real time predictions.

●Implementation of end-to-end data solution on Azure using Azure Databricks, ADF, DW and PowerBI.

●Designed a robust data modelling environment using Databricks on Azure, enabling consumers to easily operate highly descriptive Notebooks in a fully governed environment

●Migration large data sets to Databricks (Spark), create and administer cluster, load data, configure data pipelines, loading data from ADLS Gen2 to Databricks using ADF pipelines

●Extensive hands-on experience of writing notebooks in data bricks using python/Spark SQL for complex data aggregations, transformations, schema operations. Good familiarity with Databricks delta and data frames concepts

●Developed the AWS Lambda server less scripts to handle ad-hoc requests.

●Performed Cost optimization reduced the infrastructure costs.

●Configure and manage connectors, APIs, and data loaders within the Quantexa environment

●Knowledge and experience on using Python Numpy, Pandas, Sci-kit Learn, Onnx & Machine Learning.

●Worked on building the data pipelines using PySpark (AWS EMR), processing the data files present in S3 and loading it to Redshift.

●Other activities include supporting and keeping the data pipelines active, working with Product Managers, Analysts, Data Scientist & addressing the requests coming from them, unit testing, load testing and SQL optimizations in DB2 server.

Environment: Groovy, Python, Flask, Numpy, Pandas, DB2, Cassandra, AWS EMR, Spark, AWS Kinesis, AWS Redshift, AWS EC2, AWS S3, AWS BeanStalk, AWS Lambda, AWS data pipeline, Quantexa, AWS cloud-watch, Docker, Shell scripts, Looker.

Client: Dun & Bradstreet, Florham Park, NJ Sep 2022 – Aug 2023

Sr Data Engineer

Responsibilities:

●As a Data Engineer I am responsible for building scalable distributed data solutions using Hadoop.

●Involved in Agile Development process (Scrum and Sprint planning).

●Handled Hadoop cluster installations in Windows environment.

●Migrated on-premises environment in GCP (Google Cloud Platform)

●Experience building and deploying cloud infrastructure using Terraform

●Migrated data warehouses to Snowflake Data warehouse.

●Defined virtual warehouse sizing for Snowflake for different type of workloads.

●Demonstrated knowledge of AWS, Azure, Google Cloud Platform, and other cloud providers

●Involved in porting the existing on-premises Hive code migration to GCP (Google Cloud Platform) BigQuery.

●Ability to design, develop, and implement Terraform scripts for infrastructure automation.

●Proven understanding of the principles of Infrastructure as Code (IaC).

●Involved in migration of an Oracle SQL ETL to run on Google cloud platform using cloud Dataproc & BigQuery, cloud pub/sub for triggering the Apache Airflow jobs.

●Extracted data from data lakes, EDW to relational databases for analysing and getting more meaningful insights using SQL Queries in DB2 DB and PySpark.

●Developed PySpark script to merge static and dynamic files and cleanse the data.

●Created Pyspark procedures, functions, packages to load data.

●Designed, developed and did maintenance of data integration programs in a Hadoop and RDBMS environment with both traditional and non-traditional source systems.

●Developed MapReduce programs to parse the raw data, populate staging tables and store the refined data in partitioned tables in the EDW.

●Good experience with Quantexa platform updates, AWS services, and best practices to enhance data engineering capabilities and contribute to continuous improvement initiatives.

●Integrate data and workflows into the Quantexa platform, ensuring seamless connectivity and data synchronization for advanced analytics and entity resolution.

●Wrote Sqoop Scripts for importing and exporting data from RDBMS (DB2) to HDFS.

●Set up Data Lake in Google cloud using Google cloud storage, BigQuery and Big Table.

●Developed scripts in BigQuery and connecting it to reporting tools.

●Designed workflows using Airflow to automate the services developed for Change data capture.

●Carried out data transformation and cleansing using SQL queries and PySpark.

●Used Kafka and Spark streaming to ingest real time or near real time data in HDFS.

●Worked related to downloading BigQuery data into Spark data frames for advanced ETL capabilities.

●Worked on PySpark APIs for data transformations.

●Built reports for monitoring data loads into GCP and drive reliability at the site level.

●Participated in daily stand-ups, bi-weekly scrums and PI panning.

Environment: Hadoop 3.3, GCP, BigQuery, Big Table, Spark 3.0, PySpark, Sqoop 1.4.7, ETL, HDFS, Snowflake DW, DB2, MapReduce, Kafka 2.8 and Agile process.

Client: Honeywell, Charlotte, NC Aug 2019 – Dec 2021

Sr Data Engineer

Responsibilities:

●Extensive experience in working with AWS cloud Platform (EC2, S3, EMR, Redshift, Lambda and Glue).

●Working knowledge of Spark RDD, Dataframe API, Data set API, Data Source API, Spark SQL and Spark Streaming.

●Developed Spark Applications by using Python and Implemented Apache Spark data processing Project to handle data from various RDBMS and Streaming sources.

●Worked with Spark for improving performance and optimization of the existing algorithms in Hadoop.

●Using SparkContext, Spark-SQL, Spark MLlib, Data Frame, Pair RDD and Spark YARN.

●Used Spark Streaming APIs to perform transformations and actions on the fly for building common.

●Learner data model which gets the data from Kafka in real time and persist it to Cassandra.

●Developed Kafka consumer API in python for consuming data from Kafka topics.

●Consumed Extensible Markup Language (XML) messages using Kafka and processed the XML file using Spark Streaming to capture User Interface (UI) updates.

●Developed Preprocessing job using Spark Data frames to flatten JSON documents to flat file.

●Load D-Stream data into Spark RDD and do in memory data Computation to generate output response.

●Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a Data pipeline system.

●Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for data sets processing and storage.

●Experienced in Maintaining the Hadoop cluster on AWS EMR.

●Loaded data into S3 buckets using AWS Glue and PySpark. Involved in filtering data stored in S3 buckets using Elasticsearch and loaded data into Hive external tables.

●Configured Snow pipe to pull the data from S3 buckets into Snowflakes table.

●Stored incoming data in the Snowflakes staging area.

●Created numerous ODI interfaces and loaded into Snowflake DB.

●Worked on Amazon Redshift for shifting all Data warehouses into one Data warehouse.

●Good understanding of Cassandra architecture, replication strategy, gossip, snitches etc.

●Designed columnar families in Cassandra and Ingested data from RDBMS, performed data transformations, and then exported the transformed data to Cassandra as per the business requirement.

●Used the Spark Data Cassandra Connector to load data to and from Cassandra.

●Worked from Scratch in Configurations of Kafka such as Mangers and Brokers.

●Experienced in creating data-models for Clients transactional logs, analyzed the data from Cassandra.

●Tables for quick searching, sorting and grouping using the Cassandra Query Language.

●Tested the cluster performance using Cassandra-stress tool to measure and improve the Read/Writes.

●Used Hive QL to analyse the partitioned and bucketed data, Executed Hive queries on Parquet tables.

●Used Apache Kafka to aggregate web log data from multiple servers and make them available in downstream systems for Data analysis and engineering type of roles.

●Worked in Implementing Kafka Security and boosting its performance.

●Experience in using Avro, Parquet, RCfile and JSON file formats, developed UDF in Hive.

●Developed Custom UDF in Python and used UDFs for sorting and preparing the data.

●Worked on Custom Loaders and Storage Classes in PIG to work on several data formats like JSON, XML, CSV and generated Bags for processing using pig etc.

●Developed Oozie coordinators to schedule Hive scripts to create Data pipelines.

●Written several Map Reduce Jobs using Pyspark, Numpy and used Jenkins for Continuous integration.

●On cluster and testing of HDFS, Hive, Pig and MapReduce to access cluster for new users.

●Continuous monitoring and managing the Hadoop cluster through Cloudera Manager.

Environment: Spark, Spark-Streaming, Spark SQL, AWS EMR, map R, HDFS, Hive, Pig, Apache Kafka, Sqoop, Python, Pyspark, Shell scripting, Linux, MySQL Oracle Enterprise DB, SOLR, Jenkins, Eclipse, Oracle, Git, Oozie, Tableau, MySQL, Soap, Cassandra and Agile Methodologies.

Client: W3 Softtech, India Sep 2016 – July 2019

Data Engineer

Responsibilities:

●Participate in requirement grooming meetings which involves understanding functional requirements from business perspective and providing estimates to convert those requirements into software solutions (Design and Develop & Deliver the Code to IT/UAT/PROD and validate and manage data Pipelines from multiple applications with fast-paced Agile Development methodology using Sprints with JIRA Management Tool).

●Responsible to check data in DynamoDB tables and to check EC2 instances are upon running for

●(DEV, QA, CERT and PROD) in AWS.

●Analysis on existing data flows and create high level/low level technical design documents for business stakeholders that confirm technical design aligns with business requirements.

●Creation and deployment of Spark jobs in different environments and loading data to no sql database Cassandra/Hive/HDFS. Secure the data by implementing encryption based.

●Implemented AWS solutions using E2C, S3, RDS, EBS, Elastic Load Balancer, Auto scaling groups, Optimized volumes, and EC2 instances and created monitors, alarms, and notifications for EC2 hosts using Cloud Watch.

●Developing code using Apache Spark and Scala, IntelliJ, NoSQL databases (Cassandra), Jenkins, Docker pipelines, GITHUB, Kubernetes, HDFS file System, Hive, Kafka for streaming Real time streaming data, Kibana for monitor logs etc. authentication/authorization to the data Responsible to deployments to DEV, QA, PRE-PROD (CERT) and PROD using AWS.

●Scheduled Informatica Jobs through Autosys scheduling tool.

●Created quick Filters Customized Calculations on SOQL for SFDC queries, Used Data loader for ad hoc data loads for Salesforce

●Extensively worked on Informatica power centre Mappings, Mapping Parameters, Workflows, Variables and Session Parameters.

●Responsible for facilitating load data pipelines and benchmarking the developed product with the set performance standards.

●Used Debugger within the Mapping Designer to test the data flow between source and target and to troubleshoot the invalid mappings.

●Worked on SQL tools like TOAD and SQL Developer to run SQL Queries and validate the data.

●Study the existing system and conduct reviews to provide a unified review on jobs.

●Involved in Onsite & Offshore coordination to ensure the deliverables.

●Involving in testing the database using complex SQL scripts and handling the performance issues effectively.

Environment: Apache spark 2.4.5, Scala2.1.1, Cassandra, HDFS, Hive, GitHub, Jenkins, kafka, SQL Server 2008, Salesforce Cloud, Visio, TOAD, Putty, Autosys Scheduler, UNIX, AWS, WinScp, Salesforce data loader, SFDC Developer console.

Client:Integra Micro Software Services, India. June 2012 – Aug 2016

Role: Data Engineer

Responsibilities:

Involved in the implementation of the project went through several phases namely: data set analysis, preprocessing data set, user-generated data extraction, and modeling.

Participated in Data Acquisition with the Data Engineer team to extract historical and real-time data by using Sqoop, Pig, Flume, Hive, MapReduce, and HDFS.

Wrote user-defined functions (UDFs) in Hive to manipulate strings, dates, and other data.

Performed Data Cleaning, features scaling, and features engineering using pandas, and Numpy packages in Python.

Process Improvement: Analyzed error data of recurrent programs using Python and devised a new process to reduce the turnaround time of the problem's solutions by 60%

Worked on production data fixes by creating and testing SQL scripts.

Deep dive into complex data sets to analyze trends using Linear Regression, Logistic Regression, Decision Trees

Prepared reports using SQL and Excel to track the performance of websites and apps

Visualized data using Tableau to highlight abstract information

Applied clustering algorithms i.e. Hierarchical, K-means using Scikit, and Scipy.

Performed Data Collection, Data Cleaning, Data Visualization, and Feature Engineering using Python libraries such as Pandas, Numpy, matplotlib, and Seaborn.

Optimized SQL queries for transforming raw data into MySQL with Informatica to prepare structured data for machine learning.

Used Tableau for data visualization and interactive statistical analysis.

Worked with Business Analysts to understand the user requirements, layout, and look of the interactive dashboard.

Used SSIS to create ETL packages to Validate, Extract, Transform, and Load data into a Data Warehouse and Data Mart.

The lifetime values were classified based on the RFM model by using an XG Boost classifier.

Maintained and developed complex SQL queries, stored procedures, views, functions, and reports that meet customer requirements using Microsoft SQL Server

Participated in Building Machine Learning using python

Environment: Python, PL/SQL scripts, Oracle Apps, Excel, IBM SPSS, Tableau, Big Data, HDFS, Sqoop, Pig, Flume, Hive, MapReduce, HDFS, SQL, Pandas, Numpy, MatPlotLib, Seaborn, ETL, SSIS, SQL Server, Windows.

Education:

Wilmington University, Delaware

Master of Science in Information Systems Technology, 2023

Gandhi Institute of Technology and Management, Visakhapatnam

Bachelor of Science in Mechanical Engineering, 2012

Contact this candidate