Senior Big Data Engineer

Location:

St. Charles, MO

Salary:

150000

Posted:

May 17, 2025

Contact this candidate

Resume:

Sowmith Nagabhiru

**********@*****.***

636-***-****

LinkedIn- https://www.linkedin.com/in/sowmith-nagabhiru-7a9a6a261/

Senior Big Data Engineer

Professional Summary:

Over 8+ years of IT experience in software analysis, design, development, testing and implementation of Big Data, Hadoop, Cloud, SQL and No SQL technologies.

Good Exposure on Apache Hadoop Map Reduce programming PIG Scripting and Distribute Application and HDFS. Good Knowledge on Hadoop Cluster architecture and monitoring the cluster.

Experienced on Hadoop Ecosystem and Big Data components including Apache Spark, Scala, Python, HDFS, Map Reduce, KAFKA.

Worked on Snowflake Schemas and Data Warehousing

Good knowledge in Database Creation and maintenance of physical data models with Oracle, Teradata, Netezza, DB2, MongoDB, HBase and SQL Server databases.

Have good knowledge in Job Orchestration tools like Oozie, Zookeeper & Airflow.

Utilized analytical applications like SPSS, Rattle and Python to identify trends and relationships between different pieces of data, draw appropriate conclusions and translate analytical findings into risk management and marketing strategies that drive value.

Have experience in Apache Spark, Spark Streaming, Spark SQL and NoSQL databases like HBase, Cassandra, and MongoDB.

Excellent Experience in Designing, Developing, Documenting, Testing of ETL jobs and mappings in Server and Parallel jobs using Data Stage to populate tables in Data Warehouse and Data marts.

Experience in usage of Hadoop distribution like Cloudera and Hortonworks.

Deep understanding of MapReduce with Hadoop and Spark. Good working knowledge of Big Data ecosystem like Hadoop (HDFS, Hive, Pig, Impala), Spark (SparkSQL, Spark MLlib, Spark Streaming).

Capable of using AWS utilities such as EMR, S3 and cloud watch to run and monitor Hadoop and Spark jobs on AWS.

Establishes and executes the Data Quality Governance Framework, which includes end - to-end process and data quality framework for assessing decisions that ensure the suitability of data for its intended purpose.

Integrated Kafka with Spark Streaming for real time data processing.

Skilled in performing data parsing, data manipulation and data preparation with methods including describe data contents.

Extensive experience in Text Analytics, generating data visualizations using R, Python and creating dashboards using tools like Tableau.

Experienced in writing complex SQL Quires like Stored Procedures, triggers, joints, and Sub quires.

Interpret problems and provides solutions to business problems using data analysis, data mining, optimization tools, and machine learning techniques and statistics.

Experienced in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension)

Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow.

Worked on Microsoft azure services like HDInsight Clusters, BLOB, ADLS, Data Factory and Logic Apps and also done POC on Azure Data Bricks.

Experienced with JSON based RESTful web services, and XML/QML based SOAP web services and also worked on various applications using python integrated IDEs like Sublime Text and PyCharm

Excellent performance in building, publishing customized interactive reports and dashboards with customized parameters including producing tables, graphs, listings using various procedures and tools such as Tableau and user-filters using Tableau.

Experience with Unix/Linux systems with scripting experience and building data pipelines.

Experience with Data Analytics, Data Reporting, Ad-hoc Reporting, Graphs, Scales, PivotTables and OLAP reporting.

Ability to independently multi - task, be a self-starter in a fast-paced environment, communicate fluidly and dynamically with the team and perform continuous process improvements with out of the box thinking.

Technical Skills:

Big Data: Cloudera Distribution, HDFS, Yarn, Data Node, Name Node, Resource Manager, Node Manager, MapReduce, PIG, SQOOP, Kafka, Hbase, Hive, Flume, Cassandra, Spark, Storm, Scala, Impala

Programming: Python, PySpark, Scala, Java, C, C++, Shell script, Perl script, SQL, PL/SQL

Databases: Snowflake(cloud), Teradata, IBM DB2, Oracle, SQL Server, MySQL, NoSQL

Cloud Technologies: AWS, Microsoft Azure

Frameworks: Django REST framework, MVC, Hortonworks

ETL/Reporting: Ab Initio, Informatica, Tableau

Tools: PyCharm, Eclipse, Visual Studio, SQL*Plus, SQL Developer, TOAD, SQL Navigator, Query Analyzer, SQL Server Management Studio, SQL Assistance, Eclipse, Postman

Machine Learning Techniques: Linear & Logistic Regression, Classification and Regression Trees, Random Forest, Associative rules, NLP and Clustering.

Database Modelling: Dimension Modeling, ER Modeling, Star Schema Modeling, Snowflake Modeling

Visualization/ Reporting: Tableau, ggplot2, matplotlib, SSRS and Power BI

Web/App Server: UNIX server, Apache Tomcat

Operating System: UNIX, Windows, Linux, Sun Solaris

Project Experience:

Client: Department of Veteran Affairs, Washington, D.C. Sep’2023 to till now

Role: Data Engineering Lead Developer

Responsibilities:

Evaluated big data technologies and prototype solutions to improve our data processing architecture. Data modeling, development and administration of relational and NoSQL databases.

Using Python in spark to extract the data from Snowflake and upload it to Salesforce on Daily basis.

Designed Kafka producer client using Confluent Kafka and produced events into Kafka topic.

Worked on SQL Server concepts SSIS (SQL Server Integration Services), SSAS (Analysis Services) and SSRS (Reporting Services). Using Informatica & SSIS, SPSS, SAS to extract transform & load source data from transaction systems.

Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.

Worked with Hadoop infrastructure to storage data in HDFS storage and use Spark / HIVE SQL to migrate underlying SQL codebase in AWS.

Worked with Hadoop ecosystem and Implemented Spark using Scala and utilized Data frames and Spark SQL API for faster processing of data.

Developed Spark Streaming job to consume the data from the Kafka topic of different source systems and push the data into HDFS locations.

Data sources are extracted, transformed and loaded to generate CSV data files with Python programming and SQL queries.

Use SparkSQL to load JSON data and create Schema RDD and loaded it into Hive Tables and handled structured data using SparkSQL.

Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.

Developing Spark programs with Python, and applied principles of functional programming to process the complex structured data sets.

Perform structural modifications using MapReduce, Hive and analyze data using visualization/reporting tools (Tableau).

Was responsible for creating on-demand tables on S3 files using Lambda Functions and AWS Glue using Python.

Subscribing the Kafka topic with Kafka consumer client and process the events in real time using spark.

Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.

As a part of Data Migration, wrote many SQL Scripts for Mismatch of data and worked on loading the history data from Teradata SQL to snowflake.

Highly Implemented a generic ETL framework with high availability for bringing related data for Hadoop & Cassandra from various sources using spark.

Experience with Data Analytics, Data Reporting, Ad-hoc Reporting, Graphs, Scales, PivotTables and OLTP reporting.

Use python to write a service which is event based using AWS Lambda to achieve real time data to One-Lake (A Data Lake solution in Cap-One Enterprise).

Utilized Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, executed machine Learning use cases under Spark ML and Mllib.

Joined various tables in Cassandra using spark and Scala and ran analytics on top of them.

Utilized Spark, Scala, Hadoop, HBase, Cassandra, MongoDB, Kafka, Spark Streaming, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.

Exported Data into Snowflake by creating Staging Tables to load Data of different files from Amazon S3.

Developed reusable objects like PL/SQL program units and libraries, database procedures and functions, database triggers to be used by the team and satisfying the business rules.

Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, MongoDB, T-SQL, and SQL Server using Python.

Used Airflow for scheduling the Hive, Spark and MapReduce jobs.

Converting Hive/SQL queries into Spark transformations using Spark RDDs and Pyspark

Filtering and cleaning data using Scala code and SQL Queries

Created Snow pipe for continuous data load from staged data residing on cloud gateway servers.

Troubleshooting errors in Hbase Shell/API, Pig, Hive and MapReduce.

Implemented Installation and configuration of multi-node cluster on Cloud using Amazon Web Services (AWS) on EC2.

Responsible for analyzing large data sets and derive customer usage patterns by developing new MapReduce programs using Java.

Designed and Developed Real Time Stream Processing Application using Spark, Kafka, Scala and Hive to perform Streaming ETL and apply Machine Learning.

Environment: Hadoop, Spark, Scala, DataStage, MapReduce, Snowflake, Hbase, Hive, Python, PL/SQL AWS, EC2, S3, Lambda, Auto Scaling, Cloud Watch, Cloud Formation, IBM Info sphere, Oracle12c, Flat files, TOAD, MS SQL Server database, XML files, Cassandra, MongoDB, Kafka, MS Access database, Autosys, UNIX, Erwin.

Client: Credit Suisse, New York, NY Jan’2022 to Aug’2023

Role: Senior Big Data Engineer

Responsibilities:

Developed Pig scripts for data analysis and perform transformation.

Implemented Cloud Security and Data Loss Protection.

Developed an ETL system to replace an existing data pipeline built on DTS using Python 2.7 and SQL server.

Used Scala to store streaming data to HDFS and to implement Spark for faster processing of data (30% faster).

Conducted performance tuning and optimization of SQL queries and Spark jobs to improve overall system efficiency.

Built data ingestion frameworks to automate the extraction and loading of data from diverse sources into data lakes and warehouses.

Implemented data quality checks and validation mechanisms to ensure accuracy and consistency of data across different stages of processing.

Designed and developed batch and streaming data processing pipelines using Apache Spark and Kafka, reducing processing time by 30%.

Developed and maintained ETL processes to extract, transform, and load data from various sources into data warehouses.

Implemented real-time data streaming solutions using Kafka and Spark Streaming to enable timely insights for business operations.

Optimized performance and scalability of data processing jobs by fine-tuning algorithms and leveraging parallel processing techniques.

Conducted code reviews and provided mentorship to junior team members to foster professional growth and maintain code quality standards.

Worked closely with stakeholders to gather requirements, define project scope and ensure alignment with business objectives.

Implemented monitoring and alerting systems to proactively identify and address issues in data pipelines ensuring high availability and reliability.

Environment: Python, Java, Scala, Hadoop, Spark, Hive, Kafka, HBase, AWS, Azure, Google Cloud Platform, SQL, NoSQL(MongoDB, Cassandra), Apache Airflow, Informatica, Talend, Docker, Kubernetes.

Client: Macy's, New York, NY Feb’2019 to Dec’2022

Role: Big Data Engineer

Responsibilities:

Configured Oozie workflow to run multiple Hive and Pig jobs which run independently with time and data availability.

Developed a NIFI Workflow to pick up the data from SFTP server and send that to Kafka broker.

Used HUE for running Hive queries. Created partitions according to day using Hive to improve performance.

Migrate data into RV Data Pipeline using Databricks, Spark SQL and Scala.

Used Databricks for encrypting data using server-side encryption.

Used Airflow to monitor and schedule the work

Involved in creating HDInsight cluster in Microsoft Azure Portal also created Events hub and Azure SQL Databases.

Configured Spark streaming to get ongoing information from the Kafka and store the stream information to HDFS.

Create Spark Vectorized panda user defined functions for data manipulation and wrangling

Developed and designed data integration and migration solutions in Azure.

Setting up Azure infrastructure like storage accounts, integration runtime, service principal id, app registrations to enable scalable and optimized utilization of business user analytical requirements in Azure.

Worked on a clustered Hadoop for Windows Azure using HDInsight and Hortonworks Data Platform for Windows.

Participates in the development improvement and maintenance of snowflake database applications.

Worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, PySpark, Impala, Tealeaf, Pair RDD's, DevOps, Spark YARN

Create and maintain optimal data pipeline architecture in cloud Microsoft Azure using Data Factory and Azure Databricks

Worked with PowerShell and UNIX scripts for file transfer, emailing and other file related tasks.

Implement IOT streaming with Databricks Delta tables and Delta lake to enable ACID transaction logging

Exposed transformed data in Azure Spark Databricks platform to parquet formats for efficient data storage

Delivered de normalized data for Power BI consumers for modeling and visualization from the produced layer in Data lake

Extracted and updated the data into HDFS using Sqoop import and export.

Utilized Ansible playbook for code pipeline deployment

Used Delta Lake as it is an open-source data storage layer which delivers reliability to data lakes.

Create custom logging framework for ELT pipeline logging using Append variables in Data factory

Enabling monitoring and azure log analytics to alert support team on usage and stats of the daily runs

Took proof of concept projects ideas from business, lead, developed and created production pipelines that deliver business value using Azure Data Factory

Installed Kafka Producer on different severs and Scheduled to produce data for every 10 seconds Implemented Data Quality in ETL Tool Talend and having good knowledge in Data Warehousing Developed Apache Spark applications by using spark for data processing from various streaming sources.

Build the Logical and Physical data model for snowflake as per the changes required.

Implement Continuous integration/continuous development best practice using Azure DevOps, ensuring code versioning

Developed HIVE UDFs to incorporate external business logic into Hive script and Developed join data set scripts using HIVE join operations.

Responsible to manage data coming from different sources through Kafka.

Took proof of concept projects ideas from business, lead, developed and created production pipelines that deliver business value using Azure Data Factory

Good Exposure on Map Reduce programming using Java, PIG Latin Scripting and Distributed Application and HDFS.

Writing PySpark and spark Sql transformation in Azure Databricks to perform complex transformations for business rule implementation

Implemented Kafka producers create custom partitions, configured brokers and implemented High level consumers to implement data platform.

Developed various Oracle SQL scripts, PL/SQL packages, procedures, functions, and java code for data

Strong Knowledge on architecture and components of Tealeaf, and efficient in working with Spark Core, SparkSQL. Designed and developed RDD Seeds using Scala and Cascading. Streaming data to Spark streaming using Kafka

Worked with NoSQL databases like HBase in creating HBase tables to load large sets of semi-structured data coming from various sources

Environment: Hadoop, Spark, MapReduce, Kafka, Scala, JAVA, Azure Data Factory, Data Lake, Databricks, Azure DevOps, PySpark, Agile, Power BI, Snowflake, Python, R, PL/SQL, Oracle 12c, SQL, No SQL, HBase, Scaled Agile team environment

Client: British Petroleum, Chicago, IL Nov’2017 to Jan’2019

Role: Big Data Engineer

Responsibilities:

Build the Oozie pipeline which performs several actions like file move process, Sqoop the data from the source Teradata or SQL and exports into the hive staging tables and performing aggregations as per business requirements and loading into the main tables.

Running of Apache Hadoop, CDH and Map-R distros, dubbed Elastic Map Reduce(EMR) on (EC2).

Applied Apache Kafka to transform live streaming with the batch processing to generate reports.

Used AWS Data Pipeline to schedule an Amazon EMR cluster to clean and process web server logs stored in Amazon S3 bucket.

Experience in configuring the Zookeeper to coordinate the servers in clusters and to maintain the data consistency which is important for decision making in the process.

Worked on different data formats such as JSON, XML and performed machine learning algorithms in Python.

Application development using Hadoop Ecosystems such as Spark, Kafka, HDFS, HIVE, Oozie and Sqoop.

Involved in support for Amazon AWS and RDS to host static/media files and the database into Amazon Cloud.

Automated RabbitMQ cluster installations and configuration using Python/Bash.

Designed and Developed Real Time Data Ingestion frameworks to fetch data from Kafka to Hadoop.

Using partitioning and bucketing in HIVE to optimize queries.

Implemented real time system with Kafka and Zookeeper.

Used Python Library Beautiful Soup for web scrapping to extract data for building graphs.

Continuous monitoring and managing the Hadoop cluster through Cloudera Manager.

Scheduled and executed workflows in Oozie to run Hive and Pig jobs.

Environment: Hadoop, HDFS, MapReduce, Hive, Kafka, Spark, AWS, Apache Airflow, Python, ETL workflows, Python, Scala, Spark, PL/SQL, SQL Server, Tableau, ETL, Pig

Client: Maveric Systems, Chennai, India Apr’2015 to Aug’2017

Role: Data Engineer

Responsibilities:

Used Hive to implement data warehouse and stored data into HDFS. Stored data into Hadoop clusters which are set up in AWS EMR.

Performed pig script which picks the data from one Hdfs path and performs aggregation and loads into another path which later pulls populates into another domain table. Converted this script into a jar and passed as parameter in Oozie script

Continuous monitoring and managing the Hadoop cluster through Cloudera Manager.

Performed Data Preparation by using Pig Latin to get the right data format needed.

Utilized the clinical data to generate features to describe the different illnesses by using LDA Topic Modelling.

Performed Data Visualization and Designed Dashboards with Tableau and generated complex reports including chars, summaries, and graphs to interpret the findings to the team and stakeholders.

Used Hive and created Hive tables and involved in data loading and writing Hive UDFs.

Used Sqoop to import data into HDFS and Hive from other data systems.

Used Git for version control with Data Engineer team and Data Scientists colleagues.

Environment: Ubuntu, Hadoop, Spark, PySpark, Nifi, Talend, SparkSQL, SparkMLIib, Pig, Python, Tableau, GitHub, AWS EMR/EC2/S3, and Open CV.

Client: Nakshatra IT solutions, Hyderabad, India Sep’2014 to Mar’2015

Role: Data Analyst

Responsibilities:

Gathered requirements from Business and documented for project development.

Prepared ETL standards, Naming conventions and wrote ETL flow documentation for Stage, ODS and Mart.

Prepared and maintained documentation for on-going projects.

Worked with Informatica Power Center for data processing and loading files.

Extensively worked with Informatica transformations.

Worked with SQL*Loader tool to load the bulk data into Database.

Developed mappings using Informatica to load data from sources such as Relational tables, Sequential files into the target system.

Created data maps in Informatica to extract data from Sequential files.

Coordinated design reviews, ETL code reviews with teammates.

Interacted with key users and assisted them with various data issues, understood data needs and assisted them with Data analysis.

Collect and link metadata from diverse sources, including relational databases and flat files.

Performed Unit, Integration and System testing of various jobs.

Extensively worked on UNIX Shell Scripting for file transfer and error logging.

Environment: Informatica Power Center, Oracle 10g, SQL Server, SQL*Loader, UNIX Shell Scripting, ESP job scheduler

Contact this candidate