Data Engineer Big

Location:

Saskatoon, SK, Canada

Salary:

120K P.A.

Posted:

May 03, 2025

Contact this candidate

Resume:

SUBRAMANIAN VENKATARAMAN

437-***-****

******@*****.***

Saskatoon, SK- S7V 0Y3

PROFESSIONAL

SUMMARY

Dynamic Big Data expert specializing in data architecture, analytics, Machine learning, Generative AI and ETL workflows. Proficient in Hadoop, Spark, Scala, Kafka, Python, and NoSQL databases, with a proven track record of designing and deploying scalable data solutions. Recognized for fostering collaborative teamwork, adapting to evolving project requirements, and maintaining a results-oriented approach that ensures project success. Proactive and goal-driven, with strong time management and problem-solving abilities. Known for reliability and quick adaptability, as well as a keen ability to acquire and implement new skills. Dedicated to leveraging these strengths to enhance team performance and support organizational growth.

SKILLS

Data Engineering Data Analytics Data Science

Generative AI Java Scala

Python Hadoop Ecosystem Spark

AML Data pipelines Apache Kafka

HIVE Cloudera Distribution ETL development

HBASE Mongo DB

RESTful APIs Cassandra Impala

NLP LLM LLAMA2

Vector DBs RAG Prompt Engineering

Replicate DB Pinecone ChromaDB

R-Cloud R-Programming R Shiny Dashboard

CNN Power-BI Oracle

Keras RNN LSTM

Scikit-Learn TensorFlow Scikit-learn

AWS-Glue PyCharm PyTorch

AWS-RedShift Apache Airflow

EXPERIENCE

Sr. Big Data Engineer

Citi Bank, Oct 2019 - Aug 2024

• Developed ETL Data Pipelines utilizing Spark, Scala, Java, PySpark, Kafka, Oracle, and Hive to create an end-to-end transformation solution for Mortgage, Personal Loan, and Credit Card data.

• This included developing reusable methods for processing data across various formats and systems, including reading and writing data to Oracle, parsing AVRO files, and loading data into Hive Staging and target environments.

• Enhanced Multi-threading Models in Java to trigger Spark-Submit commands with dynamic input parameters, enabling efficient status tracking for each execution.

• Designed and Implemented Kafka Producer and Consumer modules to manage upstream message intake within the reconciliation framework, ensuring smooth communication of execution results to the original sender.

• Implemented Kafka Messaging Service to handle data reconciliation by reading, validating, and parsing JSON messages from Kafka, triggering Spark Submit commands through Livy URL.

• Led the Porting of the Reconciliation Framework into Spark Services, collaborating with the Spark Services Infrastructure Team for design, development, and implementation, utilizing Spark, Scala, Java, Kafka, and Hive to trigger jobs from user-generated Kafka messages.

• Developed an Event-driven Publisher-Subscriber Model using Kafka Streams to trigger the reconciliation framework in various environments (SIT, DEV, UAT, PROD).

• The model dynamically selects Spark-Submit configurations and triggers comparisons based on real-time parameters, sending summary messages and error notifications to users based on predefined event statuses.

• Optimized Performance for Scala Applications in UAT by tuning memory and resource allocation to resolve bottlenecks, manage heavy workloads, and adjust executor configurations for better resource utilization.

• Developed a Dynamic SQL Query Tool using Scala, Oracle, and Hive to facilitate data reconciliation between balance, domain, and date columns, with the resulting queries stored as views for streamlined comparison.

• Integrated Dynamic Queries with Arcadia BI Tool to enhance business intelligence capabilities by enabling real-time comparison result display based on filter conditions.

• Created a Tool in Scala to convert AVRO format data to Parquet based on partition keys, optimizing query performance by leveraging the Parquet format's efficiency.

• Developed REST API model to fetch and display mortgage, credit cards and personal loan recon results from HBase.

• Developed a Data Transfer Tool using Shell scripts to move data between production and UAT/SIT environments.

• The tool utilized ‘distcp’ commands for data validation, simplifying data ingestion for subsequent transformation and loading processes.

• Created a ‘FileWatcher' Tool in Shell Script to monitor file drops in Unix directories, parse and validate files, and trigger Spark-Submit jobs to automate the data workflow process.

• Developed Utilities in Shell Scripts for SQL queries execution, data transfer, Spark log generation, and email attachments handling, showcasing proficiency in managing diverse data-related tasks.

• Developed PySpark Code for Logging Modules to capture user-friendly, error, and validation messages for data pipeline processes, enabling easier monitoring and debugging of ETL processes.

• Coordinated with Business Teams for requirement analysis, design, and delivery, facilitating task allocation, JIRA status tracking, and regular status updates to clients, ensuring timely project progress.

• Technologies: Spark, PySpark, Scala, Java, Kafka, Hive, Impala, ETL, HBase,Starburst, Cloudera, Hadoop, HDFS, Oracle SQL Server, Arcadia BI Tool, Unix, Shell Scripting, Bitbucket.

Sr. Machine En Learning Engineering Lead

Capital One - TX, Jan 2019 - Jul 2019

• Project: Linear Regression Model for Auto Loan Default Prediction.

• Developed Python and AWS Lambda scripts for inserting customer records into the customer agreement table as part of the approval workflow, utilizing AWS-S3, Scala, and Quantum Framework (an in-house tool at Capital One), along with ScalaTest and Jenkins for automated testing.

• Optimized the Linear Regression Random Forest, Gradient Boosting, KNN models which predict vehicle loan outstanding balances, loan severity, and the probability of default by leveraging AWS EC2 to reduce memory consumption and improve execution speed.

• Utilized AWS Glue to source and transform data from multiple customer-related files

(including car loans, credit history, and market data) into a staging area, applying joins into AWS Redshift for further analysis.

• Validated the presence of customer car loan records and verified approval statuses in the S3 bucket using Python programming.

• Also, checked record counts in source data files and compared them with data in customer PDF files stored in S3.

• Developed a Jenkins pipeline for automating the application onboarding process to production, including build preparation, validation, and deployment processes.

• Conducted data analysis and traceability for the Front Book Recovery Model, which forecasts payment schedules, auction payments, and deficiency amounts, ensuring proper data lineage and forecasting accuracy.

• Technologies: Python, AWS Lambda, AWS S3, AWS Redshift, AWS Athena, AWS Glue, R, AWS EC2, Jenkins.

Scala Developer

Walmart, Bentonville, AR., Aug 2018 - Jan 2019

• Contributed to the Bookkeeping project, responsible for maintaining and enhancing Kafka pipelines that process data from Point of Sale (POS) systems to the General Ledger.

• Developed Scala Akka Microservices utilizing Futures for handling successful and error messages after data transmission, ensuring data is written into Cassandra based on the outcome.

• Designed and implemented Scala Akka Microservices for DB2 connection retries in the event of failures, improving system resilience and minimizing downtime.

• Configured Typesafe Config to manage and optimize the number of retry attempts for DB connections, enhancing the system's fault tolerance.

• Actively participated in Scrum meetings, collaborating on user story creation, technical story development, and retrospective analysis to optimize team processes and deliverables.

• Technologies: Scala, Play Framework, Microservices, Kafka, Cassandra, DB2 Data Engineer

Hayward Industries, NC, Jan 2018 - Jun 2018

• Contributed to this project for a swimming pool installation and maintenance company, serving approximately 10,000 clients across the United States.

• Developed analytical reports using Microsoft Power BI and SQL Server (hosted in Azure Cloud) to gain insights into the company's customer base and device installations, identifying key demographic trends and business development opportunities.

• Created Python modules utilizing NLTK (Natural Language Toolkit) and Part of Speech

(POS) techniques to capture failure types from problem descriptions, enabling effective categorization of problem and device types for root-cause analysis.

• Technologies: R-Studio, R, Python, NLTK, SQL Server, Azure Cloud, Microsoft Power BI Data Engineer

Barclays, Whippany, USA, Nov 2016 - Jul 2017

• Contributed to the Comprehensive Capital Analysis and Review (CCAR), A regulatory framework for the 9-quarter capital fund projections to identify and mitigate risks associated with maintaining capital funds, supporting strategic decision-making for Capital Funding.

• Led the report generation process within Hive and Oracle databases, ensuring accurate reporting based on various account codes.

• Developed a Data Compare Tool using Scala and HashMap to compare records across various sources, including CSV, Excel, and Hive tables, ensuring data integrity and consistency.

• Created Avro files for defining data structure and developed Hive tables using Avro files and a partition strategy for optimized storage and query performance.

• Designed and implemented ETL pipelines in Spark (Java/Scala) to load data from CSV files into Hive tables, ensuring the accuracy of data loads in alignment with source files.

• Built Scala-based tools to query Oracle databases and extract results into CSV/Excel formats for analyzing impacts in Moniker SQL, tailored to support business flow and upcoming data file changes.

• Developed tools in Scala to automate the verification of Fixed Form Reports against target expected reports, streamlining the validation process for complex matrices.

• Created R-based tools to generate SQL queries dynamically in real time, supporting functional reporting requirements and querying Hive/Oracle tables for comprehensive data analysis.

• Technologies: Cloudera, Hadoop, Unix, Spark, Scala, Hive, Oracle, R, Python, Java, Eclipse, IntelliJ IDEA, Agile, Maven, SBT, REST API, Postman, Bitbucket Data science projects developed and posted in public domain a. Authored and published a book on Generative AI-Prompt Engineering on Amazon

• Title: Crafting effective prompts: A Practice to Prompt Engineering.

• In this book, you will gain understanding on OpenAI, LLM, LLAMA2, LangChain, Transformer Model, Vector Store, HuggingFace and how to utilize them through the principles of Prompt Engineering using Prompt design techniques by solved programming examples written in Python.

• Please click here to open the book: Authored and published a book in Amazon on Gen AI with the title

"Crafting Effective Prompts: A Guide to Prompt Engineering" GitHub URL for the source code examples discussed in this book: https://github.com/vsubu1/CraftingEffectivePrompts b. Project: Twitter Analytix

• Description: This project performs real time sentiment analysis for the US Presidential Nominees of 2016 election by reading the tweets from twitter in real time. It reads the twitter using Spark Steaming, Scala and performs analysis using R language in order to generate the Sentiment analytics report. It uses Cassandra, Mongo DB and AWS for data maintenance. The reports are displayed using R-Shiny dashboard.

• Developed Real time sentiment analysis tool to display instant and continues line graphs for the positive and negative sentiments of the public for each second based on the data captured from US location using Spark-Scala, Cassandra, R-Programming and Mongo DB installed in m-Lab running under AWS.

• Installed Mongo DB in m-Lab, running under the hood of AWS, created DB and relations.

• Developed R Programming to read the twitter data from Cassandra, performed sentiment analysis using R-libraries and aggregated the data to each second and loaded into Mongo DB.

• Developed Analytical application using R-Shiny and connected to S3 instance for reading and displaying analytics from ShinyApps.io server.

• The continuous instant line graphs are generated for positive and negative sentiment counts aggregated per second for the presidential candidates for the data downloaded by Spark Streaming into Cassandra which is subsequently processed by R-Programming for sentiment analysis and uploaded into the Mongo DB in cloud.

• URL1: Real Time Sentiment Analysis on US Presidential Candidates - 2016

• URL2: YouTube Presentation on Real Time Sentiment Analysis on US Presidential Candidates - 2016

• Technologies: Spark/Scala, R Programming, Cassandra, MongoDB installed in AWS cloud, R-Shiny Dashboard, Sentiment Analysis, Machine Learning, NLP, PyCharm c. Project: Presidential job Approval Rating Analysis through social media

• Description: President's Approval Ratings help measure chances of re-election, predict performance in mid-terms of the party in power and generally take stock of the public's approval of the administration's agenda and performance. The objective of the project is to capture public sentiments from Twitter and compare it against the results of major approval rating companies such as Gallup Poll, Rasmussen Reports, Fox News, NBC News, Investor's Business Daily (IBD/TIPP to see is there any correlation between Twitter Sentiments and Scientific polling.

• Developed codes in Spark/Scala to download tweets from Twitter and saved into Cassandra DB

• Developed Training Set and Test Sets for performing Sentiment analysis using Naïve Bayes Classifier

• Developed code in Python for Naïve Bayes Classifier for performing Sentiment analysis on Tweets.

• Data from polling companies are downloaded manually and correlation graphs are generated.

• Developed R codes using R-Shiny dashboard to display the analytical graphs.

• The Twitter data results are normalized between 1% to 100% to match with approval rating limits.

• URL: Presidential Job Approval Rating Analysis Through Social Media

• Technologies: Spark/Scala, Python, Sentiment Analysis, Machine Learning, NLP, Naïve Bayes Classifier, PyCharm, TensorFlow/ Keras, Scikit-Learn, PyTorch, DNNs, GANS, GNNs and Time Series

d. Project: Traffic Monitoring Analytix

• Description: This project studies the impact and root-cause of Traffic Congestion for 5 chosen cities of US. It studies the reason for the high and low traffic in the selected US cities by reviewing population, population density, average household income and commutation time to work by framing set of theories and tries to prove or disprove based on the actual results.

• Downloaded Tweets for the 5 chosen selected cities from twitter for the public sentiments.

• Developed R Codes for Twitter-Sentiment Analysis, R-Shiny Dashboard, loading Twitter Sentiment results into Mongo DB.

• Generated correlation graphs, location-based graphs using R-Leaflet and Twitter sentiment analysis reports.

• Performed Sentiment Analysis in Twitter for traffic and grouped tweets into various emotional categories.

• Mongo DB is installed in m-lab which uses Amazon Web Services - AWS S3 Instance, were created DB, created relations.

• Developed Analytical application using R-shiny and connected to S3 instance for reading and displaying analytics from ShinyApps.io server.

• URL : https://harpanalytics.shinyapps.io/TrafixAnalytix/

• Environment: R-Programming, R-Shiny Dashboard, Mongo DB installed in AWS cloud, Sentiment Analysis

e. Predicting disease through Machine learning models

• Description: This project presents a data-driven approach to identify the factors responsible for causing diseases. By analyzing a list of independent attributes such as weight, height, cholesterol levels, glucose levels, smoke, alcohol, and a dependent attribute called disease, we aim to find the relationship between the disease and the rest of the attributes.

• The feature engineering process involves data cleansing, data exploration, categorical variable creation, and removal of unknown/null values. Heat map generation helps to identify the attributes that cause a higher probability of deaths. After applying machine learning algorithms, we evaluate confusion matrix and accuracy score.

• Prediction: As part of this prediction, we have used Gaussian Navie Bayes, Gradient boosting, K-Nearest Neighbors, and Random Forest models.

• The major factors for causing the disease are age, height, weight, high blood pressure, and low blood pressure, as they have high importance scores. This data-driven approach helps us plan to control them from our daily routines.

• URL: Predicting Disease with machine learning models f. Predicting COVID-19 deaths through machine learning models

• Description: This article starts with feature engineering where we do data cleansing, data exploration for all attributes against the number of deaths, categorical variable creation, removal of unknown/null values, heat map generation to identify the attributes cause the higher probability of deaths and after applying machine learning algorithms, we evaluate confusion matrix and accuracy score.

• Prediction: As part of this prediction, we have used Navie Bayes Algorithm, Logistic Regression and Random Forest models. Of which the Random Forest algorithm produces the highest accuracy score of 94.02% in predicting the death of the patients for the given set of attributes.

• URL: Predicting Covid-19 deaths with machine learning models g. Deep learning

• Developed and deployed machine learning models using TensorFlow and PyTorch frameworks, ensuring optimal performance and scalability.

• Applied deep learning concepts such as CNNs, RNNs, and LSTMs to solve complex real- world problems in various domains.

• Utilized expertise in computer vision, system architecture, and algorithm design to enhance project outcomes and streamline processes. OTHER PROFESSIONAL

EXPERIENCE

EDUCATION

• Sr. Consultant Zenosys Consulting - NJ 08/2015 - 12/2015

• Sr. Consultant Starwood Hotels - USA 04/2015 - 08/2015

• Expert Systems Analyst Allscripts Healthcare LLC - NY 12/2012 - 05/2014

• Sr. Test Analyst Encore Software Services - CA 10/2011 - 10/2012

• Senior Consultant Capco - New York City, NY 09/2010 - 09/2011

• Test Automation Engineer Geeksoft LLC - NJ 07/2010 - 08/2010

• Test Automation Consultant Credit Suisse - New York City, NY 06/2009 - 07/2010

• Test Automation Engineer JetBlue Airways 09/2008 - 05/2009

• Test Engineer Wachovia - Charlotte, NC 05/2008 - 09/2008

• Assistant Vice President Merrill Lynch India and US 07/2000 - 02/2008 Worked in India and USA for Merrill Lynch through various organizations.

• Senior Software Professional India Comnet Intl. (P) Ltd - India 06/1999 - 06/2000

• Member Technical Staff India Comnet Intl. (P) Ltd - India 04/1997 - 12/1998

• Systems Analyst Railway Products (India) Ltd - Hosur, India 06/1994 - 07/1996 Master of Science: Analytics

Harrisburg University of Science and Technology, Harrisburg, USA Master of Computer Applications

Annamalai University, India

Bachelor of Science: Physics

Bharathidasan University, India

WEBSITE, PORTFOLIO

AND PROFILES

• LinkedIn Profile

• Github Projects

CERTIFICATIONS

• AWS Certified AI ML Practitioner, valid till 12/31/27

• AWS Certified Cloud Practitioner, valid till 01/31/28

Contact this candidate