Tanmay Singh
480-***-**** *******************@*****.*** LinkedIn GitHub
Professional Summary
Data Engineer and Data Analyst with over 5 years of expertise in designing, building, and optimizing scalable data solutions, proficient in SQL, Python, AWS, GCP, data warehousing, data modeling, and ETL/ELT pipelines. Skilled in leveraging Big Data technologies, including the Hadoop and Spark frameworks, and proficient in associated tools such as HDFS, Apache Airflow, and Spark. Optimizing Data Build Tool (DBT) projects in a Snowflake environment by implementing incremental models, leveraging partitioning, tuning query performance, and reducing query runtime and cloud data warehousing costs. Experience in Database design, Data Modeling, Data Cleansing, and ETL Processes, with a deep understanding of both RDBMS (SQL Server, MySQL) and NoSQL (MongoDB, HBase) technologies, to design and implement solutions for diverse data needs.
Experience
M&T Bank, USA Data Engineer Sep 2024 – Present
Migrated data aggregation layer from legacy services to Snowflake utilizing Data Build Tool (DBT) models, resulting in up to 70% cost savings and improved query performance.
Spearheaded the development and maintenance of Kafka-based real-time data pipelines, demonstrating strong analytical thinking and problem-solving skills to improve data throughput and system uptime by 40%.
Employed Tableau to create analytical dashboards that provide business users with insights from data, enhancing decision-making.
Designed and implemented AWS Data Pipelines to automate ETL workflows, processing 5 TB+ of data daily, resulting in a 25% improvement in data processing efficiency and reducing manual intervention.
Leveraged PySpark to process 10TB of data daily within the Apache Spark ecosystem, accelerating data processing by 25% and utilizing DataFrames and SQL functionalities to streamline data manipulation and analysis.
Improved ETL workflows through rigorous testing and debugging of SQL scripts and Python code, resulting in a 50% increase in data processing efficiency and ensuring seamless data integration with downstream systems.
Capgemini Pvt Ltd, India Data Engineer– HSBC UK (Application Data Management) Oct 2020 – May 2023
Engineered robust data pipelines using Python and PySpark for transformation, validation, and quality assurance, ensuring accurate computation of customer credit insights based on HSBC’s proprietary risk models.
Collaborated with cross-functional teams to optimize data storage and retrieval by refining BigQuery queries and indexing strategies, reducing batch pipeline processing time by 40%.
Led the design and development of scalable ETL pipelines using Apache Airflow for HSBC’s customer credit history application, ensuring reliable and timely data processing.
Managed Apache Zookeeper clusters to coordinate distributed systems, reducing downtime by 30% and ensuring seamless synchronization across 100+ nodes in the data processing pipeline.
Created and deployed Apache Flink applications for processing high-volume streaming data, achieving a 25% improvement in processing speed compared to traditional batch processing methods.
Dixon Technology, India Data Analyst Jan 2019 – Sep 2020
Implemented data cleaning routines for large datasets using Python and SQL, working closely with data scientists and analysts to ensure data accuracy, leading to a 15% reduction in data cleaning time and allowing more time for advanced data analysis.
Generated data cleaning and wrangling with NumPy, Pandas, and Datetime in Python to merge data, change the data type, remove duplication and null values, and check outliers.
Enhanced query performance on Amazon Redshift by 25% by implementing advanced SQL query optimization techniques, such as indexing and partitioning, resulting in significantly faster data analysis.
Directed the end-to-end development of interactive Power BI reports, employing advanced DAX and Power Query modeling techniques to uncover deep data insights and drive informed decision-making.
Analyzed A/B test results using statistical methods such as t-tests and confidence intervals to identify statistically significant differences, enabling data-driven decisions that improved conversion rates by 15%.
Instituted reusable SQL templates and interactive Excel reports, fostering a data-driven culture that increased independent data analysis and boosted team efficiency by 20%.
Skills
Programming Language: Python, SQL, Spark SQL
Big Data Ecosystem: Apache Spark, Apache Kafka, Apache Nessie, Apache Flink, Hadoop, Hive, HDFS, Zookeeper
Cloud: AWS (EC2, S3, Lambda, Glue, Athena, Kinesis, Redshift), GCP (Google BigQuery), Azure
Packages: NumPy, Pandas, Matplotlib, Seaborn, PySpark
ML & Statistical Methods: Predictive Modeling, Decision Trees, Clustering, Regression
Visualizations: Tableau, Power BI, Excel
ETL and Tools: SSIS, Informatica PowerCenter, Data Pipelines, Data build tool (DBT), Apache Airflow, Jenkins
Version Control & Database: GitHub, Git, SQL Server, PostgreSQL, DynamoDB, MySQL, Snowflake
Data Skills: Data Warehousing, Data Mining, Data Manipulation, Statistical Analysis, Data Modeling, Data Processing
Education
Master of Science in Data Science, Analytics, and Engineering Arizona State University (GPA: 3.4) May 2025
Bachelor of Technology in Computer Science and Engineering Manipal University, Jaipur, India (CGPA: 7.3) May 2020