Yashwanth Reddy
**.************@*****.***
Data Engineer
SUMMARY:
Around 6 years of industry experience as a Data Engineer with a solid understanding of Data Analysis, evaluating data sources, and a strong foundation in Data Warehouse/Data Mart design, ETL, BI, OLAP, and Client/Server applications.
Extensive experience in Data Analysis, Migration, Cleansing, Transformation, Integration, and Data Import/Export using multiple ETL tools such as Informatica PowerCenter. Skilled in testing and writing SQL statements, including Stored Procedures, Functions, Triggers, and Packages.
Proficient in all aspects of Data Analysis, Data Modeling, Development, and Visualization for various business applications across multiple database platforms.
Proficient in AWS, Snowflake, Teradata, SQL Server, and Oracle.
Experienced in Data Analysis focusing on business processes and tools using Excel, Teradata SQL Assistant, Query Analyzer, and SQL Developer.
Extensive experience with cloud technologies across Azure and AWS, proficient in Azure Data Lake Storage, Azure Data Factory, Azure SQL, Azure Data Warehouse, Azure Synapse Analytics, Azure HDInsight, Databricks, AWS EMR, Redshift, S3, EC2, and additional AWS services such as Glue, Athena, IAM, Step function, and RDS.
Involved in all phases of the SDLC, including Analysis, Development, Testing, Production, Support, and User Training.
Experienced in Data Remediation, Data Profiling, Data Cleansing, and Data Validation.
Thorough understanding of Data Warehouse/Datamart/Database modeling concepts, including Star and Snowflake schemas.
Extensive experience with MS Office (Word, Excel, Access, PowerPoint), MS Visio, and Office 365.
Involved in troubleshooting, performance tuning of reports, and resolving issues within the Tableau server and Tableau Desktop. Expertise in installation, configuration, and administration of Tableau servers in multi-server and multi-tier environments.
Advanced Excel skills, including Functions, Charts, Pivot Tables, Data Validation, and importing/exporting data with other databases and applications.
Experienced in Data Analysis focusing on business processes and tools using advanced Excel, Teradata SQL Assistant, Query Analyzer, and SQL Developer.
Experienced in the Software Development Lifecycle (SDLC) using SCRUM and Agile methodologies.
Technical Skills:
Big Data Ecosystem
HDFS, Yarn, MapReduce, Spark, Kafka, Kafka Connect, Hive, Airflow, Stream Sets, Sqoop, HBase, Flume, Pig, Ambari, Oozie, ZooKeeper, Nifi, Sentry
Hadoop Distributions
Apache Hadoop 2.x/1.x, Cloudera CDP, Hortonworks HDP, ADW
Cloud Environment
Amazon Web Services (AWS), Microsoft Azure, GCP
AWS
EC2, ELB, VPC, RDS, AMI, IAM, EMR, S3, Redshift, Lambda, Kinesis, Glue, Cloud Watch, Terraform, Amazon Aura, SNS, SQS, EBS, Route53, AWS Management Console, AWS CLI
Microsoft Azure
Databricks, Data Lake, Blob Storage, Azure Data Factory, Azure Synapse, SQL Database, SQL Data Warehouse, Cosmos DB, Azure Active Directory
Databases
MySQL, Oracle, Teradata, MS SQL SERVER, PostgreSQL, DB2, SnowFlake
NoSQL Database
Cassandra, MongoDB, DynamoDB, Cosmos DB
Operating systems
Linux, Unix, Windows, Mac OS
Software /Tools
Microsoft Excel, Eclipse, Shell Scripting, ArcGIS, Linux, PyCharm, Vi / Vim, Sublime Text, Visual Studio, Postman
Reporting Tools/ETL Tools
Informatica, Salesforce, Talend, SSIS, SSRS, SSAS, ER Studio, Tableau, Power BI, Arcadia, Veeva CRM, Data stage, Pentaho, Ignio AIOps
Programming Languages
Python (Pandas, SciPy, NumPy, Scikit-Learn, Stats Models, Matplotlib, Plotly, Seaborn, Keras), PySpark, Bash, T-SQL/SQL, PL/SQL, C, C++, JAVA, J2EE, JavaScript, HiveQL, Scala, UNIX Shell Scripting, JavaScript, HTML, CSS, JSON and XML
Version Control
Git, SVN, Bitbucket
Development Tools
Eclipse, NetBeans, IntelliJ, Hue, Microsoft Office Suite (Word, Excel, PowerPoint, Access)
Citi Bank, Plano, TX Aug 2022 – Current
Sr. Data Engineer
Responsibilities:
Developed and fine-tuned Spark applications using PySpark and Spark-SQL to handle data extraction, transformation, and aggregation from different file formats, improving data integration and analytics capabilities.
Analyzed and documented the flow of data through systems, ensuring accuracy from source to destination.
Developed comprehensive data flow documentation and updated data mappings as API and system requirements evolved.
Managed real-time data integration from AWS S3 and Kafka into HDFS using PySpark, ensuring data quality and governance.
Created data models and Power BI reports, providing stakeholders with insights into market trends and performance metrics.
Resolved data discrepancies by collaborating with upstream system owners and negotiating conflicts in data definitions.
Conducted data validation and quality checks using Snowflake and custom scripts, maintaining compliance and data standards.
Managed large-scale data migration projects, transferring data from on-premises servers to Snowflake. This involved designing ETL pipelines using Python, Spark, and Informatica to ensure efficient and accurate data transfer.
Integrated external government datasets with Citi Bank’s client data, performing data enrichment to fill gaps in records and enhance the overall quality and usability of the data.
Developed PySpark applications for data extraction, transformation, and aggregation from various formats, optimizing data processing and improving analytical performance.
Actively maintained and updated comprehensive documentation for data flows, mapping API structures accurately and adjusting as requirements evolved.
Built dynamic Power BI dashboards to present data stored in Snowflake, providing stakeholders with clear, actionable insights and supporting data-driven decision-making.
Conducted data validation and quality checks to uphold governance standards, resolving discrepancies promptly and maintaining data accuracy.
Leveraged real-time data streaming technologies, such as Spark Streaming and Kafka, to maintain an up-to-date data model in HDFS and improve responsiveness.
Applied financial transaction management expertise to Oracle EBS General Ledger, overseeing bank accounts, transaction records, and financial reports.
Collaborated with analysts and stakeholders to outline data needs, translating these into effective Snowflake schemas and DBT models for accurate, accessible data.
Integrated machine learning models into real-time data processing workflows using Apache Kafka and Spark Streaming.
Designed and implemented data ingestion pipelines to integrate Salesforce Data Cloud with Snowflake for customer data processing and segmentation.
Developed real-time ETL workflows in Azure Data Factory and Informatica to synchronize Salesforce Marketing Cloud with structured and unstructured financial transaction data.
Utilized Salesforce API and Kafka to stream customer insights from Salesforce Data Cloud into a centralized data warehouse for predictive analytics.
Involved in creating dynamic, well-organized, accurate and insightful Power BI reports/dashboards to support strategic decisions.
Conducted thorough data quality checks and validations in Snowflake with SQL and custom scripts to maintain data integrity and compliance.
Utilized Jira and Jenkins for project management and CI/CD, ensuring adherence to best practices in data governance and cataloging.
Developed intricate ETL pipelines in DataStage, overseeing job scheduling and monitoring with Airflow to improve data transformation and loading efficiency.
Environment: Python, HDFS, Hadoop, Hive, Tableau, Snowflake, Python, AWS, Cassandra, Jira, MS Excel, MS PowerPoint, Hadoop.
Aganitha, Hyderabad Jan 2020 – Aug 2021
Data Engineer
Responsibilities:
Streamlined interactions with PostgreSQL databases through ORM layers, simplifying CRUD operations via Python for easier data management and API endpoint creation.
Built data integration pipelines in Azure Databricks and optimized data workflows for healthcare analytics.
Automated ETL processes and applied data transformation techniques, enhancing data quality for business intelligence solutions.
Leveraged Python’s asynchronous capabilities for efficient handling of concurrent requests and automatically generated API documentation to enhance developer usability.
Managed PostgreSQL as the main data storage for various data types, including banking operations and customer transactions, with strong SQL querying capabilities.
Developed complex distributed systems for data handling, analytics, and pipeline construction, ensuring high performance and scalability.
Created interactive reports and dashboards in Power BI to visualize and analyze healthcare data trends, supporting data-driven decision-making
Expertly created Databricks notebooks to extract data from sources like DB2 and Teradata, performing data cleansing, wrangling, and loading into Azure SQL DB.
Worked closely under the DBA team to manage, optimize, and extend the data environment, executing complex SQL queries and ensuring the performance of PostgreSQL databases.
Developed data-driven applications and streamlined CRUD operations using Python, improving data management processes and the usability of APIs.
Built Databricks notebooks for data extraction, cleansing, and loading into Azure SQL DB, automating tasks to improve efficiency and ensure data integrity.
Created Power BI dashboards for analyzing healthcare data trends, providing actionable insights to support key business decisions.
Built Azure Databricks pipelines to process and transform Salesforce CRM and Salesforce Marketing Cloud data for a healthcare analytics platform.
Integrated Salesforce Data Cloud with PostgreSQL databases, ensuring real-time updates for patient engagement and outreach programs.
Developed custom ETL workflows in SSIS and Snowpipe to load Salesforce marketing data into Snowflake for in-depth analytics. Built Azure Databricks pipelines to process and transform Salesforce CRM and Salesforce Marketing Cloud data for a healthcare analytics platform.
Integrated Salesforce Data Cloud with PostgreSQL databases, ensuring real-time updates for patient engagement and outreach programs.
Developed custom ETL workflows in SSIS and Snowpipe to load Salesforce marketing data into Snowflake for in-depth analytics.
Led efforts in ETL operations, implementing data validation, cleansing processes, and ensuring seamless data flows into Snowflake using SSIS and Snowpipe.
Participated in extensive data profiling and cleansing tasks, leveraging data quality tools to ensure high levels of accuracy and reliability.
Environment: Azure SQL Database, Azure Database Migration (DMS), Oracle EBS, Snowflake, AWS, Windows XP, UNIX, SQL, Python, MS Office, Tableau
Mindtree, Hyderabad Jan 2018 –Dec 2019
Jr Data Engineer
Responsibilities
Created a data pipeline involving various AWS services including S3, Kinesis firehose, kinesis data stream, SNS, SQS, Athena, Snowflake, etc.
Integrated AWS CloudTrail with AWS CloudWatch, creating custom alarms and notifications for critical events, allowing for real-time response to potential issues and ensuring the integrity of data processing pipelines.
Successfully utilized React for building interactive user interfaces and leveraging Node.js for server-side development, API integrations, and real-time data processing.
Worked on Python libraries and frameworks such as Flask, Django, and NumPy, enabling rapid development of web applications, data analysis, and scientific computing tasks.
Converted SQL Server Stored Procedures to Redshift PostgreSQL and embedded them in Python pandas framework.
Implemented Continuous Integration/Continuous Deployment (CI/CD) pipelines using tools like Jenkins or AWS CodePipeline to automate the deployment process of Python applications on AWS.
Worked on Integrating more than 20 data sources and developed python modules to handle different file formats ofdata such as Txt, Csv, Excel, Html, Json and Xml.
Generated custom SQL queries and hive queries based on requirements.
Managed JIRA dashboards for team projects, including Sprint reports, daily burndown reports, velocity charts, tracking, email communication to senior management on daily accomplishments, and Epic completion status for releases.
Collaborated with Chairside QA to identify test scenarios for data consumption, order consumption, order schedules, and results; performed data mapping, creating Data Mapping documents in Excel.
Worked with both unstructured and structured data from multiple sources and automated the cleaning process using Python scripts, improving fraud prediction performance using random forest and gradient boosting for feature selection with Python's Scikit-learn, and implemented a machine learning model (logistic regression) with Python's Scikit-learn.
Used Python to preprocess data and uncover insights, prepared scripts in Python and Shell for the automation of administrative tasks and created Data Mapping documents to capture the source-to-target data flow and any transformation and business logic involved.
Built pivot tables and charts in Excel for daily, weekly, and monthly reports, communicated with customers to establish business/data requirements and systems requirements, and created queries to obtain data from the central data repository for reporting, quantitative and qualitative data analysis, and presented the final analysis to customers.
Environment: Python (sci-kit-learn, pandas, NumPy), Excel, Windows XP, SQL Server, UNIX, Windows, SQL, MS Office, Tableau9.1
CERTIFICATIONS: 1. SNOWPRO CORE CERTIFICATION.
Education
Master of Science in Data Science
Lewis University, Romeoville, IL
August 2021 – May 2023.