Krishna Chaitanya Vyda
Sr Data Engineer
Phone: 901-***-****
Email: ***********************@*****.***
Summary:
•Data Engineer with over 7 years of experience across diverse industries including telecommunications, healthcare, banking, and general IT sectors.
•Specializes in architecting and implementing robust data solutions tailored for enterprise-scale applications.
•Holds Expertise in Databricks and Unity Catalog (ULTA) from managing Datasets to providing access controls to users and creating end to end data pipelines and job scheduling.
•Proficient in Cluster and job management in Databricks and Utilized Databricks for ETL and creating End to End pipelines and for CI/CD services .
•Expertise includes designing and optimizing complex Data Lake architectures, Data Warehousing solutions, and ETL (Extract, Transform, Load) pipelines.
•Skilled in building scalable systems that support seamless integration and processing of large and varied datasets.
•Proficient in utilizing a wide range of cloud services including AWS (Amazon Web Services) and Azure (Microsoft Azure). Mastery in AWS services such as EC2, S3, RDS, Lambda, Redshift, Athena, EMR, Kinesis, DynamoDB, and Azure services including Azure Data Lake Stores, Azure SQL Data Warehouse, Azure Data Factory, and Azure Databricks for efficient data storage, processing, and orchestration.
•Experienced in designing end-to-end data pipelines on advanced platforms like DataBricks, ensuring efficient ETL processes for handling large and diverse datasets.
•Expertise extends to optimizing ETL/ELT workflows through advanced techniques like incremental data loading and data deduplication.
•Skilled in managing and automating data workflows using Apache Airflow, for scheduling and orchestrating ETL processes on AWS. Proficient in ensuring scalability, reliability, and performance of data processing workflows.
•Designed and implemented scalable data pipelines using Scaler to process high-volume data, improving data ingestion performance by 30%.
•Implemented robust data validation and quality checks using Great Expectations, custom scripts, or unit testing frameworks in Python.
•Specialization in Apache Spark, Scala, and Kafka for processing complex data tasks across distributed computing environments.
•Successfully implemented scalable solutions supporting real-time data streaming, batch processing, and machine learning model training.
•Extensive hands-on experience with Big Data Analytics tools including Hadoop MapReduce, HDFS, HBase, Zookeeper, Hive, Sqoop, Pig, Flume, Cassandra, Spark, Oozie, and Airflow.
•Proficient in installing, configuring, and optimizing these components to support data processing and analytics needs.
•Strong skills in system analysis, E-R and Dimensional Data Modeling, Database Design, and RDBMS implementation.
•Developed ETL processes to extract data from APIs and web sources using libraries such as Requests, and Scrapy.
•Expert in designing efficient data models that cater to both relational and NoSQL databases, ensuring optimal performance and scalability.
•Proficiency in managing and extracting actionable insights from diverse databases, employing CI/CD pipelines with tools like Azure DevOps and AWS Code Pipeline.
•Ensures rigorous testing methodologies and frameworks are applied to maintain software quality and robustness throughout the SDLC, following Agile and Waterfall methodologies.
•Skilled in version control tools like Bitbucket and Git for managing software changes effectively. Experienced in enhancing team workflows and project outcomes through advanced branching and merging techniques, promoting collaboration and code quality management.
•Demonstrated ability in developing and deploying data processing applications with Docker and Kubernetes, optimizing deployment flexibility and scalability across hybrid cloud and on-premises environments.
Technical Skills:
Programming Languages:
Python, Scala, Java, Spark
Cloud Services:
AWS Athena, AWS Glue, AWS SagemakerADF, Snowflake, Azure SQL, Data Warehouse, Databricks, Data pipelines, CI/CD, Docker, Kubernetes, Redshift, EC2, S3, Lambda, EMR, Cloud watch,, Data Lake, Airflow, Oozie, APIs, RSS, OLTP, OLAP, Unity Catalog.
ETL Tools:
Databricks, Informatica, Talend, Apache
Data Visualization:
SAS, Power BI, Tableu.
Databases:
Azure SQL, Oracle SQL, AWS RDS, MongoDB, Cassandra
Hadoop/Bigdata:
HDFS, MapReduce, Sqoop, Hive, PIG, HBASE, Zookeeper, FLUME, AWS, Cloudera
Operating System:
Windows, Linux, Unix, Scala, Mac OS
Orchestration:
MLFlow,Apache Airflow, Dataflow,
Data/Stream Processing:
Apache Storm, Apache Kafka, Apache Spark, Sqoop.
Messaging Technologies:
Kafka Cluster
EDUCATION:
Masters in
Professional Experience:
Client: Truist Financial, Charlotte, North Carolina July 2023 - Present
Role: Sr. AWS Data Engineer
Project Overview: Fraud Detection
The project focused on leveraging machine learning to detect and prevent SIM swap and port-out fraud in telecommunications at Verizon. By analyzing customer-agent interaction data, we built models using techniques like Random Forest and XGBoost to identify fraudulent behavior and generate fraud scores for review. We implemented a real-time monitoring system with AWS SageMaker and Lambda, ensuring continuous model updates to effectively combat fraud and reduce financial losses.
Responsibilities:
Designed and implemented a robust Data Lake infrastructure on AWS Cloud, leveraging services like S3, EMR, Redshift, Athena, Glue, EC2, and DynamoDB. Supported diverse data tasks including analysis, processing, storage, and reporting for data related to customer usage pattern, billing insights, and network reliability metrics.
Scheduled jobs using Databricks for proper Data flow and implemented Triggers. Designed Clusters based on the policies and managed different clusters for different pipelines.
Created Kafka Clusters integrated with Amazon Databricks interface for creating Streaming Data pipelines from source to destination for proper data flow.
Performed ETL using Pyspark and used Databricks for Spark code and spark jobs within the ETL.
Led the migration of raw telecom data related to customer usage and billing to AWS S3, orchestrating refined data processing using AWS EMR. Ensured scalable handling of data to optimize customer service and operational efficiencies.
Leveraged Scaler's distributed architecture to handle large datasets, ensuring efficient parallel processing and reducing data processing time by 40%.
Implemented MLflow using Docker and deployed on AWS Elastic Container Registry.
Registered all the datasets in Unity catalog (ULTA) to ensure the central management of databases.
Utilized Unity Catalog’s data lineage feature to trace how data flows and transforms through various stages of pipelines, providing transparency for debugging, audits, and impact analysis.
Implemented data quality checks or constraints at the dataset level to ensure high-quality data is available for downstream users.
Used MLflow to log important information about model runs, such as hyperparameters, metrics, and the final model .
Used MLflow's Model Registry to register, manage, and version machine learning model in Databricks.
Utilized serverless architecture with API Gateway, Lambda, and DynamoDB for real-time processing of telecom customer requests and network reliability events. Deployed AWS Lambda functions triggered by data events from S3, enhancing operational agility and responsiveness.
Created optimized external tables with partitions using Hive, AWS Athena, and Redshift to efficiently manage and query telecom usage and billing data. Facilitated quick access to critical customer and network insights.
Built and maintained Python-based ETL pipelines that ingest data from multiple sources (e.g., APIs, S3 buckets) into a data warehouse. Used Pandas and pyspark for data transformation and validation before loading into Redshift.
Provided technical support during the setup, deployment, and troubleshooting of telecom-specific data pipelines related to customer interactions and billing processes. Ensured reliable data flow and system stability for customer service improvements and network monitoring.
Developed and optimized algorithms for analyzing usage patterns and optimizing service delivery based on billing data insights. Supported resource allocation strategies and operational efficiency improvements.
Designed and implemented ETL jobs using Databricks to extract and load data into Data Lake or Data Mart within Redshift. Enhanced data integration and availability for network optimization and customer experience enhancements.
Monitored and rectified Databricks pipelines and regarding performance and failures.
Responsible for Cluster Management and cluster configuration and optimization of existing clusters.
Developed a real-time data streaming application using Apache Kafka and Python, enabling near-instantaneous ingestion and processing of log data for analytics.
Implemented data visualization tools to illustrate customer usage patterns. Supported informed decision-making and strategic planning initiatives for customer service improvements.
Automated ETL workflows using AWS Lambda, S3, EMR, Glue, and Redshift for seamless telecom data processing and integration. Improved efficiency and real-time availability of customer insights for network management and analysis.
Developed custom input formats in Spark jobs to efficiently handle data formats related to customer usage, billing, and network reliability. Ensured accurate processing and actionable insights from diverse telecom data sources.
Designed robust data pipelines to integrate and process telecom customer data from structured and unstructured sources. Supported comprehensive analytics on customer interactions.
Implemented data governance frameworks to maintain data quality and integrity across distributed telecom databases like Cassandra and MongoDB. Ensured compliance with regulatory standards and customer data protection.
Implemented data processing with real-time systems using Amazon Kinesis, providing timely insights into customer interactions and network performance. Enabled proactive responses to service issues and customer requests.
Developed real-time streaming applications using PySpark, Apache Flink, Kafka, and Hadoop clusters for continuous monitoring and analysis of telecom customer interactions and network reliability. Supported rapid response to service disruptions and customer needs.
Automated build and deployment of ETL jobs using Python and Boto3 libraries, reducing manual effort and accelerating telecom data processing and analytics for customer service enhancements.
Organized and scheduled complex ETL pipelines using Apache Airflow, ensuring reliable and efficient data workflows for telecom customer usage analytics.
Utilized Apache Spark and Hadoop frameworks for efficient processing of extensive telecom datasets related to billing behaviors, and network reliability. Enabled timely insights for service improvements and operational optimizations.
Designed serverless application CI/CD pipelines using Lambda application model, facilitating agile deployment of telecom data applications for enhanced customer service delivery.
Created and executed SQL scripts to validate customer data flow and ensure data consistency and accuracy in billing and network performance monitoring systems.
Managed database activities including indexing, performance tuning, and backup operations for telecom-specific databases. Ensured optimal performance and reliability of data systems supporting customer management and network operations.
Applied fact dimensional modeling, transactional modeling, and SCD techniques to telecom customer data. Supported comprehensive reporting and analytics for network performance metrics and customer service evaluations.
Leveraged tools like Informatica, Alteryx, and Airbyte for efficient telecom data processing, cleansing, and integration operations. Improved data quality management and operational efficiency across customer and billing data workflows.
Collaborated cross-functionally, managed application versions using Git with GitHub. Reviewed team-developed components and contributed to design documents and test cases for telecom data projects, ensuring alignment with customer service and operational requirements.
Closed defects raised by QA and managed release processes for telecom-specific modules. Ensured high software quality and reliability for telecom data applications and systems supporting customer service improvements.
Client: Ascension Health, St. Louis, Missouri February 2022 – June 2023
Role: AWS Data Engineer
Project Overview: - "Real-Time Healthcare Insights and Monitoring"
In the project, patient data was processed in real-time to ensure quick access to essential information for healthcare providers. Operational monitoring allowed for continuous oversight of claims processing and system performance, ensuring smooth operations. Healthcare analytics focused on analyzing patient data to identify trends, predict health risks, and optimize treatment plans, ultimately improving patient care and decision-making efficiency
Responsibilities:
Developed ETL processes using AWS Glue for seamless migration of data from S3 and various file formats to AWS Redshift. Ensured data integrity and performance optimization.
Leveraged AWS Glue catalog and crawlers to extract and organize data from S3. Conducted SQL operations efficiently using AWS Athena for data analysis.
Built an ETL framework with Spark and Python to process and load standardized data into Hive and HBase tables. Enhanced data processing efficiency and scalability.
Designed and implemented data transformation logic and feature extraction using PySpark in a distributed environment for a machine-learning pipeline using AWS Databricks as coding tool.
Managed ingestion of structured and semi-structured data into HDFS using Sqoop and Spark jobs. Ensured efficient handling of large data volumes and diverse formats.
Automated execution of AWS Glue jobs using lambda functions triggered by S3 events. Streamlined data processing workflows and improved operational efficiency.
Implemented monitoring and alerting mechanisms for lambda functions and Glue jobs with CloudWatch. Ensured proactive management and timely response to operational issues.
Wrote scalable PySpark code in distributed environments to handle diverse CSV file schemas. Improved data processing speed and flexibility.
Integrated Scaler with cloud platforms (AWS/Azure/GCP) to enable seamless data storage and processing.
Developed procedures for data pipelines in Databricks and architechture and Unity Catalog for Setting up governance policies to ensure that sensitive data is handled according to regulatory standards, such as GDPR or HIPAA.
Scheduled jobs using Databricks on a timely basis of gathehring data into AWS Databricks from DB.
Developed MapReduce scripts for data parsing and processing from multiple sources. Stored parsed data efficiently in HBase and Hive for analysis.
Engineered data pipelines with Flume, Sqoop, Pig, and MapReduce for ingesting customer behavioral data into HDFS. Supported data-driven insights and decision-making for data team.
Implemented Spark applications with Scala and Spark SQL to enhance data testing and processing capabilities across diverse data sources. Improved performance and scalability.
Managed data workflows with Apache Airflow, scheduling and monitoring ETL processes. Oversaw Relational and NoSQL databases, focusing on design, schema optimization, performance tuning, and troubleshooting. Ensured robust data management and system efficiency.
Engineered scalable data storage solutions with distributed columnar databases such as Apache Druid and Amazon Redshift. Optimized query performance and resource utilization for analytics.
Created real-time data processing systems using Apache Kafka . Enabled low-latency data ingestion and analysis for critical applications like fraud detection.
Developed data models to organize and structure data efficiently for storage, retrieval, and analysis purposes. Enhanced data accessibility and usability across the organization.
Containerized data processing applications with Docker and Kubernetes. Improved deployment flexibility and scalability in CI/CD workflows across hybrid on-premises and cloud environments.
Utilized Git and Bitbucket for collaborative development, code management, and version control in data engineering projects. Ensured efficient collaboration and code quality management.
Client: Lumen, Broomfield, CO. Feb 2019 – Sep 2021
Role: Azure Data Engineer
Responsibilities:
Utilized Azure Data Factory to ingest data from various sources, both relational and unstructured. Ensured data integration met business requirements effectively.
Created batch and real-time data processing solutions with ADF and Databricks. Leveraged stream analytics to handle continuous data flows.
Used MLflow for model training running and deploying the machine learning model to identify the trends in data to communicate it with the users.
Collaborated with cross-functional teams to define data architecture and governance strategies using Scaler, ensuring data integrity and security.
Used Unity catalog for Data integration and Data pipeline development from Source DB to Target DB.
Developed and managed data pipelines to optimize ETL/ELT jobs from multiple data sources. Ensured seamless data flow from SQL databases, Azure services, and data lakes.
Managed and scheduled data workflows with Airflow. Created DAGs in Python to streamline task execution.
Implemented data processing solutions with Azure Stream Analytics, Event Hub, and Service Bus Queue. Set up custom alerts in Azure Data Factory for monitoring.
Used Azure Databricks notebooks for data transformation. Applied complex business logic with Spark SQL to process data. Deployed ML flow using docker container on Azure kubernetes services.
Performed thorough unit testing on ETL components. Validated data accuracy and integrity through end-to-end system testing to eliminate inconsistency in data transfer from source to destination.
Established automated testing frameworks like ADF monitoring and management for data pipelines. Ensured data quality and performance with comprehensive testing during CI/CD processes.
Automated deployment and configuration of data pipelines with CI/CD tools. Ensured consistent delivery of workflows across different environments.
Engaged in UAT to identify and fix critical issues. Ensured the final solution met user expectations and requirements.
Developed PySpark scripts to read parquet files from Azure Data Lake. Transformed data and loaded it into Azure SQL databases.
Applied advanced SQL optimization techniques to improve query performance. Reduced execution times and enhanced system efficiency.
Designed ETL packages to load data from Excel and RDBMS tables. Applied transformations to prepare data for staging tables.
Developed mapping documents to align source and target data fields. Enhanced data traceability and audit compliance through accurate transformation processes.
Created and deployed Power BI dashboards for real-time insights. Supported data-driven decisions with interactive visualizations.
Worked with stakeholders and cross-functional teams to align technical solutions.
Worked with teams to define data models for NoSQL databases. Adapted to changing business needs with flexible schema designs.
Client: Electronics Arts, Hyderabad, India Feb 2018 – Jan 2019
Role: Software Engineer.
Responsibilities:
Designed advanced SQL scripts and optimized indexes to enhance query performance and efficiency. Developed complex queries to extract meaningful insights from large datasets.
Conducted detailed data mapping and rigorous cleansing to ensure data accuracy. Developed programs to automate and verify the data migration process.
Collaborated with stakeholders to understand data needs and implemented mappings. Ensured seamless data integration across different systems.
Integrated multiple data sources using techniques such as data merging and cleansing. Prepared data for further processing by standardizing formats and ensuring quality.
Created visually compelling Power BI dashboards and published them for easy access. Ensured that reports were user-friendly and met business requirements.
Established connections between Power BI and multiple databases for real-time data access. Enabled automatic query updates to ensure data freshness.
Leveraged Power Query to manipulate data models and clean data efficiently. Utilized pivot functions for accurate data representation.
Used MLflow for
Pulled data from various databases such as SQL Server and Oracle. Ensured seamless integration and accurate data extraction.
Fine-tuned SQL queries for better performance and efficiency. Provided support for data loading tasks to ensure smooth operations.
Handled multiple data formats to ensure compatibility and flexibility. Facilitated data processing and analysis by using appropriate formats.
Developed data pipelines with Talend to extract and transfer data. Migrated data efficiently to AWS S3 for storage and further processing.
Scheduled automated data storage in AWS S3 and utilized AWS Glue for data loading. Leveraged AWS Lambda to automate data processing tasks.
Orchestrated CI/CD for data infrastructure to maintain scalability and resilience. Ensured efficient delivery and management of dependable components like Spark, Kafka, and Hadoop.
Used tools like Apache Ambari to monitor database performance. Identified and resolved performance bottlenecks to enhance system reliability.
Conducted thorough testing phases, including integration and regression. Analyzed test results to ensure the reliability and quality of data processes.
Embraced agile methodology to expand and adapt data infrastructure. Ensured the infrastructure met the evolving needs of the business.