Mukesh Sai
******.*****@*****.*** 813-***-**** LinkedIn
PROFESSIONAL SUMMARY
Highly experienced and detail-oriented Data Engineer with 6+ years of hands-on experience in designing, building, and optimizing data pipelines and systems for large-scale data processing. Proficient in Python, SQL, and ETL tools, with a strong track record of transforming complex datasets into actionable insights. Adapt at leveraging cloud platforms such as AWS and Azure to develop scalable, data solutions. Committed to ensuring data integrity, performance optimization, and enabling data- driven decision-making.
• Excellent working experience in Scrum / Agile framework and Waterfall project execution methodologies.
• Experience with NoSQL databases like HBase, Cassandra, and MongoDB and SQL databases like Teradata.
• Hands-on experience on Hadoop, HDFS, Hive, Sqoop, Pig, HBase, Oozie, Flume, Spark, MapReduce, Cassandra, Zookeeper, YARN, Kafka, Scala, PySpark, Airflow, Snowflake, SQL, Python.
• Expertise in AWS Glue, Redshift, Snowflake, Apache Airflow, Kafka, and Python (PySpark), ensuring secure, compliant (HIPAA, GDPR), and high-performance data architectures.
• Experience in Dimensional Data Modeling Star Schema, Snowflake Schema, Fact, and Dimensional Tables, concepts like Lambda Architecture, and Batch processing.
• Implemented real-time streaming ingestion using Kafka and Spark Streaming and loaded data using Spark-streaming with Scala and Python.
• Experienced with CI/CD tools like Docker, Kubernetes, and Jenkins. TECHNICAL SKILLS
Programming
Languages
Python, SAS, R, SQL, PostgreSQL, Scala, Java, C
Databases MongoDB, Cassandra, Elasticsearch, DynamoDB, MYSQL, SQL, Oracle DBMS, PostgreSQL. Big Data Ecosystem Hadoop, MapReduce, Kafka, PySpark, Pig, Hive, YARN, Flume, Sqoop, Oozie, Talend. Tools/Frameworks/
Utilities
MySQL, TSQL, Jira, Confluence, Power BI, Tableau, Jupiter Notebook, Flask, Spring, NumPy, Pandas, GitHub, Bitbucket, PyCharm, SDLC, MapReduce, Spark, Snowflake, Airflow, Informatica, Teradata, Hadoop, HBase, Fivetran, Stitch, Rivery, Kimball, Inmon, Data Vault Cloud Azure, Azure Databricks, Azure Data Factory, Azure Cloud, Azure SQL Server, Azure Data Lake, Azure DevOps, Azure Storage, AWS S3, Sage maker, EMR, Redshift, Glue, Athena, Step Functions, RDS, CloudWatch, DynamoDB, kinesis, Lambda, SQS
Machine Learning Linear Regression, Logistic regression, Decision tree, Random Forest, K Means Clustering, Sk- learn, Keras, TensorFlow.
Operating Systems UNIX, Linux, and Windows
Data Visualization
Tools
Tableau, Power BI, Excel
DevOps Jenkins, Terraform, Pulumi, Docker, Kubernetes, Ansible, and Chef. WORK EXPERIENCE
ChenMed, Miami, FL June 2023 – Present
Data Engineer
• Designed and developed high-performance data pipelines using Apache Spark, PySpark, and SQL, enabling seamless ingestion, transformation, and storage of large datasets.
• Migrated on-premises data to AWS services, including S3, Redshift, and Glue, streamlining ETL processes and improving data processing efficiency by 6%.
• Created and managed ELT workflows with Apache Airflow, integrating data from diverse sources like APIs, databases, and cloud storage into centralized data warehouses.
• Built and optimized complex SQL queries for transforming and loading data from MySQL, PostgreSQL, and NoSQL databases, ensuring high data accuracy and system reliability.
• Led the migration of legacy Oracle databases to AWS Redshift, leveraging advanced techniques like partitioning and indexing to achieve significant cost savings and query performance improvements.
• Automated data validation scripts and implemented error-handling mechanisms within ETL pipelines using Python, ensuring consistent data quality and significantly reducing manual intervention.
• Partnered with data scientists to deploy machine learning models using ML flow and Kubeflow, enabling efficient tracking, training, and deployment workflows.
• Optimized performance of Spark and SQL queries, reducing processing times by 5% and improving resource utilization for large-scale data operations.
• Implemented cloud infrastructure using AWS CloudFormation, containerized workflows with Docker, and automated deployment pipelines with Jenkins, enhancing CI/CD efficiency.
• Designed real-time data pipelines with Apache Kafka and Spark Streaming, ensuring low-latency data availability for mission- critical applications.
• Strengthened data governance by implementing data lineage and quality checks across pipelines, ensuring compliance with organizational standards and security protocols. Larsen & Toubro Infotech, Mumbai, India Jan 2019 - Jul 2022 Big Data Engineer
• Designed and implemented scalable big data pipelines using Apache Spark, Hadoop, and Kafka to process and analyze large datasets, ensuring high performance and fault tolerance.
• Optimized data ingestion and transformation workflows, reducing processing time by 4% through the use of PySpark and Hive
• for batch and stream processing.
• Developed custom data processing in Spark, leveraging RDDs, Data Frames, and Spark SQL to handle diverse data formats like JSON, Parquet, and Avro.
• Led the design and implementation of real-time data streaming solutions using Apache Kafka and Spark Streaming, enabling near-instantaneous processing of time-sensitive data for analytical needs.
• Collaborated closely with data scientists to develop machine learning pipelines using Spark MLlib, overseeing the training of models on large datasets and facilitating their deployment for real-time predictions.
• Successfully migrated legacy data processing systems to a Hadoop ecosystem, integrating HDFS for enhanced data access, reliability, and system performance.
• Optimized data storage and retrieval processes across HDFS and cloud-based data lakes AWS S3, significantly improving scalability and data query performance.
• Automated data quality checks and validation processes using Apache Airflow, ensuring data accuracy and integrity across all pipeline stages.
• Worked on Azure HDInsight to implement distributed processing solutions, managed cluster performance, and fine-tuned resource allocation.
• Designed and implemented data warehousing solutions using Amazon Redshift, optimizing query performance through effective indexing and partitioning strategies.
• Created and fine-tuned complex SQL queries for data transformations, aggregations, and reporting, reducing execution time and improving data availability.
• Utilized Apache Hive and Impala to perform fast, SQL-like data querying on large datasets stored in HDFS, supporting critical business intelligence applications.
• Developed and managed ETL pipelines for ingesting structured and unstructured data, consolidating data from diverse sources into Hadoop for comprehensive processing and analysis.
• Collaborated with cross-functional teams to define data requirements, ensuring seamless data flow and integration across systems, and supporting analytics, reporting, and operational applications. Digital Lync, Hyd, India May 2018 - Dec 2018
Python Developer
• Assisted in developing web applications, learning the development process through hands-on practice.
• Gained experience applying object-oriented programming (OOP) principles to develop maintainable and scalable solutions.
• Applied OOP principles in real-world applications, enhancing software structure and functionality.
• Utilized Git for version control and collaborated in writing automated tests to ensure code quality.
• Managed databases, including SQL and PostgreSQL, for efficient data storage and retrieval across various projects.
• Supported cloud deployment processes, acquiring experience with AWS services such as Lambda.
• Collaborated with a team in an Agile environment, optimizing performance and automating workflows through CI/CD pipelines.
• Participated in code reviews, provided feedback, and contributed to the refinement of development best practices.
• Assisted in troubleshooting and debugging issues, ensuring the stability and performance of web applications.
• Worked closely with front-end and back-end developers to integrate APIs and improve the user experience. PROJECTS
Online Voting System
• Developed a web-based voting platform where users can participate in various polls, with results displayed in real-time.
• Admins can create, manage, and monitor polls while users can vote only once per poll. The system includes user authentication, poll expiration, and a dynamic real-time results feature. Technologies Used: Python, HTML, CSS, JavaScript, MySQL, Flask-Login/Django Authentication. EDUCATION
University of South Florida, Tampa, FL
Master of Science – Business Analytics and Information Systems Relevant Coursework: Data Management System Design, Operating System Design, Data Structures and Algorithms, Advanced Data Management System Design, Data Mining, Data Analytics with R. CERTIFICATIONS
Certified MERN stack Developer Nxtwave disruptive Technogies Microsoft Power BI Data Analyst Salesforce Administrator