Manisha Khadka
*****************@*****.***
Profession Summary:
● Over seven years of experience developing, executing, and managing data-intensive systems, with a focus on cloud platforms, big data technologies, and ETL procedures.
● Skilled at building and improving data pipelines with tools like PySpark, Snowflake, and Apache Airflow, ensuring excellent performance and scalability in data workflows.
● Capable of using cloud platforms such as AWS, Azure, and Databricks to provide scalable analytics, integration, and data processing solutions according to business requirements.
● Skilled in designing end-to-end data solutions on AWS and efficient at creating scalable ETL processes using AWS Glue, Redshift, and S3 to maximize data transformation and analytics.
● Practical understanding of the Hadoop ecosystem, including Apache Spark, Hive, and HBase, for effectively processing and analyzing big datasets.
● Proven experience in designing and implementing data lakes and warehouses, using Snowflake and Azure Data Factory to provide smooth data input and storage.
● In-depth knowledge of relational and NoSQL databases, such as PostgreSQL, MySQL, Oracle, and MongoDB, to help with analytical and transactional tasks.
● Excellent understanding of data migration, especially the smooth transfer of on-premises databases to cloud platforms like AWS and Azure utilizing technologies like Sqoop and Flume.
● Skilled in automating cloud infrastructure provisioning with Terraform and CloudFormation, assuring reliable and effective deployment pipelines.
● Advanced understanding of streaming data processing with Spark Streaming, Kinesis, and Apache Kafka to enable real-time analytics and decision-making.
● Experienced in CI/CD techniques, automating development, testing, and deployment processes using tools like Jenkins, GitLab, and Azure DevOps.
● Ability to create data visualization and reporting solutions using Tableau that give stakeholders interactive dashboards that allow them to take meaningful insights.
● Solid understanding of Agile project management techniques, utilizing Jira and Kanban to ensure effective teamwork and project completion.
● Expertise in creating, refining, and managing data models in Redshift and Snowflake, assuring fast analytics and querying.
● Extremely experienced in implementing data governance frameworks, guaranteeing data quality and compliance, and effectively resolving data-related issues.
Technical Skills:
Cloud Platforms and Services: AWS (S3, Glue, Redshift, EMR, Lambda, DynamoDB), Azure, GCP Big Data Technologies: Hadoop (HDFS, Sqoop, Flume, Oozie), Spark (PySpark, Spark-SQL, Spark Streaming), Apache Kafka, Apache Hudi
ETL and Workflow Tools: Informatica PowerCenter, Apache Airflow, AWS Glue, Snowpipe Programming Languages: Python, Scala, SQL
Databases: Snowflake, PostgreSQL, MySQL, Oracle, SQL Server, MongoDB CI/CD and Automation Tools: Jenkins, GitLab, Azure DevOps, Terraform, CloudFormation Data Visualization Tools: Tableau, Power BI, Kibana Data Quality and Governance: Informatica Data Quality (IDQ) Machine Learning Support: Pandas, NumPy
Containerization and Orchestration: Docker, Kubernetes Monitoring and Logging: CloudWatch, Elasticsearch, Grafana, Kibana Project Management Tools: Jira, Kanban
Education:
Master of Science, University of Texas at Arlington, Arlington TX, 2024. Fidelity - Durham, NC June 2021 - Present
Lead Data Engineer
● Used cloud platforms such as AWS, GCP, and Azure to design and provide complete data solutions that covered data intake, transformation, and visualisation for large-scale business projects.
● Key workloads were effectively moved from Teradata to Snowflake, and ETL processes were redesigned using Python and Snowpipe for increased scalability and performance.
● Supervised the implementation of big data technologies including Hadoop, Spark, and Hive to effectively handle large datasets and provide new analytical possibilities.
● Built real-time data pipelines with Kafka and Spark, ensuring high-speed, low-latency data processing and seamless system integrations.
● Designed and optimized database schemas for transactional and analytical workloads, using MySQL and PostgreSQL to integrate seamlessly with cloud-based data warehouses.
● Led the development of CI/CD pipelines with tools like Jenkins, GitLab, and Kubernetes, automating deployments for data engineering workflows and analytics applications.
● Redshift-based scalable data warehouses were designed and developed with an emphasis on workload management, query optimisation, and partitioning for optimal performance.
● Complex data pipelines were moved from Azure Data Factory to GCP Dataflow, saving costs while utilising cloud-native technologies to increase productivity.
● Terraform and Ansible were used to automate the installation of cloud infrastructure, allowing for consistent deployments and minimising manual involvement.
● Redshift and EMR were integrated with AWS Glue Data Catalogue to automate schema discovery for ETL procedures and simplify metadata management.
● Data pipelines performed thorough testing and performance optimisation to guarantee scalability and dependability even with large data volumes.
● Collaborated with product managers and business analysts to translate business requirements into robust data architectures and delivered insights using Tableau and Power BI.
● Provided mentorship to junior engineers, sharing best practices for building efficient pipelines, writing optimized SQL queries, and leveraging cloud services.
● Established advanced monitoring and alerting systems with tools like CloudWatch, Elasticsearch, and Grafana, ensuring high availability and system stability.
Technologies used: AWS, GCP, Azure, Hadoop, Spark, Hive, Snowflake, Redshift, Kafka, Snowpipe, Python, Scala, PySpark, MySQL, PostgreSQL, Jenkins, GitLab, Kubernetes, Terraform, Ansible, Azure Data Factory, GCP Dataflow, AWS Glue, EMR, CloudWatch, Elasticsearch, Grafana, Tableau, Power BI. Prudential Financial, Inc - Newark, NJ August 2019 - March 2021 Senior Data Engineer
● Designed and implemented advanced data pipelines with Apache Airflow, automating ETL workflows for seamless scheduling, monitoring, and execution across diverse environments.
● Built scalable data engineering solutions using AWS services like S3, Glue, Redshift, and EMR, supporting both real- time and batch processing requirements.
● Converted to Snowflake from on-premises data architecture, improving query and storage speed to allow for more sophisticated analytics and reporting features.
● Developed real-time data ingestion frameworks with Apache Kafka and Spark Streaming, handling high-volume data processing for critical applications.
● Designed and managed secure, cost-efficient data lakes on AWS S3, incorporating security policies, versioning, and lifecycle management for compliance.
● Created complex Snowflake SQL scripts, UDFs, and stored procedures to streamline data transformations and integrate with business intelligence tools.
● Developed CI/CD pipelines for AWS Glue and Lambda using Jenkins and Terraform, enabling automated testing and deployment of ETL jobs and serverless functions.
● Optimized AWS Redshift clusters for performance and cost efficiency, improving data storage, query execution, and workload management.
● Built event-driven data processing architectures using AWS Lambda, S3, and DynamoDB, automating workflows triggered by real-time events.
● Ensured data accuracy and consistency by implementing advanced quality frameworks with Informatica Data Quality
(IDQ) tools across all data pipelines.
● Collaborated up with data scientists to use Pandas, NumPy, and Spark to preprocess and analyse big datasets, establishing the creation of effective machine learning models.
● Docker and Kubernetes were used to deploy scalable and dependable ETL applications in containerised environments, guaranteeing fault tolerance across configurations.
● To allow real-time analytics with effective updates and deletions, Apache Hudi was utilised for incremental data input and processing in data lakes.
● Created interactive Kibana dashboards to track application logs, business data, and system performance, extremely cutting down on issue response times.
● Collaborated together with business teams to provide customised data solutions for operational insights, sales forecasting, and customer segmentation.
Technologies used: Apache Airflow, AWS (S3, Glue, Redshift, EMR, Lambda, DynamoDB), Snowflake, Apache Kafka, Spark Streaming, Apache Hudi, Pandas, NumPy, Informatica Data Quality (IDQ), Docker, Kubernetes, Jenkins, Terraform, Kibana.
Costco Wholesale Corporation - Issaquah, WA February 2017 – June 2019 Data Engineer
● Created and implemented reliable ETL processes with Informatica PowerCenter, ensuring data integrity and quality while integrating data from flat files, SQL Server, and Oracle into centralised databases.
● Automated data workflows and scheduling with Apache Airflow, minimizing manual interventions and ensuring reliable pipeline execution.
● Partnered with cross-functional teams to gather requirements, analyze data sources, and design data pipelines aligned with business goals.
● Wrote Python scripts for data cleaning, transformation, and analysis, enhancing data usability and supporting advanced analytical initiatives.
● Using indexing and partitioning techniques, PostgreSQL database designs were designed and optimised to enhance query efficiency for analytical workloads.
● Built Hadoop-based data ingestion frameworks with Sqoop and Flume, enabling seamless integration between relational databases and distributed file systems (HDFS).
● Developed scalable Spark applications with PySpark and Spark-SQL to process and transform large datasets, significantly improving pipeline efficiency.
● Designed a secure and scalable data lake on AWS S3, enabling storage and access for structured and unstructured data with lifecycle management.
● Oozie tasks were established and run to automate MapReduce, Hive, and Pig processes, which simplified large data workflows.
● Optimized SQL queries, Informatica workflows, and Spark jobs to enhance resource utilization and reduce overall processing times.
● Built ETL pipelines with AWS Glue to integrate data into Redshift, supporting analytics and reporting for business stakeholders.
● Implemented data auditing and validation mechanisms to ensure data integrity at every stage of ETL workflows.
● Created interactive Tableau dashboards, empowering business teams with actionable insights into key performance indicators.
● Automated CI/CD processes for data engineering projects using Jenkins, ensuring smooth deployments across all environments.
● Monitored and analyzed system logs and performance using Kibana, reducing incident resolution times and enhancing system reliability.
Technologies used: Informatica PowerCenter, SQL Server, Oracle, PostgreSQL, AWS (S3, Glue, Redshift), Hadoop (HDFS, Sqoop, Flume, Oozie), Spark (PySpark, Spark-SQL), Apache Airflow, Tableau, Jenkins, Kibana, Python.