Data Processing Big

Location:

Dallas, TX

Posted:

February 24, 2025

Contact this candidate

Resume:

Manisha Khadka

*****************@*****.***

682-***-****

Profession Summary:

● Over seven years of experience developing, executing, and managing data-intensive systems, with a focus on cloud platforms, big data technologies, and ETL procedures.

● Skilled at building and improving data pipelines with tools like PySpark, Snowflake, and Apache Airflow, ensuring excellent performance and scalability in data workflows.

● Capable of using cloud platforms such as AWS, Azure, and Databricks to provide scalable analytics, integration, and data processing solutions according to business requirements.

● Skilled in designing end-to-end data solutions on AWS and efficient at creating scalable ETL processes using AWS Glue, Redshift, and S3 to maximize data transformation and analytics.

● Practical understanding of the Hadoop ecosystem, including Apache Spark, Hive, and HBase, for effectively processing and analyzing big datasets.

● Proven experience in designing and implementing data lakes and warehouses, using Snowflake and Azure Data Factory to provide smooth data input and storage.

● In-depth knowledge of relational and NoSQL databases, such as PostgreSQL, MySQL, Oracle, and MongoDB, to help with analytical and transactional tasks.

● Excellent understanding of data migration, especially the smooth transfer of on-premises databases to cloud platforms like AWS and Azure utilizing technologies like Sqoop and Flume.

● Skilled in automating cloud infrastructure provisioning with Terraform and CloudFormation, assuring reliable and effective deployment pipelines.

● Advanced understanding of streaming data processing with Spark Streaming, Kinesis, and Apache Kafka to enable real-time analytics and decision-making.

● Experienced in CI/CD techniques, automating development, testing, and deployment processes using tools like Jenkins, GitLab, and Azure DevOps.

● Ability to create data visualization and reporting solutions using Tableau that give stakeholders interactive dashboards that allow them to take meaningful insights.

● Solid understanding of Agile project management techniques, utilizing Jira and Kanban to ensure effective teamwork and project completion.

● Expertise in creating, refining, and managing data models in Redshift and Snowflake, assuring fast analytics and querying.

● Extremely experienced in implementing data governance frameworks, guaranteeing data quality and compliance, and effectively resolving data-related issues.

Technical Skills:

Cloud Platforms and Services: AWS (S3, Glue, Redshift, EMR, Lambda, DynamoDB), Azure, GCP Big Data Technologies: Hadoop (HDFS, Sqoop, Flume, Oozie), Spark (PySpark, Spark-SQL, Spark Streaming), Apache Kafka, Apache Hudi

ETL and Workflow Tools: Informatica PowerCenter, Apache Airflow, AWS Glue, Snowpipe Programming Languages: Python, Scala, SQL

Databases: Snowflake, PostgreSQL, MySQL, Oracle, SQL Server, MongoDB CI/CD and Automation Tools: Jenkins, GitLab, Azure DevOps, Terraform, CloudFormation Data Visualization Tools: Tableau, Power BI, Kibana Data Quality and Governance: Informatica Data Quality (IDQ) Machine Learning Support: Pandas, NumPy

Containerization and Orchestration: Docker, Kubernetes Monitoring and Logging: CloudWatch, Elasticsearch, Grafana, Kibana Project Management Tools: Jira, Kanban

Education:

Master of Science, University of Texas at Arlington, Arlington TX, 2024. Fidelity - Durham, NC June 2021 - Present

Lead Data Engineer

● Used cloud platforms such as AWS, GCP, and Azure to design and provide complete data solutions that covered data intake, transformation, and visualisation for large-scale business projects.

● Key workloads were effectively moved from Teradata to Snowflake, and ETL processes were redesigned using Python and Snowpipe for increased scalability and performance.

● Supervised the implementation of big data technologies including Hadoop, Spark, and Hive to effectively handle large datasets and provide new analytical possibilities.

● Built real-time data pipelines with Kafka and Spark, ensuring high-speed, low-latency data processing and seamless system integrations.

● Designed and optimized database schemas for transactional and analytical workloads, using MySQL and PostgreSQL to integrate seamlessly with cloud-based data warehouses.

● Led the development of CI/CD pipelines with tools like Jenkins, GitLab, and Kubernetes, automating deployments for data engineering workflows and analytics applications.

● Redshift-based scalable data warehouses were designed and developed with an emphasis on workload management, query optimisation, and partitioning for optimal performance.

● Complex data pipelines were moved from Azure Data Factory to GCP Dataflow, saving costs while utilising cloud-native technologies to increase productivity.

● Terraform and Ansible were used to automate the installation of cloud infrastructure, allowing for consistent deployments and minimising manual involvement.

● Redshift and EMR were integrated with AWS Glue Data Catalogue to automate schema discovery for ETL procedures and simplify metadata management.

● Data pipelines performed thorough testing and performance optimisation to guarantee scalability and dependability even with large data volumes.

● Collaborated with product managers and business analysts to translate business requirements into robust data architectures and delivered insights using Tableau and Power BI.

● Provided mentorship to junior engineers, sharing best practices for building efficient pipelines, writing optimized SQL queries, and leveraging cloud services.

● Established advanced monitoring and alerting systems with tools like CloudWatch, Elasticsearch, and Grafana, ensuring high availability and system stability.

Technologies used: AWS, GCP, Azure, Hadoop, Spark, Hive, Snowflake, Redshift, Kafka, Snowpipe, Python, Scala, PySpark, MySQL, PostgreSQL, Jenkins, GitLab, Kubernetes, Terraform, Ansible, Azure Data Factory, GCP Dataflow, AWS Glue, EMR, CloudWatch, Elasticsearch, Grafana, Tableau, Power BI. Prudential Financial, Inc - Newark, NJ August 2019 - March 2021 Senior Data Engineer

● Designed and implemented advanced data pipelines with Apache Airflow, automating ETL workflows for seamless scheduling, monitoring, and execution across diverse environments.

● Built scalable data engineering solutions using AWS services like S3, Glue, Redshift, and EMR, supporting both real- time and batch processing requirements.

● Converted to Snowflake from on-premises data architecture, improving query and storage speed to allow for more sophisticated analytics and reporting features.

● Developed real-time data ingestion frameworks with Apache Kafka and Spark Streaming, handling high-volume data processing for critical applications.

● Designed and managed secure, cost-efficient data lakes on AWS S3, incorporating security policies, versioning, and lifecycle management for compliance.

● Created complex Snowflake SQL scripts, UDFs, and stored procedures to streamline data transformations and integrate with business intelligence tools.

● Developed CI/CD pipelines for AWS Glue and Lambda using Jenkins and Terraform, enabling automated testing and deployment of ETL jobs and serverless functions.

● Optimized AWS Redshift clusters for performance and cost efficiency, improving data storage, query execution, and workload management.

● Built event-driven data processing architectures using AWS Lambda, S3, and DynamoDB, automating workflows triggered by real-time events.

● Ensured data accuracy and consistency by implementing advanced quality frameworks with Informatica Data Quality

(IDQ) tools across all data pipelines.

● Collaborated up with data scientists to use Pandas, NumPy, and Spark to preprocess and analyse big datasets, establishing the creation of effective machine learning models.

● Docker and Kubernetes were used to deploy scalable and dependable ETL applications in containerised environments, guaranteeing fault tolerance across configurations.

● To allow real-time analytics with effective updates and deletions, Apache Hudi was utilised for incremental data input and processing in data lakes.

● Created interactive Kibana dashboards to track application logs, business data, and system performance, extremely cutting down on issue response times.

● Collaborated together with business teams to provide customised data solutions for operational insights, sales forecasting, and customer segmentation.

Technologies used: Apache Airflow, AWS (S3, Glue, Redshift, EMR, Lambda, DynamoDB), Snowflake, Apache Kafka, Spark Streaming, Apache Hudi, Pandas, NumPy, Informatica Data Quality (IDQ), Docker, Kubernetes, Jenkins, Terraform, Kibana.

Costco Wholesale Corporation - Issaquah, WA February 2017 – June 2019 Data Engineer

● Created and implemented reliable ETL processes with Informatica PowerCenter, ensuring data integrity and quality while integrating data from flat files, SQL Server, and Oracle into centralised databases.

● Automated data workflows and scheduling with Apache Airflow, minimizing manual interventions and ensuring reliable pipeline execution.

● Partnered with cross-functional teams to gather requirements, analyze data sources, and design data pipelines aligned with business goals.

● Wrote Python scripts for data cleaning, transformation, and analysis, enhancing data usability and supporting advanced analytical initiatives.

● Using indexing and partitioning techniques, PostgreSQL database designs were designed and optimised to enhance query efficiency for analytical workloads.

● Built Hadoop-based data ingestion frameworks with Sqoop and Flume, enabling seamless integration between relational databases and distributed file systems (HDFS).

● Developed scalable Spark applications with PySpark and Spark-SQL to process and transform large datasets, significantly improving pipeline efficiency.

● Designed a secure and scalable data lake on AWS S3, enabling storage and access for structured and unstructured data with lifecycle management.

● Oozie tasks were established and run to automate MapReduce, Hive, and Pig processes, which simplified large data workflows.

● Optimized SQL queries, Informatica workflows, and Spark jobs to enhance resource utilization and reduce overall processing times.

● Built ETL pipelines with AWS Glue to integrate data into Redshift, supporting analytics and reporting for business stakeholders.

● Implemented data auditing and validation mechanisms to ensure data integrity at every stage of ETL workflows.

● Created interactive Tableau dashboards, empowering business teams with actionable insights into key performance indicators.

● Automated CI/CD processes for data engineering projects using Jenkins, ensuring smooth deployments across all environments.

● Monitored and analyzed system logs and performance using Kibana, reducing incident resolution times and enhancing system reliability.

Technologies used: Informatica PowerCenter, SQL Server, Oracle, PostgreSQL, AWS (S3, Glue, Redshift), Hadoop (HDFS, Sqoop, Flume, Oozie), Spark (PySpark, Spark-SQL), Apache Airflow, Tableau, Jenkins, Kibana, Python.

Contact this candidate