Real-Time Data Processing

Location:

Katy, TX, 77494

Posted:

February 24, 2025

Contact this candidate

Resume:

Bibhusha Neupane

Email: **********@*****.***

Phone: 904-***-****

PROFESSIONAL SUMMARY:

Over 7+ years of experience in developing databases, ETLs, data modeling, reporting, and big data solutions.

Expertise in designing and implementing end-to-end data workflows on AWS, utilizing AWS Glue for managing complex ETL processes to ensure scalable, efficient, and reliable business operations.

Proficient in leveraging cloud platforms like Azure and Databricks for scalable, efficient data processing, storage, and analysis.

Designed and streamlined scalable ETL/ELT workflows in Snowflake, showcasing skills in data integration, migration, warehousing, and business intelligence.

Skilled in utilizing AWS services such as EC2, S3, Lambda, and EMR, with a strong focus on using Redshift for data migrations and analytics.

Used Python libraries, including Pandas, NumPy, and PySpark, to handle data transformation, manipulation, and real-time analytics to enhance workflows.

Employed Airflow’s Python-based architecture to develop dynamic, reusable data pipelines that integrated various tools and data sources, improving scalability and maintenance.

Applied Hadoop tools like Apache Spark and MapReduce for processing extensive datasets, used HBase for real-time storage, and leveraged Hive for SQL-like querying to optimize data management.

Designed relational database schemas in MySQL, PostgreSQL, and Oracle to support OLTP and OLAP functions, while crafting advanced SQL queries for data retrieval and transformation.

Managed MongoDB for schema-less, flexible storage, organizing collections to support real-time analytics and ensure smooth workflows and robust data management.

Created data ingestion workflows with Sqoop and Flume for transferring data between relational databases and Hadoop, while orchestrating tasks with Oozie and managing pipeline operations via Zookeeper.

Developed complex workflows using Apache Airflow for task automation, scheduling, and monitoring to enhance data pipeline efficiency.

Implemented distributed data processing using AWS EMR and automated ETL processes with AWS Glue for seamless integration of diverse data sources.

Automated infrastructure deployment with CloudFormation and Terraform, managing resources like EC2, VPCs, RDS, and IAM roles across multi-account environments.

Designed and optimized data models on Amazon Redshift, supporting high-performance analytics and scalable querying of large datasets.

Built scalable data processing applications using PySpark and Scala to analyze extensive datasets and improve real-time analytics performance.

Utilized Apache Kafka for reliable real-time data streaming and seamless integration across distributed systems.

Conducted thorough analysis of complex datasets to derive actionable insights, using data visualization tools and statistical techniques to present findings to stakeholders.

Leveraged MongoDB for flexible data modeling, enabling efficient storage and querying of unstructured data for scalable real-time analytics.

Managed scalable cloud infrastructures on AWS using services such as EC2, S3, and RDS to enable robust data storage and processing.

Developed and optimized data pipelines on AWS platforms like Glue, Lambda, and Redshift for seamless data integration and analysis.

Automated provisioning of cloud resources using Terraform, ensuring consistent deployment, efficient scaling, and reduced deployment times for data engineering workloads.

Applied CI/CD pipelines and version control tools like Git, GitLab, Azure DevOps, and Jenkins to ensure code quality and streamline deployments.

Designed and maintained CI/CD pipelines to streamline deployment workflows across multiple environments efficiently.

Collaborated on data engineering projects using Git and Bitbucket for version control, enhancing team-based development and ensuring code reliability.

Used Maven for build automation and dependency management, incorporating unit testing to ensure robust and reliable data processing applications.

Skilled in Agile project management tools like Jira and Kanban, driving project visibility and improving team efficiency.

Developed comprehensive migration strategies, including risk assessments, timelines, and validation processes.

Delivered across the entire development lifecycle with expertise in performance tuning, system testing, data cleansing, profiling, mapping, and conversion.

TECHNICAL SKILLS:

Operating Systems

Unix, Linux, Mac OS, Windows, Ubuntu

Hadoop Ecosystem

HDFS, MapReduce, Yarn, Oozie, Zookeeper, Job Tracker, Task Tracker, Name Node, Data Node, Cloudera,AWS

BigData Technologies

Hadoop, Spark, MapReduce, YARN, Hive, Pig, Sqoop, Oozie, Maven

Amazon Web Services

EC2, Lambda, EMR, Spark, Kinesis, S3, Boto3, Bean Stalk, ECS, CloudWatch, Lambda, ELB, VPC, Elastic Cache, Dynamo DB, Redshift, RDS, Athena

Cloud Computing

Tools

Snowflake, AWS, Databricks, Amazon EC2, EMR, S3

NoSQL Databases and Methodologies

HBase, Cassandra, MongoDB, Apache, Hadoop HBase, Agile, Scrum, Waterfall

Programming Languages

Python (Jupiter, PyCharm IDE), R, Java, Scala, SQL, PL/SQL, SAS

Databases

Snowflake Cloud DB, AWS Redshift, AWS Athena, Oracle, MySQL, Teradata 12/14, SQL Server, PostgreSQL 9.3, Netezza, AmazonRDS

SQL Server Tools

SQL Server Management Studio, Enterprise Manager, Query Analyzer, Profiler, Export, and Import (DTS)

PROFESSIONAL EXPERIENCE:

Client: Novartis Pharmaceuticals, East Hanover, NJ Jun 2021-Present

Role: AWS Data Engineer Analyst

Responsibilities:

Designed and implemented ETL workflows using Informatica PowerCenter for extracting, transforming, and loading data from diverse sources into target databases, ensuring high data accuracy and performance.

Developed multiple ETL solutions by utilizing SQL scripting, Python, shell scripting, ETL tools, and scheduling systems to process various data sources efficiently.

Used EMR to process large-scale data across Hadoop clusters, leveraging EC2 for computation and S3 for storage, optimizing both performance and cost.

Built complex data pipelines with Apache Airflow, automating ETL task scheduling, execution, and monitoring for seamless data processing.

Wrote Python scripts to perform data cleaning, transformation, and analysis, improving data quality and usability for analytics.

Automated repetitive data processing workflows using Python, significantly enhancing productivity and reducing manual workload.

Created data ingestion frameworks with Sqoop and Flume for seamless data transfers between Hadoop and relational databases, ensuring data availability.

Administered Hadoop clusters and utilized Zookeeper for configuration management, improving pipeline reliability, and automated complex workflows using Apache Airflow.

Designed and optimized data warehouse solutions in Redshift to support large-scale analytics, focusing on advanced queries and scalable architectures.

Used Informatica to integrate data from various sources, including SQL Server, Oracle, flat files, and cloud platforms, ensuring accurate data transformation for analytics.

Created and optimized complex data models in PostgreSQL, leveraging advanced indexing methods and partitioning to improve query performance and storage efficiency for large datasets.

Proficient in utilizing AWS services such as EC2 for virtual machines, Lambda for serverless functions, and ECS/EKS for containerized application management.

Designed event-driven processes triggered by CloudWatch and S3, enabling real-time data workflows integrated with AWS services like DynamoDB and Redshift.

Deployed serverless architectures with AWS Lambda and Glue to reduce infrastructure costs and enhance the efficiency of data processing pipelines.

Worked with AWS storage solutions such as S3, RDS, DynamoDB, and EFS, and used Glue, EMR, and Redshift for data transformation, big data processing, and warehousing.

Configured Snowpipe to automate the ingestion of semi-structured data from S3 for near real-time analytics in scenarios like claims and enrollment tracking.

Built and managed data lakes on S3 to store both structured and unstructured data, implementing lifecycle policies and access controls for optimized data management.

Designed relational database schemas in MySQL, PostgreSQL, and Oracle to support OLTP/OLAP workloads and wrote advanced SQL queries for aggregation and data extraction.

Maintained MongoDB for flexible, schema-less data storage, designing collections for real-time applications and ensuring efficient workflows.

Utilized AWS infrastructure to store and manage terabytes of data for customer-focused business intelligence and reporting solutions.

Created CI/CD pipelines for transferring data from Azure to SQL DB, ensuring consistent deployment across environments.

Automated CI/CD workflows using Jenkins pipelines, Maven for Java builds, and test automation frameworks.

Built CI/CD pipelines with Jenkins, Docker, and Kubernetes to streamline deployments for data engineering projects.

Implemented Apache Hudi to manage incremental data ingestion and storage, supporting efficient updates and real-time processing within data lakes.

Designed interactive Kibana dashboards to track key performance metrics, enabling actionable insights through clear visualizations for stakeholders.

Streamlined infrastructure provisioning by employing Infrastructure as Code (IaC) tools to automate setup and enhance team productivity.

Used Maven for automating build processes and dependency management, simplifying project maintenance and configuration.

Applied unit testing frameworks to validate data workflows and applications, ensuring code quality and reducing production issues.

Analyzed large datasets from multiple sources (structured & semi-structured) to identify business insights and trends, leading to data-driven decision-making.

Developed interactive dashboards using Power BI / Tableau to visualize business KPIs and monitor real-time data performance.

Designed and optimized SQL queries for ad hoc reporting, data transformations, and performance tuning of analytical queries.

Client: Verizon (Irving, TX) Jul 2019- Apr 2021

Role: AWS Data Engineer Analyst

Responsibilities:

Developed and managed ETL pipelines using AWS Glue and PySpark, enabling the seamless ingestion, transformation, and storage of structured and unstructured data from diverse sources.

Integrated Informatica Data Quality (IDQ) tools to maintain high data quality standards across ETL processes, setting up data profiling, cleansing, and validation rules to prevent data issues.

Implemented robust data validation processes within AWS Glue ETL jobs, ensuring high data quality before ingestion into Redshift and other analytics platforms.

Implemented data auditing mechanisms using AWS Glue Data Catalog to track metadata, enabling data lineage and ensuring data integrity across multiple stages of the ETL process.

Configured and scheduled AWS Glue Crawlers to automatically catalog datasets stored in S3, ensuring updated metadata and schema discovery for dynamic ETL processes.

Experienced in documenting metadata while designing OLTP and OLAP systems, creating tables, and implementing naming conventions in Erwin for logical and physical data models.

Created and maintained Snowflake SQL scripts, User-Defined Functions (UDFs), and stored procedures for complex data transformations, aggregations, and analyses.

Experienced in using Python to build and automate ETL/ELT pipelines, streamlining data integration and reducing manual tasks for better scalability.

Leveraged Python libraries like Pandas and NumPy to manipulate large datasets and perform complex calculations.

Experienced in utilizing Apache Spark, MapReduce, and Hive for processing and analyzing large-scale datasets, along with HBase for real-time data storage.

Proficient in using Sqoop and Flume for seamless data ingestion from relational databases into Hadoop, while orchestrating workflows with Oozie and ensuring high availability and fault tolerance using Zookeeper for reliable big data pipelines. Created MapReduce jobs for data ingestion, transformation, and aggregation.

Analyzed and monitored system logs and application performance using Kibana visualizations and Elasticsearch alerting, reducing response time for incident management.

Utilizing Spark's in-memory computing capabilities to perform advanced procedures like text analytics and processing.

Designed and improved relational database schemas in MySQL, PostgreSQL, or Oracle to support both transactional (OLTP) and analytical (OLAP) processes, creating advanced SQL queries for data retrieval and transformation.

Experienced in writing advanced PL/SQL stored procedures, functions, and triggers to automate complex business logic and data transformations,reducing repetitive tasks and improving the efficiency of database-driven processes within transactional systems.

Experience with S3 for storage, RDS and DynamoDB for databases, and AWS tools like Glue, EMR, and Redshift for ETL, big data processing, and data warehousing.

Developed complex stored procedures, views, and triggers for data integrity and security.

Utilized AWS services such as Lambda for serverless automation, S3 for data storage, and CloudWatch for monitoring ETL jobs and performance metrics. Deployed and configured SNS for notifications and alerts during data pipeline operations.

Optimized data storage solutions using Amazon S3 and Redshift, ensuring high-performance data warehousing and efficient resource utilization for big data analytics.

Performed fine-tuning of AWS Glue jobs, adjusting memory and execution times to ensure optimal data processing performance with minimal resource overhead.

Integrated various third-party data sources and APIs with AWS data pipelines, automating data retrieval, transformation, and loading into AWS environments.

Deployed CI/CD pipelines for AWS data engineering workflows using Jenkins, automating code deployment, testing, and updates to AWS Glue and Lambda functions.

Deployed and managed containerized applications using Kubernetes (K8s), orchestrating multi-container workloads for scalable and resilient environments.

Performed clustering analysis on customer data using Python & SQL, leading to targeted marketing strategies.

Built Power BI dashboards using historical sales data to predict future revenue trends.

Client: Nationwide Insurance Aug 2017- Jun 2019

Role: Big Data Developer Analyst

Responsibilities:

Designed and managed relational databases such as SQL Server, PostgreSQL, and Oracle, along with NoSQL databases like MongoDB and DynamoDB for versatile data management.

Consolidated data from diverse sources, including SQL Server, S3, flat files, and Kafka, into centralized Data Lakes and Data Warehouses.

Enhanced query performance by employing techniques like indexing, partitioning, and optimizing complex SQL queries for faster data retrieval.

Applied ETL/ELT best practices to efficiently extract, transform, and load structured and unstructured data across various systems.

Optimized Spark jobs to improve memory efficiency and reduced execution times by 30% through performance tuning.

Migrated legacy systems to Infrastructure-as-Code (IaC) setups, minimizing manual interventions and boosting deployment efficiency by over 50%.

Implemented data governance frameworks to maintain data accuracy, consistency, and completeness across platforms.

Preprocessed data for machine learning by applying techniques such as feature scaling, normalization, and imputing missing values.

Processed large datasets in distributed systems using tools like Apache Hadoop, Hive, Pig, HBase, Kafka, and MapReduce.

Designed scalable pipelines to process massive datasets effectively with Apache Spark and Hadoop frameworks.

Built batch processing workflows for large-scale data transformations using Apache Spark, Hive, and Hadoop.

Developed comprehensive data models for Data Warehousing and Data Lakes using Star and Snowflake schemas for optimal performance.

Automated data ingestion workflows by using Sqoop to load relational database content into HDFS.

Created custom ETL pipelines with tools like Informatica, AWS Glue, and PySpark to integrate and process data from multiple sources.

Utilized version control systems such as Git, Bitbucket, and GitLab to ensure collaborative and reliable data engineering project management.

Leveraged tools like JIRA and Confluence for documentation and project tracking, fostering team collaboration and transparency.

Education Details:

Masters University of the Cumberlands., ( Williamsburg,KY)., 2020

Contact this candidate