Data Engineer Processing

Location:

Houston, TX

Posted:

October 22, 2024

Contact this candidate

Resume:

Palistha Manandhar

Data Engineer Parker, CO, USA +1-720-***-**** ***************@*****.*** https://www.linkedin.com/in/palisthamndhar/

About

Experienced Data Engineer with over 6 years of expertise in designing, implementing, and optimizing scalable data pipelines and architectures. Proficient in leveraging big data technologies such as Apache Hadoop, Spark, and Kafka to manage large datasets efficiently. Skilled in ETL processes, data warehousing, and database management with a strong command of SQL, Python, and Java. Adept at collaborating with cross-functional teams to deliver high-quality data solutions, ensuring data integrity, performance, and security. Demonstrated ability to develop and maintain data models, build data lakes, and support data-driven decision-making through robust data infrastructure. Adept at collaborating cross-functionally, ensuring impeccable data quality, and streamlining data-driven decisions. Committed to leveraging the latest in data engineering to drive business growth and innovation. Aiming for a lead data engineering role to advance proficiency in Spark, Hadoop, and Kafka, promoting innovative data strategies and impactful business intelligence in a forward-thinking organization. Committed to continuous learning and staying abreast of emerging trends in data engineering and analytics.

Experience

7-Eleven Irving, TX, USA Data Engineer Sep '22 - Present

Improved ETL workflows, cutting down data processing time through performance tuning and query optimization. Optimized Hadoop jobs resulting in decreased processing time, improving overall data processing efficiency · Migrated legacy data from SQL databases to MongoDB, reducing data storage costs.

Architect MongoDB collections, indexing strategies, enhancing NoSQL query performance in cloud environments.

Collaborated with a cross-functional team to design and implement a scalable data pipeline, resulting in a 30% increase in data processing efficiency

Designed and implemented a data pipeline using SQL, resulting in increased data processing efficiency.

Utilized Hadoop and Hive for distributed processing of large datasets, enhancing data analysis capabi

Leveraged SQL querying to filter, group, join, and aggregate data, ensuring data quality and integrit

Utilized Python, SQL, Spark, EC2, EMR for structured data import/export across relational databases, S3, RDS.

Design, develop, and maintain data pipelines using PySpark to extract, transform, and load (ETL) data.

Monitor Spotify-like streaming pipeline quality with Apache Airflow, implementing reliability alerts.

Implemented Apache Airflow for middleware and EBI team communication.

Lead Real-time Data Analytics Platform using Kafka, Spark Streaming, Snowflake for rapid insight

Automate workflows using Airflow for data tasks, content transformation, maintaining data integrity.

Deploy AWS, Heroku apps with Docker, Kubernetes for high reliability, scaling, using containerization.

Implemented Kafka messaging system to improve data streaming process, resulting in reduction in data processing time

Managed AWS EMR clusters to process large-scale data sets, reducing processing time,

Developed API endpoints that enhanced data security measures, reducing potential vulnerabilities.

Improved team collaboration by conducting Git training sessions, resulting in a 20% increase in code quality

Merck Boston, MA, USA Data Engineer Jun '20 - Aug '22

Developed automated data cleansing processes using Python, leading to the increase in data accuracy.

Led the migration of an on-premises data warehouse to Azure Synapse Analytics, increasing processing speed and reducing operational costs by 25%.

Developed ETL workflows using Apache Airflow to ingest, clean, and transform clinical trial data for real-time analysis.

Utilized Azure Data Lake Storage to centralize unstructured data, enabling cross-functional teams to access patient health data more efficiently.

Created data pipelines in Databricks using PySpark to process massive datasets, improving data processing times by 50%.

Implemented Azure Blob Storage for archiving and storing large-scale healthcare datasets securely.

Led the migration of legacy codebase to Git, resulting in the improvement in code maintainability

Designed and implemented Puppet modules to automate configuration management tasks, resulting in reduction in deployment time.

Worked closely with cloud computing services, such as AWS (S3, Redshift, EMR, Glue) and either GCP or Azure, to deploy scalable data solutions.

Craft optimized SQL queries, indexing, partitioning for analytics on large datasets.

Designed and implemented data processing pipelines on cloud platforms like AWS and Azure, handling data ingestion, processing, transformation, and loading

Leveraged Snowflake for designing data warehouses that support real-time analysis of large datasets,

Led integration of Spring, Django, and Flask backends through RESTful APIs for enhanced performance a

Crafted options around Snowflake Data Warehouse, test scripts for continuous integration.

Developed and optimized ETL pipelines using Python, Pandas, and Numpy for processing and storing large volumes of data from diverse sources.

Managed Kubernetes-based microservices deployment, optimizing Java and Python backend performance.

Elevated real-time data processing with Kafka for seamless financial operations.

Innovated CI/CD pipelines via Jenkins, explored Apache Spark using PySpark, Hadoop for big data.

Automated AWS Data Lake ETL with validated, transformed, loaded data from S3 using Cloud Services

Design business intelligence architectures, combining Redshift, Azure SQL DB, Glue ETL for informed decisions.

Implemented Apache Airflow best practices to streamline workflow management, resulting in increase in productivity.

Developed automated monitoring and alerting system for PostgreSQL databases, increasing system reliability by 20% · Led a team in resolving a critical data outage, restoring data access within 2 hours and minimizing business impact.

Collaborated with cross-functional teams to design and deploy Terraform best practices, leading to an increase in infrastructure stability.

Created Snowflake schemas and tables to support business intelligence reporting, resulting in the increase in report generation speed

Automated data loading processes into Amazon Redshift, reducing manual effort by 50% and improving data accuracy.

Utilized Power BI to create interactive dashboards and comprehensive reports, enabling stakeholders to gain actionable insights from complex datasets.

Ford Motor Dearborn, MI, USA Data Engineer Jan '18 - Apr '20

Implemented Python monitoring scripts to track ETL pipeline performance metrics, resulting in the improvement in monitoring efficiency.

Designed and implemented database schema changes using SQL Workbench, leading to the improvement in data retrieval efficiency.

Managed Docker orchestration tools like Kubernetes, improving system reliability.

Expertly leveraged Delta Live Tables Streaming and Delta Lake to perform real-time data processing, enhancing insights extraction from streaming sources.

Played a key role in maintaining and administering big data technologies like Hadoop, Hive, and Spark to ensure smooth operations and performance.

Exposed the Microservices based on REST API utilizing Spring Boot with Spring MVC.

Implemented JMS to exchange information over reliable channels in an asynchronous way by using Active MQ as a message queue.

Utilized Python, SQL, Spark, EC2, EMR for structured data import/export across relational databases, S3, RDS.

Performed optimization tuning on Apache Hive clusters, resulting in improvements of overall system performance.

Migrated legacy data systems to Azure cloud infrastructure, resulting in reduction of maintenance costs.

Implemented Apache Atlas to provide metadata management and governance capabilities, resulting in a 30% increase in data visibility and lineage tracking.

Automated surveillance of Fivetran pipelines employing custom scripts, diminishing manual monitoring efforts.

Established communication channels between data engineering and business teams, triggering a decrease in data request turnaround time.

Education

Touro University

Bachelor of Science (BS) Cybersecurity and Networking

Projects

Customer Loyalty Program Analytics for 7-Eleven

Designed a data platform for 7-Eleven's customer loyalty program to analyze customer behavior and enhance retention strategies. Developed ETL processes using AWS Glue and Kinesis with Lambda to integrate data from loyalty cards, mobile apps, and in-store purchases into an AWS Redshift data warehouse. Employed SQL and Python for data analysis and created machine learning models to segment customers and predict churn. Implemented Tableau dashboards to visualize key metrics and inform marketing campaigns.

Merck: Drug Manufacturing Data Pipeline Optimization

Led the optimization of data pipelines using AWS Glue to process manufacturing data from multiple facilities, reducing data latency by 30%. Deployed Amazon Redshift as a scalable data warehouse for storing and analyzing large datasets related to drug production. Implemented SQL query optimization techniques to improve reporting efficiency for production metrics. Utilized AWS Lambda for event-driven processing, automating notifications and data updates across the system. Built looker dashboards to provide visibility into key manufacturing KPIs, enabling better decision-making on production processes.

Predictive Supply Chain Analytics for Ford Motor

Developed a predictive analytics solution for Ford Motor Company's supply chain optimization using Python and AWS SageMaker. Integrated data from ERP systems, suppliers, and logistics into a centralized data warehouse on AWS Redshift. Created robust ETL pipelines to process and cleanse data, ensuring high data quality and consistency. Enabled accurate demand forecasting and inventory management by leveraging advanced analytics and machine learning models on AWS SageMaker. Streamlined data migration processes, improving data accessibility and reliability across the organization, ultimately achieving significant operational cost reductions and enhanced decision-making capabilities.

Skills

Programming Languages (Python · SQL · Java · Scala).

Data Warehousing Solutions (Amazon Redshift · Google Big Query · Snowflake) ·

ETL Tools (Apache NIFI · Talend · AWS Glue · Informatica).

Big Data Technologies (Hadoop · Apache Spark · Apache Flink) ·

Data Pipelines (Apache Airflow · Luigi · Prefect) ·

Cloud Platforms ( AWS · Google Cloud Platform · Microsoft Azure) ·

Containerization ( Docker · Kubernetes) ·

Data Lakes ( Apache Hive · Delta Lake · Apache Hudi) ·

Databases (MySQL · Postgres SQL · MongoDB · Cassandra) ·

Version Control System ( Git ) · CI/CD Tools (Jenkins · Circle CI · Travis CI) ·

Data Modeling Tools (Erwin Data Modeler, IBM InfoSphere Data Architect) ·

Real-time Data Processing ( Apache Kafka · RabbitMQ · Amazon Kinesis) ·

Monitoring and Logging (Prometheus · grafana · ELK stack) ·

Data Governance Tools (Apache Atlas, Collibra, Alation) ·

Data Integration Platform ( Fivetran · Cross Stitch · Segment) ·

SQL Query Tools (DBeaver, SQL Workbench, Toad) ·

Data visualization Tools (Tableau, Power BI, Looker)

Infrastructure as Code (Terraform · CloudFormation) ·

DevOps Tools (Ansible, Puppet, Chef) · Linux · Problem Solving · Statistical Analysis · Critical Thinking · Scripting · Communication

Contact this candidate