AWS Data Engineer

Location:

Fort Worth, TX

Salary:

99999

Posted:

February 24, 2025

Contact this candidate

Resume:

Sandesh Poudel

[**** Vista Mar Dr, Euless, Texas] [445-***-****] [*************@*****.***]

PROFESSIONAL SUMMARY

More than 7 years of professional expertise in setting to work scalable data engineering solutions, such as steps enabling large data processing, data modeling, and ETL/ELT pipelines.

Expertise in building large-scale data pipelines using AWS services such as Glue (ETL workflows), Redshift (data warehousing), S3 (data lakes), and Lambda (serverless solutions).

Experienced with building scalable ETL workflows for large-scale data processing, as well as Snowflake for data warehousing, migration, and analytics.

Skilled in Scala, Python (Pandas, NumPy, PySpark), and advanced SQL for real-time analytics, data analysis, and manipulation.

In-depth understanding about the Hadoop ecosystem, including how to efficiently process and manage massive datasets using tools like Apache Spark, Hive, HBase, and MapReduce.

Highly skilled in automating, monitoring, and scaling intricate data pipelines across many data sources through workflow orchestration with Apache Airflow.

A working knowledge with Apache Kafka for messaging and real-time data streaming, ensuring dependable data ingestion and integration across remote systems.

Qualified in using Terraform and AWS CloudFormation to fully automate the provisioning of cloud infrastructure, allowing for scalable and consistent deployment in multi-account scenarios.

Specialized in streamlining development processes and guaranteeing code quality using CI/CD pipelines with tools such as Git, GitLab, Jenkins, and Azure DevOps.

Fluent in project management with Dynamic approaches, utilizing Jira and Kanban to improve teamwork, output, and visibility.

Experienced in workflow management programs like Oozie and Zookeeper, as well as data ingestion frameworks like Sqoop and Flume, to ensure dependable and effective data pipeline operations.

TECHNICAL SKILLS

Operating Systems: Unix, Linux, Mac OS, Windows, Ubuntu

Hadoop Ecosystem: HDFS, MapReduce, Yarn, Apache Spark, Hive, HBase, Zookeeper, Oozie, Sqoop, Flume, Cloudera

Big Data Technologies: Apache Spark, MapReduce, YARN, Hive, Pig, Sqoop, Oozie

Amazon Web Services: EC2, Lambda, EMR, Glue, S3, Redshift, RDS, DynamoDB, Kinesis, Elastic Beanstalk, ECS, CloudWatch, ELB, VPC, Athena

Cloud Computing Tools: Snowflake, AWS, Databricks, Azure, Amazon EC2, EMR, S3

NoSQL Databases and Methodologies: HBase, Cassandra, MongoDB, Agile, Scrum, Waterfall

Programming Languages: Python (Pandas, NumPy, PySpark), Scala, Java, SQL, PL/SQL

Databases: Snowflake, AWS Redshift, Oracle, Teradata 12/14, MySQL, SQL Server, PostgreSQL 9.3, Netezza, Amazon RDS

SQL Server Tools: SQL Server Management Studio, Enterprise Manager, Query Analyzer, Profiler, Export, and Import (DTS)

Workflow Orchestration Tools: Apache Airflow, Oozie, Zookeeper

CI/CD and Version Control: Git, GitLab, Jenkins, Azure DevOps

Real-Time Streaming Tools: Apache Kafka

Infrastructure: Automation Terraform, AWS CloudFormation

Project Management Tools: Jira, Kanban

PROFESSIONAL EXPERIENCE

Highmark Health (Pittsburgh, PA)

Role: AWS Data Engineer

May 2022-Present

Responsibilities:

Designed and developed ETL pipelines using Informatica PowerCenter, Python, and Shell Scripting for data extraction, transformation, and loading, optimizing processing speed and data integrity.

Implemented scalable Big Data processing on AWS EMR with Hadoop clusters, utilizing EC2 for computation and S3 for storage, achieving cost-effective and efficient processing.

Automated data workflows with Apache Airflow, ensuring reliable scheduling, monitoring, and execution of complex ETL processes.

Developed Python scripts for data cleaning, transformation, and analysis, improving data quality and automating repetitive data processing tasks.

Designed and optimized data warehouses using Amazon Redshift and PostgreSQL for large-scale analytics, focusing on query optimization and scalability.

Implemented data ingestion frameworks with Sqoop and Flume for seamless transfer between Hadoop and relational databases.

Built serverless data pipelines using AWS Lambda and Glue, minimizing infrastructure overhead and enabling real-time data processing.

Created and managed data lakes on AWS S3, facilitating structured and unstructured data storage with lifecycle policies and access controls.

Designed event-driven architectures using AWS CloudWatch, S3, and DynamoDB, enabling real-time data processing and seamless service integration.

Implemented CI/CD pipelines with Jenkins, Docker, and Kubernetes, streamlining deployments for data projects across multiple environments.

Managed schema-less databases like MongoDB and DynamoDB, optimizing collection design for real-time applications and analytics.

Utilized Snowflake and Redshift for data warehousing solutions, employing Snowpipe to automate semi-structured data ingestion for near real-time analytics.

Developed interactive dashboards in Kibana, enabling stakeholders to track performance metrics and gain actionable insights.

Enhanced infrastructure automation with Terraform and AWS CloudFormation, streamlining setup and configuration processes for scalable and reliable systems.

Environment: Informatica PowerCenter, Python, Shell Scripting, AWS (EMR, EC2, S3, Lambda, Glue, Redshift, CloudWatch, DynamoDB, CloudFormation), Hadoop Ecosystem (Hadoop, Sqoop, Flume), Apache Airflow, PostgreSQL, MongoDB, Snowflake, Jenkins, Docker, Kubernetes, Terraform, Kibana.

BOK Financial (Tulsa, OK)

Role: AWS Data Engineer

Jun 2019-Mar 2022

Responsibilities:

Developed and managed ETL pipelines using AWS Glue and PySpark, enabling the seamless ingestion, transformation, and storage of structured and unstructured data from diverse sources.

Configured and scheduled AWS Glue Crawlers to automatically catalog datasets stored in S3, ensuring updated metadata and schema discovery for dynamic ETL processes.

Implemented robust data validation processes within AWS Glue ETL jobs, ensuring high data quality before ingestion into Redshift and other analytics platforms.

Created and maintained Snowflake SQL scripts, User-Defined Functions (UDFs), and stored procedures for complex data transformations, aggregations, and analyses.

Designed and improved relational database schemas in MySQL, PostgreSQL, or Oracle to support both transactional (OLTP) and analytical (OLAP) processes, creating advanced SQL queries for data retrieval and transformation.

Experienced in using Python to build and automate ETL/ELT pipelines, streamlining data integration and reducing manual tasks for better scalability.

Optimized data storage solutions using Amazon S3 and Redshift, ensuring high-performance data warehousing and efficient resource utilization for big data analytics.

Utilized AWS services such as Lambda for serverless automation, S3 for data storage, and CloudWatch for monitoring ETL jobs and performance metrics. Deployed and configured SNS for notifications and alerts during data pipeline operations.

Experienced in utilizing Apache Spark for processing and analyzing large-scale datasets, leveraging in-memory computing capabilities for advanced text analytics and data transformations.

Integrated various third-party data sources and APIs with AWS data pipelines, automating data retrieval, transformation, and loading into AWS environments.

Deployed CI/CD pipelines for AWS data engineering workflows using Jenkins, automating code deployment, testing, and updates to AWS Glue and Lambda functions.

Performed fine-tuning of AWS Glue jobs, adjusting memory and execution times to ensure optimal data processing performance with minimal resource overhead.

Environment: AWS Glue, PySpark, AWS Lambda, Amazon S3, Amazon Redshift, Snowflake, MySQL, PostgreSQL, Oracle, AWS CloudWatch, Jenkins, Apache Spark.

Centene Corporation (St. Louis, MO)

Big Data Developer

Sep 2017-Apr 2019

Responsibilities:

Processed and analyzed large-scale datasets using Hadoop Distributed File System (HDFS) and Apache Hive for efficient querying and data summarization.

Developed batch and real-time data processing workflows using Apache Spark and Kafka, enabling seamless data ingestion and transformation.

Automated data ingestion pipelines using tools like Flume and Sqoop, ensuring the transfer of data between Hadoop and relational databases.

Built and maintained scalable distributed systems using Hadoop Ecosystem tools such as MapReduce, HBase, and Pig to handle diverse data processing needs.

Designed and implemented NoSQL database solutions using HBase, enabling efficient storage and retrieval for semi-structured and unstructured data.

Developed and fine-tuned Hive queries for ETL operations, optimizing query performance on large datasets.

Integrated Hadoop and cloud-based platforms like AWS S3 and EMR, enabling flexible and cost-efficient big data processing workflows.

Implemented job orchestration workflows using Apache Oozie to schedule and monitor Hadoop jobs, ensuring timely execution and minimal downtime.

Utilized Zookeeper for distributed application coordination, ensuring high availability and fault tolerance in big data applications.

Leveraged Python libraries for data preprocessing and validation, ensuring high data quality and consistency before loading into the data lake.

Developed data integration solutions using Apache Nifi to facilitate seamless data flow between systems in real-time.

Collaborated with data analysts and scientists to understand requirements, design solutions, and deliver data pipelines tailored for machine learning and analytics.

Deployed and monitored Hadoop clusters in a production environment, managing resource allocation and troubleshooting issues to ensure uninterrupted data processing.

Environment: Hadoop Ecosystem (HDFS, Hive, MapReduce, HBase, Pig), Apache Spark, Apache Kafka, Flume, Sqoop, AWS S3, AWS EMR, Apache Oozie, Zookeeper, Apache Nifi, Python

EDUCATION

Bachelor’s in business administration - Kist College and SS., Naxal, Kathmandu, 2021

Master’s in data Analytics - Wright State University, Fairborn, Ohio, 2024

Contact this candidate