Sri Lakshmi P
AWS DATA ENGINEER
Email: ****************@*****.*** Phone: +1-469-***-****
PROFESSIONAL SUMMARY:
●AWS Data Engineer with 7+ years of hands-on experience designing and delivering scalable, reliable, and secure data engineering solutions across modern cloud and big data ecosystems.
●Skilled in architecting and implementing complex end-to-end data pipelines, ensuring high availability, performance optimization, and compliance with strict industry standards.
●Adept at collaborating with diverse stakeholders, documenting technical workflows, and translating business requirements into optimized technical data solutions that drive decision-making.
●Implemented ETL and ELT strategies using AWS Glue, Redshift SQL, and Informatica, significantly improving governance, automation, and enterprise reporting efficiency.
●Built scalable production pipelines in Databricks with PySpark and Delta Lake, processing billions of records daily and enabling analytics teams with curated datasets.
●Designed advanced Snowflake dimensional models with star and snowflake schemas, applied clustering and micro-partitioning, and enabled continuous ingestion with Snowpipe and time travel.
●Automated recurring ingestion, preprocessing, and data quality checks with Python, Pandas, NumPy, and Boto3, packaging reusable modules for logging, validation, and exception handling.
●Designed distributed Hadoop batch workflows, managed large-scale data zones in HDFS, and applied MapReduce, Hive, and Pig for complex high-volume aggregations and reporting.
●Migrated structured datasets into Hadoop with Sqoop, captured unstructured application data with Flume, coordinated dependencies with Oozie, and stabilized clusters with Zookeeper.
●Implemented high-performance HBase tables for low-latency queries and integrated AWS SQS and SNS for event-driven pipelines supporting real-time applications.
●Optimized Spark and Spark SQL pipelines with broadcast joins, partition pruning, and adaptive execution, reducing runtime by up to 35% on large datasets.
●Developed robust PySpark and Scala applications for both batch and streaming workloads, applying caching, checkpoints, and incremental processing techniques.
●Integrated Apache Kafka with Spark Structured Streaming to deliver near real-time ingestion pipelines with guaranteed exactly-once semantics and reduced latency.
●Orchestrated critical data dependencies with Apache Airflow, scheduling 100+ DAGs and enforcing SLAs for production-ready enterprise pipelines.
●Built and optimized AWS Glue ETL jobs with schema mapping and business rule enforcement, and processed terabyte-scale data on EMR clusters tuned for cost and performance.
●Designed and tuned Redshift clusters using WLM, distribution/sort keys, and Spectrum integration, improving query performance and concurrency across multiple workloads.
●Partnered with analysts and business users to perform in-depth data analysis, document lineage, and maintain reproducible workflows across projects and teams.
●Modeled and optimized relational schemas in MySQL, Oracle, and PostgreSQL for operational and BI reporting; implemented MongoDB and DynamoDB for flexible and low-latency workloads.
●Delivered governed cloud-native data lakes on AWS S3 with Glue Catalog, Athena, Step Functions, and Lambda, enabling reliable, cost-effective analytics and reporting.
●Applied proactive monitoring and optimization with CloudWatch dashboards, alarms, and S3 lifecycle policies to improve system reliability and reduce storage costs.
●Automated cloud infrastructure with reusable Terraform modules for consistent deployment of Redshift, Glue, EMR, and IAM resources across environments.
●Built enterprise-grade CI/CD pipelines with Jenkins, Git, and Bitbucket, integrating quality checks with PyTest, JUnit, and Maven to ensure consistent and reliable releases.
●Delivered in Agile development environments with iterative sprints, retrospectives, and continuous improvement, demonstrating adaptability and rapid adoption of emerging frameworks.
●Strengthened data security and compliance by applying IAM best practices, encryption with KMS, fine- grained S3 access controls, and HIPAA-ready governance for regulated industries.
TECHNICAL SKILLS:
Big Data & Hadoop Ecosystem
Hadoop, HDFS, YARN, MapReduce, Hive, Pig, HBase, Sqoop, Flume, Oozie, Zookeeper, Impala, Spark (Core, SQL, Streaming, PySpark, Scala), Delta
Lake, Kafka, Apache Airflow, Hue, HCatalog, NiFi, Cloudera, Kerberos
Cloud Platforms
Amazon Web Services (AWS), Azure and GCP
AWS Services
S3, Redshift, Glue, EMR, Lambda, Step Functions, Athena, Kinesis,DynamoDB,
RDS, DMS, IAM, KMS, CloudWatch, CloudTrail, SQS, SNS, Lake Formation, QuickSight, SageMaker
ETL / ELT Tools
AWS Glue, Informatica PowerCenter, Databricks, Snowflake, Talend, dbt
Programming & Scripting
Python, SQL, Scala, HiveQL, Java, UNIX/Linux Shell Scripting
Data Modeling & Warehousing
Dimensional modeling (Star, Snowflake), ER modeling, Redshift, Snowflake, Oracle, PostgreSQL, MySQL, Teradata
NoSQL Databases
MongoDB, DynamoDB, Cassandra
Infrastructure as Code (IaC)
Terraform, AWS CloudFormation
Containerization & Orchestration
Docker, Kubernetes, Amazon EKS
CI/CD & Version Control
Git, GitHub, Bitbucket, Jenkins, CodePipeline, Maven, GitHub Actions
Testing & Methodologies
PyTest, JUnit, Agile/Scrum, Sprint Planning, Retrospectives
Monitoring & Logging
AWS CloudWatch, ELK Stack, Prometheus, Grafana
Data Analysis & Libraries
Pandas, NumPy, Matplotlib, scikit-learn, Boto3, Seaborn, Plotly, SciPy, NLTK
PROFESSIONAL EXPERIENCE:
Client: Bremer Bank – St Paul, MN August 2023 - Present Role: AWS Data Engineer
Responsibilities:
●Lead end-to-end AWS data engineering projects, from requirements and solution design through build, testing, release, documentation, SLAs, and governance.
●Implement a hybrid ETL approach with AWS Glue for extraction, cleansing, and schema mapping, and ELT with Redshift SQL for in-warehouse business rules, while modernizing legacy AWS Data Pipeline jobs into Step Functions and Airflow without disrupting SLAs.
●Build Databricks pipelines in PySpark on AWS for heavy transformations and feature engineering, tuning partitions, caching, and shuffle operations to reduce runtime and overall cost.
●Design Snowflake dimensional models with star and snowflake schemas, apply clustering and micro- partitioning, and operate Snowpipe with time travel for continuous data loads and audit accuracy.
●Schedule and maintain Informatica workflows that automate complex integrations across enterprise systems with deterministic data quality checks and validations.
●Author Python automation for ingestion, validation, reconciliation, and metadata updates using Pandas, NumPy, and Boto3, and package reusable libraries for logging, exception handling, and schema enforcement.
●Optimize Spark and Spark SQL with broadcast joins, adaptive execution, file sizing, and partition pruning, and deliver Scala-based jobs for windowed aggregations and incremental checkpoints.
●Manage raw, refined, and curated zones on S3 and HDFS with partitioning, compression, encryption, and retention policies to ensure analytics dependability and storage efficiency.
●Create Hive external tables on S3 and leverage MapReduce paths for long-horizon backfills and compliance- grade historical recalculations at scale.
●Offload relational systems with Sqoop and capture application logs/events with Flume, orchestrating workflows with Oozie and stabilizing distributed services with ZooKeeper.
●Operate HBase for low-latency lookups and decouple enterprise systems with SQS and SNS for event-driven processing and proactive alerts.
●Build and run Apache Airflow DAGs with retries, SLA sensors, task-level lineage, and on-failure notifications for clear operational visibility.
●Provision and tune EMR clusters with autoscaling and optimized spot/on-demand mix, right-sizing compute resources for cost and performance efficiency.
●Configure Glue crawlers and ETL jobs with schema mapping, deduplication, and business rule enforcement, publishing verified datasets to both Redshift and Snowflake for downstream analytics.
●Model and tune Amazon Redshift with workload queues, distribution/sort keys, Spectrum external tables, and disciplined vacuum and analyze cycles to improve concurrency and query speed.
●Integrate Apache Kafka with Spark Structured Streaming to land near real-time transaction feeds with exactly- once semantics via checkpoints and watermarking.
●Design DynamoDB tables and Streams to support serverless, low-latency access patterns and event triggers for downstream processing pipelines.
●Coordinate multi-step workflows with Step Functions sequencing Glue, Lambda, and EMR steps for resilient, observable, and cost-efficient orchestration.
●Execute low-downtime migrations with AWS DMS, validating change data capture into S3 and Redshift for continuous replication and consistency.
●Enable ad hoc analytics with Athena over Glue-cataloged S3 tables, eliminating cluster wait times and empowering true self-service queries.
●Profile data with SQL and Python and produce pipeline specifications, lineage diagrams, data dictionaries, and runbooks for ongoing support and audits.
●Design MySQL schemas with indexing/partitioning, optimize Oracle stored procedures for heavy reporting, and implement MongoDB collections with targeted indexes and TTL for governed retention.
●Enforce security with IAM roles and policies, KMS encryption, VPC endpoints, and fine grained S3 controls, operate CloudWatch dashboards and alarms, and reduce storage spend with S3 lifecycle policies and intelligent tiering.
●Provision reproducible infrastructure with Terraform modules and deliver safely with Jenkins and CodePipeline, managing code in Git and Bitbucket, standardizing builds with Maven, and releasing iteratively under Agile with PyTest and JUnit protecting critical paths.
Client: Quantum Health – Dublin, OH November 2020 - August 2023 Role: AWS Data Engineer
Responsibilities:
●Designed secure, scalable AWS data pipelines for clinical analytics and claims reporting, enforcing HIPAA Privacy and Security Rules with clear SLAs, governance, and audit readiness.
●Implemented a hybrid ETL approach with AWS Glue for extraction, cleansing, and schema mapping, and ELT
with Redshift SQL over S3-staged data to accelerate refresh cycles and simplify controls.
●Built Databricks PySpark notebooks on AWS with Delta Lake, applying ELT patterns for reliable ACID lakehouse writes and high-volume transformations and feature engineering.
●Developed Informatica workflows that automated complex source integrations with validation, deduplication, and reconciliation across core healthcare data feeds.
●Authored Python-based ELT scripts with PySpark and Pandas for transformation, aggregation, cleansing, and anomaly detection; packaged reusable modules with logging, schema enforcement, and exception handling.
●Created Lambda functions in Python and Glue workflows to orchestrate event-driven tasks, integrating with S3, Redshift, and Kinesis for serverless processing and near real-time enrichment.
●Optimized Spark and Spark SQL on EMR using adaptive query execution, broadcast joins, partitioning, and right-sized files to improve throughput and runtime stability on large transactional datasets.
●Provisioned and tuned EMR clusters with autoscaling and an optimized spot/on-demand mix, while managing Glue crawlers and ETL/ELT jobs with schema mapping, deduplication, and embedded data-quality gates.
●Operated Apache Airflow on MWAA for dependency-aware ELT, implementing retries, SLA monitoring, lineage annotations, and on-failure notifications for observability and recovery.
●Managed raw, refined, and curated zones on S3 and HDFS with partitioning, compression, encryption, and retention policies to safeguard PHI and support downstream analytics.
●Used Hive for interactive exploration, MapReduce for long-horizon backfills, and HBase for low-latency member and provider lookups that supported operational reporting.
●Ingested relational datasets with Sqoop and captured logs/events with Flume, scheduling legacy batches with Oozie and stabilizing distributed services via ZooKeeper.
●Delivered streaming ingestion pipelines by integrating Kafka via MSK and Kinesis with Spark Structured Streaming, using checkpoints and watermarking for exactly-once, near real-time processing.
●Profiled datasets with SQL and Python, performed anomaly detection, and produced pipeline specifications, lineage diagrams, and transformation documentation to satisfy audits and governance requirements.
●Designed and tuned relational schemas for MySQL and Oracle; implemented MongoDB and DynamoDB for
semi-structured payloads with targeted indexes and TTL policies for governed retention.
●Built governed enterprise data lakes on S3 with Glue Catalog and Lake Formation, enabling row-level security, audited access, and curated publish layers for BI teams.
●Coordinated event-driven and multi-step ELT flows with SQS, SNS, Step Functions, and AWS Data Pipeline, sequencing Glue, Lambda, EMR, and streaming jobs end-to-end.
●Optimized Amazon Redshift using workload management settings, distribution and sort keys, Spectrum external tables, and disciplined vacuum and analyze cycles to raise concurrency and reduce query times.
●Provisioned repeatable AWS environments with Terraform modules for networking, IAM, EMR, Glue, Redshift, and secrets to improve consistency, scalability, and auditability.
●Implemented CI/CD pipelines with Jenkins and CodePipeline; managed code in Git and Bitbucket with Maven builds and Agile delivery practices; enforced quality with PyTest and JUnit on critical paths, while providing observability with CloudWatch metrics, dashboards, and QuickSight BI reporting.
Client: State Farm – St Augustine, Florida July 2018 - November 2020 Role: Big Data Engineer
Responsibilities:
●Delivered analytics and reporting solutions as a Data Engineer, owning requirements, solution design, build, testing, release, documentation, and SLA-driven delivery on distributed platforms.
●Developed Python-based pipelines and utilities for ingestion, validation, cleansing, reconciliation, and metadata updates, packaging reusable modules for logging, exception handling, and schema enforcement.
●Wrote performant SQL for extraction, transformation, and analysis using window functions, joins, and aggregations; tuned queries with indexes, partitions, and statistics to reduce runtime.
●Designed and operated ETL workflows with clear dependencies and scheduling, implemented incremental loads and slowly changing dimensions, and automated data-quality gates before publish.
●Engineered Hadoop processing on multi-node clusters, managed HDFS layouts, replication, and retention, and tuned YARN resources for stable, high-throughput jobs.
●Built large-scale batch transformations with Hive and Spark SQL, used Pig and legacy MapReduce where appropriate, and optimized with partitioning, bucketing, and columnar formats.
●Ingested relational sources with Sqoop and captured application logs with Flume; coordinated distributed services using ZooKeeper for consistent, fault-tolerant operations.
●Implemented HBase for low-latency lookups on reference and session data, designing row keys and column families to support operational and analytical patterns.
●Developed PySpark and Scala jobs for heavy transformations, windowed calculations, and incremental checkpoints; applied caching, broadcast joins, and adaptive execution to improve throughput.
●Streamed events through Kafka, designed topics and retention policies, and built Spark Structured Streaming consumers to deliver near real-time feeds for dashboards and alerts.
●Modeled dimensional marts in Hive and Impala, loading via partitioned external tables and bulk inserts; optimized with Parquet or ORC, bucketing, and table statistics to reduce query latency.
●Designed and optimized Teradata star schemas with appropriate primary indexes and partitioning; maintained performance with collect stats and workload management rules to support concurrent users.
●Designed relational schemas and tuned SQL for MySQL and Oracle, including indexing, partitioning, and stored procedures to support staging, operational reporting, and downstream analytics.
●Established version control practices with Git and Bitbucket, implemented branching and pull-request reviews, and standardized builds and dependency management with Maven.
●Created unit and integration tests for Python and Spark using PyTest and custom data checks, embedding tests into the development workflow to protect critical paths.
●Produced clear technical documentation including pipeline specifications, data dictionaries, lineage diagrams, and runbooks, enabling rapid onboarding and reliable support handoffs.
EDUCATION:
Master’s in computer science from University of North Texas, Denton TX.