Sama Srujana
Email: ******************@*****.*** Mobile: 940-***-**** Location: Lewisville, TX
PROFESSIONAL SUMMARY
6+ years of experience in Data Engineering and Big Data technologies.
Proficient in setting up Change Data Capture (CDC) for various database types, ensuring seamless data hydration into data lakes.
Extensive hands-on experience with Apache Spark for both streaming and batch ETL transformations, utilizing Spark DataFrames and Spark SQL.
Strong command of AWS services, including S3, EMR, Glue Data Catalog, and Lambda functions, to enhance data processing and analytics workflows.
Demonstrated ability to optimize performance of data pipelines and improve ETL processes through advanced big data concepts.
Experience in orchestrating raw CDC data workflows using Apache Airflow, ensuring reliable and efficient data movement.
Proven track record of delivering actionable insights by transforming complex datasets into queryable formats for analytics.
SKILLS
Programming Languages: Java, Python, Scala
Frameworks & Libraries: Apache Spark (Spark SQL, Spark Streaming), Apache Airflow, Apache Hudi, Apache Griffin
Databases & Data Warehousing: Relational Databases, NoSQL Databases, Data Lakes
Big Data & Streaming: Apache Kafka, Apache Flink, CDC Tools (Debezium)
Cloud Platforms: AWS (S3, EMR, Glue Data Catalog, Lambda, Step Functions, AWS Batch)
DevOps & CI/CD Tools: Docker, Kubernetes, Jenkins, Git
Testing & QA: Unit Testing, Integration Testing, Data Quality Checks
Security & Compliance: Data Governance, Security Best Practices, Compliance Standards
Monitoring & Observability: AWS CloudWatch, Grafana, ELK Stack
Collaboration Tools: JIRA, Confluence, Slack
Documentation Tools: Markdown, Confluence
Operating Systems: Linux, Windows
WORK EXPERIENCE
Privia Health – Arlington, VA
Senior Data Engineer – Jan 2024 to Present
Spearheaded the implementation of Change Data Capture (CDC) processes across various databases, enabling efficient data hydration into the data lake and improving accessibility for analytics teams.
Engineered robust ETL pipelines using Apache Spark, successfully transforming and processing both streaming and batch data, resulting in a 30% reduction in processing time.
Optimized data workflows by leveraging AWS services, including S3 for storage and EMR for scalable data processing, enhancing overall system performance and reliability.
Collaborated with cross-functional teams to design and implement data models that support analytical needs, yielding actionable insights and business intelligence.
Automated data pipeline orchestration using Apache Airflow, reducing manual intervention and enhancing workflow efficiency by 40%.
Conducted performance tuning for Spark applications, achieving significant improvements in query execution times and resource utilization.
Developed Python-based Lambda functions to automate data ingestion processes, further streamlining data workflows and increasing operational efficiency.
Mentored junior engineers on best practices for data engineering and cloud technologies, fostering a culture of continuous learning and innovation.
Conducted regular data quality assessments to ensure the integrity and accuracy of data in the data lake, aligning with compliance standards.
Created detailed documentation of data engineering processes and workflows, facilitating knowledge transfer and team alignment.
Technologies Used: Java, Python, Apache Spark, AWS S3, AWS EMR, Apache Airflow, AWS Lambda, Spark SQL, Spark Streaming, Data Lakes, CDC Tools
Webster Bank – Stamford, CT
Big Data Engineer – May 2021 to Jul 2023
Optimized ETL processes utilizing Apache Spark and CDC methodologies, facilitating the movement of large datasets into a centralized data lake for enhanced analytics capabilities.
Developed and maintained data pipelines using Spark DataFrames and Spark Streaming, enabling real-time data processing and analytics for business users.
Implemented AWS Glue Data Catalog for metadata management, significantly improving data discoverability and governance across the organization.
Collaborated with data scientists and analysts to define data requirements and deliver high-quality datasets for machine learning and reporting purposes.
Automated data quality checks and validation processes, improving the accuracy of data by 25% and ensuring compliance with data governance policies.
Led the migration of legacy data systems to modern cloud-based architectures, resulting in a 50% reduction in operational costs and improved scalability.
Utilized Apache Airflow to orchestrate complex ETL workflows, ensuring timely and reliable data delivery for analytics applications.
Enhanced data retrieval performance by implementing indexing and partitioning strategies within the data lake environment.
Participated in cross-functional Agile ceremonies to deliver incremental improvements to data engineering practices and tools.
Provided training and support for team members on big data technologies and tools, fostering a collaborative team environment.
Technologies Used: Java, Python, Apache Spark, AWS Glue, S3, Apache Airflow, Data Lakes, CDC Tools, Spark Streaming, ETL Processes, Big Data Concepts
Ross Stores – Dublin, CA
Data Engineer – Mar 2019 to Apr 2021
Developed and maintained ETL processes for data integration from multiple sources into a centralized data warehouse, increasing data availability for business intelligence reporting.
Implemented data transformations using Apache Spark, enhancing data quality and ensuring compliance with business requirements.
Collaborated with data analysts to gather requirements and deliver high-performance data solutions tailored to their analytical needs.
Conducted performance optimization of existing data pipelines, resulting in a 35% improvement in processing times and reduced resource consumption.
Integrated data quality monitoring tools to ensure data accuracy and reliability, significantly reducing data-related errors in reporting.
Participated in the design and implementation of data models for analytics, improving data accessibility and usability for end-users.
Assisted in the migration of data processes to AWS, leveraging cloud technologies for improved scalability and performance.
Engaged in troubleshooting and resolving data-related issues, ensuring minimal disruption to business operations.
Documented data engineering processes and best practices to enhance team efficiency and knowledge sharing.
Contributed to the establishment of data governance frameworks to ensure data integrity and compliance across departments.
Technologies Used: Java, Python, Apache Spark, ETL Processes, Data Warehousing, AWS, Data Lakes, Data Quality Monitoring, Data Integration, Performance Optimization
CERTIFICATIONS
• Azure Data Engineer Associate (DP-203)
• AWS Solution Architect - Associate
• Databricks Certified Data Engineer Associate
EDUCATION
Masters in Information Systems
Trine University, GPA: 4.0