Data Engineer Quality

Location:

Bloomington, IN

Salary:

90000

Posted:

September 18, 2024

Contact this candidate

Resume:

Sreyas Sawant

Carrollton, TX *****

463-***-**** # *************@*****.*** ï linkedin.com/in/sreyas-sawant-5b7a8118b § github.com/sreyas0304 Summary

Data Engineer with proven expertise in designing and optimizing large-scale ETL pipelines, data modeling and database management Proficient with programming languages like Python and SQL. Proven experience in developing scalable data infrastructure using cloud services such as AWS and Snowflake. Expertise in managing data quality, collaborating with cross-functional teams, and thriving in fast-paced environments to solve complex business problems at scale. Skills & Certifications

Languages: Python, R, Scala, SQL, NoSQL, Shell Scripting Developer Tools: VS Code, Anaconda, Tableau, Power BI, Azure, AWS, GitHub, Docker Databases: PostgreSQL, MySQL, Oracle, MongoDB, SQLite, Neo4j, Vector Databases Technologies/Frameworks: REST APIs, Pandas, Numpy, Sci-Kit Learn, ETL, Hadoop, Hive, Airflow, Spark, Kafka, Databricks, CI/CD, Agile Methodologies, Snowflake, OpenAI, LangChain AWS Services: EC2, RDS, Athena, Redshift, S3, Kinesis, EMR, Lambda, Glue, Quicksight, Step Functions, SQS Certifications: AWS Data Engineer Associate (In Progress), AWS Cloud Practitioner Essentials, Big Data 101, Data Warehouse Essentials(Snowflake), Data Engineering Essentials(Snowflake), Data Analytics Essentials, Tableau Essentials Training, Lakehouse Fundamentals

Experience

CrowdDoing Jun 2024 – Present

Data Analytics Engineer San Francisco, US

• Architected and deployed an end-to-end pipeline using AWS S3, Docker, and Apache Airflow for scalable data storage and processing of book data and metadata, contributing to the architecture and design of large data pipelines.

• Managed data quality by writing automated scripts for data checks at various pipeline stages which led to a 40% improvement in the accuracy of extracted context and quotes

• Integrated PySpark for large-scale text extraction and Ollama for high-quality vector embeddings, improving the ability to analyze book content.

• Built a knowledge graph using Neo4j to store vector embeddings and relationships, and utilized OpenAI APIs,developed RAG models using LangChain for comprehensive contextual searches, significantly enhancing application search functionality.

Indiana University Jun 2023 – May 2024

Research Data Analyst Bloomington, US

• Designed and implemented a relational database to store and manage spectroscopy and DNA experiments data from animal tusk samples.

• Utilized Python and machine learning algorithms (K-means, DBScan, PCA) to cluster and analyze data, enabling efficient identification of elephant tusks for anti-poaching initiatives.

• Developed dashboards and visualizations in Tableau and ArcGIS to uncover geographical patterns and built a high-accuracy (95%) machine learning model using Python SciKit Learn for accurate elephant ivory classification. Education

Indiana University Aug. 2022 – May 2024

Master of Science in Computer Science Bloomington, US Thakur College of Engineering and Technology Aug. 2018 – May 2022 Bachelor of Engineering in Computer Engineering Mumbai, IN Projects

Real-time Streaming Data Pipeline Python, Kafka, Spark, Docker, AWS, Tableau June 2024

• Architected and optimized a real-time data pipeline, using Apache Kafka and Spark, achieving 50% increase in data processing speed. Utilized AWS S3, Glue, and Redshift for efficient data storage and transformation, enabling advanced analysis with Tableau.

End-to-End Weather Data Ingestion Pipeline Python, Airflow, AWS (S3, Codebuild, Glue, Redshift) May 2024

• Implemented a scalable ETL pipeline for weather data ingestion using Apache Airflow and AWS, enhancing data access and visualization capabilities. Integrated CI/CD tools like Git for seamless deployment and workflow management.

ETL and Analysis on Amazon’s Mobile Sales Orders SQL, Snowflake, Spark, AWS (S3, IAM) November 2023

• Executed end-to-end ETL process using Snowflake and SQL, implementing data flow from three regions, and performed trend analysis showing a 10% annual growth. Utilized Snowflake’s features to optimize data handling.

Contact this candidate