Data Engineer Big

Location:

United States

Salary:

120000

Posted:

September 10, 2025

Contact this candidate

Resume:

Sandeep Danda **************@*****.***

Data Engineer 678-***-**** LinkedIn GitHub Dallas, TX

Results-driven Data Engineer with 4+ years' experience in architecting and optimizing data pipelines, ETL/ELT workflows, and data solutions within Agile environments. Expertise in big data (Spark, Hadoop), cloud platforms (AWS, Azure, GCP), and programming (Python, SQL), emphasizing data accuracy, scalability, and real-time analytics. Proven delivery of cross-functional data infrastructure in biotech, MarTech, healthcare, banking, and finance, utilizing hybrid cloud to transform data into business insights.

Tools & Technical Skills

●Big Data Technologies : Spark, Hadoop, HDFS, YARN, Hive, Sqoop, Oozie, HBase, Flume, MapReduce, Kafka, Zookeeper, Databricks, Databricks Unity Catalog, Informatica, Airflow, Oracle Goldengate.

●Programming Language & SQL/No-SQL Databases : Python, PySpark, SQL, No-SQL, Shell.

●SQL/No-SQL Databases : Snowflake, MySQL, OracleDB, Microsoft SQL SERVER, PostgreSQL, MongoDB, HBase, DynamoDB, Cassandra.

●Version Control & Tools : Jenkins(CICD), GitHub, GitHub Actions, Jira, Confluence, Splunk, Docker, kubernetes, Terraform, Salesforce (API, Data Loader, Salesforce DX), REST APIs, GraphQL, PowerBi, Tableau, Kibana, NewRelic.

●Cloud Platforms : AWS (EC2, Lambda, S3, Athena, OpenSearch, ElasticSearch, Redshift, RDS, Aurora, EMR, Glue, Batch, Cloudwatch, SNS, SQS, MSK), GCP(Cloud Functions, BigQuery, Cloud Storage, Cloud Pub/Sub, Cloud Dataflow, Cloud Composer, Cloud Monitoring, Cloud Logging), Azure (Virtual Machines, Blob Storage, SQL Database, Synapse Analytics, Data Factory, Data Lake Storage, Event Hubs).

Professional Experience

CapitalOne (Plano, Texas) Big Data/Machine Learning Engineer, IT May 2024 - Present

●Migrated 500+ on-premise data pipelines to a cloud-native architecture, reducing manual operations by 70%. Implemented AES-256 encryption and used Snowflake Zero-Copy Cloning for streamlined development and QA.

●Containerized PySpark jobs using Docker and deployed on AWS EC2 for scalable execution. Leveraged AWS Lambda for serverless data transformations, storing results in PostgreSQL, and implemented AWS CloudWatch for monitoring and logging.

●Designed and orchestrated scalable ETL/ELT workflows using Databricks (PySpark) and Apache Airflow, ingesting raw data into AWS S3

and loading into Snowflake using JDBC connectors.

●Leveraged advanced Snowflake features (Snowpipe, Time Travel, clustering keys, materialized views, virtual warehouse tuning) achieving 45% faster query execution and 30% lower compute costs.

●Engineered scalable PySpark/Databricks Auto Loader pipelines to ingest & transform semi-structured data (JSON, XML, CSV, Avro) into Delta Lake with schema enforcement, ACID compliance, Time Travel, and optimized tiered (bronze, silver, gold) storage via VACUUM.

●Designed and deployed scalable ETL pipelines using AWS Glue and EMR to process structured and semi-structured data, enabling seamless integration with S3, Redshift, and Athena for downstream analytics.

●Delivered GenAI solutions with Databricks AI (Mosaic AI, LLMOps, Model Serving) and RAG pipelines for real-time inferencing. Improved observability and operational insight with NewRelic and PowerBI dashboards.

ThermoFisher Scientific (Carlsbad, CA) Data Engineer, Enterprise Data Platform Jun 2021 - Apr 2024

●Designed and deployed end-to-end ETL pipelines using Databricks & Airflow, ingesting and transforming large datasets into AWS ELK

and Kafka for real-time analytics. Tuned Elasticsearch clusters for optimized search performance.

●Integrated data from 20+ ERP systems, including SAP (PR1, C11, HANA), Microsoft Navision, Salesforce, and REST APIs using PySpark, SQL, and Databricks to enable a unified data platform supporting business-critical insights.

●Ingested and processed 5M+ of streaming records per hour from Kafka, AWS SQS, and SNS using PySpark, performed real-time transformations, and loaded the curated data into MongoDB for downstream analytics and operational use cases.

●Enabled centralized data governance and access control by registering Delta Tables in Databricks Unity Catalog. Tuned Databricks

clusters based on performance metrics, achieving a 40% reduction in pipeline execution time and 30% reduction in compute costs.

●Partnered with Data Science and Marketing teams to ingest and transform data from Ad-Tech platforms (Google Ads, Facebook Ads, LinkedIn Ads, Eloqua), delivering interactive dashboards in Power BI and Tableau that contributed to a 30% increase in campaign-driven sales and supported $10B+ in annual revenue.

●Implemented CI/CD pipelines using GitHub Actions and Jenkins to automate the deployment of data transformation jobs, accelerating release cycles, enhancing version control, and reducing production deployment errors across environments.

●Streamlined backend systems through GraphQL integration, empowering clients to query only required data and eliminating inefficiencies caused by over-fetching or under-fetching in traditional REST APIs.

Education

●Master’s Degree in Data Science (Applied Mathematics and Statistics) State University of New York at Albany Aug 2019 - May 2021.

●Bachelor’s Degree in Computer Science Chalapathi Institute of Engineering and Technology Jun 2015 - Apr 2019.

Certifications

●Databricks Certified Generative AI Engineer Associate.

Contact this candidate