Big Data Cloud Engineer with 12+ Years Experience

Location:

Allen, TX

Posted:

April 08, 2026

Contact this candidate

Resume:

Sarat Chandra

Bigdata Cloud Engineer

Email: *************@*****.***

Contact: 972-***-****

Status: US Citizen

PROFESSIONAL SUMMARY:

12+ years of experience in IT which includes Analysis, Design, Datastage, Development of Big Data using Hadoop, AWS, Python, data Lake, Scala, design and development of web applications using JAVA, Spring boot and data base and data warehousing development using My SQL, Oracle

Certified in Airflow fundamentals and Airflow DAG authoring

Experience in creating applications using Spark with Python

Experienced in Apache Spark and developing data processing and analysis algorithms using Python.

Strong working experience on Big Data Analytics with hands on experience in installing, configuring and using ecosystem components like Hadoop Map reduce, HDFS, HBase, Zookeeper, Hive, Swagger, Splunk, Flume, Cassandra, Kafka and Spark, Oozie, Airflow

Building data processing triggers for Amazon S3 using AWS Lambda functions with Python.

Worked on the Spark SQL and Spark Streaming modules of Spark and used Scala and Python to write code for all Spark use cases

Experience on Migrating SQL database to Azure data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory.

Expertise in preparing Interactive Data Visualization's using Tableau Software from different sources

Strong experience in analyzing large amounts of data sets writing PySpark scripts and Hive queries.

Running of Apache Hadoop, CDH and Map-R distros, dubbed Elastic MapReduce(EMR) on (EC2)

Implemented advanced procedures like text analytics and processing using the in-memory computing capabilities like Apache Spark written in Scala.

Implemented DWH solutions to ingest data and load to target tables and adequately skilled with deployment tools with Hands on experience in VPN Putty and WinSCP, CI/CD(Jenkins)

Experience in migrating databases to Snowflake multi-cluster warehouses and adequate understanding of Snowflake cloud technology and using Snowflake Clone.

Experience with Snowflake cloud data warehouse and AWS S3 bucket for integrating data from multiple source system which include loading nested JSON formatted data into the snowflake table.

Extensively worked on AWS services like EC2, S3, EMR, FSx, Lambda, Cloud watch, RDS, Auto scaling, Cloud Formation, SQS, ECS, EFS, DynamoDB, Route53, Glue etc.

Creating service based applications using python to move data across accounts assuming the role(as direct access is denied)

Expertise in Waterfall and Agile - SCRUM methodologies.

Experienced with code versioning and dependency management systems such as Git, SVT, and Maven.

Writing code to create single-threaded, multi-threaded or user interface event driven applications, either stand-alone and those which access servers or services.

Good experience in using Data Modelling techniques to find the results based on SQL and PL/SQL queries and handling different file formats like Text file, Avro data files, ORC files, parquet files, Sequence files, Xml and Json files.

Experience working with different databases, such as Oracle, SQL Server, MySQL and writing stored procedures, functions, joins, and triggers for different Data Models.

Great team player and quick learner with effective communication, motivation, and organizational skills combined with attention to details and business improvements.

TECHNICAL SKILLS:

Big Data Technologies

HDFS, Hive, MapReduce, Hadoop distribution, Spark, Spark Streaming, Yarn, Zookeeper, Kafka, ETL(Nifi, Talend etc.)

Scripting/Web Languages

Pyspark, CSS3, XML, SQL, Shell/Unix, Perl, Python, Spark, Scala, HQL, Bash

Databases

Cassandra, mongoDB, Oracle, MS SQL, Teradata, MS-SQL Server 20012/16, Oracle 10g/11g/12c, Tigergraph

Operating Systems

Linux, Windows XP/7/8/10, Mac.

Software Life Cycle

SDLC, Waterfall and Agile models.

Utilities/Tools

Visio, Jenkins, Jira, Intellij, Splunk,Eclipse, Tomcat, NetBeans, JUnit, SQL, SVN, Log4j, SOAP UI, Jules, Swagger, Splunk, Airsyn, Github, Bitbucket, AEP(Adobe Experience Platform), CJA(Customer Journey Analytics), Real-Time CDP(Customer Data Platform) B2B Edition, Label Studio

Data Visualization Tools

Tableau, SSRS, Cloud Health.

Cloud Services

AWS(EC2, S3, EMR, RDS, Lambda, Cloudwatch, FSx, Auto scaling, Redshift,Cloud Formation, Secret Manager, Glue etc.), Azure Databricks,Data Factory, Synapse Analytics, GCP/GCS(BigQuery, DataProc, Dataflow, PUB/SUB, Cloud Composer), InfoWorks, Airflow

Certifications

Airflow Fundamentals, Airflow DAG authoring

PROFESSIONAL EXPERIENCE

Blue Acorn iCi, Charleston, SC Feb 2026 – current

Data Engineer

Responsibilities:

Migrated existing data pipeline source system within Apache Airflow to a new data source, ensuring seamless data flow and integrity across platforms.

Managed all subsequent column and schema changes, implementing robust data validation and quality checks to maintain data accuracy and reliability.

Resolved potential data inconsistencies and errors arising from the source change, ensuring high-quality data delivery for downstream analytics.

Collaborated with source system owners and data consumers to understand new data structures and implement necessary ETL adjustments, guaranteeing alignment with business requirements.

Worked on two ETL processes using JSON scripts to integrate files into AEP (Adobe Experience Platform) and CJA (Customer Journey Analytics).

Worked on lookup datasets, marketing metadata with Real-Time CDP (Customer Data Platform) B2B Edition along with Journey Orchestration and Adobe Mix Modeler

Involved in the moving of data from Databricks to AEP after transformations – ingestion jobs (CJA)

Worked with the Mosaic – Component Library Tool (M-CST) to create a component library using an AI-powered tool that automatically analyzes screenshots/DOM data from a client’s website, detects content sections and generates a structured component inventory, which is exported as a CSV

The M-CST reviews the site, examines additional data about the client site, such as: metadata, data layer(s) for analytics, image types, sizes, use, URL structures, ADA compliance.

Manual audits take 20-40 hours per site, but the tool Reduced manual audit efforts by 70%

Trained a vision model on content section vocabulary, extract data attributes from DOM Parsing to provide more context to the report, provide normalization by using an LLM (GPT, Anthropic, Ollama) to map each detectedblock to the defined vocabulary and output

Experience in model training, crawling & data capture, LLM Integration, CLI Tooling

Environment : Azure Databricks, AEP(Adobe Experience Platform), CJA(Customer Journey Analytics), Real-Time CDP(Customer Data Platform) B2B Edition, Label Studio, Spark, SQL, Github, Python, Airflow

JPMC, Dallas, TX Oct 2023 – May 2025

Scala/Spark Developer

Responsibilities:

Involved in Requirement gathering, Business Analysis and translated business requirements into Technical design in Hadoop and Big Data, after interacting with Data Modeler’s team and SRE team

Created various data pipelines using Spark, Scala and SparkSQL for faster processing of data

Designed number of partitions and replication factor for Kafka topics based on business requirements and worked on Spark transformations using Spark and Scala, initially done using python (PySpark).

Created Data Quality Scripts using SQL and Hive to validate successful data load and quality of the data. Created various types of data visualizations using Python, Tableau and Power BI.

Extensively worked with Avro, ORC and Parquet, XML, CSV, JSON files and converted the data from either format Parsed Semi Structured JSON data and converted to Parquet using Data Frames in PySpark. Also, configured and implemented various Data Quality Checks at the column level and table level after interacting with the DQCS team

Developed a Python Script to load the CSV files into the S3 buckets and created AWS S3buckets, performed folder management in each bucket, managed logs and objects within each bucket.

Involved in Analyzing system failures, identifying root causes, and recommended course of actions, Documented the systems processes and procedures for future references.

Participated in Hadoop installation, Commissioning, Decommissioning, Balancing, Troubleshooting, Monitoring and, debugging Configuration of multiple nodes using Hortonworks platform.

Involved in working with Spark on top of Yarn/MRv2 for interactive and Batch Analysis

Worked with a team to migrate from Legacy/On prem environment into AWS.

Stored and retrieved data from data-warehouses using Amazon Redshift.

Experienced in analyzing and Optimizing RDD's by controlling partitions for the given data

Used HiveQL to analyze the partitioned and bucketed data and compute various metrics for reporting

Worked with querying data using SparkSQL on top of Spark engine.

Worked on Airflow development and the scheduling analysis of all the spark jobs/tasks

Developed and executed a migration strategy to move Data Warehouse from an ICDW and HWX platform to Cassandra and TigerGraph.

Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into Cassandra.

Assisted in creating and maintaining Technical documentation to launching EMR Clusters

Assisted in Cluster maintenance, cluster monitoring, adding and removing cluster nodes and Installed and configured spark jobs for data cleaning and pre-processing.

Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS and SQLAnalyzer tool for PERF deployment (pre-prod)

Participated in the Automation and monitoring of complete AWS infrastructure with terraform.

Created data partitions on large data sets in S3 and DDL on partitioned data, worked on Swagger for filter criterion of various Data Quality Checks (SDQ – Statistical Data Quality checks)

Ingested data into Iceberg tables, Implemented schema evolution capabilities using Apache Iceberg, allowing seamless feature edits without requiring data rewrites, and worked with the time travel feature for data versioning and auditing purposes

Extensively used SSIS transformations such as Lookup, derived column, Data conversion, Aggregate, conditional split, SQL task, etc. Prepared the complete data mapping for all the migrated jobs using SSIS and extensively worked in developing ETL program fro supporting Data extraction, transformations and loading using Informatica Power Center.

Experienced in Informatica performance tuning involving source level, target level and map level bottlenecks and Converted all Hadoop jobs to run in EMR by configuring the cluster according to the data size. Built data pipelines to process structured/unstructured data for LLM training, reducing preparation time by 40% and assisted in deploying LLM services on AWS/GCP with Docker and Kubernetes, reducing latency by 25% while the Q&A accuracy improved by 30%

Designed ETL workflows for embeddings and vector database integration,enabling real-time semantic search, assisted in the automation of synthetic data generation, boosting model performance by 15%

Partnered with ML and product teams to productionize GenAI prototypes, cutting time-to-market by 2 months

Environment: HDFS, Hive, Scala, Swagger, Datastage, Spark, SQL,Terraform, Splunk, RDBMS, Python, data Lake, Kerberos, Jira, Confluence, AWS(EC2,S3,EMR,Redshift, ECS,Glue), Ranger, Git, Kafka, CI/CD(Jenkins), Kubernetes, Airsyn, Data Comparator, Airflow, Teradata, Jules Deployment, Tiger graph, POSTMAN/SOAP UI, Cassandra, PostgreSQL, Prometheus

CVS Health, Dallas, TX Jan 2022 – May 2023

Sr. Bigdata Engineer

Responsibilities:

Involved in Requirement gathering, Business Analysis and translated business requirements into Technical design in Hadoop and Big Data.

Developed Pyspark scripts to extract the data from the web server output files to load into GCS buckets within the GCP environment.

Designed number of partitions and replication factor for Kafka topics based on business requirements and worked on migration with Spark transformations using Spark and Scala, initially done using python (PySpark). Created various data pipelines using Spark, Pyspark and SQL for faster processing of data

Written Spark-SQL and embedded the SQL in Pyspark script to generate jar files for submission onto the Hadoop cluster and Worked with a team to migrate from Legacy/On prem environment into GCS, the storage within the GCP.

Data ingestion to Azure Data Lake and processing the data in Azure databricks

Worked with the migration of legacy data to XML-based CDA format and validated CDA documents.

Extensively worked with Avro and Parquet, XML, JSON files and converted the data from either format Parsed Semi Structured JSON data and converted to Parquet using Data Frames in PySpark.

Developed a Python Script to load the CSV files into the (GCP/)GCS buckets, performed folder management in each bucket, managed logs and objects within each bucket.

Worked on migrating the data warehouse to the the Adobe big query tables in the RAW layer which further are made available to the consumer teams via Big query view setup on the tables.

Involved in Analyzing system failures, identifying root causes, and recommended course of actions, Documented the systems processes and procedures for future references.

Created Dockerized backend cloud applications with exposed Application Program Interface (API) interfaces and deployed on Kubernetes.

Involved in the data migration efforts for ERP system implementation and developed SQL scripts to validate the migrated datasets to reduce the reconciliation time.

Experienced in writing live Real-time Processing using Spark Streaming with Kafka

Architectured segmented VPCs with subnets, gateways and configured VPC endpoints

Worked with querying data using SparkSQL on top of Spark engine and used Python and Shell scripting to build data pipelines and Created data partitions on large data sets and DDL on partitioned data.

Involved in the migration strategy to move data from hive on-prem to the cloud storage within the GCP environment using the Replicator Infoworks tool and orchestrated Spark jobs on Dataproc

Developed Airflow DAGs in Cloud Composer to orchestrate end-to-end pipelines, covering Pub/Sub, Dataflow, and BigQuery

Worked on Cloud Composer to transfer data from GCS to BigQuery and optimized BigQuery datasets.

Developed and executed a migration strategy to move Data Warehouse from Oracle platform to GCS, within the GCP environment using InfoWorks tool.

Involved in making the data files available to the consumer teams via Big Query view setup on the Adobe Big Query tables in the raw layer using the Infoworks ETL tool

Lead the development team in a de-identification project which handled PHI/PII features, masking (encryption and decryption) of PII extracts(identifier types), sanitize PII text and structured datasets, ensuring 100% compliance with HIPAA standards.

Assisted in Cluster maintenance, cluster monitoring, adding and removing cluster nodes for data cleaning and pre-processing and TIDAL automation.

Environment: Hive, Scala, Datastage, Spark, Cloudera, SQL, RDBMS, Python, data Lake,Jira, Confluence, Shell scripting, GCS/GCP, Git, Kafka, CI/CD(Jenkins), Kubernetes, Infoworks, Informatica, Big Query, Replicator Infoworks, Tidal automation, Cloud composer, IAM (Service Accounts, Roles)

NIKE, Portland, Oregon June 2021 – Dec 2021

Application Support Engineer

Responsibilities:

Developed action plans and processes, in coordination with management team, for integrating activities and optimizing department resources to meet major goals and objectives.

Facilitated application support, problem solving, issue resolution with internal and external resources & Resolved issues and determined options for issue resolution and risk mitigation.

Contributed and reviewed recommendations for technical solutions by coordinating compliance issue identification and remediation

Enforced architecture, governance, security, and global process standards to system changes and deployments by collaborating with other teams on integration needs/design

Contributed to database design and creates critical-path, high-risk, advanced technical designs and involved in approving proof of concept efforts and reviewed results by enforcing client and department architectural direction and ensuring consistent technical approach within department to decide on engineering tools based on recommendations

Allocated resources, Reviewed and approved performance test results based on performance monitoring and tuning results by managing and enforcing performance thresholds and standards

Environment: Spark, Datastage, SQL, RDBMS, Python, Data Lake, Jira, Confluence, Shell scripting, AWS, Glue, GIT, S3 instances, Lambda functions, EC2, EMR, CloudWatch, Autoscaling, Cloud Formation

AT&T, Dallas, TX May 2020 – May 2021

Bigdata Cloud Engineer

Projects : Data load from RDBMS sources to Hadoop systems; Data visualizations using (Python)Tableau

Environment: HDFS, Hive, Scala, Sqoop, Datastage, Spark,Tableau, Yarn, Cloudera, SQL, Terraform, Splunk, RDBMS, Python, Elastic search, data Lake, Kerberos, Jira, Confluence, Shell/Perl Scripting, Zookeeper, AWS(EC2, S3, EMR, Redshift, ECS, Glue, VPC, RDS etc.), Ranger, Git, Kafka, CI/CD(Jenkins), Kubernetes, Azure Databricks, Notebooks, Deltalake, HBase, SparkSQL

Trimble, Dallas, TX Mar 2019 – Mar 2020

Bigdata AWS Cloud Engineer

Projects : Data migration from HDFS to RDBMS (Data stored in AWS S3)

Environment: Hadoop, HDFS, Hive, Core Java, Sqoop, NIFI, Spark, Scala, Hive, Cloudera (CDH3 to CDH4), Oracle, Elastic search, Kerberos, Datastage, SFTP, data Lake, Impala, Jira, Wiki, Alteryx, Teradata, Shell/Perl Scripting, Kafka, AWS EC2, S3, EMR, Oozie workflows

Alcons Lab, Fort Worth, TX Dec 2017 – Dec 2018

Bigdata Engineer

Projects : Migration of data from Transactional source systems to Redshift data warehouse using spark and AWS EMR & Troubleshooting production level issues

Environment: Hadoop, HDFS, Data Warehouse, Pig, Hive, Spark, Scala, MapReduce, Java, Cloudera Hadoop, Dell Boomi, HDFS, Map Reduce, Hive, Sqoop, SSIS, SQL, Oozie, Grafana, Agile, AWS, MongoDB, NoSQL, REST API, GIT, GitHub, SourceTree version control platform

T Rowe Price, Owing Mills, MD Jun 2016 – Nov 2017

Data/Scala Developer

Projects : Automation of all the jobs, for pulling data from FTP server to load data into Hive tables

Environment: Hadoop, MapReduce, HDFS, Hive, Java, SQL, Cloudera Manager, Teradata, PL/SQL, MySQL, Hbase, Datastage, ETL(Informatica/SSIS), Spark RDDs, Scala, Python, MongoDB, Hue GUI, Qlikview, Oozie

TMW Systems, Dallas, TX Sept 2011 - Dec 2015

Java SQL Developer

Projects : Java-based enterprise application focusing on backend integration, UI development, and robust testing involving full SDLC, JAVA/J2EE development for an Oracle backed system

Environment: Java, Struts, Springboot, Servlets, JSP, EJB, JavaScript, CRM, AJAX, SOAP/RESTful APIStruts, WebLogic, Oracle-SQL, P/LSQL, TOAD, Microservices, Eclipse, HTML, UNIX, CSS, XML/XSL, Maven

Education Details 2002-2006

Bachelor of Engineering, Electrical and Electronics Engineering, Osmania University, Hyderabad, India

Contact this candidate