TRIVENI G
Senior Data Engineer
Phone: +1-330-***-****
Email: ***************@*****.***
LinkedIn: http://www.linkedin.com/in/triveni-g-5363b5292
Professional Summary:
Data Engineer with over 10+ years' experience creating data platforms on GCP, AWS, and Hadoop.
Skilled at building ETL/ELT pipelines, real-time streaming, and data lakes that process 1-2 TB+ daily.
Have moved old Hadoop systems to cloud platforms (like Dataproc, Big Query, Snowflake, Redshift), cutting costs by up to 30%.
Proficient in PySpark, Scala, Hive, Kafka, Airflow, Terraform, and Kubernetes, focusing on automation and tuning things for better results.
Experienced with data warehouse modeling and tuning (BigQuery, Snowflake, Redshift) for fast analytics.
Know how to manage workflows with Airflow/Composer/Oozie, setting up SLAs, retries, and alerts to keep pipelines running.
Good at getting real-time data using Kafka, Flume, and Pub/Sub for quick analytics.
I have worked with business and data science teams to create feature engineering pipelines, ML datasets, and BI dashboards.
Have set up data governance, lineage tracking, and compliance controls (GDPR, HIPAA, PCI DSS).
Good Knowledge in writing advanced Spark Applications in Scala and Python.
Experience working with Spark Streaming and Kafka for real-time data processing.
Expertise in developing production ready Spark applications utilizing Spark-Core, Data frames, Spark-SQL, Spark-ML and Spark-Streaming API's.
Strong experience troubleshooting failures in spark applications and fine-tuning spark applications and hive queries for better performance.
Worked extensively on Hive for building complex data analytical applications.
Strong understanding various Hive concepts like partitioning, bucketing, serde’s, UDFs, windowing analytical functions, file formats etc.,
Strong experience writing complex map-reduce jobs including development of custom Input Formats and custom Record Readers.
Sound Knowledge in map side join, reduce side join, shuffle & sort, distributed cache, compression techniques, multiple Hadoop Input & output formats.
Good experience working with AWS Cloud services like S3, EMR, Redshift, Athena etc.,
Strong experience with automating and monitoring data pipelines in both GCP Cloud and AWS Cloud.
Good understanding of various cloud native operational concepts like security policies, IAM roles, service accounts etc.,
Worked on building real time data workflows using Kafka, Spark streaming and HBase.
Solid experience in working with csv, text, avro, parquet, orc, json formats of data.
Extensive experience in building ML pipelines using Jupyter notebooks with Spark and Python.
Solid interpersonal, analytical skills, and strong ability to perform as part of team.
Strong understanding of Software Development Lifecycle (SDLC) and various methodologies (Waterfall, Agile).
Technical Summary
Programming & Scripting: Python (3.x), Scala, Java (8/11), SQL, Shell Scripting
Big Data & Processing: Apache Spark (2.4/3.x), Hive, HBase, Impala, Sqoop, Flume, MapReduce
Data Warehousing & Analytics: BigQuery, Snowflake, Redshift, HiveQL, Oracle, MySQL
Cloud Platforms: Google Cloud Platform (Dataproc, Dataflow, Pub/Sub, BigQuery, GCS, Composer), AWS (S3, EMR, Athena, Redshift, Lambda, IAM)
Orchestration & Workflow: Apache Airflow (2.x), Cloud Composer, Oozie, Control-M
Streaming & Messaging: Kafka (2.x), Pub/Sub, Flume
DevOps & Infrastructure: Terraform, Jenkins, Docker, Kubernetes, Git, GitHub, GitLab CI/CD
Data Governance & Quality: Data Lineage, GDPR, HIPAA, PCI DSS, PyDeequ, Great Expectations
Monitoring & Logging: CloudWatch, Stackdriver, Prometheus, Grafana, ELK Stack
Tools & IDEs: IntelliJ, Eclipse, JIRA, Maven, Postman, Tableau, Power BI
Work Experience:
Client: Kroger Inc, OH Mar 2024 – Present
Role: Senior Big Data Developer
Responsibilities:
Made ETL pipelines using PySpark and Scala on GCP Dataproc to handle 1.2 TB+ of daily clickstream data, which cut down job times by 40%.
Set up real-time ingestion pipelines from Adobe Analytics and Google Analytics using Pub/Sub + Spark Streaming, with less than 10 minutes of delay.
Made partitioned/clustered BigQuery tables, dropping query costs by 25% for over 15 downstream teams.
Used Terraform to automate Dataproc, cutting cluster spin-up time from 45 minutes to under 5.
Developed Airflow DAGs with SLA monitoring and Slack alerts, and lowered pipeline failures by 30%.
Helped data scientists by building feature stores in BigQuery for ML recommendation engines.
Moved old Hive/Sqoop processes to Spark SQL + BigQuery, boosting runtime.
Headed the move from on-prem Hadoop to GCP, which saved 20% on infrastructure costs.
Worked with BI teams to make self-service dashboards for campaign metrics.
Written series of Spark applications for performing data cleansing, event enrichments, custom extractions, complex column transformations, aggregations, joins etc.,
Used Spark Data Frame API to process Structured and Semi Structured files and load them into GCS Bucket.
Used different Spark Modules like Spark core, Spark SQL, Spark Streaming, Spark Datasets and Data frames.
Created Hive external tables on top of datasets stored in GCS buckets and used Spark-Sql to process hive tables.
Built Integrations for easy data movement from BigQuery and loading processed data back to BigQuery.
Built real time data pipelines by developing Kafka producers and spark streaming applications for consuming.
Automated creation of Dataproc clusters and termination of clusters once the jobs are finished.
Worked closely with data science team to operationalize machine learning models in our cloud infrastructure.
Worked with Jupyter notebooks for exploratory data analysis, building machine learning models and visualizing model results.
Worked on Infrastructure as code for maintain and managing various cloud services.
Environment: PySpark 3.1, Scala 2.12, Java 11, GCP Dataproc 2.0, BigQuery, Airflow 2.2, Terraform, Kafka 2.6, Docker, Kubernetes, Git
Client: Ascena Retail Group, Patskala, ohio Nov 2021 – Feb 2024
Role: Big Data Engineer
Responsibilities:
Processed 2 TB of daily clickstream data from web logs using S3, Spark, and Redshift for reporting.
Made Hive/Presto tables that helped the business get answers faster, cutting query times by 35%.
Made ETL pipelines using Sqoop to move structured data from Oracle to Hadoop.
Used Oozie to automate job schedules, which took care of dependencies and recovery.
Wrote Athena queries on S3 to get close to real-time insights into customers' experiences.
Moved batch ETLs to Spark Structured Streaming to get data available in less than an hour.
Wrote data validation scripts in Python that improved data quality to over 99% accuracy.
Improved Spark jobs by pruning partitions and caching, which cut costs by 25%.
Set up a data archival plan on S3 with lifecycle policies to lower storage costs.
Worked with analysts to put together full customer view dashboards in Tableau.
Developed multi-threaded Java based input adaptors for ingesting click stream data from external sources like ftp server and S3 buckets on daily basis.
Communicating with business customers effectively to gather the required information for the project.
Involved in loading data into HDFS from Teradata using Sqoop
Worked on performance improvement by implementing Dynamic Partitioning and Buckets in Hive and by designing managed and external tables.
Worked on MapReduce programs to cleanse and pre-process data from various different sources.
Created Hive Generic UDF’s for implementing business logic. And have worked on incremental imports to Hive Tables.
Collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis
Worked with Apache NiFi to automate the data flow between the systems and managed flow of information between systems
Environment: PySpark 2.4, Java 8, Hive, Sqoop, Oozie, AWS S3, Athena, Redshift, Oracle, Linux
Client: Bluecoat systems, Sunnyvale, CA Apr 2020 – Oct 2021
Role: Big Data Engineer
Responsibilities
Created Spark + Impala ETL pipelines to process customer data from over 20 sources.
Built data ingestion workflows with Flume/Kafka for near-real-time data.
Designed data warehouse models in Hive for churn analysis and campaign targeting.
Automated pipelines with Airflow, including retry policies and SLA monitoring.
Made aggregated datasets for BI dashboards, cutting query time by 45%.
Put in place data lineage tracking and a metadata repository for compliance reporting.
Tuned Hive queries by using partitioning and bucketing; improved job runtimes by 30%.
Created Python scripts for data quality checks and finding anomalies.
Worked with data scientists to make feature engineering datasets for ML.
Documented pipeline architecture and trained the offshore team.
Worked on importing and exporting data from Oracle and DB2 into HDFS using Sqoop
Developed PIG Latin scripts to extract the data from the web server output files to load into HDFS
Designed and Develop user defined functions to provide custom HIVE and PIG capabilities cross the application teams
Created Hive External tables and loaded the data into tables and query data using HQL
Worked on Impala for exposing data for further analysis and for generating transforming files from different analytical formats to text files
Implemented test scripts to support test driven development and continuous integration
Worked on tuning the performance of HIVE and PIG queries
Environment: Spark 2.0, Hive, Impala, Flume, Kafka, Airflow, Python, Oracle
Client: Master Card, MO Jan 2018 – Mar 2020
Role: Data Engineer
Responsibilities
Developed fraud detection data feeds, using Kafka + Spark Streaming, for machine learning models.
Moved old PL/SQL jobs to Hadoop/Spark pipelines, which improved how well they scaled and cut down on how long they ran.
Put in place data masking and encryption to stay within PCI DSS rules.
Set up automated pipeline deployment with Jenkins, which included regression testing.
Worked with risk teams to provide daily fraud score dashboards in Tableau.
Improved big Hive queries by using partitioning, indexes, and ORC/Parquet formats.
Wrote down and made sure people followed data governance rules for sensitive PII.
Adjusted Spark clusters for fast transaction processing, hitting 99.9% SLA compliance.
Trained analysts on how to use Hive/SQL for quick financial reports.
Environment: Java 8, Spark 1.6, Hive, Kafka, Sqoop, Jenkins, Oracle, Linux
Client: Advithri Technologies, Hyderabad, India Jun 2014 – Aug 2017
Role: Junior Data Engineer
Responsibilities
Helped build ETL pipelines to put EMR and claims data into Hadoop clusters.
Moved batch processes from SQL Server to Hive/Spark to improve scalability.
Made Sqoop jobs to load structured patient data from RDBMS to Hadoop.
Wrote HiveQL queries for analytics reporting that followed HIPAA.
Put in place basic data quality scripts in Python for record validation.
Helped make HBase tables for keeping semi-structured patient documents.
Worked with senior engineers to improve storage format (Parquet vs ORC).
Made automated shell scripts for log parsing and job monitoring.
Helped with HL7 feed integration using Flume for streaming healthcare records.
Kept up documentation and gave support for production problems.
Environment: Java 7, Hadoop, Hive, Sqoop, Flume, HBase, SQL Server, Linux