Post Job Free
Sign in

Data Engineer with 6 Years in Big Data & ML Ops

Location:
Allen, TX
Posted:
January 13, 2026

Contact this candidate

Resume:

Thrishalini Vemula Contact: 210-***-****

Email : ***************@*****.***

Data Engineer

Background Summary:

Motivated IT professional with around 6 years of experience in the Software Industry

Hands-on experience working with Big Data/Hadoop ecosystem including Apache Spark, Map Reduce, Spark Streaming, PySpark, Hive, HDFS, AWS Kinesis, Airflow Dags, and Oozie.

Engineered data pipelines supporting portfolio management systems for holdings, transactions, valuations, and performance.

Owned end-to-end ingestion and integration of new data sources, modeling datasets to support downstream analytics, reporting, and machine learning use cases.

Partnered with Analytics and Business stakeholders to gather requirements and deliver a scalable internal data mart enabling self-service reporting and advanced analytics.

Collaborated with Data Science teams to deploy, monitor, and maintain production ML models, implementing MLOps best practices, governance, and model observability.

Implemented and enforced data quality, observability, and governance standards across pipelines and enterprise data assets.

Hands-on experience engineering data pipelines for Wealth Management / Financial Services datasets including holdings, transactions, performance, and market data.

Expertise in Data Migration, Data Profiling, Data Ingestion, Data Cleansing, Transformation, Data Import, and Data Export using multiple ETL tools such as Informatica Power Centre.

Working knowledge of Spark RDD, Dataframe API, Data set API, Data Source API,

Integrated Snowflake with Azure Data Factory, AWS Glue, Databricks, Informatica, and Kafka connectors.

Hands-on experience in configuring workflows using Apache Airflow and the Oozie workflow engine to manage and schedule Hadoop jobs.

Experience with partitioning, and bucketing concepts in Hive, and designed both managing and external tables in Hive to optimize performance. Experience with various file formats such as Avro, Parquet, ORC, JSON, and XML.

Proficient in designing and managing AWS Glue ETL workflows for data transformation, cleansing, and integration.

IT Technical Skills:

Languages: Python, Java, Shell Scripting

Big Data Technologies: Yarn, Spark SQL, Kafka, Presto, Hadoop, HDFS, Hive, Pig, HBase, Sqoop, Flume

Operating System: Windows, MacOS, Linux/Unix

BI Tools: SSIS, SSRS, SSAS.

Database Tools: Oracle 12c/11g/10g, MS Access, Microsoft SQL Server, Teradata, Poster SQL, Netezza

Cloud Platform: AWS (Amazon Web Services), Microsoft Azure, GCP

Reporting Tools: Business Objects, Crystal Reports.

Tools: & Software: TOAD, MS Office, BTEQ, Teradata SQL Assistant.

ETL Tools: Pentaho, Informatica Power, SAP Business Objects XIR3.1/XIR2, Web Intelligence.

Modeling Tools: IBM Info sphere, SQL Power Architect, Oracle Designer, Erwin, ER/Studio, Sybase Power Designer.

Other tools: TOAD, SQL PLUS, SQL LOADER, MS Project, MS Visio and MS Office, have worked on C++, UNIX, PL/SQL etc

Education:

Masters in Computer Science from University of Louisiana (Jan 2018 - Dec 2019)

Bachelors in Computer Science from JNTUH (Sep 2013 - April 2017)

Work Experience:

Client: Wells Fargo, Dallas, TX June 2025- till now

Role: Data Engineer

Key Contributions:

Involved in Daily Scrum (Agile) meetings, Sprint planning and estimation of the tasks for the user stories, participated in retrospective and presenting Demo end of the sprint.

Responsible for cluster size estimation, monitoring and troubleshooting of Spark Data Brick cluster.

Designed and built scalable ETL/ELT pipelines using Azure Data Factory (ADF), integrating diverse on-prem and cloud data sources.

Developed and optimized data transformation workflows using Azure Databricks (PySpark, Spark SQL) for large-scale processing.

Built and maintained scalable data pipelines for Wealth Management datasets including client holdings, trades, transactions, portfolio performance, and market data.

Integrated market data (prices, benchmarks, corporate actions) with client portfolios for valuation and performance calculations

Built scalable data pipelines on Databricks using Apache Spark (PySpark/SQL) for large-scale data processing.

Implemented ingestion frameworks leveraging Azure Event Hubs, Azure IoT Hub, and Azure Functions for real-time and batch data flows.

Leveraged Microsoft Fabric (OneLake, Data Pipelines) along with existing Azure resources to unify data assets and streamline analytics workflows.

Implemented metadata-driven ETL frameworks ensuring lineage, parameterization, and environment-agnostic deployments.

Supported BI and analytics teams by exposing optimized semantic layers through Synapse SQL, Azure Analysis Services, or Fabric models.

Integrated curated datasets into Power BI for self-service analytics, ensuring performance through partitioning, aggregations, and DAX optimization.

Proficient in Git-based workflows (pull requests, branching, code reviews) within agile development environments.

Familiarity with orchestration tools (Apache Airflow, Prefect, Azure Data Factory) for building and managing complex data workflows.

Demonstrated ability to design and optimize data transformation processes, validation workflows, and workload management systems.

Managed Databricks clusters, autoscaling, and cost optimization strategies.

Built optimized Snowflake schemas (Star, Snowflake, 3NF) using clustering, micro-partitioning, and data distribution best practices.

Implemented Time Travel & Fail-safe strategies for recovery, auditing, and historical data tracking.

Designed dimensional data models (star/snowflake) and fact/dimension tables in Amazon Redshift for analytics workloads

Designed data ingestion workflows from on-prem and APIs into BigQuery and Cloud Storage.

Leveraged Pub/Sub for real-time streaming data pipelines integrated with Dataflow and BigQuery.

Automated data extraction, file transfers, and job executions using UNIX Shell/Perl scripting integrated with Informatica command-line utilities (pmcmd, pmrep).

Created Shell/Perl scripts for pre- and post-ETL validation, file monitoring, and automated error handling.

Developed parameterized and reusable Informatica mappings, mapplets, and workflows to streamline data movement across environments.

Designed and implemented scalable data warehousing solutions using platforms like Amazon Redshift, Google BigQuery, or Snowflake, optimizing storage and retrieval operations to support business intelligence needs.

Applied strong knowledge of database architecture, ETL pipelines, data lakes, and data warehousing concepts to support scalable data solutions.

Integrated curated datasets into Power BI, Azure Analysis Services, and Redshift for business intelligence and analytics reporting.

Implemented data classification and masking for PII and confidential financial data.

Migrated large-scale on-prem SQL and SSIS workloads to Azure cloud-native architecture with minimal downtime.

Collaborated with Data Science teams to deploy ML models using Azure Machine Learning (AML) with pipelines, model registry, and endpoints.

Built Azure Landing Zones with proper governance, networking, and security baselines for data workloads.

Proficient in data cataloging, data quality management, and governance platforms, ensuring data lineage and compliance.

Designed robust ETL pipelines using PySpark and Apache Airflow, enabling scalable data processing across multiple sources (S3, HDFS, SQL, APIs).

Environment: Cloudera, Eclipse, Hive, Impala, Spark, Apache Kafka, Flume, Scala, AWS, EC2, S3, DynamoDB, Auto Scaling, Lambda, Nifi, Snowflake, Java, Shell-scripting, SQL, GCP, Sqoop, Oozie, Java, PL/SQL, Oracle 12c, SQL Server, HBase, BitBucket, Control-M, Python

Client: American Express, New York Oct 2023- till now

Role: Data Engineer

Responsibilities:

Design and implement end-to-end data solutions (storage, integration, processing, visualization) in Azure.

Integrated Fabric-aligned Power BI workspace models and semantic layers to improve cross-team accessibility and consistency of analytical datasets.

Designed and implemented API-based data integrations to ingest and publish data across internal and external systems.

Strong understanding of data governance, lineage, and security best practices in regulated financial services environments.

Optimized Azure Data Factory pipelines for performance and cost efficiency by fine-tuning data processing activities, optimizing resource utilization, and implementing caching and partitioning strategies.

Developed comprehensive documentation and provided training sessions to stakeholders and team members on Azure Data Factory best practices, usage guidelines, and troubleshooting techniques.

Designed and implemented data models in Cosmos DB, taking advantage of its support for multiple data models (e.g., SQL, MongoDB, Cassandra)

Collaborated with product, risk, and business stakeholders to translate Wealth Management requirements into data solutions.

Modeled and managed enterprise datasets in Azure Synapse Analytics, including dedicated SQL pools, serverless pools, and Spark workloads.

Leveraged Azure Data Lake Storage (ADLS Gen2) for hierarchical, secure, and cost-effective data storage following medallion architecture (Bronze/Silver/Gold).

Designed star schemas, fact/dimension tables, and optimization strategies for high-performance analytics workloads.

Supported early adoption of unified analytics patterns later reflected in Microsoft Fabric by centralizing ingestion, transformation, and reporting workflows.

Designed streaming ingestion pipelines using Pub/Sub integrated with Dataflow for near real-time data ingestion into BigQuery.

Performance Monitoring with SQL Profiler Windows System Monitor. Involved in Data Modeling using Star Schema, Snowflake Schema.

Collaborated with cross-functional teams to design and implement ETL pipelines in Snowflake, ensuring high data quality and consistency for business intelligence (BI) tools.

Extensive experience in building ETL processes and handling structured/unstructured data across diverse data sources.

Strong proficiency in SQL, including complex query optimization and usage of platforms like BigQuery and SQL Mesh (or similar).

Experience in developing and managing APIs and frameworks to streamline data integrations.

Designed and implemented scalable data warehousing solutions using platforms like Amazon Redshift, Google BigQuery, or Snowflake, optimizing storage and retrieval operations to support business intelligence needs.

Performed comprehensive data analysis using tools such as SQL, Excel, and Tableau to generate actionable business insights and support data-driven decision-making processes.

Implemented machine learning models for real-time streaming data using Azure Stream Analytics and Azure Event Hubs.

Environment: Cloudera, Eclipse, Hive, Impala, Spark, Apache Kafka, Scala, Pyspark, AWS, EC2, S3, DynamoDB, Auto Scaling, Azure, Lambda, Snowflake, Talend, Data Lake, Cosmos DB, Big Query, Looker, Terraform, Hadoop, Linux, Shell-scripting, SQL, GCP, Sqoop, Oozie, Java, PL/SQL, Oracle 12c, SQL Server, HBase, BitBucket, Python

Client: UHG, Minneapolis, Minnesota Nov 2021- Oct 2023

Role: Data Engineer

Responsibilities:

Worked with third-party tools (e.g., Fivetran, Talend) to integrate Snowflake with external data sources, facilitating seamless data ingestion and transformation in a multi-cloud environment.

Implemented CI/CD pipelines for data workflows using Azure DevOps, Git, and automated deployment strategies.

Managed infrastructure-as-code deployments with ARM templates or Terraform for consistent environment provisioning.

Performed ETL workload optimization including partitioning, pushdown optimization, pipeline parallelism, and dependency controls.

Automated ETL validation, error-handling, and reconciliation using Unix/Python scripts integrated with Informatica command-line utilities

Collaborated with cross-functional groups—BI, Finance, and operations teams—to ensure datasets were structured to support broad analytical use cases.

Built real-time dashboards and reporting solutions in conjunction with Snowflake data warehouse, utilizing BI tools such as Tableau and Power BI to empower data-driven decision-making for business leaders.

Created complex SQL queries and stored procedures within Snowflake to support analytical workflows, improving data accessibility and insights generation across multiple teams.

Monitored Snowflake usage and optimized queries for cost-effective operations, reducing compute costs by 20% through the use of query optimization techniques and optimizing storage allocation.

Troubleshot and resolved data-related issues, collaborating with cross-functional teams to identify root causes and implement corrective measures.

Successfully delivered data engineering projects on time and within budget, providing valuable insights and enabling data-driven decision-making for the organization

Designing and Developing Oracle PL/SQL and Shell Scripts, Data Import/Export, Data Conversions, Data Cleansing.

Environment: Hadoop, Cloudera, Flume, Hadoop, AWS, Azure, Oracle, Jenkins, Kubernates, Python, Sacla, Airflow, Data Lake, Pyspark, HBase, HDFS, GCP, MapReduce, Eclipse, Kafka, YARN, Hive, Pig, Sqoop, Oozie, Tableau, Azure, Data Factory, Databricks, HDInsight, PL/SQL, MySQL, Oracle, TEZ

Client: Ascension Technologies, Austin, TX April 2019- Oct 2021

Role: Data Engineer

Responsibilities:

Responsible for cluster size estimation, monitoring and troubleshooting of Spark Data Brick cluster.

Designed and implemented data models in Cosmos DB, taking advantage of its support for multiple data models (e.g., SQL, MongoDB, Cassandra)

Enforced security and compliance through Azure Active Directory (AAD), RBAC, Managed Identities, and Key Vault–based secret management.

Built data quality checks, validation rules, and observability dashboards using Azure Monitor, Log Analytics, and Databricks Expectations.

Implemented governance and cataloging solutions using Microsoft Purview for lineage, metadata management, and data classification.

Built ETL workflows on Spark for batch and real-time processing of structured and semi-structured data from multiple data lakes and sources.

Develop dashboards and visualizations to help business users analyze data and provide data insights to senior management with a focus on Microsoft products such as SQL Server Reporting Services (SSRS) and Power BI.

Delivered analytics by connecting Snowflake with Power BI, Tableau, Looker, and Sigma.

Experience Spark application development using Spark SQL on Databricks for data extraction, transformation and aggregation from multiple file formats to analyze and transform the data to gain insights into customer usage patterns.

Writing HiveQL as per the requirements and Processing data in Spark engine and stored in Hive tables.

Environment: Cloudera, Eclipse, Hive, Impala, Spark, Apache Kafka, Flume, Scala, AWS, EC2, S3, DynamoDB, Auto Scaling, Lambda, Nifi, Snowflake, Java, Shell-scripting, SQL, GCP, Sqoop, Oozie, Java, PL/SQL, Oracle 12c, SQL Server, HBase, BitBucket, Control-M, Python



Contact this candidate