Data Engineer Big

Location:

Coppell, TX

Posted:

June 20, 2025

Contact this candidate

Resume:

Shireesha Pasuladhi

Data Engineer

Email: ******************.***@*****.*** Phone No: 682-***-****

PROFESSIONAL SUMMARY

Results-driven Data Engineer with 10+ plus year of experience in designing, building, and optimizing large-scale data ecosystems across on-premises and cloud environments. Proven expertise in leveraging modern big data technologies, including Apache Spark, Databricks, Hadoop, and SQL, to transform raw data into actionable insights that drive strategic decision-making.

Deep expertise in cloud-native data engineering, with hands-on experience in Azure Databricks, Azure Data Factory, AWS Glue, and Amazon EMR for scalable data processing and orchestration.

Proficient in building robust ETL/ELT pipelines using PySpark, Scala, and SQL for ingesting, transforming, and aggregating structured and semi-structured data across various formats (CSV, JSON, Parquet, Avro).

Adept at data modeling, data wrangling, and performance optimization for both OLAP and OLTP systems using SQL, T-SQL, PL/SQL, and NoSQL solutions (MongoDB, Cassandra).

Extensive experience in data lake and data warehouse architecture design and implementation using Delta Lake, Azure Synapse Analytics, Snowflake, and Amazon Redshift.

Strong statistical and analytical background, skilled in predictive analytics, text mining, and natural language processing (NLP) using Python, R, and relevant libraries such as Scikit-learn, NLTK, and TensorFlow.

Hands-on experience creating and managing Spark clusters in Azure HDInsight, configuring Databricks notebooks, and implementing ML pipelines for real-time and batch processing.

Skilled in leveraging Airflow, Databricks Workflows, and Azure Data Factory pipelines for end-to-end data orchestration and automation.

Experience developing scalable and reliable data lakehouses combining structured, semi-structured, and unstructured data to support advanced analytics and BI.

Expertise in data visualization and dashboard development using Tableau, Power BI, and Plotly, enabling effective communication of insights to business stakeholders.

Proficient in SSIS, SSRS, UNIX Shell Scripting, and Excel Pivot Tables for legacy system integration and reporting needs.

Strong collaborative skills, with a history of working closely with data scientists, business analysts, and engineering teams to gather requirements, validate data models, and deploy data solutions to production environments.

Demonstrated excellence in all phases of the data product lifecycle — from requirement analysis and prototyping to deployment and continuous improvement using Agile/Scrum methodologies.

TECHNICAL SKILLS

Big Data/Hadoop Ecosystem

Apache Spark, Databricks, HDFS, Map Reduce, HIVE, Sqoop, Oozie, Zookeeper, Kafka, Flume, IntelliJ

Programming Languages

Python, Scala, R, SQL, PL/SQL, Linux Shell Scripts

Cloud Environment

AWS, Azure

Tools

Eclipse, Putty, WinSCP, NetBeans, QlikView, PowerBI, Tableau

Methodologies

Agile/Scrum, Rational Unified Process and Waterfall

Distributed Platforms

Cloudera, Horton Works, MapR

Database

Oracle, MY SQL, MS-SQL Server, DB2, Teradata

Operating Systems

Linux, Unix, Windows Variants, Mac

PROFESSIONAL EXPERIENCE

Client: Mayo Clinic Rochester MN Apr 2023 - Present

Role: Data Engineer

Responsibilities:

Architected and implemented end-to-end data pipelines on GCP and Databricks, supporting large-scale ETL/ELT workloads with structured and unstructured data across diverse sources.

Developed Spark-based transformation jobs in PySpark and Scala, optimizing data ingestion and transformation performance across distributed systems.

Engineered data ingestion pipelines using Databricks Auto Loader, Delta Live Tables (DLT), and Delta Lake to enable real-time analytics and historical data preservation.

Designed dimensional data models including Star Schema, Snowflake Schema, and implemented Slowly Changing Dimensions (SCD) Type I, II for enterprise data warehousing solutions.

Developed and maintained PL/SQL stored procedures, functions, views, and triggers with performance tuning via indexing, materialized views, and partitioning strategies in Oracle and PostgreSQL.

Automated complex ETL workflows using Apache Airflow and Databricks Workflows, improving efficiency and reducing manual data handling by over 45%.

Utilized Cloud Dataflow, Google Pub/Sub, and BigQuery for real-time event processing and analytics on streaming data sources.

Built custom GCP Cloud Functions in Python to handle data ingestion from GCS buckets and external APIs.

Applied machine learning models using MLflow, scikit-learn, Pandas, and NumPy to predict healthcare outcomes and enhance data-driven decision-making.

Visualized large datasets using Matplotlib, Seaborn, and Tableau for effective communication of trends, patterns, and business insights.

Performed comprehensive data quality checks, profiling, cleansing, validation, and reconciliation using Great Expectations, PyDeequ, and dbt.

Leveraged Confluence and Jira for agile project tracking, documentation, and sprint planning.

Designed and deployed containerized data services using Docker and version-controlled artifacts through GitHub Actions and CI/CD pipelines.

Integrated AWS services such as EC2, S3, and Lambda, and deployed serverless solutions across hybrid cloud environments.

Conducted advanced data migration, replication, and synchronization for cloud-to-cloud and on-premise-to-cloud transitions.

Environment: Databricks, Apache Spark, Delta Lake, Apache Airflow, Hadoop, Hive, GCP (BigQuery, GCS, Cloud Functions, Pub/Sub, Dataflow), AWS (EC2, S3, Lambda), Azure Data Factory, Python, PySpark, SQL, PL/SQL, Docker, GitHub, Jenkins, Tableau, Matplotlib, Seaborn, Pandas, NumPy, MLflow, dbt, Great Expectations, Oracle, PostgreSQL.

Client: Fifth Third Bank, Evansville, IN Jan 2022 – Mar 2023

Role: Data Engineer

Responsibilities:

Designed and executed Spark-based cluster jobs in Databricks to consolidate daily sales team metrics for executive-level reporting.

Scheduled and orchestrated ingestion of application analytics data into enterprise data warehouses using Delta Lake and Apache Airflow workflows.

Developed scalable data delivery pipelines in Python, enabling configurable and automated updates to customer-facing data stores.

Utilized GPU-accelerated cloud computing on AWS SageMaker and Databricks MLflow to automate machine learning and analytics pipelines.

Applied supervised learning algorithms such as Decision Trees, Linear and Logistic Regression, and advanced statistical modeling for predictive analytics.

Monitored and optimized Hadoop/Hive clusters, fine-tuning query performance and cluster resource allocation to meet SLA requirements.

Led impact analysis for system enhancements, optimized ETL strategies, and implemented robust data transformation processes using Spark Structured Streaming.

Designed and maintained ETL transformation layers supporting high-throughput data pipelines for structured and semi-structured data sources.

Developed Docker-based Continuous Integration/Continuous Delivery (CI/CD) pipelines integrated with GitHub Actions and deployed across AWS ECS.

Implemented Apache Airflow DAGs to automate batch and real-time data pipelines with dependency handling and dynamic scheduling logic.

Consolidated and analyzed multi-source data to derive business insights using PySpark, SQL, and data profiling frameworks like Great Expectations.

Engaged in full Agile software development lifecycle (SDLC), including sprint planning, code reviews, and post-deployment support in Scrum environments.

Documented technical specifications, data mapping documents, and test strategies to ensure full traceability and maintainability of data pipelines.

Facilitated cross-functional collaboration with business and engineering stakeholders to deliver reliable, reusable, and scalable data solutions.

Managed data ingestion, transformation, integration, and validation across data lakes, data warehouses, operational data stores, and MDM systems.

Evaluated existing systems for optimization opportunities and implemented proactive changes to improve data reliability and reduce latency.

Environment: Databricks, Apache Spark, Delta Lake, Apache Airflow, Hadoop, Hive, AWS (S3, EC2, SageMaker, Lambda), GCP (BigQuery, GCS, Pub/Sub), Azure Data Factory, Python, PySpark, SQL, MLflow, Docker, GitHub Actions, Tableau, Great Expectations, Oracle, PostgreSQL, Windows, Confluence, Jira

Client: Homesite insurance, Boston, MA Sep 2019 – Dec 2021

Role: Data Engineer

Responsibilities:

Successfully migrated on-premise SQL Server databases to Azure Data Lake Storage, Azure SQL Database, Azure Synapse Analytics, and Databricks, while managing user roles, access control, and security policies.

Engineered scalable ETL/ELT pipelines using Ab Initio and PySpark on Azure Databricks, improving data ingestion and transformation efficiency by 27%.

Designed and implemented efficient data models applying normalization, denormalization, indexing, and partitioning techniques for high-performance query execution and data retrieval.

Developed and optimized complex T-SQL stored procedures, views, triggers, and functions in SQL Server to support advanced ETL workflows and analytical processing.

Built interactive and data-rich Power BI dashboards to visualize key metrics and KPIs, enabling real-time business intelligence and strategic decision-making.

Orchestrated data movement and transformation workflows using Azure Data Factory, and performed large-scale analytics in Azure Synapse Analytics with seamless integration to downstream BI tools.

Applied advanced Microsoft Excel capabilities including VLOOKUP, XLOOKUP, Power Query, and pivot tables to support data profiling, reconciliation, and ad-hoc analysis.

Utilized Git/GitHub for source control and implemented CI/CD pipelines to streamline deployment processes across Power BI, Azure Data Factory, and dbt workflows.

Collaborated cross-functionally with data analysts, architects, and business stakeholders to translate functional requirements into scalable and reusable data solutions.

Maintained comprehensive documentation for data mapping, ETL design, testing protocols, and environment configurations, ensuring transparency and audit readiness.

Environment: Databricks, PySpark, Azure Data Factory, Azure Synapse Analytics, Azure Data Lake, Azure SQL Database, SQL Server, Power BI, Ab Initio, dbt (early adoption), Git/GitHub, T-SQL, Power Query, VLOOKUP, XLOOKUP, Pivot Tables, CI/CD, Windows Server, Excel 2016/365

Client: Germin8, Mumbai, India Feb 2017 – May 2019

Role: Data Engineer

Responsibilities:

Independently developed end-to-end Big Data proof-of-concepts (POCs) leveraging Apache Spark, Hadoop, and Databricks, showcasing performance and scalability to stakeholders.

Architected enterprise-grade data solutions integrating Kafka, Spark Structured Streaming, and Delta Lake for real-time ingestion and transformation of high-volume datasets.

Resolved data integrity issues preventing EDI file loads, while delivering predictive analytics documents and MicroStrategy dashboards to empower BI teams.

Utilized Google Cloud Vision API to perform image classification, OCR, and label detection; integrated vision features into scalable applications for image-based analytics.

Applied text mining and natural language processing (NLP) techniques to transform unstructured text into structured formats using Python, enabling sentiment and behavioral analysis.

Developed QlikView data models pulling from diverse sources including DB2, Excel, CSV files, and Big Data platforms, improving analytical capabilities for business users.

Created Python-based utilities using pandas, SciPy, and NumPy for data wrangling, exploratory analysis, and transformation across structured and semi-structured datasets.

Implemented supervised classification algorithms such as Logistic Regression, Decision Trees, K-Nearest Neighbors, and Naive Bayes to support customer segmentation and predictions.

Employed OpenCV for drone-based object detection and image processing as part of an experimental image analytics initiative.

Conducted comprehensive data profiling and unified data across sources to understand user behavior and support downstream analytics.

Built scalable Big Data ingestion and processing frameworks using Hadoop, Hive, and Spark to normalize large volumes of open and proprietary data.

Designed machine learning frameworks in Python, R, and MATLAB, and integrated outputs into BI tools such as MicroStrategy using R-script connectors.

Developed reusable object-oriented solutions and components for core data products, aligned with long-term enterprise architecture goals.

Utilized Teradata utilities like FastExport and MLOAD to migrate data between OLTP and OLAP environments.

Partnered with engineers and DBAs to optimize SQL queries, improve ETL reliability, and extract data from Oracle, Greenplum, and SQL Server systems.

Led data science analytics initiatives using TensorFlow, Spark MLlib, and PySpark to support predictive modeling and automation.

Coordinated and analyzed A/B tests to measure the performance of personalized recommendation systems.

Built and deployed interactive Tableau dashboards to visualize trends, KPIs, and user behavior insights.

Developed marketing performance evaluations based on customer segmentation and behavioral analysis.

Applied NLP techniques to analyze customer satisfaction feedback, identifying actionable themes to improve customer experience.

Environment: MATLAB, MongoDB, exploratory analysis, feature engineering, K-Means Clustering, Hierarchical Clustering, Machine Learning, Python, Spark (MLlib, PySpark), Tableau, MicroStrategy, Windows.

Client: SIBIA Analytics, Kolkata, India Jan 2015 – Jan 2017

Role: Data Analyst

Responsibilities:

Defined architecture blueprints and advised enterprise clients on cloud adoption strategies, data management best practices, and scalable analytics frameworks.

Delivered technical leadership by preparing software design proposals and ensuring alignment between business objectives and technical implementation.

Leveraged Azure Data Lake Store and Azure Cloud Services to enable scalable, secure storage and analytics for large volumes of structured and unstructured data.

Employed PySpark and SparkSQL within Hadoop ecosystems to accelerate data preparation and testing, improving ETL performance across data pipelines.

Designed and migrated data systems using a suite of big data tools including HDInsight, Apache Spark, Hive, Pig, Sqoop, and HDFS.

Implemented cloud-based analytical applications using Azure SQL, NoSQL databases, and Azure Data Warehouse technologies.

Tuned and optimized high-volume ETL mappings and data flows for both relational and non-relational data environments, improving performance by up to 30%.

Built robust, automated ETL pipelines in Azure Data Factory, orchestrating daily ingestion and transformation from various sources including on-prem databases and flat files.

Designed modular, pull-based ETL architectures to prepare cleansed and structured datasets for downstream consumption in Azure SQL and Power BI.

Integrated job orchestration using HDInsight and Azure Machine Learning (ML) to schedule and execute data models for advanced analytics.

Created interactive and executive-level dashboards using Power BI, enabling stakeholders to access KPIs and trends in real time.

Modernized legacy ingestion workflows by redesigning data pipelines for improved reliability, scalability, and fault tolerance.

Managed all stages of the software development lifecycle including requirement gathering, data modeling, ETL development, testing, deployment, and documentation.

Environment: Python, PySpark, Azure Data Factory, Data Lake Store, PowerBI, Azure SQL, Azure ML, SSIS, HDInsight, Azure Data Bricks, Oracle, Windows.

Contact this candidate