Data Engineer
Rohit Ibrahimpatnam
Email: ******************@*****.***
Contact: +1-214-***-****
Professional Summary:
Data Engineer with 5 years of experience designing and developing scalable data pipelines across on-prem, AWS, and Azure environments.
Strong expertise in ETL/ELT development, data ingestion, transformation, and optimization using Python, SQL, Apache Spark, and PySpark.
Hands-on experience with Azure Data Factory, Azure Databricks, Azure Synapse Analytics, ADLS Gen2, and AWS Glue, Amazon S3, Amazon Redshift.
Proven ability to build and maintain incremental and CDC-based data pipelines, ensuring data accuracy, performance, and reliability.
Solid background in data modeling (star and snowflake schemas) to support analytics and business intelligence use cases.
Experienced in implementing data quality checks, validation frameworks, and reconciliation processes across enterprise datasets.
Strong understanding of cloud security and governance, including RBAC, IAM policies, and secrets management.
Skilled in performance tuning of Spark jobs and SQL queries to handle large-scale data processing efficiently.
Hands-on experience with CI/CD pipelines, version control, and production support, ensuring stable and maintainable data platforms.
Effective collaborator with cross-functional teams, translating business requirements into reliable data engineering solutions.
Technical Skill
Category
Skills / Tools
Programming Languages
Python (PySpark, Pandas), SQL
Data Engineering
ETL / ELT Development, Data Ingestion, Incremental & CDC Loads, Data Quality & Validation, Data Modeling (Star & Snowflake)
Big Data Processing
Apache Spark, PySpark
Cloud Platforms
Microsoft Azure, Amazon Web Services (AWS)
Azure Services
Azure Data Factory, Azure Databricks, Azure Data Lake Storage Gen2, Azure Synapse Analytics, Azure SQL Database, Azure Key Vault, Azure Active Directory, Azure Monitor, Log Analytics
AWS Services
AWS Glue, Amazon S3, Amazon Redshift, AWS IAM, Amazon CloudWatch
Databases & Storage
Relational Databases (Oracle, MySQL), Azure SQL Database, Amazon Redshift
Data Formats
Parquet, CSV, JSON
Security & Governance
RBAC, IAM Policies, Secrets Management
CI/CD & Version Control
Azure DevOps, Git, CI/CD Pipelines
Scheduling & Orchestration
Azure Data Factory Triggers, AWS Glue Triggers, Cron
Monitoring & Logging
Azure Monitor, Log Analytics, Amazon CloudWatch
Reporting & Analytics
Power BI (Data Consumption Support)
Operating Systems
Linux
Professional Experience:
Client: CMS, Dallas, TX Sept 2023 – Present
Role: Azure Data Engineer
Responsibilities:
Developed and maintained end-to-end data ingestion pipelines using Azure Data Factory, utilizing Self-Hosted Integration Runtime to extract data from on-prem SQL Server and flat files into Azure Data Lake Storage Gen2.
Implemented ETL/ELT processes using Azure Databricks (PySpark) to clean, transform, and standardize large healthcare datasets, applying Spark optimizations such as partitioning, broadcast joins, and caching.
Created and managed analytical tables and views in Azure Synapse Analytics (Dedicated and Serverless SQL Pools) to support reporting and downstream analytics workloads.
Implemented incremental data loads and CDC logic using watermark columns, control tables, and metadata stored in Azure SQL Database to improve pipeline efficiency and reduce processing time.
Performed data modeling in Synapse by building star and snowflake schemas aligned with business reporting requirements and healthcare metrics.
Implemented data quality checks within ADF and Databricks, including schema validation, duplicate detection, null checks, and reconciliation logic to ensure data accuracy and consistency.
Integrated Azure Key Vault with Azure Data Factory and Databricks to securely manage secrets, credentials, and service principals.
Applied role-based access control (RBAC) using Azure Active Directory and managed ADLS Gen2 ACLs to enforce secure access to sensitive healthcare data.
Configured pipeline orchestration using ADF triggers, parameters, and reusable datasets, enabling automated scheduling and dependency management.
Optimized Databricks Spark jobs and Synapse SQL queries by tuning file formats (Parquet), distribution strategies, indexing, and query execution plans.
Monitored pipeline executions and failures using Azure Monitor and Log Analytics, performing root-cause analysis and implementing fixes to improve stability.
Supported business intelligence and reporting teams by delivering curated datasets from Synapse for Power BI consumption.
Managed source control, build, and deployment of data pipelines using Azure DevOps, following CI/CD practices across development, test, and production environments.
Created and maintained technical documentation, data flow diagrams, and operational runbooks for production support.
Tech Stack: Azure Data Factory, Azure Databricks (PySpark/Python), ADLS Gen2, Azure Synapse Analytics, Azure SQL Database, Azure Key Vault, Azure DevOps, Power BI.
T-Mobile - Cognizant, India Jan 2022 – July 2023
Role: Data Engineer
Responsibilities
Developed and maintained scalable data ingestion pipelines using AWS Glue, Python (PySpark), and JDBC connectors to extract data from Oracle and MySQL systems into Amazon S3.
Built ETL workflows using AWS Glue Spark jobs (PySpark) to cleanse, standardize, and enrich high-volume telecom datasets, applying Spark optimizations such as partitioning, caching, and optimized joins.
Designed and managed data lake storage structures in Amazon S3, organizing raw, processed, and curated layers using partitioned Parquet formats for efficient querying.
Implemented analytical data models in Amazon Redshift, creating fact and dimension tables using star and snowflake schemas to support reporting and performance analytics.
Developed incremental data processing logic using Glue job bookmarks and S3 partitioning to reduce full data reloads and improve processing efficiency.
Integrated Apache Spark (PySpark) for complex transformations, aggregations, and window functions on large structured and semi-structured datasets.
Enforced data quality by implementing schema validation, null checks, duplicate detection, and reconciliation logic within Glue and Spark jobs.
Orchestrated end-to-end data workflows using AWS Glue Triggers and Amazon CloudWatch Events, enabling automated scheduling and dependency management.
Implemented security controls using AWS IAM roles and policies, ensuring secure access to S3, Glue, and Redshift in accordance with enterprise standards.
Tuned Redshift performance by configuring distribution styles, sort keys, vacuum operations, and query optimization to support high-concurrency workloads.
Monitored ETL pipelines and cluster performance using Amazon CloudWatch logs and metrics, performing root-cause analysis for job failures and latency issues.
Collaborated with reporting teams to deliver curated datasets from Amazon Redshift for downstream BI and analytics consumption.
Managed source code, versioning, and deployment of data pipelines using Git and CI/CD pipelines, supporting development, test, and production environments.
Tech Stack: AWS Glue (PySpark/Python), Apache Spark, Amazon S3, Amazon Redshift, AWS IAM, Amazon CloudWatch, Git, CI/CD pipelines.
Client: RCOM, Hyderabad, India Jan 2021 – Jan 2022
Role: Data Engineer
Responsibilities:
Developed batch data ingestion processes using Python (Pandas, PyODBC) and SQL to extract data from operational systems and load it into a centralized relational database.
Built and maintained ETL pipelines using Python scripts and SQL transformations, performing data cleansing, normalization, and business rule application on telecom datasets.
Designed and managed relational data models, creating fact and dimension tables to support reporting and analytical queries.
Implemented incremental and delta load logic using timestamp columns and surrogate keys to handle daily data refreshes efficiently.
Wrote and optimized complex SQL queries, including joins, CTEs, subqueries, and window functions, to support analytics and reporting needs.
Enforced data quality controls by implementing validation checks, record counts, duplicate detection, and reconciliation logic within ETL processes.
Performed database performance tuning by creating indexes, analyzing execution plans, and optimizing query logic.
Scheduled and monitored ETL jobs using cron and batch scheduling tools, ensuring reliable daily data processing.
Collaborated with reporting teams to deliver curated datasets aligned with business definitions and KPIs.
Documented data flows, transformation logic, and operational procedures to support ongoing maintenance and knowledge transfer.
Tech Stack: Python (Pandas), SQL, Relational Databases (Oracle/MySQL), ETL Scripts, Linux, Cron Scheduling.
Certification
Microsoft Certified: Fabric Data Engineer Associate
Education:
Master's – University of North Texas, Denton, Texas, USA
Bachelor’s - CVR College of Engineering —India