Data Engineer Big

Location:

Jersey City, NJ

Posted:

August 04, 2025

Contact this candidate

Resume:

Sr. Data Engineer

Phone Number: 945-***-****

Email: *************************@*****.***

LinkedIn: https://www.linkedin.com/in/rakesh-reddy-donthireddy/

Professional Summary:

Over 10+ years of experience as an Data Engineer, Big Data Engineer, ETL Developer, and Data Warehouse Developer, applying data mapping, validation, and statistical analysis to translate business requirements into actionable insights.

Proficient in Azure Services, Cloudera, Hadoop Ecosystem, Spark (PySpark/Scala), Python, Databricks, MapReduce, Tez, Hive, Redshift, Snowflake, relational databases, and visualization tools like Tableau and Power BI.

Had understanding and knowledge of Azure services like Azure Databricks, Azure Stream Analytics, Azure Synapse Analytics, Azure Log Analytics, Azure Security Center, Azure Event Hubs, Azure HDInsight, Azure Logic Apps, Triggers, Azure Cosmos DB, DAX.

Worked extensively with Big Data technologies, including HDFS, MapReduce, YARN, Pig, Hive, Sqoop, Oozie, Zookeeper, Kafka, Cassandra (CDL), and Teradata, for a robust and scalable big data infrastructure.

Developed scalable distributed data processing solutions using Spark Streaming, Kafka, and Azure Event Hubs, enabling real-time data ingestion and processing for enterprise analytics.

Integrated on-premises and cloud-based data sources with Azure Data Factory, applying data transformations to manage seamless data movement to Snowflake.

Established a scalable data ecosystem by integrating Azure Data Factory with Blob Storage, Azure SQL Database, Azure Databricks, Azure Synapse Analytics, and Databricks SQL Analytics.

Designed and implemented end-to-end data integration pipelines with Azure Data Factory, optimizing ETL/ELT workflows for efficient data movement and transformation.

Built data pipelines and transformations using Azure Data Lake Storage (ADLS Gen2), Azure SQL Data Warehouse, Azure DevOps, HDInsight, and PySpark with Databricks, ensuring structured data processing.

Developed and optimized data models and schemas with Apache Hive, Apache HBase, and Snowflake, ensuring efficient data retrieval, storage, and analytics.

Designed high-throughput data ingestion pipelines using Apache Kafka, enabling real-time data streaming for business intelligence applications.

Proficient in Kusto Query Language (KQL), with writing and optimizing queries to extract meaningful insights and enhance data accuracy.

Developed and automated data extraction and processing scripts using Python, improving overall process efficiency and data workflow automation for multiple projects.

Demonstrated strong expertise in Azure Data Explorer, leveraging it for data modeling, reporting, and creating dashboards to support business decision-making.

Created and managed interactive Power BI reports and dashboards, allowing for better data visualization and understanding of complex data sets.

Utilized Power Apps/Power Automate to design and implement automated workflows, streamlining processes and improving overall operational efficiency.

Integrated Azure Monitor, Log Analytics, and Prometheus to create real-time performance monitoring dashboards, ensuring data pipeline health and proactive incident management.

Developed and automated ETL/ELT workflows with Apache Spark, Apache Beam, and Apache Airflow, ensuring seamless data movement and transformation.

Applied data warehousing methodologies, including Slowly Changing Dimensions (SCD), Change Data Capture (CDC), surrogate key assignment, and data cleansing, improving historical data tracking in Star Schema models.

Implemented Unity Catalog for establishing a centralized metadata repository, improving data governance, asset discovery, and lineage tracking.

Integrated Collibra Data Governance for enforcing consistent data standards, compliance, and regulatory policies, ensuring organizational data integrity.

Enhanced data lifecycle management by designing data retention, archival, and purging strategies for compliance with GDPR, HIPAA, and enterprise governance policies.

Provided technical leadership and mentorship to junior engineers and cross-functional teams, conducting training sessions on best practices in data engineering, cloud architecture, and big data processing.

Designed and implemented CI/CD pipelines using Azure DevOps, Terraform, and GitHub Actions, automating deployment of data pipelines, infrastructure, and monitoring alerts.

Experienced in developing interactive dashboards and reports using Tableau, Power BI, and Excel, enabling data-driven insights and business intelligence.

Experienced in working within Agile (Scrum, Kanban) and Waterfall methodologies, actively participating in sprint planning, daily stand-ups, backlog grooming, and project lifecycle management for efficient software delivery.

Technical Skills:

Azure Services: Azure Data Factory, Azure Event Hubs, Azure Data Lake Storage (ADLS Gen2), Azure Cosmos DB, Azure Active Directory, Azure Blob Storage, Azure Data Bricks, Azure Synapse Analytics, Azure Databricks SQL Analytics, Azure Stream Analytics, Azure Purview, Data Lake, Azure Monitor and Log Analytics, Logic Apps, Functional Apps, Azure DevOps, Airflow, Azure Notebooks, Data lake, Data ware housing, Control-M, Trino, Alteryx .

Big Data Technologies: HDFS, MapReduce, YARN, Hive, Pig, Scala, Spark Streaming, Splunk, Sqoop, Zookeeper, Apache Spark, Apache Flink, Apache Kafka, Mapr Control Systems, Oozie, Apache Superset, Cloudera Hue.

ETL Tools: Microsoft SQL Server Integration Services (SSIS), Informatica PowerCenter 8.1.0, Informatica PowerCenter, IBM Information Server, Talend, OBIEE, Matillion, DAX, Tableau, PowerBI, Coupa, QlickSense.

Methodologies: Agile, Scrum, Waterfall, Kanban.

Languages: PySpark, SQL, PL/SQL, Python, PIG, Java, HiveQL, Scala.

Databases: MS SQL Server 2016/2014/2012, Oracle 11/10g, MySQL, Teradata, PostgreSQL, HBase, MongoDB, Snowflake, Cosmos DB, Cassandra.

Hadoop Distribution: Hadoop Distribution Cloudera, Horton Works.

Operating Systems: Windows, UBUNTU, LINUX, UNIX.

Version Control: GitLab, GitHub, Bit Bucket, VisualSVN, Assembla.

IDE & Build Tools Design: Eclipse, Visual Studio, Sublime, Databricks, Jupyter Notebook, OpenShift, Kubernetes, Data Ops, Custom Connector Development.

Education:

Master of Science Missouri State University, Springfield, Missouri May 2014

Bachelor of computer Science Lovely Professional University, Jalandhar, Punjab May 2012

Work Experience:

Client: Citi Bank, NJ April 2022 - Present

Role: Data Engineer

Responsibilities:

Designed and developed scalable ETL frameworks in Azure Data Factory, ingesting and transforming structured and semi-structured data from Blob Storage, SQL Server, and Cosmos DB into Azure Synapse Analytics and ADLS Gen2.

I implemented Medallion Architecture using Azure Data Lake Storage Gen2 and Azure Databricks, which helped standardize data ingestion, refinement, and analytics. This layered approach enabled efficient data governance, scalability, and simplified reporting for business stakeholders.

Built reusable ingestion templates in ADF with support for parameterization, delta logic, and complex control flows; ensured automated error handling and alerting via Azure Monitor and Logic Apps.

Developed transformation logic using Azure Databricks (PySpark) to cleanse and enrich data across batch and real-time use cases, integrating outputs with Power BI and Azure Analysis Services (AAS) for analytics consumption.

Integrated Collibra Data Governance for enforcing consistent data standards, compliance, and regulatory policies, ensuring organizational data integrity.

Engineered data pipelines integrating Cosmos DB, Azure Data Explorer (ADX), and Synapse, leveraging Kusto Query Language (KQL) for high-performance data exploration and analytics.

Designed and optimized large-scale data ingestion workflows using Scope scripts in Cosmos DB, supporting telemetry pipelines and structured logging for cloud-native systems.

Integrated Enterprise Data Lake (EDL) pipelines with Cosmos and EventHub sources using Azure Data Factory, streamlining ingestion and transformation of streaming data feeds.

Developed optimized Scope queries for data summarization, flattening, and ingestion across high-velocity telemetry datasets in Cosmos environment.

Created technical documentation for NiFi data flows, workflow patterns, and troubleshooting playbooks to standardize maintenance and support procedures.

Maintained and enhanced legacy data pipelines using SQL, Liquibase, SSIS, AutoSys, and TFS to support critical business operations.

Designed and implemented scalable ETL solutions within data lake architecture using Hive, Dremio, PySpark, and Airflow.

Developed robust Python scripts for data transformation, orchestration, and ingestion across distributed systems.

Demonstrated strong ability to interpret and work with compiled codebases in Java and C#, contributing to efficient legacy system support.

Collaborated with data scientists and analysts to prepare and transform raw datasets into analysis-ready formats using Spark SQL and PySpark Data Frame APIs.

Managed and updated Erwin models for logical and physical data modeling of Consolidated Data Store (CDS), Actuarial Data Mart (ADM), and Reference DB to align with evolving user requirements.

Utilized TFS for version control and tracking environment-specific script deployments, ensuring consistency across database environments.

Led governance-aligned data engineering practices by integrating Azure Purview and Snowflake Access Controls to maintain regulatory compliance across pharma data domains.

Supported data governance and process control frameworks, collaborating with business SMEs and data stewards to ensure compliance, audit readiness, and traceability of pharma data pipelines.

Implemented RBAC policies, encryption, and audit logging using Azure and Snowflake features to ensure data security and regulatory compliance.

Utilized Azure Resource Manager (ARM) templates for repeatable infrastructure provisioning and leveraged Azure Import/Export Service for efficient bulk data transfers.

Utilized Shell scripting and Python to automate ETL maintenance, data validation, backups, and batch job orchestration.

Developed and maintained CI/CD pipelines using Jenkins, Azure DevOps, and Git, automating deployment of ETL pipelines and database schema changes.

Automated CI/CD deployment pipelines with Jenkins to streamline integration with PySpark-based data processing pipelines.

Applied data lifecycle management strategies, including archiving and purging via ADF, Open Shift, stored procedures, ensuring performance tuning and compliance with retention policies.

Environment: Azure Data Lake Storage, Azure Data Factory, Azure Databricks, Azure SQL Database, Azure Synapse Analytics, Snowflake, Apache Spark, PySpark, Scala, Snowflake Time Travel, Snowflake Streams, Snowflake Tasks, Hadoop, Hive, Apache Kafka, Apache Airflow, Airflow DAGs, Azure Purview, Azure Virtual Network, Azure Cost Management, ARM Templates, DAX, Jenkins, Pentaho Data Integration (PDI/Kettle), Spoon Designer, CI/CD, JIRA, Git and Agile Methodologies, open shift.

Client: TennCare, State of Tennessee, TN. Jun 2019 – March 2022

Role: Data Engineer

Responsibilities:

I implemented Medallion Architecture (Bronze, Silver, Gold layers) using Azure Data Lake Storage Gen2 and Azure Databricks, which helped standardize data ingestion, refinement, and analytics. This layered approach enabled efficient data governance, scalability, and simplified reporting for business stakeholders.

Integrated Collibra Data Governance for enforcing consistent data standards, compliance, and regulatory policies, ensuring organizational data integrity.

Built reusable ingestion templates in ADF with support for parameterization, delta logic, and complex control flows; ensured automated error handling and alerting via Azure Monitor and Logic Apps.

Developed transformation logic using Liquibase, Azure Databricks (PySpark) to cleanse and enrich data across batch and real-time use cases, integrating outputs with Power BI and Azure Analysis Services (AAS) for analytics consumption.

Engineered data pipelines integrating Cosmos DB, Azure Data Explorer (ADX), and Synapse, leveraging Kusto Query Language (KQL) for high-performance data exploration and analytics.

Maintained and enhanced legacy data pipelines using SQL, Liquibase, SSIS, AutoSys, and TFS to support critical business operations.

Designed and implemented scalable ETL solutions within data lake architecture using Hive, Dremio, PySpark, and Airflow.

Developed robust Python scripts for data transformation, orchestration, and ingestion across distributed systems.

Demonstrated strong ability to interpret and work with compiled codebases in Java and C#, contributing to efficient legacy system support.

Integrated Enterprise Data Lake (EDL) pipelines with Cosmos and EventHub sources using Azure Data Factory, streamlining ingestion and transformation of streaming data feeds.

Built and deployed machine learning-ready datasets, applying feature engineering techniques like rolling averages, signal smoothing, and lag-based features for modeling activity and biometric trends.

Contributed to early development and evaluation of LLM/GenAI-based solutions for summarizing patient trends, interpreting sensor patterns, and auto-generating clinician-ready reports from raw biometric logs.

Created reusable Python libraries for data validation, signal transformation, and unit testing, improving reusability and modularity across internal analytics teams.

Participated in research projects focused on digital biomarker innovation, working cross-functionally with clinical scientists, informaticians, and biostatisticians to validate derived health indicators.

Ensured HIPAA and GDPs compliance through encryption, audit logs, and access control in both development and production pipelines.

Maintained clear technical documentation, reports, and white papers to present methodology, exploratory findings, and outcomes to both technical and non-technical stakeholders.

Implemented RBAC policies, encryption, and audit logging using Azure and Snowflake features to ensure data security and regulatory compliance.

Collaborated closely with business analysts to translate business needs into technical data models, documenting system architecture, data lineage, and reporting logic.

Applied Agile methodologies to streamline development workflows, ensuring iterative delivery, continuous integration, and cross-functional collaboration.

Environment: Azure Data Factory, Azure Synapse Analytics, Azure Data Lake Storage Gen2 (ADLS Gen2), Delta Lake, Blob Storage, Parquet, Azure Databricks, JSON, Avro, Azure SQL Database, Change Data Capture (CDC), Power BI, Apache Airflow, PySpark, Materialized Views, Query Caching, Azure Monitor, Azure NSGs, Load Balancers, Autoscaling, Microsoft Purview, Azure RBAC, Active Directory (AD), Azure Key Vault, Azure Log Analytics, Tableau, open shift, Agile.

Client: T Rowe Price Group Inc, MD. Jan 2018 – May 2019

Role: Big Data Engineer

Responsibilities:

Designed and implemented scalable data ingestion pipelines using Apache Spark, Apache Kafka, and Apache Flink to collect structured and unstructured data from customer databases, IoT devices, and third-party sources.

Developed ETL workflows in Apache Spark, PySpark, and SQL to clean, transform, and load data into Hadoop Distributed File System (HDFS), Apache Hive, and Snowflake for large-scale processing and analytics.

Collaborated with business analysts and product owners to translate business requirements into technical specifications, data models, and implementation plans for enterprise-grade data pipelines.

Participated in agile ceremonies (daily stand-ups, sprint planning, retrospectives) while tracking delivery timelines and dependencies using JIRA and Azure Boards.

Designed and managed enterprise-scale data pipelines in Azure Data Factory, implementing dynamic control flows, parameterization, and reusable pipeline templates for ingestion and transformation.

Engineered Snowflake-based data warehouses with advanced features like Streams, Tasks, and Time Travel, ensuring version control, data auditing, and fault tolerance.

Optimized Spark SQL and Hive queries by leveraging Adaptive Query Execution (AQE), Dynamic Partition Pruning, Materialized Views, and Bucketing, improving query response times by 50%.

Provisioned and managed Azure Databricks and HDInsight clusters, ensuring seamless scalability and high availability for big data workloads.

Maintained and enhanced legacy data pipelines using SQL, Liquibase, SSIS, AutoSys, and TFS to support critical business operations.

Designed and implemented scalable ETL solutions within data lake architecture using Hive, Dremio, PySpark, and Airflow.

Developed robust Python scripts for data transformation, orchestration, and ingestion across distributed systems.

Demonstrated strong ability to interpret and work with compiled codebases in Java and C#, contributing to efficient legacy system support.

Created templatized ELT solutions using Azure Data Factory and Databricks, enabling reusable logic, consistent naming conventions, and scalable ingestion for sales and incentive data.

Developed and maintained detailed work instructions and repeatable process documentation, standardizing execution across engineering and operations teams.

Supported data governance and process control frameworks, collaborating with business SMEs and data stewards to ensure compliance, audit readiness, and traceability of pharma data pipelines.

Led governance-aligned data engineering practices by integrating Azure Purview and Snowflake Access Controls to maintain regulatory compliance across pharma data domains.

Developed Azure Functions and Logic Apps to automate file ingestion, transformation, and data validation processes from third-party vendors in the life sciences domain.

Supported multiple pharma and life sciences implementations, collaborating with business stakeholders to align data pipelines with regulatory and commercial needs.

Collaborated with data governance teams to implement Master Data Management (MDM) strategies, ensuring a single source of truth for HCP (healthcare professional) and account data.

Implemented CI/CD pipelines using Azure DevOps integrated with GitHub, automating build, test, and release workflows for ETL jobs and infrastructure deployment.

Delivered compliant data solutions in FDA-regulated environments, ensuring secure processing, auditing, and validation for critical sales and customer data workflows.

Environment: Apache Spark, PySpark, Apache Kafka, Apache Flink, Apache Hive, Hadoop Distributed File System (HDFS), Snowflake, Azure Data Lake Storage (ADLS Gen2), Delta Lake, Trino (Presto), Parquet, ORC, Avro, Apache Ranger, Microsoft Purview, Kafka Streams, Spark Structured Streaming, Power BI, Apache Atlas, Apache Airflow, Oozie, Azure Databricks, HDInsight, PostgreSQL, MySQL, SQL Server, Terraform, Kubernetes, Jenkins, GitHub Actions, Azure DevOps, YARN, Informatica, Talend, Coupa.

Client: Toyota Motors, Plano, TX Jan 2017 – Dec 2017

Role: Big Data Developer

Responsibilities:

Designed and deployed an Enterprise Data Lake to support analytics, data processing, storage, and reporting of large-scale, rapidly evolving datasets.

Gained expertise in the full Software Development Life Cycle (SDLC), including business requirements analysis, programming, database design, data warehousing, and business intelligence, applying both Star and Snowflake schema methodologies.

Developed and optimized T-SQL scripts, stored procedures, functions, table variables, indexes, common table expressions (CTE), views, and temporary tables in SQL Server for efficient query execution.

Led the design, development, and testing of ETL processes using Informatica PowerCenter for building robust data warehouse applications and test data management solutions.

Identified and addressed data quality issues at the field and column levels in source systems, improving data integrity through validation, cleansing, and error handling in ETL workflows.

Designed and developed ETL packages in SQL Server Integration Services (SSIS) to extract, transform, and load data from multiple sources, including SQL Server, Oracle, flat files, CSV, and XML.

Applied strong expertise in OLTP and OLAP data warehousing, data mining, data governance, and overall data management services, ensuring data quality and consistency.

Analyzed transactional database schemas and designed optimized star schema models to support business intelligence and reporting requirements.

Deployed SSIS packages into production environments, leveraging various package configurations for seamless migration and maintaining an environment-independent setup.

Developed and optimized SQL and PL/SQL scripts for database triggers, performance tuning, and efficient data processing.

Created complex T-SQL stored procedures, functions, triggers, cursors, and schema objects, enhancing database functionalities.

Designed and implemented Drill-through, Drill-down, Sub-reports, and Linked reports using SQL Server Reporting Services (SSRS) for advanced business insights.

Optimized SQL queries and enhanced SSIS package performance to improve report execution times and database consistency.

Applied Agile methodology using the Scrum framework, ensuring efficient project management and development through iterative sprints and continuous improvements.

Environment: SQL Server, T-SQL, PL/SQL, Informatica PowerCenter, SSIS, SSRS, OLTP, OLAP, Star Schema, Snowflake Schema, Data Warehousing, Data Mining, Data Governance, ETL, Stored Procedures, Functions, Indexing, Common Table Expressions (CTE), Views, Temporary Tables, Oracle, Flat Files, CSV, XML, Database Triggers, Performance Tuning, Drill-through Reports, Drill-down Reports, Sub-reports, Linked Reports, Agile, Scrum, Tibco, Matillion, ETL Migration.

Client: American Airlines, Fort Worth, Tx Jan 2016 - Dec 2016

Role: Hadoop Developer

Responsibilities:

Developed and maintained data ingestion pipelines using Apache Flume, Apache Kafka, and Apache Sqoop to import structured and unstructured data from relational and NoSQL databases into HDFS.

Designed and optimized ETL workflows using Apache NiFi and Apache Airflow to automate data movement across distributed systems.

Managed large-scale datasets in HDFS, optimizing replication, compression (Snappy, LZO, Gzip), and fault tolerance for efficient data retrieval.

Developed MapReduce programs and Spark (PySpark/Scala) applications to process large datasets, leveraging Data Frames, RDDs, and partitioning for performance optimization.

Built and optimized Hive queries using bucketing, partitioning, and indexing to reduce query latency and improve analytical performance.

Integrated Apache Tez to enhance Hive query execution and performance.

Developed real-time streaming solutions using Apache Kafka, Apache Flink, and Spark Streaming for processing IoT, log, and social media data.

Designed and managed workflow automation using Apache Oozie and Apache Airflow for scheduling ETL jobs and monitoring dependencies.

Integrated Hadoop with HBase and Cassandra, designing row-key and column-family structures for low-latency data retrieval.

Implemented data security best practices using Apache Ranger for access control and Kerberos authentication, ensuring compliance with governance policies.

Monitored Hadoop cluster performance using YARN, Ambari, and Cloudera Manager, tuning Spark jobs by optimizing memory allocation and shuffle partitions.

Worked in Agile Scrum environments, collaborating with data engineers, analysts, and business teams to define data requirements and optimize processing workflows.

Environment: Hadoop, HDFS, Apache Spark, PySpark, Scala, Hive, Impala, Apache Flink, Apache Kafka, Apache Sqoop, Apache Flume, Apache NiFi, MapReduce, HBase, Cassandra, MongoDB, Apache Oozie, Apache Airflow, Apache Ranger, Apache Atlas, Apache Tez, YARN, Ambari, Cloudera Manager, Kerberos, Git, Jenkins, Prometheus, SQL, Python, Python, Tibco, Matillion, ETL Migration.

Client: Morgan Stanley, Chicago, IL Jun 2014 – Dec 2015

Role: Data Warehouse Developer

Responsibilities:

Created SQL Mail Agent jobs, alerts, and scheduled DTS/SSIS packages to automate recurring data processing workflows.

Managed and updated Erwin models for logical and physical data modeling of Consolidated Data Store (CDS), Actuarial Data Mart (ADM), and Reference DB to align with evolving user requirements.

Utilized TFS for version control and tracking environment-specific script deployments, ensuring consistency across database environments.

Exported current data models from Erwin to PDF format and published them on SharePoint for seamless user access and collaboration.

Developed, administered, and maintained databases, including Consolidated Data Store, Reference Database, and Actuarial Data Mart, to support enterprise data needs.

Wrote and optimized triggers, stored procedures, and functions using Transact-SQL (T-SQL), ensuring efficient database operations and query performance.

Deployed scripts across multiple environments following Configuration Management and Playbook guidelines for controlled database updates.

Created and managed files and file groups, optimizing table and index structures to enhance query performance and database efficiency.

Tracked, analyzed, and resolved database-related defects using Quality Center for effective issue management and resolution.

Maintained and administered database security by managing users, roles, and permissions within SQL Server to enforce access control policies.

Environment: SQL Server 2008/2012 Enterprise Edition, SSRS, SSIS, T-SQL, Windows Server 2003, Performance Point Server 2007, Oracle 10g, Visual Studio 2010.

Contact this candidate