Data Engineer Machine Learning

Location:

Dallas, TX

Posted:

March 10, 2025

Contact this candidate

Resume:

Name: TEJA SREE

Phone No: 469-***-****

*****************@*****.***

Professional Summary

Data Engineer with 10 years of hands-on experience in deploying robust data solutions across diverse industries including Banking, Retail & Health Care. Proficient in utilizing cloud platforms, particularly Microsoft Azure and AWS, to design and implement scalable architectures. Expertise includes end-to-end development of data pipelines using advanced tools like Apache Spark, Azure Data Factory, and Azure Synapse Analytics.

Skilled in designing scalable architectures using Azure services including HDInsight, Data Factory, Data Lake, Databricks, Machine Learning Studio, Event Hub, Logic Apps, API Management (APIM), DevOps, Stream Analytics, Cosmos DB, Synapse Analytics, IoT Hub and Terraform for comprehensive cloud-based solutions.

Skilled in designing scalable architectures using Azure services such as HDInsight, Data Lake, Databricks, and AWS services like S3, Redshift, and Lambda for comprehensive cloud-based solutions. Developed end-to-end data pipelines using Python, PySpark, SQL, and Shell Scripting across various platforms, ensuring efficient data processing workflows.

Expertise in optimizing Snowflake architecture, utilizing Snowflake SQL, Snowpipe, and Snowflake streams for efficient data management and scalable analytics.

Automated SSIS package execution using Azure Automation and Data Factory, streamlining ETL workflows and improving operational efficiency.

Proficient in real-time data processing with Apache Kafka, Spark Streaming, and Storm on Azure HDInsight for handling large-scale data streams in real-time applications.

Hands-on experience in developing and implementing Big Data Management Platforms utilizing Hadoop, HDFS, MapReduce, YARN, Spark, Hive, Pig, Oozie, NiFi, Airflow, Talend, and Sqoop on Azure HDInsight.

Comprehensive understanding of Hadoop architecture, with management of Name Node, Data Node, Resource Manager, Node Manager, and Job History Server for Azure-based clusters.

Expertise with Hadoop distributions such as Hortonworks Data Platform (HDP) and Cloudera Distribution Including Apache Hadoop (CDH) for managing and analyzing extensive datasets on Azure.

Designed and managed Hadoop clusters on Azure HDInsight, leveraging Hive partitioning and tools like Apache Hue, Apache Zeppelin, and Apache Pig for analyzing extensive transactional data.

Experienced in migrating ETL tasks from Teradata to Snowflake, streamlining data warehouse operations for enhanced analytics on Azure Synapse Analytics.

Utilized Terraform to automate the deployment and scaling of Azure resources, including Azure VMs, Data Lake, and Event Hub, ensuring cost-effective and efficient cloud infrastructure.

Developed data quality audit systems using Azure Logic Apps, Azure Functions, Data Factory, and Python to monitor table health and ensure data integrity across datasets.

Streamlined weekly jobs on Azure Databricks clusters by fine-tuning Apache Spark configurations, reducing runtime by 30%.

Proficient in implementing cross-account data sharing using Azure Data Lake and SFTP connections, ensuring secure and compliant data exchanges between Azure environments.

Hands-on experience with NoSQL databases including Apache HBase, Apache Cassandra, MongoDB, and Azure Cosmos DB for scalable NoSQL solutions.

Implemented real-time data processing using Apache Hudi on Azure VMs and AWS EC2, managing healthcare data streams with Azure Event Hubs and AWS Kinesis. Configured Apache Hudi to ingest data from various healthcare systems into Azure Data Lake and AWS S3, ensuring data integrity and enabling efficient querying.

Configured Apache Hudi to ingest data from various healthcare systems into Azure Data Lake, ensuring data integrity and enabling efficient querying and analytics.

Experience in coding MapReduce/YARN programs using Java, Scala, and Python, and building data pipelines with Big Data technologies on Azure HDInsight.

Proficient in data visualization tools like Tableau, Microsoft Power BI, and Azure Synapse Studio for creating insightful reports and dashboards.

Technical Skills

Microsoft Azure: HDInsight, Data Factory, Data Lake, Databricks, Machine Learning Studio, Event Hub, Logic Apps, API Management (APIM), DevOps, Azure Stream Analytics, Cosmos DB, Synapse Analytics, IoT Hub

Amazon Web Services (AWS): S3, Glue, Kinesis, MSK, Redshift, Lambda Functions, EMR, Athena, Databricks, SageMaker Data Science Studio, CloudWatch, DynamoDB, RDS, IAM Policies, CloudFormation Templates, Jupyter Notebooks

Big Data Technologies: Apache Hadoop (HDFS, MapReduce, YARN, Hive, Pig), Apache Spark (Spark Core, Spark SQL, Spark Streaming), Hudi, CDC, Oozie, NiFi, Airflow, and Sqoop

NoSQL Databases: Apache HBase, Apache Cassandra, MongoDB, Azure Cosmos DB, Amazon DynamoDB

Programming Languages: Python, Scala, Java

Data Visualization: Tableau, Microsoft Power BI, Looker

DevOps Tools: Jenkins, Terraform, Docker and Kubernetes

RDBMS: Microsoft SQL Server, Oracle, MySQL, and PostgreSQL

Version Control Systems: TCP/IP, HTTP, FTP, and SOAP

Web services: Rest (JAX-RS), SOAP (JAX-WS)

Version Control Tools: Git, GitHub, Bitbucket

Other Tools: Apache Kafka, Apache Flume, Snowflake, Control-M, Informatica, Talend, SQL Server Integration Services (SSIS), InfluxDB, Grafana and Qlik

Professional Experience

Client: BOFA, Plano, Texas Jan 2022 – Present

Role: Senior Data Engineer

Responsibilities:

Created and managed data extraction pipelines in Azure Data Factory (ADF) to collect and integrate data from diverse sources such as transactional databases, APIs, and streaming platforms, ensuring compliance with banking regulations and data governance policies.

Implemented data ingestion processes using Azure services, including Azure Data Lake Store (ADLS), Azure SQL Database, Blob Storage, and Azure Synapse Analytics, supporting real-time analytics and reporting for banking operations, improving data access speed by 40%.

Integrated Snowflake as a Central Data Warehouse, establishing it as the primary storage solution for diverse data sources, enabling centralized data management and enhanced reporting capabilities.

Developed data ingestion pipelines to Snowflake using ADF to facilitate seamless data loading from Azure Blob Storage and ADLS, ensuring timely access to critical banking data for analytics.

Designed and developed ETL (Extract, Transform, Load) processes using Azure Data Factory to cleanse, validate, and transform collected banking data into standardized formats for regulatory reporting and analytical purposes, reducing data processing time by 30%.

Leveraged Azure HDInsight and Azure Data Lake Store to process structured and unstructured data from transaction logs, customer interactions, and external market feeds, enhancing the bank’s ability to derive actionable insights for risk management.

Provisioned Spark clusters on Azure Databricks, utilizing cross-account roles to access ADLS for efficient processing of large datasets, facilitating data-driven decision-making in risk assessment and customer segmentation, leading to a 25% improvement in predictive accuracy.

Built on-demand data warehouses using Azure Synapse and Snowflake to process high volumes of financial data, providing datasets for data scientists to conduct predictive analytics for credit scoring and fraud detection.

Programmed in Hive, Spark SQL, and Python within Azure Databricks to streamline data processing and build robust data pipelines that generate insights for improving customer services and optimizing product offerings, resulting in a 20% increase in customer satisfaction scores.

Orchestrated data pipelines using Azure Data Factory, managing the flow of data and scheduling regular processing tasks to ensure timely updates for risk management and regulatory compliance reporting, aligning with GDPR and PCI-DSS standards.

Stored raw banking data in optimized formats such as ORC and Parquet within Azure HDInsight, Azure Blobs, and Snowflake to facilitate efficient retrieval and processing for analytical purposes.

Imported data from various sources into HDFS using Sqoop, creating internal and external tables in Hive for data organization and analysis, supporting various banking applications including loan processing and customer analytics.

Utilized Snowflake’s capabilities for advanced analytics, enabling data scientists to execute complex queries and conduct detailed analysis directly on data stored in Snowflake.

Developed shell scripts and automated workflows for incremental loads, Sqoop, Hive, and Spark jobs using tools like Oozie and crontab schedulers to improve operational efficiency in data handling, resulting in a 15% reduction in manual processing time.

Optimized Hive performance through techniques like partitioning, bucketing, and implementing Hive SerDes for efficient data retrieval and processing, ensuring high availability of data for critical banking applications.

Developed Spark applications using Python (PySpark) for seamless data processing and analytics, leveraging Spark APIs to handle large-scale data in the Azure environment, ultimately improving reporting speed and accuracy by 35%.

Integrated Tableau dashboards with live data sources such as Snowflake, Azure SQL Database for real-time analytics and reporting.

Implemented Snowflake’s real-time data ingestion with Snowpipe, allowing for continuous loading of data into Snowflake and enhancing the bank’s operational reporting capabilities.

Ensured data security and compliance in Snowflake by leveraging its security features, including role-based access control and encryption, aligning with banking regulations such as GDPR and PCI-DSS.

Environment: Microsoft Azure (Data Factory, Data Lake, Data Lake Analytics, Storage, Synapse, Blob Storage, SQL, Snowflake, Databricks, HDInsight), Spark, Spark SQL, ORC, Python, SQL, PySpark, Scala, Tableau, Parquet, HDFS, ETL, Hive.

Client: Cisco Systems, Dallas, Texas Feb 2021 – Dec 2021

Role: Big Data Engineer

Responsibilities:

Extensive knowledge in utilizing Azure Data Factory for complex ETL workflows, enabling multi-source data integration, data cleansing, and transformation within scalable architectures tailored for retail analytics.

Created and managed intricate ETL workflows to extract data from various sources (databases, files, APIs), transforming it according to business rules, and loading it into target systems using Azure Data Factory integrated with Azure Synapse Analytics, Azure SQL Database, and Snowflake for centralized data warehousing.

Designed and implemented scalable data solutions on Azure, leveraging services such as Azure Blob Storage, Azure Data Lake Storage, Azure Data Factory, Azure Synapse Analytics, and Snowflake to address the complex data processing needs of the retail industry, improving inventory management and sales forecasting by 25%.

Integrated Azure features into data solutions, utilizing Azure Data Lake, Azure SQL Database, and Snowflake for enhanced analytics capabilities, allowing for more efficient processing of large datasets and improved reporting functionalities.

Orchestrated end-to-end data ingestion and processing workflows using Azure Data Factory and Azure Event Hubs, ensuring seamless integration of retail data streams from diverse sources while adhering to industry compliance standards.

Implemented Change Data Capture (CDC) techniques using Azure SQL Database and Azure Databricks, along with Snowflake for real-time synchronization of data across Azure environments, enhancing data accuracy and timeliness critical for dynamic retail operations.

Developed error handling and recovery mechanisms using Azure Functions and Azure Logic Apps to manage CDC process failures, ensuring minimal data loss and allowing for quick recovery of critical retail metrics and operational insights.

Utilized Azure Functions for serverless compute solutions, automating data processing tasks to optimize resource utilization and meet stringent performance requirements of retail data analytics.

Managed and orchestrated data workflows using Azure Databricks and Snowflake, ensuring reliability, scalability, and fault tolerance in processing large retail datasets.

Integrated Hadoop ecosystem tools such as HDFS, Hive, and Spark on Azure HDInsight to efficiently manage and process datasets, supporting retail analytics including customer behavior analysis and sales trends.

Executed data migration strategies from on-premises systems to Azure, ensuring seamless transitions while leveraging Snowflake’s advanced features for improved data processing and analytics capabilities.

Ensured the protection of sensitive data, including Personally Identifiable Information (PII), by implementing encryption protocols and stringent access controls using Azure Key Vault, Azure AD, Snowflake's security features, and Azure Security Center to maintain data confidentiality in compliance with Federal Data Privacy Laws.

Developed automated processes to detect and mask PII in data pipelines using Azure Data Factory, Azure Purview, and Snowflake, enhancing data privacy and regulatory compliance within retail applications.

Implemented Python and shell scripts for data processing and automation tasks within Azure Databricks and Snowflake, incorporating industry-specific business logic and regulatory compliance rules to ensure data integrity.

Collaborated closely with retail analysts and stakeholders to analyze requirements and design data solutions addressing unique challenges, such as inventory optimization and customer segmentation, resulting in a 30% improvement in inventory turnover rates.

Optimized Tableau performance by tuning queries, implementing data extracts, and leveraging Tableau Server for efficient data sharing.

Created interactive dashboards and visualizations using Power BI and Azure Synapse Analytics, alongside data from Snowflake, to provide actionable insights into retail data, supporting decision-making processes related to sales and marketing strategies, leading to a 20% increase in sales performance.

Documented technical specifications, architectural designs, and deployment procedures for retail-specific data solutions, ensuring alignment with regulatory requirements and industry best practices for data governance on Azure.

Actively contributed to Azure community forums and knowledge-sharing platforms, sharing insights and best practices with the broader cloud data engineering community, enhancing collaborative learning within the retail data landscape.

Environment: Microsoft Azure (Data Factory, SQL Database, Blob Storage, Data Lake Storage, Databricks, Snowflake, Event Hubs, HDInsight, Batch, Active Directory) IBM Cloud pak, Python, Shell scripting, Tableau, HDFS, Hive, Spark, Oracle, Tableau.

Client: City of New Orleans, New Orleans, Louisiana Sep 2018 - Jan 2021

Role: Data Engineer

Responsibilities:

Designed and built scalable, secure data solutions using AWS managed services to support data ingestion, transformation and reporting tailored for healthcare data, including medical records, claims and operational data, resulting in a 30% reduction in data processing time.

Extracted, transformed and loaded (ETL) large volumes of data from diverse healthcare systems into AWS data storage services, utilizing AWS Glue and Amazon S3 and loading processed data into Amazon Redshift as the primary data warehouse for advanced analytics and regulatory reporting, processing over 5 million patient records monthly.

Developed data pipelines in AWS Glue, leveraging jobs, crawlers and data catalog features to automate and optimize ETL workflows that load data directly into Amazon Redshift, significantly improving operational efficiency and patient data management.

Orchestrated data workflows across multiple AWS services, including Amazon S3, Amazon RDS, Amazon EMR and Amazon Redshift, ensuring seamless integration and real-time access to patient records and clinical trial data, which improved data availability for healthcare professionals by 40%.

Utilized AWS EMR with PySpark and Spark SQL for large-scale data processing, enabling high-performance analysis on healthcare claims and treatment outcomes, and facilitating data movement to Amazon Redshift for comprehensive reporting.

Implemented data security and governance policies in compliance with healthcare regulations (e.g., HIPAA, GDPR) to ensure patient data privacy and compliance. Established robust access controls and encryption mechanisms using AWS IAM and AWS Key Management Service (KMS) to safeguard sensitive data.

Developed Spark applications to process healthcare data in various formats (e.g., ORC, Parquet) and stored it in Amazon S3, facilitating fast retrieval and aggregation of health-related metrics for analytics.

Used AWS Glue and AWS Database Migration Service (DMS) for incremental data loading from electronic health record (EHR) systems ensuring timely updates to health data stored in Amazon Redshift.

Optimized performance of AWS EMR clusters for real-time processing of large healthcare datasets, ensuring efficient memory utilization and resource management, which decreased operational costs by 25%.

Automated data processing and transformation workflows using AWS Glue, SQL scripts and AWS CloudFormation templates for infrastructure deployment, reducing manual intervention and enhancing operational efficiency in health data reporting.

Implemented advanced data governance practices for healthcare data, ensuring compliance with HIPAA and GDPR, applying encryption and access control to sensitive health information in Amazon Redshift, thus maintaining data integrity and confidentiality.

Worked with healthcare data formats including HL7, FHIR, JSON, and CSV for processing and ingestion into Amazon RDS, Amazon S3 and Amazon Redshift, supporting various healthcare applications and analytics use cases.

Used Apache Airflow on Amazon MWAA for scheduling and managing data pipeline tasks related to patient records, lab results and treatment plans, ensuring consistent data availability and operational continuity.

Developed user-defined functions (UDFs) in PySpark for specific healthcare business logic, enhancing data processing capabilities for reporting on patient outcomes and operational metrics, contributing to improved care delivery.

Worked extensively with AWS CodePipeline and AWS CodeCommit for continuous integration and deployment (CI/CD) of data solutions, ensuring smooth releases of healthcare data applications into production environments while maintaining high standards of code quality and documentation.

Leveraged Amazon Redshift as a centralized data warehouse for healthcare analytics, enabling healthcare professionals to run complex queries and perform ad-hoc analysis on large datasets, leading to actionable insights into patient outcomes, operational metrics, and regulatory compliance.

Utilized Redshift's features such as concurrency scaling and data sharing to enhance collaborative analytics across departments, ensuring timely and accurate reporting that supports clinical decision-making and operational efficiency.

Environment: AWS (Amazon S3, AWS Glue, Amazon EMR, Amazon Redshift, AWS Lambda, AWS RDS, AWS Key Management Service, AWS IAM, Amazon MWAA), SQL Server Integration Services (SSIS), Shell scripting, Python, SQL, PySpark, Spark SQL, Scala, ORC, Parquet, Avro formats, Apache Airflow, Power BI.

Client: Innovative Software Solutions, Bengaluru, India Jul 2016 - Aug 2017

Role: Big Data Developer

Responsibilities:

Developed and maintained Hadoop-based data processing applications, leveraging technologies such as Hadoop MapReduce, HDFS, and Hive.

Designed and implemented data ingestion pipelines to load large volumes of data from diverse sources into Hadoop clusters using tools like Sqoop or Flume.

Implemented custom MapReduce jobs in Java to process and analyze structured and unstructured data, extracting meaningful insights and patterns.

Utilized Hive for data querying and analysis, optimizing queries and leveraging Hive partitions and bucketing for improved performance.

Integrated and utilized other data processing frameworks and libraries such as Apache Spark, Pig, or Cascading to enhance data processing capabilities.

Worked with diverse data formats such as Avro, Parquet, or ORC, optimizing data storage and retrieval efficiency in Hadoop clusters.

Collaborated with Data Architects and Data Scientists to understand data requirements and design data models and schemas for efficient data processing.

Developed and maintained Oozie workflows to orchestrate and schedule Hadoop jobs, ensuring reliable and timely execution of data processing tasks.

Implemented data security measures in Hadoop clusters, including authentication, authorization, and data encryption, to ensure data privacy and compliance.

Optimized Hadoop cluster performance by tuning various parameters such as heap size, block size, and replication factor, based on workload characteristics.

Implemented data lineage and metadata management solutions using tools like Apache Atlas or custom-built systems, enabling data traceability and governance.

Developed and maintained monitoring and alerting mechanisms using tools like Nagios, Ganglia, or Ambari, ensuring the health and performance of Hadoop clusters.

Integrated Hadoop clusters with other data storage and processing systems, such as relational databases or cloud storage, for seamless data integration and analysis.

Implemented data backup and disaster recovery solutions for Hadoop clusters, ensuring data availability and business continuity in case of system failures.

Kept up to date with emerging technologies and trends in the big data ecosystem, continuously exploring new tools and frameworks to enhance data processing capabilities.

Environment: Hadoop, MapReduce, HDFS, Hive, Sqoop, Flume, Apache Spark, Pig, Cascading, Avro, Parquet, ORC, Java, Oozie, Apache Atlas, Nagios, Ganglia, Ambari

Client: Bodhtree, Bengaluru, India May 2014 - Jun 2016

Role: Hadoop Developer

Responsibilities:

Experience in configuration, supporting and monitoring Hadoop clusters using Cloudera distribution.

Worked in Agile scrum development model on analyzing Hadoop cluster and different Big Data analytic tools including Map Reduce, Pig, Hive, Flume, Oozie, and SQOOP.

Configured Hadoop MapReduce, HDFS, and developed MapReduce jobs in Java for data cleaning and preprocessing.

Established custom MapReduce programs to analyze data and used Pig Latin to clean unwanted data.

Involved in creating Hive tables and writing Hive queries that will run internally in map reduce way.

Implemented Partitioning, Dynamic Partitions, and Buckets in Hive for increasing performance benefits.

Implemented loading and transforming data sets of different data formats like structured and semi-structured data.

Involved in scheduling the Oozie workflow engine to run jobs automatically.

Implemented No SQL database like HBase for storing and processing different formats of data.

Involved in Testing and coordination with business in User testing.

Involved in Unit testing and delivered Unit test plans and results documents.

Environment: Apache Hadoop, Map Reduce, HDFS, Hive, Pig, Sqoop, Oozie, HBase, UNIX shell scripting, Zookeeper, Java, Eclipse.

Education

Master’s in computer science, UNT, Denton, Texas

Bachelors in electrical and Electronics Engineering, VRSEC, India.

Contact this candidate