Azure Data Machine Learning

Location:

Dallas, TX

Salary:

70000

Posted:

February 25, 2025

Contact this candidate

Resume:

WORK EXPERIENCE:

USAA San Antonio, Texas, USA

AZURE Data Engineer Aug 2024 – Present

Description: USAA (United Services Automobile Association) is a financial services group provides a wide range of insurance, banking, investment, and retirement services. Architecting and developing robust data pipelines using Azure Data Factory to ingest, cleanse, transform, and load data from diverse sources into data warehouses and data lakes and Utilizing Azure Data Bricks for large-scale data processing and machine learning tasks.

Responsibilities:

Design, implement, and optimize Azure Data Factory pipelines for seamless data integration and automation across various data sources and platforms and build and maintain Azure Synapse Analytics and Apache Kafka solutions for real-time and batch data processing, enabling high-performance analytics and reporting.

Develop and manage scalable data architectures using Azure Data Lake and Azure Blob Storage for large-scale data storage and secure data management and write complex SQL queries and perform data transformations and analysis using Azure SQL Database and Azure Data Warehouse to support business intelligence initiatives.

Utilize Azure Databricks and Apache Spark for big data processing and machine learning model integration, optimizing data workflows and develop and deploy automated data pipelines and jobs using Azure DevOps for CI/CD, ensuring smooth deployment of data models and pipelines.

Leverage Power BI and Tableau for data visualization and to provide stakeholders with insights from processed data and Optimize data querying.

Leverage Apache Hive and Impala for data querying and analytics on large-scale data stored in Hadoop and Azure Data Lake and Use Snowflake for cloud data warehousing, implementing optimized schemas and handling complex data modelling to support analytical workloads.

Automate data flow management and orchestration with Apache Airflow to ensure timely and accurate data delivery across the data ecosystem and Design and implement data quality checks and validation rules using Pyspark and Azure Databricks for improved data accuracy.

Collaborate with data scientists and analysts to integrate machine learning models and AI capabilities into Azure Machine Learning and Databricks environments for predictive analytics and Utilize Git for version control and Maven for build management to streamline the development and deployment of data engineering projects.

Design and implement scalable ETL pipelines using tools like Azure Data Factory, Apache NiFi, and Talend, ensuring efficient extraction, transformation, and loading of large volumes of data from multiple sources into centralized data repositories

Utilize Sqoop for efficient data transfer between Hadoop and relational databases, implement Impala for high-performance SQL querying on large datasets, and manage real-time data ingestion and monitoring using Flume and Zookeeper to ensure seamless data flow and availability in distributed systems and Leverage Python for data manipulation, automation of ETL processes, and integration with APIs, enabling efficient data pipelines.

Expertise in working with SQL, MySQL, and PostgreSQL for relational database management, as well as Scala for big data processing, and MongoDB and Cassandra for handling NoSQL databases, ensuring efficient data storage, querying, and real-time analytics across diverse data environments.

EXPERIENCE: Azure Data Factory, Azure Synapse Analytics, SQL, Azure Data Lake, Azure Blob Storage, SQL, Azure Data Warehouse, Azure Databricks, Apache Spark, Azure DevOps, CI/CD, Power BI, Tableau, Azure Stream Analytics, Apache Kafka, Apache Hive, Impala, Hadoop, Snowflake, Apache Airflow, Pyspark, Azure Machine Learning, Git, Maven, Apache NiFi, Talend, Sqoop, Hadoop, Impala, Flume, SQL, PostgreSQL, MongoDB, Cassandra.

Hallmark Financial Services, Inc. Fort Worth, Texas, USA

AWS Data Engineer Dec 2023 – July 2024

Description: Hallmark Financial Services, Inc. is a diversified specialty property and casualty insurance company provides a broad range of insurance products, including commercial lines, personal lines, and specialty insurance coverage. Designed and built data pipelines to move data from multiple sources to data warehouses and Use machine learning algorithms to make predictions

Responsibilities:

Design, implement, and maintain scalable data pipelines using AWS Glue, AWS Lambda, and Amazon S3 for efficient data extraction, transformation, and loading (ETL) processes from various insurance data sources.

Build and manage real-time data streaming solutions using Amazon Kinesis and AWS Lambda to handle data ingestion and processing for real-time analytics and claims management.

Develop and optimize data models and data warehouses using Amazon Redshift and AWS Snowflake to support business intelligence, reporting, and analytics for underwriting, claims, and customer insights.

Leverage Amazon RDS, Amazon Aurora, Amazon DynamoDB, and Amazon Neptune to manage relational and NoSQL databases for policyholder data, transactions, and analytics.

Automate data workflows and orchestration using AWS Step Functions, AWS Data Pipeline, and Apache Airflow for efficient scheduling and management of data processing tasks.

Integrate third-party insurance data sources and external APIs using AWS API Gateway, AWS Lambda, and Amazon AppFlow to provide seamless data flow and system interoperability.

Implement data quality checks, validation, and cleansing processes using AWS Glue, Pyspark, and AWS Lambda to ensure the accuracy and consistency of insurance data.

Develop and deploy machine learning models for predictive analytics, fraud detection, and risk assessment using Amazon Sage Maker and AWS Lambda for real-time decision-making.

Design and implement end-to-end ETL solutions using AWS Glue and Apache Spark for transforming raw insurance data into structured formats suitable for analytics, ensuring efficient and automated data processing.

Implement ETL pipelines for data processing and transformation using Maven for build automation, while establishing CI/CD workflows with Jenkins and GitLab CI to streamline the deployment of data engineering solutions, ensuring efficient integration, testing, and delivery of data solutions

Develop and manage big data solutions using Hadoop ecosystem components like HDFS, MapReduce, Hive, Pig, and HBase to process and analyse large volumes of insurance data, ensuring scalability, fault tolerance, and efficient data storage for policy management, claims processing, and underwriting.

Design and implement efficient ETL processes using AWS Glue, Apache NiFi, and Talend to extract, transform, and load large datasets from various insurance systems.

Utilize Sqoop to transfer data between Hadoop and relational databases, leverage Impala for fast SQL queries on big data stored in HDFS, manage real-time data flows using Flume for event collection, and ensure system coordination and high availability with Zookeeper for efficient data stream processing and monitoring in distributed environments.

ENVIRONMENT: AWS Glue, AWS Lambda, Amazon S3, Amazon Kinesis, AWS Lambda, Amazon Redshift, AWS Snowflake, Amazon RDS, Amazon Aurora, Amazon DynamoDB, Apache Airflow, AWS Glue, Pyspark, Amazon Sage Maker, AWS Glue, Apache Spark, Maven, Jenkins, CI/CD, GitLab CI, Hadoop, HDFS, MapReduce, Hive, Pig, HBase, Apache NiFi, Talend, Impala, Flume, Sqoop.

Giant Eagle Bangalore, India

GCP Data Engineer Jun 2022 - July 2023

Description: Giant Eagle, Inc. is a regional supermarket chain providing a wide range of grocery products, fresh produce, meats, bakery items, household essentials, and pharmacy services. Monitored and analysed cloud resource usage to optimize cost and efficiency. Lead cloud migration projects, ensuring minimal downtime and data integrity and develop and optimize Big Query datasets for efficient data analysis and reporting.

Responsibilities:

Design and implement scalable data pipelines using Google Cloud Dataflow, Apache Beam, and Pub/Sub for real-time data ingestion and processing from various retail systems, ensuring seamless integration of sales, inventory, and customer data and develop and manage data storage solutions using Google Cloud Storage and Big Query to store, process, and analyse large volumes of transactional, inventory, and customer data for reporting.

Utilize Google Cloud Composer for orchestration of data workflows and automation of ETL processes across different data systems to improve efficiency and streamline data operations.

Integrate Apache Kafka with Google Cloud Pub/Sub to enable real-time event streaming and data pipelines for inventory updates, sales transactions, and customer activity, ensuring real-time insights and operational efficiency.

Implement Cloud Dataproc with Apache Spark to process large datasets efficiently for analysing transaction history, customer segmentation, and product performance across various regions.

Automate and schedule data pipeline execution using Apache Airflow and Cloud Composer to ensure regular updates to data models, reducing manual intervention and ensuring data availability for business users.

Develop and manage efficient ETL workflows using Google Cloud Dataflow, Apache Beam, and Cloud Composer to automate the extraction, transformation, and loading of retail data from multiple sources.

Implement big data solutions using the Hadoop ecosystem, including HDFS for distributed storage, Hive for data warehousing, and MapReduce for large-scale data processing, enabling the efficient handling and analysis of transactional, inventory, and customer data in a retail environment.

Design and manage scalable data warehousing solutions using Snowflake to store and analyse large volumes of retail data, leveraging its capabilities for real-time data sharing, high-performance queries.

Design and optimize data storage solutions using SQL for structured transactional data, MongoDB for flexible, schema-less storage of customer profiles and product catalogs, and Cassandra for handling large-scale, high-velocity data such as real-time inventory updates and customer activity logs.

Leverage Sqoop to efficiently transfer data between relational databases and Hadoop ecosystems, use Impala for fast, low-latency SQL querying on big data stored in HDFS, integrate Flume to capture and ingest real-time data from retail transaction systems

ENVIRONMENT: Google Cloud Dataflow, Apache Beam, Pub/Sub, Google Cloud Storage, Big Query, Google Cloud Composer, Apache Kafka, Apache Spark, Apache Airflow, Apache Beam, Hadoop, HDFS, Hive, MapReduce, Snowflake, MongoDB, SQL, Cassandra, Sqoop, Hadoop, Impala, HDFS, Flume.

CVS Health Bangalore, India

Data Engineer March 2021 - May 2022

Description: CVS Health Corporation is healthcare company operates across various segments, including retail pharmacies, health services, and insurance. Involved in designing, building, and maintaining scalable data infrastructure and pipelines to support various business operations, including integrating data from diverse sources, ensuring data quality, and enabling data analysis for insight.

Responsibilities:

Designed and implemented scalable data architectures to manage and optimize the flow of data across various systems within the healthcare environment, using tools such as Apache Spark, Apache Kafka, and custom ETL solutions to ensure seamless data integration and processing of patient, pharmacy, and healthcare data.

Utilized SQL Server, PostgreSQL, and cloud-based solutions like Azure Redshift and Google Big Query to manage and store critical healthcare data, while designing business requirement collection approaches based on project scope and SDLC methodology; wrote and executed various MySQL database queries from Python using Python-MySQL connector and MySQL DB package and designed and deployed data pipelines using Data Lake, Databricks, and Apache Airflow.

Utilized Jira for issue tracking and project management, and Jenkins for continuous integration and continuous deployment to ensure a streamlined and automated process for healthcare data applications.

Implemented Azure Data Lake, Azure Data Factory, and Azure Databricks to move and conform healthcare data from on-premises systems to the cloud, optimizing the processing and analytical capabilities of CVS Health and enabling real-time healthcare insights. Utilized Docker for managing application environments across multiple teams and applications.

Implemented continuous delivery (CI/CD) pipelines using Jenkins and Docker to manage custom healthcare data application images, improving the delivery and deployment of new features for CVS Health’s analytics and reporting platforms and Developed Databricks ETL pipelines using notebooks, Spark Data Frames, Spark SQL, and Python scripting to optimize the transformation of healthcare data for improved reporting, patient outcomes.

Developed Spark applications using Scala and Spark SQL for data extraction, transformation, and aggregation from multiple healthcare data sources, providing insights into customer usage patterns and driving data-driven decision-making for CVS Health’s healthcare services and patient management systems.

Environment: Apache Spark, Apache Kafka, SQL Server, PostgreSQL, Azure Redshift, Python, Apache Airflow, Jira, Apache Airflow, Jenkins, Docker, Kibana, Spark, CI/CD, python scripting, Scala, Azure Data Lake, Azure Data Factory.

TECHNICAL SKILLS:

Cloud Technologies: AWS, GCP, Amazon.

AWS Ecosystem: S3Bucket, Athena, Glue, EMR, Redshift, Data Lake, AWS Lambda, Kinesis.

Azure Ecosystem: Azure Data Lake, ADF, Databricks, Azure SQL

Google Cloud (GCP): Big Query, Dataflow, Pub/Sub, Cloud Storage, Cloud Composer, Cloud Functions

Databases: Oracle, MySQL, SQL Server, PostgreSQL, HBase, Snowflake,

Cassandra, MongoDB.

Programming Languages: Java, HTML,

Python, Hibernate, JDBC, JSON, CSS.

Script Languages: Python, Shell Script (bash, shell).

Version controls and Tools: GIT, CBT

Maven, SBT.

Hadoop Components / Big Data: HDFS, Hue, MapReduce, PIG, Hive, HCatalog, HBase, Sqoop, Impala, Zookeeper, Flume, Kafka, Yarn, Cloudera Manager, Kerberos, Pyspark Airflow, Kafka, Snowflake Spark Components

Visualization& ETL tools: Tableau, PowerBI, Informatica, Talend

Operating Systems: Windows, Linux

Methodologies: Agile (Scrum),

Waterfall, UML, Design Patterns, SDLC.

Webservers: Apache Tomcat,

WebLogic.

Machine Learning Integration: AWS Sage Maker, Azure ML, GCP Vertex AI

Workflow Automation: Terraform, CloudFormation, Ansible

DevOps & CI/CD: Jenkins, GitLab CI/CD Pipeline, Docker, Kubernetes

PROFILE SUMMARY:

Experienced Data Engineer with around 4 years of expertise in designing, developing, and managing data architectures and pipelines.

Experienced in using Git for version control and collaboration, alongside Maven for automating project builds

Strong proficiency in working with AWS, Azure, GCP, and hybrid cloud environments to ensure seamless data integration.

Strong expertise in working with a variety of databases and languages, including SQL, MySQL, PostgreSQL, MongoDB, Cassandra, and Scala, to design, optimize, and manage data models, queries, and scalable data solutions across relational and NoSQL databases.

Skilled in designing and implementing efficient ETL processes using tools like Apache NiFi, Talend, Informatica, and AWS Glue to automate data extraction, transformation, and loading, ensuring seamless data flow and integration across systems and platforms.

Skilled in Python for data manipulation, automation of workflows, and integration with APIs and databases.

Solid experience with Hadoop, Spark, and Hive for big data processing and analytics and Familiar with Tableau and Power BI for data visualization, enabling business intelligence and actionable insights.

Experienced in managing and optimizing CI/CD pipelines using tools like Jenkins, Git, and Docker and Strong background in data warehousing concepts and technologies like Redshift, Snowflake, and Big Query.

Expertise in building and maintaining scalable data pipelines using Apache Kafka, Airflow, and AWS Lambda for real-time processing.

Knowledge of machine learning techniques and integration with data pipelines for predictive analytics.

Proficient in leveraging AWS services like S3, Redshift, Lambda, Glue, and EMR to build scalable, cost-effective data pipelines and perform large-scale data processing and analytics in the cloud.

Extensive experience with Azure services such as Azure Data Lake, Azure SQL Database, Azure Data Factory, and Azure Synapse Analytics to design and implement scalable data solutions.

Experienced in utilizing GCP services like Big Query, Dataflow, Pub/Sub, and Cloud Storage to design and implement scalable data pipelines.

Proficient in using Sqoop, Impala, Zookeeper, Flume, Kafka, Pyspark, Airflow, and Snowflake to design, orchestrate, and manage big data pipelines for efficient data ingestion, processing, and real-time analytics across distributed systems and cloud environments.

Saiprakash Bandi

Data Engineer

*************@*****.***

469-***-****

EDUCATION:

Masters from Texas A&M University, Texas, USA

Results-driven and detail-oriented Data Engineer with around 4 years of experience in designing, implementing, and optimizing data pipelines and solutions. Proficient in working with big data technologies, cloud platforms (AWS, Azure, GCP), SQL, Python, and ETL processes.

Contact this candidate