Machine Learning Data Processing

Location:

Beaverton, OR

Posted:

May 15, 2025

Contact this candidate

Resume:

Vivek reddy morthala

Mobile: 910-***-**** E-mail: ******************@*******.*** LinkedIn: MorthalaVivekReddy Professional Summary

With overall 8+ of experience in the IT industry, I specialize in Big Data, Data Engineering primarily focusing on the design, optimization, and management of large-scale data pipelines with proven track record of enhancing data processing efficiency by ensuring high throughput and low latency, while consistently meeting business requirements by tackling complex data integration challenges and delivering comprehensive end-to-end data solutions, encompassing data governance, seamless integration, and high-performance processing using the services like PySpark, Airflow, Databricks, Snowflake and the cloud services such as Aws and Azure.

• Expert in Hadoop, Spark, for building scalable and fault-tolerant data processing frameworks. Skilled in developing and optimizing Spark-based ETL jobs for large datasets, including tuning performance and managing cluster resources for both real-time and batch processing solutions.

• Expertise in Utilizing Databricks for building scalable and efficient data processing pipelines, utilizing its integration with Apache Spark to optimize performance and manage large datasets effectively. Skilled in developing end-to-end machine learning models and streamlining workflows through Databricks' unified analytics environment.

• Proficient in utilizing Snowflake's cloud data platform to design and implement scalable, high-performance data architectures.

• Extensive experience with AWS services like EMR, EC2, S3, Lambda, Glue, RDS, Redshift, DynamoDB, Sage Maker, Kinesis, CloudWatch, Athena, Aurora, Step Functions, SNS, SQS, CloudFormation, ECS, EKS, and Elastic Beanstalk. Skilled in designing and optimizing scalable, high-performance cloud-based data pipelines.

• Experienced with Azure services like Blob Storage, Data Lake Storage, Data Factory, Synapse Analytics, SQL Database, Databricks, Kubernetes Service, Event Hub, Cosmos DB, and Log Analytics. Skilled in building scalable, secure data pipelines using Terraform and ARM templates, with secure secret management via Key Vault and proactive monitoring using Azure Monitor.

• Deep expertise in SQL databases like Oracle, MySQL, PostgreSQL, DB2 for data modeling, querying, and optimization. Proficient in NoSQL tools like Hive, Cassandra, and MongoDB for handling large-scale, unstructured data and real-time processing.

• Experienced in CI/CD pipelines using Jenkins, Docker, and Kubernetes to automate and scale microservices- based data pipelines. Proficient in version control with Git, GitLab, GitHub, and SVN, ensuring code stability and efficient delivery through the SDLC process.

• Experience with machine learning models using Scikit-Learn and TensorFlow for predictive analysis and anomaly detection. Integrated ML models into data pipelines.

• Proficient in Apache Airflow, an open-source platform for orchestrating and automating complex workflows and data pipelines to enable seamless scalability and effective error handling, making it essential for managing data- driven processes efficiently.

• Managed Hadoop clusters using Cloudera and Hortonworks, configuring and optimizing clusters for large-scale data processing. Oversaw data replication, security, and resource allocation.

• Experienced in ETL tools like Informatica and Talend for designing and implementing data extraction and transformation processes. Designed data models using Erwin to support analytical and operational reporting, ensuring data consistency and integrity.

• Skilled in handling diverse data formats including CSV, Parquet, ORC, Avro, JSON, XML, and TXT. Built data ingestion frameworks to standardize and transform structured and semi-structured data while maintaining data quality and integrity.

• Developed interactive dashboards using Tableau, Power BI, and Looker to enable data-driven decision-making. Created real-time visualizations to track KPIs and business metrics, ensuring seamless integration with data warehouses and streams.

• Demonstrated strong documentation skills using JIRA, ServiceNow, Confluence for collaborative project documentation, SharePoint for secure document management, and Lucidchart for detailed data flow diagrams.

• Highly motivated and proactive team player with strong problem-solving skills and a passion for learning new technologies. Excellent communication and leadership skills with a commitment to delivering high-quality solutions aligned with business objectives.

TECHNICAL SKILLS:

Big Data Tools Apache Spark, Hive, Sqoop, Kafka, Flink, Snowflake Cloud

Technologies

AWS: EMR, EC2, S3, Lambda, Glue, RDS, Redshift, DynamoDB, Athena, Kinesis, CloudWatch, CloudTrail, Amazon SNS, Amazon SQS, Amazon Step Functions Azure: Blob Storage, Azure Data Lake Storage, Azure Data Factory, Synapse Analytics, Azure SQL Server, data bricks, Azure Kubernetes Service, Cosmos DB, Stream Analytics, Azure SQL data Warehouse, Azure event bridge,

Languages Python, PySpark

Hadoop

Distributions

Cloudera, Hortonworks

ETL TOOLS Apache Nifi, Apache Airflow, Informatica, Talend, Google Cloud Factory, SQL Server Integration Services

Databases SQL: Oracle, MySQL, PostgreSQL, DB2

NoSQL: Hive, Cassandra, MongoDB

BI and

Visualization

Tableau, Power BI, Looker

Version Control

Tools

Git, GitLab, GitHub, SVN

Methodologies Software Development Life Cycle (SDLC) CI/CD Pipelines: Jenkins, Docker, Kubernetes

Project Management: JIRA, ServiceNow, Rally, Azure Dashboard Machine

Learning

Scikit-Learn, TensorFlow

PROFESSIONAL EXPERIENCE:

Sr DATA ENGINEER

DXC Technology Jan 2024 - present

Responsibilities:

• Designed secure, scalable AWS cloud architectures for real-time inventory tracking and multi-channel customer experiences, ensuring PCI-DSS compliance and cost optimization. Used Terraform, JIRA, and CloudWatch for automation, monitoring, and team collaboration.

• Collected retail data requirements to improve sales, inventory, and customer insights. Built ETL pipelines using AWS Glue, and Talend, integrating data from POS and CRM systems, and ensured smooth data flow with monitoring and documentation.

• Built ETL pipelines using AWS Glue, S3, Redshift, Lambda, DynamoDB, RDS, EMR, and EC2 to integrate retail data from POS and CRM systems, driving insights into sales, inventory, and customer behavior.

• Set up monitoring and logging with Amazon CloudWatch and AWS CloudTrail for efficient data flow and accurate analytics, automating workflows to support retail operations.

• Utilized Hadoop, PySpark, Hive, Snowflake, and Databricks to design and optimize large-scale data pipelines, improving data integration, transformation, and querying efficiency.

• Designed data pipelines for ingestion, integration, and aggregation using Kafka for real-time streaming, processing data from APIs and migrating Oracle and SQL Server to Amazon S3 for scalable analytics.

• Developed and executed on-premises to cloud migration strategies using AWS Data Sync and Cloud Endure, ensuring secure data transfer, optimized cloud infrastructure, and stable post-migration performance.

• Built OLAP cubes using AAS and Kyvos, enabling advanced data analysis and connected them to Excel and BI tools for interactive dashboard creation.

• Integrated Amazon DynamoDB into the retail project for high-performance, low-latency data storage, optimizing key-value and document-based data handling for inventory tracking and customer behavior analysis within the ETL pipeline.

• Automated ETL pipeline validation using Pytest and unit test, ensuring accurate data transformation and integration. Designed tests to detect anomalies and enforce business rule compliance.

• Implemented data quality checks using AWS Glue Data Brew and custom scripts to ensure consistency and integrity, applying profiling and error-handling measures to maintain high reliability standards.

• Implemented Log4j for logging and exception handling in ETL pipelines, configuring INFO, DEBUG, and ERROR log levels to track execution, capture errors, and ensure quick issue resolution for optimal system reliability and performance.

• Developed interactive dashboards and reports using Power BI, Tableau, and Quick Sight, while creating secure APIs for data sharing. Queried datasets with Amazon Athena and Snowflake and built OLAP cubes for advanced multidimensional analysis and visualization.

• Used Parquet, ORC, and Avro for optimized data storage and retrieval, reducing storage costs and improving query performance in large-scale analytics workflows, while ensuring compatibility with big data tools.

• Implemented data encryption, row-level security, and column masking to protect sensitive information while ensuring workflows complied with GDPR and DPA privacy policies.

• Enforced access control using Apache Ranger, IAM Policies, and custom logic to ensure secure, role-based, and compliant data handling.

• Used Docker for containerization and Kubernetes for scalable, fault-tolerant deployments with automated workflows, secure networking, and performance monitoring using tools like Prometheus and Grafana.

• Utilized advanced scheduling and Implemented tools such as Apache Airflow to efficiently automate workflows, optimize task management, and ensure seamless execution of complex data pipelines across systems.

• Streamlined CI/CD processes with tools like Jenkins, automating build, test, and deployment workflows. Ensured efficient and reliable application delivery across all environments.

• Used GitHub for version control to manage repositories and track code changes effectively. Enhanced collaboration and code quality through branching strategies and pull requests.

• Built and deployed machine learning models using TensorFlow, PyTorch, and scikit-learn, optimizing performance with preprocessing. Automated workflows with Kubeflow, deployed models via MLflow and SageMaker, and visualized insights using Matplotlib and Seaborn.

• Implemented Agile SDLC with sprint-based planning and iterative development. Utilized Jira for task tracking and collaboration

Environment: Glue, S3, Redshift, Lambda, DynamoDB, RDS, EMR, EC2, Terraform, JIRA, CloudWatch, AWS Data Sync, Cloud Endure, Amazon Athena, Hadoop, PySpark, Hive, Snowflake, Databricks, Kafka, AWS Glue Data Brew, Log4j, AAS, Kyvos, Power BI, Tableau, Quick Sight, Parquet, ORC, Avro, AWS IAM Policies, Apache Ranger, GDPR, DPA Privacy Policies, Data Encryption, Row-Level Security, Column Masking, Docker, Kubernetes, Prometheus, Grafana, Apache Airflow, Jenkins, GitHub, TensorFlow, PyTorch, scikit-learn, Kubeflow, MLflow, SageMaker, Matplotlib, Seaborn, Agile SDLC, Jira.

DATA ENGINEER

Betis Group Inc Jan 2023 – dec 2023

Responsibilities:

• Designed a scalable and fault-tolerant data pipeline using Azure to process large volumes of telecom data, ensuring high availability, low latency, and efficient handling of real-time and batch processing.

• Developed scalable ETL workflows using Azure Data Factory (ADF) and PySpark to efficiently extract, transform, and load structured and unstructured telecom data from diverse sources, ensuring high data quality and performance.

• Integrated data from on-prem systems into Azure Blob Storage and processed it using Azure Synapse, Azure Data Lake, and Azure SQL to enhance data accessibility and processing efficiency. Utilized Azure Data Factory for automated data ingestion and transformation.

• Deployed real-time data streaming using Azure Event Hubs and Azure Stream Analytics, enabling low-latency data processing and improving network performance insights.

• Enhanced data processing using Azure Synapse to create scalable, high-performance analytical models and improve query performance for large datasets.

• Utilized Azure Data Factory to Developed data movement and transformation across various sources, ensuring streamlined data integration and consistency.

• Developed advanced data transformation and machine learning models using Azure Databricks, enabling efficient processing of complex datasets and improving analytical insights.

• Processed large-scale network logs and customer data using Hadoop and Spark-Scala for fast data aggregation and real-time analysis.

• Migrated large telecom datasets from Teradata to Snowflake using both batch and real-time processing methods, Utilizing Snowflake’s native data migration tools.

• Developed multi-dimensional models and built analytical cubes using Azure Analysis Services (AAS) and Kyvos to optimize data exploration and enable fast, interactive analysis, enhancing decision-making for telecom business operations.

• I optimized data consumption by designing efficient access patterns using Snowflake queries, leveraging and enabling direct access to data through BI tools like Power BI and Tableau for seamless reporting and analysis.

• Implemented encryption, row-level security, and column masking to safeguard sensitive data, ensuring compliance with GDPR and DPA policies to maintain privacy and security across the system.

• Figured access control using Ranger and IAM policies to enforce secure, role-based data access, ensuring that users only had appropriate permissions to access sensitive data.

• Deployed data processing jobs using Docker and Kubernetes, enabling scalability and faster deployment cycles, which streamlined the management and Execution of data workflows.

• Automated workflow execution using Airflow and Control-M, ensuring seamless orchestration and efficient operation of data pipelines. This streamlined scheduling and improved the reliability of data workflows.

• Used Bitbucket to manage the codebase and track changes in data pipelines, facilitating seamless team collaboration and ensuring version control for efficient development

• Using Azure DevOps to automate CI/CD pipelines, enabling faster deployment, improving build quality, and enhancing development efficiency.

Environment: Azure Data Factory (ADF), PySpark, Azure Synapse, Hadoop, Spark-Scala, Snowflake, Azure Analysis Services (AAS), Kyvos, Hive, Power BI, Parquet, Encryption Tools, Ranger, IAM Policies, Docker, Kubernetes, Airflow, Control-M, Bitbucket, Azure DevOps, Azure Synapse, Azure Data Factory, Azure Databricks. Kaiser Permanente

Data Analyst Jan 2020 – Sep 2022

Responsibilities:

• Consolidated electronic health records (EHR) and insurance data into centralized data lakes using Informatica and stored them in Hadoop HDFS for scalable processing.

• Processed large volumes of healthcare claims data with Spark to identify trends and ensure accurate reimbursements while reducing processing delays.

• Built analytical queries with Hive to gain insights into patient admission patterns, treatment outcomes, and resource allocation.

• Designed robust ETL workflows with Informatica for extracting, transforming, and loading clinical trial and prescription data into Hadoop systems.

• Used Spark Streaming for real-time healthcare transaction processing, enabling faster eligibility checks and claims submissions.

• Scheduled batch processing jobs with Oozie, streamlining periodic reporting and compliance data workflows across the Hadoop ecosystem.

• Automated data cleansing and validation using Informatica to ensure the reliability and consistency of healthcare datasets.

• Analyzed population health metrics using Hive, Pig, and Hadoop, identifying trends for proactive healthcare interventions.

• Deployed distributed storage systems with Hadoop HDFS to efficiently store unstructured healthcare datasets with high scalability.

• Developed predictive models on Spark for readmission risk predictions and disease outbreak forecasting, enhancing patient care strategies.

• Automated regulatory compliance report generation using Hive to meet standards like HIPAA while maintaining data integrity.

• Processed large-scale clinical trial data using Pig, enabling researchers to derive insight into evidence-based medicine.

• Aggregated diverse datasets, including lab results and pharmacy data, with Hive and Spark, creating unified healthcare records.

• Orchestrated complex ETL workflows with Oozie, ensuring accurate and timely processing of healthcare operational data.

• Enforced strong encryption and access controls using Informatica and Hadoop security to protect sensitive healthcare information.

• Used Pig scripts to detect fraudulent activities by analyzing historical claim patterns and identifying anomalies.

• Processed data adhering to interoperability standards like HL7 and FHIR with Hive and Spark, enabling seamless system integration.

• Transformed raw clinical data into analytical formats with Informatica and stored them in Hive tables for efficient querying.

• Improved query performance by leveraging Spark SQL and optimizing partitioned Hive tables for faster data retrieval.

• Scheduled workflows with Oozie for time-sensitive healthcare data processing, ensuring critical deadlines were met seamlessly.

Environment: Informatica, Hadoop HDFS, Spark, Hive, Spark Streaming, Oozie, Pig, Hadoop security, Spark SQL, Hive tables, FHIR, HIPAA.

Python Developer

Bank of Montreal Sep 2017 – Jan 2020

Responsibilities:

• Build a centralized financial data system by handling ETL processes with SQL and performing data cleaning and transformation using Python's Pandas.

• Predict stock prices by training machine learning models with Python's Scikit-learn and accessing historical stock data via SQL queries.

• Create a personal finance tracker, using Python for automation and interaction, and SQL to store and categorize expense data.

• Detect fraud by setting SQL triggers for suspicious activities and developing predictive models with Python's Scikit-learn.

• Develop an interactive budget planner with Python's Dash for visualization and SQL to manage budget and transaction data.

• Forecast future revenues by building analytics models with Python's TensorFlow and feeding them historical data from SQL.

• Predict credit scores with machine learning models built using Python and utilize SQL to fetch user financial datasets.

• Analyze cash inflows and outflows, using Python for calculations and reports, and SQL for data storage and retrieval.

• Automate tax computation with Python's NumPy for numerical operations, storing tax rules and brackets in SQL tables.

• Build a loan eligibility tool by training decision-making models with Python's Scikit-learn and storing profiles in SQL databases.

• Monitor financial KPIs dynamically through real-time dashboards built with Python's Matplotlib, retrieving data using SQL.

• Analyze portfolio diversification with Python's financial libraries, retrieving asset allocation data stored in SQL.

• Automate profit and loss calculations using Python and query financial data for statements through SQL databases.

• Track cryptocurrency trends by fetching live prices using Python APIs and storing historical price data in SQL for comparison.

• Generate financial reports automatically using Python's Report Lab, retrieving and formatting data directly from SQL databases.

Environment: Python, SQL, Pandas, NumPy, Scikit-learn, TensorFlow, Dash, Matplotlib, APIs (requests), BeautifulSoup, ReportLab, NLTK, SpaCy, SQL queries, ETL processes, SQL joins, SQL triggers, SQL tables. Education:

Campbellsville University, DEC 2024

Master of Science in computer science, GPA 3.8

Contact this candidate