Data Engineer
Venkata Harish
Phone: +1-631-***-****
Email: ********************@*****.***
PROFESSIONAL SUMMARY:
Around 5 years of experience in IT with a strong background in data engineering, designing and implementing scalable data-driven solutions across healthcare and finance.
My expertise spans Data Engineering, Big Data analytics, data warehousing, visualization and reporting, leveraging technologies like Hadoop and Python to transform complex datasets into actionable insights that drive business success.
Skilled in configuring and leveraging Hadoop ecosystem components, including HDFS, MapReduce, Hive, YARN, Flume, HBase, Kafka, and Spark, to build efficient, high-performance data processing and analytics solutions.
Proficient in managing SQL and NoSQL databases like MongoDB, HBase, Cassandra, SQL Server, and PostgreSQL. Skilled in optimizing RDBMS structures, including tables, views, stored procedures, and transactions.
Skilled in managing scalable data pipelines using Apache Airflow, automating workflows, optimizing scheduling, and ensuring efficient data orchestration for seamless ETL processes.
Experienced in ETL/ELT pipeline development, optimizing data ingestion, transformation, and processing using Apache Spark, Airflow, AWS Glue, Azure Data Factory, and Snowflake for scalable analytics and compliance-driven solutions.
Developed and optimized Databricks ETL pipelines for real-time data processing, integrating AWS S3, Azure Blob Storage, Kafka, and Snowflake data.
Hands-on experience with cloud services, including AWS (EC2, S3, RDS, VPC, IAM, Elastic Load Balancing, Auto Scaling, CloudWatch, SNS, SQS, Lambda, Glue, EMR) and Azure (Blob Storage, Data Factory, Virtual Machines, Key Vault), to build scalable and secure cloud-based solutions across projects for multiple clients.
Developed Spark applications using Spark-SQL in Databricks to extract, transform, and aggregate data from various formats, uncovering key insights into customer usage patterns and enabling data-driven decision-making.
Utilized Terraform and AWS CloudFormation, along with Azure Resource Manager, to automate the deployment and management of scalable cloud infrastructure, driving efficiency and scalability.
Created interactive dashboards with Amazon QuickSight, Tableau, and Power BI, providing real-time operational insights that empowered data-driven decision-making.
Skilled in implementing data governance frameworks, utilizing AWS (KMS, IAM, Secrets Manager, Lake Formation) and Azure (Key Vault, Active Directory) for secure credential management and compliance.
Implemented unit and integration testing using PyTest and Unit test to ensure code reliability, enhance performance, and maintain high-quality standards in back-end development.
TECHNICAL SKILLS:
Big DATA Ecosystem
Apache Spark, Apache Kafka, Apache Hadoop, HDFS, MapReduce, Hive, Pig, HBase, Oozie, Impala
Cloud technologies
AWS (Glue, Lambda, SageMaker, Redshift, CloudWatch, CloudFormation, Step Functions, Lake Formation, Secrets Manager, CodePipeline, Fargate, EMR, SNS, SQS, IAM), Amazon S3, Amazon QuickSight, Azure (Data Factory, Databricks, Synapse Analytics, Data Lake Storage, Blob Storage, SQL Database, Kubernetes Service, Active Directory)
ETL & Data Streaming Tools
Apache NiFi, Apache Airflow
Data Management
PostgreSQL, AWS RDS, Snowflake, Informatica AWS Redshift, MongoDB, HBase, Cassandra
Programming
Python, SQL, Scala, T-SQL, Pyspark
Business Intelligence Tools
Tableau, Amazon QuickSight, Power BI, Grafana, Kibana
CI/CD
Jenkins, Docker, Amazon Elastic Kubernetes (EKS), AWS CodePipeline, Git, GitHub
Project Management Tools
JIRA, GitHub, Confluence, Agile
PROFESSIONAL EXPERIENCE:
U.S. Bancorp - Minneapolis, Minnesota April 2024 - Present
Data Engineer
Responsibilities:
Collaborated with cross-functional teams, including risk analysts, fraud detection specialists, and CRM managers, to gather and define data integration requirements for customer insights, compliance, and risk assessment.
Worked in an Agile environment to identify requirements for real-time transaction monitoring, fraud detection, and customer profiling using Big Data technologies and third-party APIs.
Designed scalable ETL pipelines using Azure Data Factory and Apache Hadoop ecosystem components (HDFS, MapReduce, Hive, Pig, HBase) to integrate customer data from core banking systems, CRM platforms, credit bureaus, and payment networks.
Developed data solutions with Informatica and Azure Synapse Analytics for large financial datasets, enabling real-time fraud detection and ensuring regulation compliance.
Implemented data governance frameworks using Azure Data Lake Governance and Azure Active Directory to ensure GDPR and PCI-DSS compliance across banking datasets, providing secure and auditable data access control.
Deployed and managed big data processing frameworks like Apache Spark and Hadoop on Azure Virtual Machines to handle large-scale data transformations.
Built batch and streaming data pipelines using Apache Spark, PySpark, and Scala in Azure Databricks, processing high-volume financial transactions and customer interactions for risk assessments and behavioral segmentation.
Developed Spark-SQL applications to create DataFrames and Datasets, transforming raw banking data into actionable insights for credit scoring and fraud detection.
Implemented real-time data ingestion workflows with Azure Event Hubs and Apache Kafka, integrating structured and semi-structured data from various sources into a centralized data lake on Azure Data Lake Storage.
Migrated legacy MapReduce programs to Spark transformations using Python and Scala, enhancing performance and scalability of financial data workflows.
Automated data orchestration using Apache Oozie workflows to manage interdependent Hadoop jobs, including Hive, Pig, Sqoop, and Java MapReduce tasks.
Created Python scripts for data extraction, ingestion, and transformation, leveraging APIs and file formats (Parquet, ORC, JSON, Text) for seamless integration with Azure Synapse Analytics and MongoDB.
Optimized Hive queries with partitioning, bucketing, and memory tuning techniques to improve query performance and data retrieval efficiency.
Conducted performance tuning of Spark applications using Spark Context, Spark-SQL, Pair RDDs, and Azure Resource Manager, ensuring scalability for large-scale banking analytics.
Monitored and enhanced Azure Logic Apps workflows with custom alerts and logging, improving reliability of banking data workflows.
Environment: Apache Spark (PySpark, Scala), Azure Databricks, Apache Hadoop (HDFS, MapReduce, Hive, Pig, HBase), Apache Kafka, Informatica, Azure (Data Factory, Data Lake Storage, Synapse Analytics, Data Lake Governance, Active Directory, DevOps), Python, SQL, Agile.
Wellcare - Tampa, FL Feb 2023 - March 2024
Senior Data Engineer
Responsibilities:
Collaborated with business stakeholders and cross-functional teams to define requirements for scalable ETL pipelines and align data ingestion strategies with healthcare system needs, ensuring seamless claims processing and compliance with HIPAA regulations and industry best practices.
Executed scalable ETL pipelines with Apache Airflow and AWS Glue, processing structured and semi-structured data from diverse healthcare sources to support analytics workflows.
Implemented Amazon S3 as a centralized data lake for raw claims storage, optimizing ingestion and preprocessing workflows for enhanced accessibility and performance.
Applied data governance frameworks using AWS Glue Data Catalog, IAM, and Snowflake role-based access controls to ensure HIPAA compliance.
Developed batch-processing pipelines with Apache Spark and PySpark on AWS EMR, transforming raw claims data into enriched datasets for downstream analytics.
Implemented Spark SQL and Delta Lake transformations in Databricks, enabling high-performance claims data processing and analytics for complex healthcare use cases.
Maintained centralized storage solutions with Amazon S3 for structured and unstructured claims data, optimizing data retrieval and integration.
Developed data governance frameworks using AWS Security Hub and AWS Config to ensure compliance and secure data management.
Built end-to-end data pipelines with PySpark and Python, integrating data from policy transactions and claims systems for downstream analytics.
Implemented serverless data processing workflows using AWS Lambda to transform incoming claims data and trigger automated processes.
Managed ETL workflows using Apache Airflow and AWS Step Functions, managing dependencies and ensuring seamless execution of complex data pipelines.
Provided post-deployment support, troubleshooting data pipeline issues, and optimizing workflows to ensure high availability and performance consistency.
Secured sensitive claims data with AWS Secrets Manager, OAuth 2.0, and Snowflake access controls, maintaining HIPAA compliance and protecting patient information.
Tracked data lineage using Snowflake Information Schema, guaranteeing uniform reporting governance and openness throughout healthcare analytics procedures.
Managed CI/CD pipelines with Git, AWS CodePipeline, and Jenkins, streamlining deployment of data processing applications for rapid delivery and updates.
Environment: Spark, Databricks, Apache Airflow, Amazon Kinesis, Apache Kafka, Snowflake, Amazon (S3, Amazon Redshift, RDS (PostgreSQL), EKS, CloudWatch), Terraform, AWS Step Functions, AWS Glue, IAM, AWS CloudFormation, Docker, Git, Jenkins, Tableau, Python, PySpark.
Equifax - Mumbai, India June 2019 - Nov 2022
Python Developer
Responsibilities:
Assisted in building scalable web applications using Python, Django, and Flask frameworks to enhance user engagement and improve functionality.
Supported the development of SOAP and RESTful APIs with Django REST Framework, optimizing performance and reducing latency for seamless data exchange.
Created intuitive website interfaces using HTML, XHTML, AJAX, CSS, and JavaScript, ensuring a user-friendly experience.
Contributed to crafting views and templates with Python and Django’s templating tools to deliver responsive and accessible front end designs.
Participated in automating workflows with Python scripts to manage RDBMS platforms, execute SQL queries, and streamline data manipulation tasks.
Implemented Python-based GUI components to enhance front-end features like selection criteria for improved usability.
Assisted in constructing data tables with PyQt to manage customer and policy details, enabling efficient record updates and retrieval.
Participated in all the stages of software development lifecycle (SDLC) like design, testing development and implementation.
Developed entire frontend and backend modules using Python on Django Web Framework by implementing MVC architecture.
Developed the tools using Python Django and used MongoDB for databases. Parsers written in Python for extracting useful data from the design database. Used Parse kit (Enigma.io) framework for writing Parsers for ETL extraction.
Worked wif Data Analysts, and Data Warehouse DB Admins to identify teh necessary tables needed to compile SDA Containers which would in turn be used to produce FHIR Resource
Created internal diagnostic tools using Golang and AngularJS in order to assist with customer issues Implemented Spark using PySpark libraries for faster testing and processing of data
Implemented responsive user interface and standards throughout the development and maintenance of the website using HTML, CSS, JavaScript, Bootstrap, jQuery.
Debugged and enhanced existing workflows, documented updates in Confluence, and maintained code in a Git repository on GitHub
Contributed to setting up staging environments for thorough testing, collaborating with a global remote team to ensure successful project outcomes.
Integrated APIs and web services using Python (Requests, BeautifulSoup, Selenium) to collect and process external data.
Assisted in tracking sprint progress using Agile methodologies and JIRA, contributing to efficient team coordination and delivery.
Environment: Python, Django, HTML, CSS, AJAX, JavaScript, Git hub, MySQL, Postgres SQL, Apache Web Server, Linux.