Data Engineer Senior

Location:

Cleveland, OH

Posted:

February 24, 2025

Contact this candidate

Resume:

Venkata Subrahmanya

Senior Data Engineer

Email: **************@*****.*** Ph: +1-720-***-****

BACKGROUND SUMMARY:

• Data Engineer with 10+ years of experience in Analysis, Design, Development, and Big Data environments, including Hadoop, Scala, PySpark, and HDFS, with additional expertise in Python.

• Implemented Big Data solutions using the Hadoop technology stack, including PySpark, Hive, and Sqoop, and optimized PySpark jobs to run on Kubernetes clusters for faster data processing.

• Architected and managed AWS environments, including VPCs, subnets, and security groups, with hands-on experience in legacy data migration projects such as Teradata to AWS Redshift and on-premises to AWS Cloud.

• Experience in optimizing search performance with ElasticSearch and managing KV stores such as HBase.

• Configured and optimized Azure services, including Data Factory, SQL Database, CosmosDB, Stream Analytics, Databricks, Load Balancers, and Auto Scaling groups, ensuring high availability and scalability.

• Designed, built, and deployed applications utilizing the AWS stack (EC2, R53, S3, RDS, HSM, DynamoDB, SQS, IAM, and EMR), focusing on high availability, fault tolerance, and auto-scaling.

• Developed and maintained ETL workflows using Talend and Informatica, efficiently extracting, transforming, and loading data from various sources into data warehouses, and implemented data quality checks and cleansing routines.

• Proficient in SQL and NoSQL databases, including Oracle, MySQL, and MongoDB, with strong working experience in data modeling and developing complex SQL queries for data warehousing and integration solutions.

• Hands-on experience with CI/CD pipelines, setting up Jenkins Master and multiple slaves for continuous development and deployment, and converting Hive queries into Spark actions and transformations.

• Implemented monitoring and alerting solutions using Azure Monitor and other third-party tools, proactively detecting and resolving issues in the data processing pipeline.

• Proficient in scripting languages including Python, Bash, and R, with experience in developing custom data processing solutions and automating data workflows.

• Involved in data warehousing and analytics projects using Hadoop, MapReduce, Hive, and other open-source tools/technologies.

• Provided support to data analysts in running Hive queries and building ETL processes.

• Created high-performance, scalable, and maintainable Java applications for complex business requirements and enhanced Java application performance and responsiveness with multithreading and concurrency features.

• Defined user stories and drove the agile board in JIRA during project execution, participated in sprint demos and retrospectives.

• Strong hands-on experience with Microservices like Spring IO, Spring Boot in deploying on various cloud infrastructures like AWS.

• Knowledge of High Availability (HA) and Disaster Recovery (DR) options in AWS and implemented data backup and disaster recovery solutions using AWS services such as EBS snapshots, S3 versioning, and Glacier storage.

• Experience in developing and optimizing complex ETL pipelines using various tools and technologies.

• Configured and managed network components to ensure secure and efficient communication between different parts of the data processing pipeline.

• Collaborated with cross-functional teams, including product managers, designers, and marketing teams, to define A/B testing objectives and success criteria.

• Expertise in data visualization and reporting, creating dashboards and reports using Tableau and PowerBI.

TECHNICAL SKILLS:

Programming Languages

Python (PySpark, Pandas), T-SQL, Java, R, Scala, PL/SQL

Big Data Technologies

Spark (PySpark, Spark applications), Hadoop, MapReduce, Hive, Kafka, Snowflake, Apache Airflow

Database Tools

Snowflake, Azure Synapse Analytics, NoSQL (MongoDB, Cassandra, DynamoDB), SQL Server, MySQL,T-SQL PostgreSQL, Oracle, DB2

Cloud Platforms

Azure (Databricks, Data Lake Storage, Data Factory, SQL Database, CosmosDB, Stream Analytics, Blob Storage), AWS (S3, Redshift, EMR, Lambda, Glue), GCP (BigQuery)

ETL Tools

Informatica PowerCenter, Talend, SSIS, SSAS, SSRS, AWS Glue

Data Visualization Tools

Power BI, Tableau

Version Control Systems, CI/CD

Git, Github, GitLab, Jenkins, Docker, DevOps

Data Quality and Governance

Data quality checks, Metadata management, Data governance policies

Collaboration

Jira

PROJECT EXPERIENCE:

Client: Centene, St. Louis, MO Feb 2023 – Present

Title: Senior Data Engineer

• Implemented data collection strategies using Spark Streaming to extract real-time data from AWS S3 buckets, enabling immediate data availability for analytics.

• Designed Kafka producer clients using Confluent Kafka to produce events into Kafka topics, ensuring reliable and scalable data ingestion.

• Managed Hadoop infrastructure for data storage in HDFS and utilized AWS Glue Crawlers to catalog metadata, enhancing data discovery and integration across the organization.

• Developed Python scripts and modules for ETL processes, ensuring high data quality and consistency. Created scalable ETL pipelines with AWS Glue, improving processing efficiency by 30%.

• Managed PostgreSQL databases, including installation, configuration, and performance tuning, resulting in optimal query execution and system performance.

• Designed and implemented NoSQL data models using MongoDB and Cassandra to efficiently manage semi-structured and unstructured data, supporting diverse application needs.

• Leveraged AWS Lambda for serverless computing, optimizing resource usage and enhancing the scalability of various applications, leading to cost savings and improved performance.

• Wrote SQL scripts for data migration and successfully loaded historical data from Teradata SQL to Snowflake, ensuring seamless data transfer and continuity.

• Used Python in Spark to extract data from Snowflake and upload it to Salesforce on a daily basis, ensuring up-to-date data for sales operations.

• Utilized AWS Machine Learning services to develop predictive models and conduct advanced analytics on AWS-stored data, driving data-driven decision-making.

• Implemented partitioning, caching, and tuning techniques to optimize Spark jobs for efficient data processing, improving performance and scalability.

• Developed job scheduling using Airflow for Hive, Spark, and MapReduce tasks, enhancing workflow automation and reliability.

• Implemented and maintained ElasticSearch clusters to improve search performance and reliability, reducing query response time by 30%.

• Managed ElasticSearch cluster health and scaling, ensuring high availability and fault tolerance.

• Executed machine learning use cases using SparkML and Mllib, enabling advanced analytics and predictive modeling for big data applications.

• Designed and implemented HBase schemas for efficient data retrieval and storage, reducing latency and improving read/write performance

• Integrated HBase with data processing pipelines, enabling seamless data ingestion and real-time analytics.

• Collaborated with cross-functional teams to gather requirements and translate them into technical specifications for Talend ETL jobs, ensuring alignment with business objectives.

• Developed reusable objects such as PL/SQL program units, database procedures, and functions, streamlining development processes and maintaining consistency in business rule implementations.

• Managed Hadoop infrastructure for data storage in HDFS, utilized AWS Glue Crawlers for metadata cataloging, and enhanced data discovery and integration.

• Developed automation regression scripts for validating ETL processes across databases like AWS Redshift, Oracle, MongoDB, T-SQL, and SQL Server, ensuring data consistency and reliability.

• Developed Python-based Spark applications using PySpark API, leveraging Pandas and NumPy for data manipulation and analysis, enhancing data processing capabilities.

• Developed efficient Spark code in Scala with Spark-SQL/Streaming for accelerated data processing, leading to significant performance improvements.

• Created Tableau dashboards to visualize ETL performance metrics, providing insights for optimization.

• Integrated Tableau with Snowflake, enabling real-time analytics and reporting for stakeholders.

• Developed infrastructure-as-code scripts using Terraform to manage AWS resources, enabling automated and consistent environment setups.

• Implemented Terraform modules to provision and manage AWS S3, EC2, and RDS instances, ensuring infrastructure scalability and reliability.

• Established CI/CD pipelines using Jenkins to automate the deployment of ETL jobs, reducing manual intervention and deployment times.

• Integrated CI/CD workflows with GitLab to ensure seamless code integration and delivery, enhancing development efficiency.

• Utilized Git for version control to track changes and collaborate on ETL pipeline development, ensuring code integrity.

• Managed project repositories in GitLab, facilitating code reviews, and ensuring compliance with development standards.

• Configured Jenkins jobs to automate the testing and deployment of Spark applications, improving release cycles.

• Integrated Jenkins with AWS services to orchestrate automated builds and deployments, enhancing operational efficiency.

• Containerized Spark applications using Docker to ensure consistent runtime environments and simplified deployment processes.

• Utilized Docker Compose to define and manage multi-container Docker applications, streamlining development workflows.

• Deployed containerized applications on Kubernetes clusters to achieve high availability and scalability of Spark jobs.

• Managed Kubernetes resources using Helm charts to automate the deployment and scaling of data processing applications.

Environment: AWS, AWS Lambda, Hadoop, Apache Spark, PySpark, Spark Streaming, AWS Redshift, Oracle, MongoDB, T-SQL, Snowflake, Apache Airflow, Talend, Python, Scala, PostgreSQL, MongoDB, Cassandra, Apache Kafka, Salesforce, Spark, SQL, NoSQL, MongoDB, Tableau, Cassandra, ETL, Git, Github Actions.

Client: FINRA, Rockville, MD July 2021 – Jan 2023

Title: Senior Data Engineer

• Leveraged expertise in Azure Data Factory for proficient data integration and transformation, optimizing processes for enhanced efficiency.

• Managed Azure Cosmos DB for globally distributed, highly available, and secure NoSQL databases, ensuring optimal performance and data integrity.

• Created end-to-end solutions for ETL transformation jobs involving Informatica workflows and mappings.

• Demonstrated extensive experience in ETL tools, including Teradata Utilities, Informatica, and Oracle, ensuring efficient and reliable data extraction, transformation, and loading processes.

• Automated ETL processes using PySpark Data Frame APIs, reducing manual intervention and ensuring data consistency and accuracy.

• Integrated Azure Databricks into end-to-end ETL pipelines, facilitating seamless data extraction, transformation, and loading.

• Implemented complex data transformations using Spark RDDs, DataFrames, and Spark SQL to meet specific business requirements.

• Developed real-time data processing applications using Spark Streaming, capable of handling high-velocity data streams.

• Developed and implemented data security and privacy solutions, including encryption and access control, to safeguard sensitive healthcare data stored in Azure.

• Enhanced search performance by implementing and maintaining ElasticSearch clusters, reducing query response time by 30%.

• Ensured high availability and fault tolerance by managing ElasticSearch cluster health and scaling.

• Designed and implemented PostgreSQL database schemas and table structures based on normalized data models and relational database principles.

• Created interactive and insightful dashboards and reports in Power BI, translating complex data sets into visually compelling insights for data-driven decision-making.

• Seamlessly integrated HBase with data processing pipelines, facilitating real-time analytics and data ingestion.

• Utilized Python, including pandas and numpy packages, along with PowerBI to create various data visualizations, while also performing data cleaning, feature scaling, and feature engineering tasks.

• Developed machine learning models such as Logistic Regression, KNN, and Gradient Boosting with Pandas, NumPy, Seaborn, Matplotlib, and Scikit-learn in Python.

• Designed and coordinated with the Data Science team in implementing advanced analytical models in Hadoop Cluster over large datasets, contributing to efficient data workflows.

• Automated the provisioning of Azure resources using Terraform scripts, ensuring consistent and repeatable environment setups.

• Managed infrastructure changes using Terraform, enabling version-controlled and auditable infrastructure deployments.

• Implemented CI/CD pipelines with Jenkins for automated testing and deployment of ETL processes, reducing manual errors.

• Integrated CI/CD workflows with GitLab for continuous integration and delivery, enhancing the efficiency of development cycles.

• Leveraged Git for version control to manage code changes and collaborate on ETL development, ensuring code quality.

• Coordinated with teams using GitLab repositories, facilitating collaborative development and code reviews.

• Configured Jenkins pipelines to automate the testing and deployment of data integration jobs, improving release management.

• Automated deployments by integrating Jenkins with Azure and containerized ETL workflows with Docker for consistent environments across all stages.

• Utilized Docker to deploy scalable and reproducible environments for data processing applications.

• Deployed containerized data processing applications on Kubernetes clusters for enhanced scalability and reliability.

• Managed Kubernetes deployments using Helm to simplify the deployment and scaling of ETL pipelines.

Environment: Azure, Azure Data Factory, Azure CosmosDB, ETL, Informatica, PySpark, Azure HDInsight, Apache Spark, Hadoop, Spark-SQL, Scikit-learn, Pandas, NumPy, PostgreSQL, MySQL, Python, Scala, Power BI, SQL.

Client: Bank of America, Charlotte, NC Dec 2019 – Jun 2021

Title: Data Engineer

• Created Azure Data Factory (ADF) pipelines using Azure Polybase and Azure Blob.

• Worked on Python scripting to automate the generation of scripts and performed data curation done using Azure Databricks.

• Worked on Azure Databricks, PySpark, HDInsight, U-SQL, T-SQL, Spark SQL, Azure ADW, and Hive used to load and transform data and performed ETL using Azure Databricks.

• Migrated on-premise Oracle ETL process to Azure Synapse Analytics and Utilized Databricks to perform ETL, enabling efficient data transformations and seamless integration with Azure Synapse Analytics.

• Developed PySpark applications in Databricks for large-scale data processing, ensuring optimal performance and reliability.

• Utilized ETL transformations to handle schema changes and accommodate evolving business requirements seamlessly.

• Wrote Python scripts to design and develop ETL (Extract-Transform-Load) process to map the data, transform it, and load it to the target, performing Python unit tests.

• Performed troubleshooting and deployed many Python bug fixes of the main applications that were maintained efficiently.

• Implemented error handling and logging mechanisms within Python scripts to ensure robustness and reliability.

• Utilized SparkSQL for executing SQL queries on distributed data, enabling seamless integration with traditional SQL-based ETL processes.

• Troubleshoot and debug PySpark applications, identifying and resolving issues related to data processing, performance, or system compatibility.

• Developed and maintained data pipelines using Pandas, integrating data from various sources and formats.

• Utilized advanced SQL features such as window functions and CTEs to solve intricate data analysis challenges.

• Applied advanced data modeling techniques in PowerBI, ensuring accurate representation of data relationships, and performed data transformations for enhanced visualization.

• Troubleshooted and resolved issues related to Tableau dashboards, data connections, and performance bottlenecks.

• Collaborated with data engineers and database administrators to design and optimize data models and data infrastructure to support Tableau reporting needs.

Environment: Azure, Azure Data Factory, Azure Databricks, SQL, T-SQL, Hive, Apache Spark, PySpark, Python, ETL, SparkSQL, Power BI.

Client: Mercury insurance group, Brea, CA Jan 2018 – Nov 2019

Title: Data Engineer

• Conducted in-depth data analysis using Excel, leveraging functions like VLOOKUP, HLOOKUP, and pivot tables to derive meaningful insights from large datasets, resulting in an increase in data processing efficiency.

• Managed and optimized data storage solutions on AWS, ensuring efficient retrieval and storage of datasets for analytical purposes, leading to a reduction in data access time.

• Validated and improved Python reports, identifying and fixing bugs to ensure accurate and reliable reporting, reducing report errors.

• Managed and optimized data storage solutions on AWS, ensuring efficient retrieval and storage of datasets for analytical purposes, leading to a reduction in data access time.

• Led the re-architecture and migration of on-premises SQL data warehouses to AWS cloud data platforms, resulting in a cost reduction and improved scalability.

• Developed data integration solutions with ETL tools like Informatica, PowerCenter and Teradata Utilities, reducing ETL processing time.

• Automated the extraction process for various files, including flat and Excel files from FTP and SFTP sources, streamlining data retrieval and enhancing efficiency, leading to an increase in data processing speed.

• Designed and implemented data pipelines using PySpark, seamlessly integrating diverse data sources and formats, improving data pipeline reliability.

• Developed and maintained Tableau dashboards and visualizations, providing meaningful insights and analyses for informed decision-making, which improved business decision-making speed.

• Ensured data quality and consistency across multiple platforms by implementing robust validation and error-checking mechanisms, reducing data inconsistencies.

• Optimized data processing workflows to improve performance and reduce costs using AWS and PySpark, leading to a 35% reduction in operational costs and decreasing downtime.

Environment: AWS, PySpark, SQL, Python, Informatica, ETL, Tableau, Excel.

Client: ICICI Financial Services, Hyderabad, India Jun 2014 – Aug 2017

Title: Data Engineer

• Developed and maintained ETL pipelines using Informatica 6.1 to process structured and unstructured data from Flat Files and Oracle 9i, enhancing data accuracy and accessibility for analysis purposes.

• Designed dynamic and insightful dashboards and reports using Power BI, enabling real-time decision-making by visualizing data trends and analytics derived from multiple sources including Teradata.

• Managed data workflows and documentation using Jira and Confluence, ensuring project tracking and effective communication across the development team.

• Configured and managed AWS S3 buckets for secure data storage and retrieval, facilitating scalable and efficient data handling within cloud environments.

• Implemented data quality assurance practices using Data Flux and Quality Center 7.2, achieving a 20% improvement in data integrity and reliability across projects.

• Developed complex SQL queries and procedures using TOAD and PL/SQL to extract, analyze, and report data, supporting strategic business initiatives and operational improvements.

• Led cross-functional teams in the integration and consolidation of data systems, utilizing Oracle 9i and Teradata, to support enterprise-wide data analytics platforms.

• Spearheaded the migration of legacy systems to modern data platforms, employing tools like Informatica 6.1 and AWS S3, to enhance data processing capabilities and support future growth.

Environment: Informatica 6.1, Scala Power BI, Visual Studio, Jupyter notebook, Python, Jira, Confluence, SQL, TOAD, PL/SQL, Flat Files, Teradata.

Contact this candidate