python, Azure, AWS, Kafka, snowflake, devops, SQL, NoSQL

Location:

Cincinnati, OH

Posted:

April 19, 2024

Contact this candidate

Resume:

Vinod Yadav Kannaboina

****************@*****.***

LinkedIn: www.linkedin.com/in/vinod-yadav401090

913-***-****

mailto:

Skills Summary

10+ Years of experience in Data Engineering, Analysis, Design, Development, Testing, Customization, Bug fixes, Enhancement, Support, and Implementation using Python, Java, Spark programming for Hadoop. Worked on AWS environment such as EMR, Athena, Glue, IAM, S3, CloudFormation, Redshift, RDS, Data Pipelines and EC2.

Experience on Migrating SQL database to Azure data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and controlling and granting database access and Migrating On premise databases to Azure Data Lake store using Azure Data factory.

Highly experienced in developing data marts and developing warehousing designs using distributed.

SQL concepts, Presto SQL, Hive SQL, Python (Pandas, NumPy, SciPy, Matplotlib) and Pyspark to cope up with the increasing volume of data.

Developed Python and PySpark applications for data analysis.

Developed Highly scalable Python applications using Python Django.

Developed the PySpark code for AWS Glue jobs and for EMR.

Have Extensive Experience in IT data analytics projects, Hands on experience in migrating on premise ETLs to Google Cloud Platform (GCP) using cloud native tools such as BIG query, Cloud Data Proc, Google Cloud Storage, Composer.

Practical understanding of the Data modeling (Dimensional & Relational) concepts like Star - Schema Modeling, Snowflake Schema Modeling, Fact and Dimension tables

Extensive experience with Azure Data Factory, orchestrating cloud-based ETL processes and ensuring seamless data integration.

Adept at leveraging Selenium WebDriver to develop and maintain robust automated test scripts for web applications.

Proven proficiency in ETL development using SSIS, excelling in designing and optimizing data workflows for efficiency and data integrity.

Developed Python code to gather the data from HBase and designs the solution to implement using Spark.

Good understanding of Spark Architecture including Spark Core, Spark SQL, Data Frames, Spark Streaming, Driver Node, Worker Node, Stages, Executors and Tasks.

Skillful in designing modular test frameworks, integrating with CI/CD pipelines, and conducting comprehensive cross-browser and API testing.

Highly motivated to work on Python, Spark scripts and performance tuning to improve efficiency and Data Quality.

Good Experience in Linux Bash scripting and following PEP Guidelines in Python.

Worked on Kafka for streaming data and on data ingestion.

Technical Skills:

Hadoop: Spark, Hive, Oozie, Sqoop, Kafka, HDFS, YARN, Zeppelin and HBase

AWS: EMR, Glue, Athena, DynamoDB, Redshift, RDS, Data pipelines, Lake formation, S3, IAM, CloudFormation, EC2, ELB/CLB, CloudWatch, SQS, AWS Lambda, SNS, CloudFront, EFS, Fargate, ECS, EBS, EKS,

GCP: Big Query, DataProc, Dataflow, Cloud Functions, GCS

Azure: Azure Data Factory, Azure Databricks, Azure Devops, Synapse, Azure SQL, Delta lake, Data Lake, Event Hub.

Confluent Cloud: Kafka, KSQLDB

Operating systems: Amazon Linux 1 and 2, Custom AMIs based on Amazon Linux with encryption, Windows, CentOS, RHEL

Programming Languages: Java, Python, Scala, Spark, Glue ETL

Web: Servlets, JSP, Spring MVC, Spring Boot, Hibernate

Front End: HTML, XML, React JS, Node JS, JSF, AJAX, JavaScript, CSS, jQuery.

Databases: MSSQL, DynamoDB, HBase, Cassandra, Teradata, DBT, MongoDB, MYSQL, SQL SERVER 2008, PostgreSQL.

Version Control: Git, SVN, SourceTree

Scripting Languages: Shell scripting, PowerShell, Bash

DevOps Platforms: Docker, Jenkins, Kubernetes, Chef, Puppet, Ansible, Terraform.

Streaming Platforms: Kafka, Confluent Kafka

Logging/Analytics: Splunk

Work History:

Data Engineer Sep 2020-Present

Paycore – NewYork

Utilized AWS and Azure services such as Amazon EMR, ECS, Glue, Athena, Databricks, Data Factory, Azure Functions and Airflow, along with Jupyter notebooks, to address diverse data engineering challenges.

Contributed to the requirements, analysis, and design of the Airflow/Spark core engine architecture within the AWS environment.

Developed and optimized SQL queries, stored procedures, and data pipelines to extract, transform, and load (ETL) data into Snowflake from various source systems.

Leveraged Python core and Django middleware to build a web application and utilized Pyspark and Python to create core engines for data validation and analysis.

Developed interactive and responsive web applications using React.js as the primary front-end framework.

Designed, developed, and maintained ETL (Extract, Transform, Load) processes using IBM DataStage, ensuring efficient and reliable data integration across various source and target systems.

Demonstrated proficiency in designing, implementing, and managing real-time data streaming pipelines using Apache Kafka, facilitating high-throughput, fault-tolerant, and scalable data ingestion, processing, and distribution across distributed systems.

Developed and maintained Python-based data analysis and machine learning pipelines to extract insights and drive decision-making processes.

Designed, implemented, and maintained infrastructure as code (IaaC) using Terraform to automate the provisioning and management of cloud resources on AWS and Azure.

Implemented and optimized real-time data processing workflows on Snowflake, leveraging its unique architecture for seamless integration with streaming data sources.

Utilized Docker to containerize and deploy applications, improving portability, scalability, and efficiency.

Designed and implemented data warehouse solutions using Snowflake, including schema design, table structures, and storage optimization techniques to support analytics and reporting requirements.

Utilized Postman as a key tool for API development, testing, and documentation, enabling streamlined and efficient API workflows throughout the software development lifecycle.

Extensive experience in developing and optimizing T-SQL queries, stored procedures, functions, and triggers for Microsoft SQL Server databases, ensuring efficient data retrieval, manipulation, and management.

Implemented complex data transformations and business logic using DataStage's graphical interface and built-in functions, meeting business requirements and data quality standards

Developed complex T-SQL queries and scripts to extract, transform, and load (ETL) data from various source systems into SQL Server databases for reporting and analysis.

Configured connections to various data sources and destinations within Airflow, including databases, cloud storage, and external APIs.

Implemented and managed data pipelines using Apache Airflow, orchestrating complex workflows, and automating data processing tasks.

Configured and managed Kafka clusters, including broker configurations, topic configurations, and security settings, to ensure high availability, fault tolerance, and data durability.

Implemented stored procedures, functions, and triggers in T-SQL to encapsulate business logic and improve application performance and maintainability.

Proven expertise in Agile development methodologies, contributing actively to sprint planning, retrospective reviews, and daily stand-ups. Proficient in utilizing Agile tools like JIRA and Confluence to ensure efficient collaboration and prioritize tasks within cross-functional teams.

Developed and maintained ETL pipelines in Snowflake using Snowpipe, SnowSQL, or third-party integration tools, ensuring efficient data ingestion, transformation, and loading processes from various source systems

Utilized Python libraries such as Pandas, NumPy, and SciPy for data manipulation, statistical analysis, and predictive modeling

Utilized PySpark extensively for large-scale data processing, analysis, and transformation tasks, handling datasets ranging from gigabytes to terabytes.

Designed and implemented data models and schemas in Snowflake using best practices to ensure efficient storage, querying, and analysis of data for analytical workloads.

Developed and maintained relational databases using Microsoft SQL Server, including database design, schema optimization, and query tuning.

Developed Terraform modules and templates to standardize infrastructure deployments and ensure consistency across environments.

Designed and optimized database schemas to leverage SQL Hyperscale's horizontal scaling capabilities, resulting in improved query performance and reduced latency.

Collaborated with cross-functional teams to gather requirements, prototype solutions, and deploy production-ready Python applications.

Collaborated with cross-functional teams to analyze business requirements, identify data sources, and design scalable data models and pipelines in Snowflake to meet business needs and objectives.

Integrated authentication and authorization mechanisms such as OAuth 2.0, JWT (JSON Web Tokens), or API keys into REST APIs to ensure secure access and protect sensitive data from unauthorized access.

Conducted testing and validation of REST APIs using tools like Postman, writing test cases to verify functionality, performance, and reliability of API endpoints across different scenarios and edge cases.

Optimized ETL jobs for performance by tuning DataStage configurations, optimizing SQL queries, and leveraging parallel processing capabilities to improve data processing throughput.

Demonstrated expertise in optimizing Snowflake configurations to ensure scalability, enabling the platform to handle growing datasets and increasing workloads without compromising performance.

Proficient in containerization with Docker, creating lightweight and portable application environments, streamlining development-to-deployment workflows, and improving consistency across different stages of the software development lifecycle.

Implemented Kubernetes clusters on cloud platforms such as AWS, Azure, ensuring high availability and fault tolerance.

Wrote complex SQL queries, including joins, subqueries, and aggregations, to extract and analyze data for reporting and analytics purposes.

Conducted performance tuning and optimization of Kafka clusters by monitoring key metrics, adjusting configurations, and scaling resources to meet throughput and latency requirements.

Designed and optimized database schemas, tables, indexes, and views using T-SQL, adhering to normalization principles and performance best practices.

Integrated DataStage with enterprise data warehouses, databases, and external systems using connectors and APIs to facilitate seamless data flow and synchronization.

Proficient in designing and implementing Extract, Transform, Load (ETL) processes using SQL Server Integration Services (SSIS).

Implemented data processing pipelines and ETL (Extract, Transform, Load) workflows using Scala and Apache Spark, handling large volumes of data for analytics and machine learning applications.

Configured and managed Kafka clusters, including broker configurations, topic configurations, and security settings, to ensure high availability, fault tolerance, and data durability.

Developed and maintained database solutions using Transact-SQL (T-SQL) on Microsoft SQL Server, ensuring efficient data storage, retrieval, and manipulation for business applications.

Developed and optimized ETL (Extract, Transform, Load) pipelines on Azure Databricks to ingest, transform, and load large volumes of data from diverse sources into Azure data lakes and data warehouses.

Environment: Azure, AWS, REST API, ETL, Apache Kafka, Apache Spark, SQL Server, Amazon EMR, ECS, Glue, Athena, Databricks, Data Factory, Azure Functions, Apache Airflow, Python, Django, Docker, Spring Boot, PostgreSQL, Power BI, Snowflake.

Data Engineer March 2019-Sep 2020

Amazon Web Services – Dallas, TX

Work on developing Pyspark scripts and Python pandas to migrate data from on-prem data stores to S3 and building Glue data catalog.

Work on Python scripts to build AWS Lambda functions for data analytics.

Work on analysis of requirements and design solutions to meet the business requirements. Work in design proposals.

Collaborated with cross-functional teams to provide technical guidance, support, and training on AWS services and best practices, fostering a culture of knowledge sharing and continuous improvement.

Performance-tuned T-SQL queries and database configurations to improve query execution times and optimize resource utilization, enhancing application performance.

Collaborated with cross-functional teams to analyze business requirements, identify data sources, and design scalable data models and pipelines in Snowflake to meet business needs and objectives.

Created and managed collections, environments, and requests in Postman to organize and automate API testing and validation processes, improving test coverage and reliability.

Skilled in server-side development using Java and the Spring framework, demonstrating expertise in creating robust and modular applications. Proficient in building RESTful services, implementing business logic, and utilizing Spring features for efficient development.

Designed, developed, and maintained DAGs (Directed Acyclic Graphs) to schedule and monitor data workflows, ensuring data integrity and timely execution using Airflow.

Implemented automated testing frameworks for data engineering pipelines, utilizing JUnit and ScalaTest, along with Apache Airflow and PyTest, to ensure robust testing coverage. This initiative contributed to streamlined testing processes and reinforced the reliability of data workflows.

Designed and implemented a data warehousing solution using MS SQL Server Integration Services (SSIS) to consolidate and analyze data from multiple sources for business intelligence purposes.

Extensively worked on Jenkins and Hudson by installing, configuring, and maintaining the purpose of.

Designed and implemented CI/CD pipelines using tools like Jenkins, GitLab CI/CD, or Travis CI to automate the deployment process to Kubernetes clusters.

Implemented best practices for Terraform code organization, versioning, and documentation to facilitate collaboration and maintainability.

Developed and maintained backend services and data processing applications using Scala, leveraging its functional programming capabilities for high-performance and scalable solutions.

Implemented data validation checks, schema validation, and integration tests within the automated framework, leading to increased efficiency in identifying and rectifying issues during the development and deployment phases.

Integrated Snowflake with other cloud services and data platforms such as AWS S3, Azure Blob Storage, and Google Cloud Storage to build end-to-end data pipelines.

Under supervision and based on the specifications improve performance and efficiency of the ETL jobs in Informatica.

Skilled in optimizing SQL and PL/SQL code for performance by analyzing execution plans, identifying bottlenecks, and applying indexing and query optimization techniques in Oracle databases.

Performance-tuned Snowflake databases by optimizing warehouse configurations, query execution plans, and indexing strategies to improve query performance and resource efficiency.

Applied ETL testing skills with proficiency in SSIS and Azure Data Factory, validating data transformations and ensuring seamless integration across diverse data sources.

Implemented data validation mechanisms within SSIS and Azure Data Factory pipelines to handle anomalies and errors gracefully.

Proficient in troubleshooting and debugging techniques for SSIS packages and Azure Data Factory pipelines.

Environment: AWS Glue, SSIS, S3, Microsoft Azure, Azure Data Factory, Informatica, Tableau, Spring Boot, SQL, Mongo DB, AWS Lambda, API Gateway, DynamoDB, Amazon RDS, Mongo DB, RESTful, Snowflake.

Data Engineer Jan 2018-Feb 2019

DXC Technology - Richardson, TX

Analyze, design, and build Modern data solutions using Azure PaaS service to support visualization of data. Understand current Production state of application and determine the impact of new implementation on existing business processes.

Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.

Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different.

Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark databricks cluster.

Strong knowledge and practical experience in JavaScript and TypeScript, showcasing expertise in developing scalable and event-driven applications.

To meet specific business requirements wrote UDF's in Scala and PySpark.

Implemented event-driven architectures by integrating AWS Lambda with various AWS services such as S3, DynamoDB, SNS, and SQS.

Utilized strong SQL skills for data validation, executing queries to verify the accuracy and completeness of data in databases.

Expertise in designing and maintaining data warehousing solutions leveraging AWS Redshift, AWS Glue, Amazon RDS (Relational Database Service), and Amazon S3 for storage.

Proven testing experience in Node.js, including unit, integration, and system testing, ensuring the delivery of high-quality and reliable code in an agile software development environment.

Played a pivotal role in developing and executing SQL-based validation patterns as part of the overall testing strategy.

Hands-on experience on developing SQL Scripts for automation purpose.

Demonstrated ability to orchestrate and schedule ETL jobs using AWS Glue, optimizing data processing workflows and ensuring timely execution of data transformations, with a focus on maintaining data integrity.

Extensive hands-on experience in utilizing Selenium WebDriver for automated testing of web applications.

Integrated Selenium automation into Continuous Integration (CI) pipelines using tools like Jenkins.

Environment: Hadoop, Hive, Selenium WebDriver, Azure Data Storage, Data Lake, T-SQL, Spark SQL, Azure Databricks, JavaScript, TypeScript, PySpark, Python, Apache Kafka.

Data Engineer Jan 2016-Dec 2017

EDASSIST - Watertown, MA

Leveraged Glue Data Catalog for metadata management, enabling seamless integration with other AWS services such as Amazon Redshift, Athena, and S3.

Worked with highly structured and semi structured data sets of 45 TB in size (135 TB with

replication factor of 3).

Responsible for building scalable distribution data solutions using Hadoop.

Implemented and optimized AWS infrastructure using services such as Amazon EC2, Amazon S3, Amazon RDS, Amazon VPC, and Amazon Route 53, aligning with best practices for security, reliability, and performance.

Implementation of a responsive UI which will scale itself depending on the device, platform and browser using Angular 4, HTML5, CSS3 and Bootstrap.

Worked with the team, helping them build out their markup and CSS. My Specialty is large scale CSS SASS, focusing on reusability and modularity.

Designed and developed Micro services using REST framework and Spring Boot.

Used Micro services to communicate using synchronous protocols HTTP and REST for implementing SOAP approach.

Developed screens using JSP, JSF, DHTML, CSS, AJAX, JavaScript, JQUERY, Spring MVC, Core Java and XML

Involved in web services design and development. Created and consumed web services using JSON, XML, and REST. Generate the stubs using JAXB.

Implemented client-side Interface using React JS.

Work with Confluent Kafka

Responsible for moving data from Teradata, MS SQL server, DB2 to HDFS and development cluster for validation and cleansing.

Extensive experience on analyzing, Writing Hadoop MapReduce jobs using JavaAPI, Pig and Hive.

Played a key role in identifying and reporting defects through systematic execution of automated test suites.

Orchestrated complex workflows using AWS Step Functions and Lambda to automate business processes and workflows.

Implemented MapReduce programs to handle Semi-Structured/structured data for log files. Stored the data in tabular formats using Hive tables and Hive SerDe’s.

Expanded automation skills to include API testing, using tools like Postman or Rest Assured, to ensure end-to-end testing coverage.

Extensive experience in testing within Agile development processes, including Scrum and Kanban methodologies.

Environment: MS SQL Server, HDFS, CI/CD, Hadoop, Hive, Selenium WebDriver, Azure Data Storage, Postman, Data Lake, T-SQL, Agile, Scrum, Spark SQL, Azure Databricks, JavaScript, TypeScript, PySpark, Python, Apache Kafka.

Data Engineer Jan 2014-Dec 2015

Web MD – New Jersey

Design and Develop Spark code using Scala programming language & Spark SQL for high-speed data processing to meet critical business requirements.

Implement RDD/Datasets/Data frame transformations in Scala through Spark Context and Hive Context

Import python libraries into the transformation logic to implement core functionality.

Wrote Spark-SQL and embedded the SQL in SCALA files to generate jar files for submission onto the Hadoop cluster.

Designed, developed, and deployed serverless applications using AWS Lambda, enabling efficient and cost-effective execution of code without managing servers.

Integrated Amazon Aurora with other AWS services such as AWS Lambda, AWS Glue, and Amazon S3 to build end-to-end data pipelines and analytical solutions.

Develop Unix Shell scripts to perform Hadoop ETL functions like Sqoop, create external/internal Hive tables, initiate HQL scripts and Big Query.

Designed, implemented, and managed scalable and cost-effective cloud solutions on Amazon Web Services (AWS) to meet business requirements and drive digital transformation initiatives.

Work with Technical Program Management and Software Quality Assurance teams to develop means to measure and monitor overall project health.

Contact this candidate