AKHIL KUMAR
Sr Data Engineer
https://www.linkedin.com/in/akhil-kumar-257881246
**************@*****.*** 508-***-****
PROFESSIONAL SUMMARY
Over 10+ years of diversified IT experience as a Senior Data Engineer, specializing in requirement gathering, design, development, testing, and maintenance of databases, Cloud technologies, data pipelines, and Data Warehouse applications.
üBig Data Expertise: Extensive hands-on experience in Hadoop architecture and various components, SPARK applications (RDD transformations, Spark core, MLlib, Streaming, SQL), Cloudera ecosystem (HDFS, Yarn, Hive, Sqoop, Flume, HBase, Oozie, Kafka, Pig), data pipeline development, and data analysis with Hive SQL, Impala, Spark, and Spark SQL.
üData Pipeline and ETL Skills: Proficient in building data pipelines using Python/Pyspark/Hive SQL/Presto/Big Query and Apache Airflow. Experienced in using Teradata utilities, Informatica client tools, Sqoop for data import/export, and Flume and NiFi for log file loading.
üDatabase Expertise: Well-versed in RDBMS like Oracle, MS SQL Server, MYSQL, Teradata, DB2, Netezza, PostgreSQL, MS Access; Exposure to NoSQL databases such as MongoDB, HBase, DynamoDB, and Cassandra.
üCloud Technologies: Hands-on experience with Azure (including Azure Data Factory, Data Lake Storage, Synapse Analytics, Cosmos NO SQL DB), GCP (including Big Query, GCS, Cloud functions, Dataflow, Pub/Sub, Data Proc), and AWS (including EC2, Glue, Lambda, SNS, S3, RDS, Cloud Watch, VPC, Elastic Beanstalk, Auto Scaling, Redshift).
üWeb Development: Experience in developing web applications using Python, Pyspark, Django, C++, XML, CSS, HTML, JavaScript, and jQuery.
üReporting and Analysis: Proficient in developing business reports with Power BI, Tableau, SQL Server Reporting Service (SSRS), analysis using SQL Server Analysis Service (SSAS), and ETL processes using SQL SERVER Integration Service (SSIS).
üData Modeling and Analytics: Adaptable in using Data Modeling packages like NumPy, SciPy, Pandas, Beautiful Soup, Scikit-Learn, Matplotlib, Seaborn in Python, and Dplyr, TidyR
üProgramming Skills: Experience in core Java, J2EE, Multithreading, JDBC, Shell Scripting, Java API's Collections, Servlets, JSP.
üVersion Control and CI-CD Tools: Well-versed with tools like SVN, GIT, SourceTree, Bitbucket, and experience with Unix/Linux commands, scripting, and deployment on servers.
üSoftware Development Life Cycle (SDLC): Involved in all phases, including Agile, Scrum, and Waterfall management processes, focusing on high availability, fault tolerance, auto-scaling.
TECHNICAL STACK:
Programming Languages
Python, R, SQL, Java, .Net, HTML, CSS, Scala
Python Libraries
Requests, Report Lab, NumPy, SciPy, Pytables, cv2, imageio, Python-Twitter, Matplotlib, HTTPLib2, Urllib2, Beautiful Soup, PySpark, Pytest, Pymongo, cxOracle, PyExcel, Boto3.
Frameworks & Development
Django, Flask, Pyramid, PyCharm, Sublime Text.
Web Technologies
Web Services: REST, SOAP, Microservices
Front-End: HTML, CSS, JavaScript
Architectures: MVW, MVC
DBMS
Oracle, PostgreSQL, Teradata, IBM DB2, MySQL, PL/SQL, MongoDB, Cassandra, DynamoDB, HBase. Microsoft SQL Server, Azure SQL Database, MySQL, Snowflake, ER Studio
Server Stacks
WAMP, LAMP.
Big Data Ecosystem Tools
Cloudera distribution, Hortonworks Ambari, HDFS, Map Reduce, YARN, Pig, Sqoop, HBase, Hive, Flume, Cassandra, Apache Spark, Oozie, Zookeeper, Hadoop, Scala, Impala, Kafka, Airflow, DBT, NiFi.
Reporting Tools
Power BI, SSIS, SSAS, SSRS, Tableau.
Containerization/orchestration Tools
Kubernetes, Docker, Docker Registry, Docker Hub, Docker Swarm.
Cloud Technologies
Amazon Web Services (EC2, S3, RDS, VPC, IAM, Elastic Load Balancing, Auto Scaling, CloudWatch, SNS, SES, SQS, Lambda, Step Functions, Cloud transformations, EMR)
Google Cloud Platform (Big Query, GCS Bucket, G-Cloud function, Cloud Dataflow, Pub/Sub, Cloud Shell, GSUTIL, BQ command line utilities, Data Proc)
Microsoft Azure (Web application, App services, Storage, SQL Database, Virtual machines, Search, Notification Hub)
Data Modelling Techniques
Relational data modeling, ER/Studio, Erwin, Sybase Power Designer, Star Join Schema, Snowflake modeling, FACT and Dimensions tables.
Streaming Frameworks
Kinesis, Kafka, Flume.
Version Control and CI/CD Tools
Concurrent Versions System (CVS), Subversion (SVN), GIT, GitHub, Mercurial, Bit Bucket, Docker, Kubernetes.
PROFESSIONAL EXPERIENCE:
NIKE, Portland
Sr. Data Engineer Google Cloud Platform (GCP) Development Sep 2023-Present
Data Ingestion & Integration
üDeveloped robust data pipelines using AWS, Spark, Snowflake, and Databricks to support large-scale data ingestion and processing.
üOrchestrated ETL pipelines using Airflow and Databricks, automating batch and real-time data processing workflows.
üExpertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow for ETL batch processing to load into Snowflake for analytical processes.
üDesigned and implemented the RCC (Raw, Cleansed, Curated) data framework to streamline data transformations and curation processes.
AWS & Cloud Technologies
üCreated and managed a scalable data lake on AWS S3, implementing operations like Insert, Upsert, and SCD Types 1 and 2 based on business requirements.
üDeveloped self-service BI tools enabling users to create reports and dashboards using AWS and Snowflake.
üIntegrated AI-driven analytics to automate data insights and recommendations.
üImplemented path analysis models to study customer journeys and optimize user engagement.
üDesigned drag-and-drop functionalities in Power BI and Tableau for ease of report creation.
üBuilt resilient data solutions leveraging AWS services like S3, Lambda, and EMR to handle large volumes of data.
Data Transformation & Processing
üPerformed complex data transformations using SparkSQL to ensure data consistency and accuracy.
üDeveloped and orchestrated PySpark applications for incremental and batch data processing on AWS EMR and Databricks.
Data Architecture & Modeling
üImplemented Dimensional Data Modeling with Star Schema and Snow-Flake Schema to support business reporting needs.
üBuilt a robust Error Handler framework using PySpark, automatically generating error reports to improve data quality.
Data Warehousing & Management with Snowflake
üDeveloped Snowflake architectures for efficient data staging and consumption, ensuring scalability for future growth.
üIntegrated Snowflake with multiple data sources for centralized analytics and reporting.
Automation & CI/CD Pipelines
üDeveloped CI/CD pipelines using Jenkins and GitHub for automated testing and deployment.
üImplemented automated monitoring dashboards using Tableau, providing stakeholders real-time insights.
Agile Development & Collaboration
üActively participated in Agile ceremonies to ensure alignment with business goals.
üCollaborated with cross-functional teams to build dashboards and visualizations using Cognos and Tableau.
Environments: AWS, Spark, Snowflake, Databricks, Airflow, Nike Data Foundation, AWS S3, AWS Lambda, AWS EMR, SparkSQL, PySpark, Star Schema, Snow-Flake Schema, AWS CloudWatch, Datadog, Tableau, Cognos, AWS SNS.
HUNTINGTON NATIONAL BANK, Columbus
Sr. Data Engineer Oct 2021 -Sep 2023
Google Cloud Platform (GCP) Development
üDeveloped and administered data storage solutions, leveraging Big Query, Cloud Storage, and Cloud SQL.
üConducted ETL and data engineering using Cloud Dataflow, Data Proc, and Google Big Query.
üArchitected and implemented data pipelines with Dataflow, Dataproc, and Pub/Sub, processing data from Pub/Subtopics to Big Query.
üWorked on scheduling all jobs using Airflow scripts using python. Adding different tasks to DAG’s and dependencies between the tasks.
üBuilt NLP-based query systems in GCP to enable natural language data interpretation.
üDeveloped predictive models for user behavior analysis using Python and BigQuery ML.
üCreated customized dashboards that dynamically visualize business trends.
üHosted applications using Compute Engine, App Engine, Cloud SQL, Kubernetes Engine, and Cloud Storage.
üEnsured data security and access controls through GCP's IAM and the Cloud Security Command Center.
Big Data Processing and Analysis
üAnalyzed Hadoop clusters and utilized Big Data tools including Pig, Hive, HBase, Spark, and Sqoop.
üDeveloped streaming applications with PySpark to read data from Kafka and persist it in NoSQL databases like HBase and Cassandra.
üUtilized Pig and Hive QL for data profiling and aggregation; constructed Spark Applications using Scala and Java.
Data Modeling and ETL Management
üConducted data analysis and design activities, creating logical and physical data models and metadata repositories with ERWIN.
üExtensively used Informatica client tools for various ETL management tasks.
üDesigned and implemented data pipelines using GCP services for efficient data processing.
CI/CD and API Management
üOrchestrated CI/CD pipelines in Jenkins, integrating with various tools through Groovy scripts.
üManaged API requests and organized responses using tools like Postman and Swagger, ensuring seamless API management.
Security and Monitoring
üActively ensured data security by implementing access controls and utilizing GCP's security tools.
üUtilized GCP's monitoring tools to maintain data pipeline and storage solution integrity.
üConfigured automated monitoring for data storage solutions using Cloud Monitoring
Environments: Google Cloud Platform (GCP), Big Query, Cloud Storage, Cloud SQL, Cloud Dataflow, Data Proc, Pub/Sub, Compute Engine, App Engine, Kubernetes Engine, IAM, Cloud Security Command Center, Hadoop, Pig, Hive, HBase, Spark, Sqoop, PySpark, Kafka,
UNITED HEALTH CARE, Minnetonka
Data Engineer Aug 2019 – Sep 2021
AWS Data Integration & Processing
üWorked on data ingestion, cleansing, and transformation using AWS Lambda, AWS Glue, and Step Functions.
üImplemented serverless architecture with API Gateway, Lambda, and DynamoDB; deployed Lambda code from S3.
üDesign and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift
üDeveloped Glue ETL jobs for data processing, including transformations, and loading data into S3, Redshift, and RDS.
Monitoring & Notifications
üConfigured CloudWatch for monitoring Lambda functions and Glue Jobs.
Streaming & Real-Time Processing
Worked with Spark Streaming and Kafka for real-time processing, combined batch and streaming data.
Big Data Technologies & Data Lakes
üBuilt Data Lakes and Data Pipelines using Hadoop, Cloudera, HDFS, MapReduce, Spark, and Hive.
üOptimized algorithms in Hadoop using Spark Context, Spark-SQL, and Spark MLlib.
Data Warehousing and Management with Snowflake
üLeveraged Snowflake as a cloud-based data warehousing platform to manage large volumes of structured and semi-structured data.
üIntegrated Snowflake with various data sources and systems for consolidated data analysis and reporting.
Database & Data Modeling
üImplemented Dimensional Data Modeling with Star Schema and Snow-Flake Schema.
üDeveloped PL/SQL packages, database triggers, and user procedures for data management.
Automation & Deployment
üDeveloped Terraform scripts for automated AWS resource deployment, including EC2, S3, and IAM Roles.
üImplemented CI/CD tools like Jenkins for automated deployment of Python code.
Additional Technologies & Techniques
üCollaborated with data analysts to gather requirements and design Power BI reports that address business needs.
üUsed Parquet files and ORC format with PySpark and Spark Streaming.
üAutomated Oozie workflows to manage jobs, including MapReduce and Hive.
Environments: AWS Lambda, AWS Glue, Step Functions, API Gateway, DynamoDB, S3, Redshift, RDS, AWS Kinesis, CloudWatch, Spark Streaming, Kafka, Hadoop, Cloudera, HDFS, MapReduce, Spark, Hive, Snowflake, PL/SQL, Terraform, EC2
WIPRO, Hyderabad, India
Azure Data Engineer Oct 2015 - Dec 2017
üUtilized Azure's ETL capabilities and Azure Data Factory (ADF) services to ingest data from various legacy data stores, including SAP (Hana) and SFTP servers, into Azure Data Lake Storage (Gen2).
üPerformed Extract, Transform, and Load (ETL) processes using Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics.
üConstructed and configured a virtual data center in the Azure cloud to host an Enterprise Data Warehouse, incorporating components such as Virtual Private Cloud (VPC) and Security Groups.
üDeveloped a framework for creating and managing snapshots in Azure Blob Storage and configured lifecycle policies for backing up data from Delta lakes.
üCreated Notebooks using Azure Databricks, Scala, and Spark by utilizing Delta tables for data capture.
üDeveloped enterprise-level solutions for batch processing using Apache Pig and incorporated streaming frameworks like Spark Streaming and Apache Kafka.
üAutomated the building of Azure infrastructure using Terraform and Azure Cloud Formation.
üDesigned and developed Tableau visualizations, including dashboards with calculations and parameters.
üImplemented data governance policies and ensured data security through GCP's IAM and the Cloud Security Command Center.
üManaged API requests and organized responses using tools like Postman and Swagger for seamless API management.
üResponsible for the installation and maintenance of web servers, including Tomcat and Apache HTTP, in UNIX environments
Environments: Azure, Azure Data Lake Storage (Gen2), Cloudera Hadoop's HDFS, Azure Data Factory (ADF), Azure Blob Storage, Azure Databricks, Enterprise Data Warehouse, Virtual Private Cloud (VPC), Snowflake, MS Excel, SQL Server 2012.
CAPGEMINI, Hyderabad, India
Role: Data Engineer May 2013- Sep 2015
Big Data Processing with PySpark:
üUtilized PySpark to process large volumes of structured, semi-structured, and unstructured data across clusters.
üLeverage Spark's distributed computing framework for parallel processing and ingested data using connectors and APIs from various sources like HDFS, Apache Kafka, and cloud storage.
Data Warehousing and Management with Snowflake:
üLeveraged Snowflake's cloud-based platform for efficient storage and management of large data volumes.
üUtilized Snowflake's scalable architecture, SQL-based querying capabilities, and built-in functionalities for data transformations, aggregations, and manipulations.
ETL and Workflow Orchestration:
üAutomated ETL processes using AWS Glue for data extraction, transformation, and loading from various sources into target data stores.
üLeveraged Azkaban's capabilities for workflow orchestration, job scheduling, and dependency management.
üDesigned and developed ETL processes with AWS Glue to migrate data from sources like S3 into AWS Redshift.
üDesigned tables and columns in Redshift for data distribution across data nodes in the cluster keeping columnar database design considerations
üCreate, modify and execute DDL in table AWS Redshift and snowflake tables to load data
AWS and Cloud Development:
üCreated AWS Lambda functions and API Gateways for data submission.
üLaunched Confidential EC2 Cloud Instances using Linux/Ubuntu and configured them for specific applications.
Environments: Python, PySpark, HDFS, Apache Kafka, ETL, AWS Lambda, API Gateway, EC2, AWS Glue, S3, Parquet, Redshift, Spring Boot, ORM, Hibernate, Snowflake, SQL, AWS, Linux, Ubuntu, Angular, HTTP.
EDUCATION:
Bachelor’s in computer science from JNTUH - 2013.
CERTIFICATIONS:
AWS Certified Solutions Architect Associate
Azure Certified Data Engineer Associate