Post Job Free
Sign in

Data Engineer Engineering

Location:
Kent, OH
Posted:
June 02, 2025

Contact this candidate

Resume:

Vineeth

Sr. Data Engineer Data Architecture & Governance Azure, Python, SQL

Email : ****************@*****.***

Phone : (323) 577- 5258

PROFESSIONAL SUMMARY:

Over 10 years of hands-on experience in data engineering and architecture, specializing in scalable data processing, modeling, and governance using Python, SQL, and cloud platforms (Azure, AWS).Experience in large scale Hadoop projects using Spark, Scala and Python that involves design, develop, and implementation of data models for enterprise level applications and systems.

Skilled in provisioning AWS EC2 instances, implementing Auto Scaling groups, Load Balancers within custom VPCs, and leveraging Lambda Functions for event-triggered actions in DynamoDB.

Designed and developed ETL pipelines using Ab Initio for large-scale data integration and transformation.

Demonstrated expertise across major Hadoop ecosystem components such as HDFS, YARN, MapReduce, Hive, Impala, Pig, Sqoop, HBase, Spark, Spark SQL, Kafka, Spark Streaming, Flume, NiFi, Oozie, Zookeeper, and Hue.

Experienced in using distributed computing architectures such as AWS products (e.g., EC2, S3 object storage, SNS, SQS, Lambda, Step function, AWS glue, AWS Athena and EMR, Elastic search) to design data pipelines.

Worked on AWS Data pipeline to configure data loads from on-premises Hadoop to AWS S3.

Worked on Cloudera migration from CDH to CDP platform.

Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Apache Airflow DAGs.

Created PySpark RDDs for data transformation and imported data from Hadoop HDFS into Hive using HiveQL.

Used Python for data analytics, data wrangling and extracting data using Pandas, NumPy and SciPy.

Experience of using Code Versioning and building process tools like Maven, Jenkins, Cloud bee, U Deploy, GitHub, Bitbucket.

Involved in the CI/CD pipeline management for managing the weekly releases.

Involved in TDD,BDD for software development and quality assurance skills.

Worked on Jira stories for managing the tasks and improving individual performance.

Developed end-to-end test plans, created test cases for Ab Initio jobs, and used tools like JIRA for defect tracking, logging, and resolution during the data validation process.

Led technical design sessions, mentored junior engineers on Ab Initio best practices, and provided code reviews to enforce development standards.

TECHNICAL SKILLS:

Cloud Platforms

AWS (EC2, S3, RDS, Lambda, SQS, SNS, Data Pipeline, AWS Glue, IAM, CloudWatch, Redshift, Step Functions, CloudFormation, Kinesis, DynamoDB, CloudFront, EMR), GCP (BigQuery, Pub/Sub, Data prep, Dataflow, Cloud Dataproc, Storage, Functions, IAM)

Data Integration

AWS Glue, ADF, Informatica, SSIS, Dataflow

Scripting languages

Python, PowerShell, Scala

Big Data Processing

Apache Spark, PySpark, Hadoop, MapReduce, Hive, Impala, Sqoop, Kafka

Databases

MySQL, PostgreSQL, SQL Server, DynamoDB, MongoDB, Oracle

Warehousing

RedShift, Snowflake, BigQuery

Version Control & CI/CD

Git (GitHub, Bitbucket), GitLab CI, Jenkins, Azure DevOps, CI/CD

Infrastructure As Code

Ansible, Terraform, CloudFormation

Containerization & Orchestration

Docker, Kubernetes

Monitoring & Logging

CloudWatch, ELK Stack, Splunk

Python Libraries

Pandas, NumPy, Scikit-Learn, PyTorch, Mahout

Project Management

Agile methodologies (Scrum, Kanban,), Tools (JIRA, Confluence, SharePoint)

Operating Systems

macOS, Windows, Linux, UNIX

PROFESSIONAL EXPERIENCE:

Client: Macys – Atlanta, GA Jul 2022 - Present

Role: Sr. Data Engineer

Domain: Retail Technology & E-commerce

Responsibilities:

Designed and developed scalable data pipelines on Azure Databricks using PySpark for real-time and batch retail data processing.

Built and automated ETL workflows using Azure Data Factory (ADF) to integrate structured and semi-structured data from internal and external retail systems.

Engineered cloud-native solutions for high-volume retail transaction data across multiple channels (in-store, online, mobile).

Created optimized SQL stored procedures and views for large-scale data transformation and business logic implementation.

Integrated Apache Kafka for streaming real-time inventory and clickstream data into the analytics ecosystem.

Modeled data warehouse schemas using Star and Snowflake schemas to support advanced analytics and reporting in Snowflake and SQL Server.

Developed Delta Lake architectures for maintaining consistent, versioned retail data in data lakes.

Implemented data quality checks and validations using PySpark and SQL to ensure accuracy in downstream dashboards.

Enabled self-service analytics by publishing curated datasets to Power BI and Tableau, empowering business users.

Automated infrastructure provisioning in Azure using Terraform (IaC), ensuring repeatable and secure deployments.

Built CI/CD pipelines for ETL code deployment using Azure DevOps, GitHub Actions, and Argo CD.

Supported ML Ops processes by preparing feature engineering pipelines and deploying models in Azure ML Studio.

Used Azure Key Vault and managed identities to securely access credentials and secrets in production environments.

Conducted data lineage tracking and metadata management using Purview for compliance and governance.

Led migration of on-prem legacy ETL processes to modern cloud-based solutions using Spark and ADF.

Designed scalable data lake storage architecture using Azure Data Lake Storage Gen2 for storing raw, curated, and aggregated data.

Worked closely with merchandising and marketing teams to define KPIs and build tailored data products for campaign performance tracking.

Designed alerting and monitoring dashboards using ELK Stack and Grafana for proactive incident detection.

Conducted code reviews, performance tuning, and optimization of Spark jobs to meet SLAs and cost efficiency.

Supported data masking and encryption strategies for handling PII and PCI-compliant retail customer data.

Implemented unit testing frameworks (like Pytest and dbt tests) to validate data pipeline logic during development.

Used Unix Shell scripting for workflow automation and system-level operations in cloud-based environments.

Collaborated in Agile/Scrum ceremonies and coordinated cross-functionally with DevOps, QA, and Product teams.

Led knowledge-sharing sessions and mentored junior data engineers on Azure, Spark, and CI/CD best practices.

Maintained documentation in Confluence, managed tasks in JIRA, and tracked project progress and stakeholder communications.

Environment: Azure Databricks, Azure Data Factory (ADF), Azure Data Lake Storage Gen2, Azure DevOps, Azure Key Vault, Azure AD, Apache Spark, Apache Kafka, Delta Lake, SQL Server, Snowflake, Power BI, Tableau, Terraform, GitHub Actions, Python, PySpark, SQL, Unix Shell Scripting, ELK Stack (Elasticsearch, Logstash, Kibana), Azure Machine Learning Studio, Azure Purview, JIRA, Confluence, Agile, Scrum.

Client: Truist Bank – Raleigh, NC Dec 2021 – Jun 2022

Role: Sr. Data Engineer

Domain: Banking Sector

Responsibilities:

Led the design, development, and implementation of robust and scalable data pipelines, leveraging technologies such as AWS Glue, AWS Data Pipeline, Apache Spark, and Apache Hadoop to process large volumes of data efficiently.

Architected and managed cloud infrastructure on AWS, including EC2 instances, S3 buckets, RDS databases, Lambda functions, and IAM roles, ensuring high availability, reliability, and security.

Utilized AWS services such as SQS and SNS for asynchronous messaging and event-driven architectures, enhancing system decoupling and fault tolerance.

Worked with CI/CD tools to enable continuous integration and deployment of ETL processes, enhancing development efficiency.

Worked with Ab Initio Metadata Hub for metadata management and data lineage tracking.

Integrated Ab Initio ETL workflows with Teradata, Oracle, and Snowflake for enterprise data processing.

Utilized Python for scripting and automation in data engineering workflows.

Hands-on experience with AWS services for cloud-based ETL solutions and Snowflake for scalable data warehousing.

Developed ETL (Extract, Transform, Load) workflows employing AWS Glue, Apache Spark, and SQL to extract, transform, and load data from diverse sources into data warehousing solutions like Redshift and MySQL.

Developed and maintained data processing workflows using AWS Step Functions, orchestrating complex tasks across different AWS services.

Monitored and optimized system performance using CloudWatch metrics and alarms, identifying and resolving issues to ensure optimal data processing and resource utilization.

Applied advanced data analysis techniques with Python libraries like Pandas and NumPy to derive actionable insights from structured and unstructured data sets.

Designed and enforced data governance policies, including schema validation using XML, XSD, and XSLT, to ensure data quality and compliance with industry standards.

Designed and implemented Meta Programming components using PDL (Parameter Definition Language) to build reusable, configurable Ab Initio frameworks and accelerate ETL development lifecycle.

Created generic ETL frameworks using Ab Initio graphs, enhancing automation and reducing redundancy.

Managed relational and NoSQL databases such as MySQL and DynamoDB, optimizing schema designs and query performance for efficient data retrieval and storage.

Configured and managed containerized applications using Docker and orchestration tools like Kubernetes, ensuring seamless deployment and scalability.

Implemented Ansible for automating infrastructure provisioning and managing configuration, optimizing deployment procedures, and maintaining uniformity across different environments.

Collaborated with cross-functional Agile teams using tools like JIRA and Bitbucket to prioritize tasks, track progress, and deliver high-quality solutions within sprint cycles.

Implemented TDD, BDD, CI/CD pipelines using Jenkins and AWS services to automate testing, build, and deployment processes.

Developed and optimized big data processing workflows using tools like Apache Spark, Hive, HBase, MapReduce, and Sqoop to handle large-scale data sets efficiently.

Developed and implemented machine learning models using frameworks to extract insights and build predictive capabilities from data.

Configured and managed content delivery networks (CDNs) using AWS CloudFront to improve the performance and availability of web applications.

Defined infrastructure as code using AWS CloudFormation templates, enabling consistent and repeatable provisioning of resources.

Implemented real-time data streaming solutions using AWS Kinesis to process and analyze streaming data for timely decision-making.

Collaborated with DevOps teams to implement version control, code review, and automated testing practices using Git and Bitbucket.

Facilitated Agile ceremonies such as sprint planning, daily stand-ups, and retrospectives to foster collaboration and continuous improvement.

Environment: AWS, AWS Glue, IAM, CloudWatch, Pandas, NumPy, XML, MySQL, DynamoDB, Spark, AWS Step Functions, Kinesis, Docker, Kubernetes, Ansible, Redshift, Hadoop, Sqoop, Mahout, Bitbucket, Jenkins, Pig, Agile, Scrum, JIRA.

Client: Amtrak, Washington D.C . Apr 2021 – Nov 2021

Role: Data Engineer

Domain : Railroad

Responsibilities:

Designed and implemented high-performance data pipelines using Apache Spark for large-scale data processing on Azure Databricks.

Azure Data Factory (ADF) was used to orchestrate data movement and ETL workflows across various data sources.

Leveraged Impala for querying data stored in Apache HDFS within the Hadoop ecosystem.

Designed and managed relational databases on SQL Server for structured data storage and analysis.

Implemented data pipelines for interaction with NoSQL databases like MongoDB and Snowflake for flexible data storage.

Utilized Pandas and NumPy libraries in Python for data manipulation, analysis, and preparation for machine learning models built with PyTorch.

Exported and imported data from text files and Excel to SQL Server database using bulk insert and BCP utility.

Checked Performance Tuning, Indexing, Query Optimization.

Developed and optimized MapReduce jobs on Hadoop clusters for large-scale data processing tasks.

Utilized Git for version control and collaboration on data engineering projects within the team.

Managed the software development lifecycle using Azure DevOps for continuous data pipeline integration and deployment (CI/CD).

Automated cloud infrastructure provisioning and management in Azure using Terraform (IaC).

Integrated Apache Kafka for real-time data streaming and ingestion into data pipelines.

Created data visualizations and dashboards using Power BI to communicate data insights effectively.

Implemented and monitored the ELK Stack (Elasticsearch, Logstash, Kibana) for log management, analysis, and visualization.

Followed Agile methodologies like Scrum, collaborating effectively with cross-functional teams to deliver data-driven solutions.

Utilized JIRA for issue tracking and project management within the data engineering team.

Presented data findings and technical solutions to stakeholders clearly and concisely.

Environment: Spark, PowerShell, Azure, ADF, Databricks, Impala, SQL server, MongoDB, Snowflake, PyTorch, Hadoop, MapReduce, GitHub, Azure DevOps, Terraform, Kafka, Power BI, Azure AD, ELK stack, Agile, Scrum, JIRA.

Client : Transamerica - Plano, TX Apr 2018 – Mar 2021

Role: Data Engineer/SQL Server Developer

Domain : Banking and Insurance

Responsibilities:

Designed and implemented data pipelines for data ingestion, transformation, and loading (ETL) on Google Cloud Platform (GCP) using tools like Cloud Dataprep and Dataflow.

Built and managed big data solutions on platforms including Apache Hadoop (CDH), Apache Spark, and PySpark for scalable data processing.

Utilized Sqoop and Pub/Sub for secure data transfer between various data sources and cloud storage.

Developed and optimized data models for data warehouses like PostgreSQL using expertise in relational databases.

Utilized Python libraries such as NumPy, Pandas, and Scikit-Learn to conduct data manipulation analysis and perform various machine-learning tasks.

Engineered data pipelines using Scala and PySpark to handle complex data transformations and computations within the Apache Spark ecosystem.

Implemented data integration solutions using tools like Informatica for seamless data flow between heterogeneous systems.

Extensively used SQL Server Profiler to debug the application execution flow and dynamic queries being executed against the database.

Effectively handled hacking and SQL injection attacks on database applications.

Developed and maintained RESTful APIs for data access and integration with external applications.

Created data visualizations and dashboards using Tableau to communicate data insights effectively

Creating and debugging SQL stored procedures.

Performed data exploration and analysis using Splunk for log management and security investigations.

Followed Agile methodologies like Scrum, collaborating effectively with cross-functional teams to deliver data-driven solutions.

Utilized GitLab CI for continuous data pipeline integration and deployment (CI/CD), ensuring automated testing and delivery.

Managed project dependencies and builds using Maven for efficient development workflows.

Documented data pipelines, architectures, and procedures with clarity for maintainability.

Presented data findings and technical solutions to stakeholders clearly and concisely.

Continuously monitored and optimized data pipelines for performance and scalability on the cloud platform.

Utilized Confluence for knowledge sharing and collaboration within the data engineering team.

Environment: GCP, CDH, BigQuery, Python, Scala, Spark, PySpark, Informatica, PostgreSQL, REST APIs, Tableau, Splunk, Agile, Scrum, GitLab CI, Maven, NumPy, Pandas, Scikit-Learn, Confluence.

Client : Value Labs – Hyderabad, India Aug 2014 – Mar 2018

Role : Big Data Engineer

Responsibilities:

Involved in Requirement gathering, Business Analysis and translated business requirements into technical design in Hadoop and Big Data.

Involved in SQOOP implementation which helps in loading data from various RDBMS sources to Hadoop systems and vice versa.

Developed Python scripts to extract the data from the web server output files to load into HDFS.

Written a python script which automates launch the EMR cluster and configures the Hadoop applications.

Extensively worked with Avro and Parquet files and converted the data from either format Parsed Semi Structured JSON data or converted to Parquet using Data Frames in PySpark.

Involved in Analyzing system failures, identifying root causes, and recommended course of actions, Documented the systems processes and procedures for future references.

Involved in Configuring Hadoop cluster and load balancing across the nodes.

Involved in Hadoop installation, Commissioning, Decommissioning, Balancing, Troubleshooting, Monitoring and, debugging Configuration of multiple nodes using Hortonworks platform.

Involved in working with Spark on top of Yarn/MRv2 for interactive and Batch Analysis.

Involved in managing and monitoring Hadoop cluster using Cloudera Manager.

Used Python and Shell scripting to build pipelines.

Developed data pipeline using Sqoop, HQL, Spark and Kafka to ingest Enterprise message delivery data into HDFS.

Developed workflow in Oozie also in Airflow to automate the tasks of loading data into HDFS and pre-processing with Hive.

Integrated Hadoop into traditional ETL, accelerating the extraction, transformation, and loading of massive semi structured and unstructured data. Loaded unstructured data into Hadoop distributed File System (HDFS).

Created HIVE Tables with dynamic and static partitioning including buckets for efficiency. Also created external tables in HIVE for staging purposes.

Loaded HIVE tables with data, wrote hive queries which run on MapReduce and Created customized BI tool for manager teams that perform query analytics using HiveQL.

Aggregated RDDs based on the business requirements and converted RDDs into Data frames saved as temporary hive tables for intermediate processing and stored in HBase/Cassandra and RDBMs.

Environment: Hadoop 3.0, Hive 2.1, J2EE, JDBC, Pig 0.16, HBase 1.1, Sqoop, NoSQL, Impala, Java, Spring, MVC, XML, Spark 1.9, PL/SQL, HDFS, JSON, Hibernate, Bootstrap, jQuery.

EDUCATION:

Vignan’s Lara institute of Technology and Science, India June 2010 – July 2014

Bachelor of Technology in Electronics and Communication Engineering GPA 3.5



Contact this candidate