Data Engineer Big

Location:

Richardson, TX

Salary:

70/hr

Posted:

February 13, 2025

Contact this candidate

Resume:

VIGNESH

SENIOR DATA ENGINEER CLOUD DATA ENGINEER

Contact:

+1-737-***-****

**************@*****.***

www.linkedin.com/in/vignesh-s-696a57192

PROFESSIONAL SUMMARY

Results-driven IT professional with over 9 years of experience in analysis, design, development, implementation, maintenance, and support of Big Data technologies. Proven expertise in deploying strategic methods to efficiently address Big Data processing requirements.

Big Data Technologies:

Skilled in large-scale Hadoop environment build and support, including design, configuration, installation, performance tuning, and monitoring.

Extensive experience in the Hadoop ecosystem, covering Apache Spark, Scala, Python, HDFS, MapReduce, KAFKA, Hive, Flume, Oozie, and HBase.

Integrated Kafka with Spark Streaming for real-time data processing.

In-depth knowledge of the Hadoop and Spark ecosystems, including Hadoop 2.0 (HDFS, Hive, Impala), Spark-SQL, Spark ML-lib, and Spark Streaming.

Data Modeling and Databases:

Strong knowledge in creating and maintaining physical data models with databases such as Oracle, Teradata, Netezza, DB2, MongoDB, HBase, SQL Server, and experience in Azure SQL Data Warehouse.

Experience with NoSQL databases like MongoDB, HBase, and Cassandra.

Experienced in fact dimensional modeling, including Star schema, Snowflake schema, transactional modeling, and Slowly Changing Dimension (SCD).

ETL, Business Intelligence & Visualization:

Proficient in designing, developing, documenting, and testing ETL jobs and mappings using tools like Data Stage, Informatica, Pentaho, and Sync Sort.

Hands-on experience in the analysis, design, development, testing, and implementation of Business Intelligence solutions.

Proficient in creating data visualizations using R, Python, and dashboards with tools like Tableau.

Published customized interactive reports and dashboards with various procedures and tools.

Cloud Computing and ML:

Successfully designed and implemented database solutions using Azure SQL Data Warehouse.

Extensive experience with Azure services beyond databases, including Azure Databricks, Azure Data Lake Storage, and Azure Data Factory.

Applied AWS concepts to enhance processing efficiency for Teradata Big Data Analytics, utilizing services such as Elastic MapReduce (EMR) and Elastic Compute Cloud (EC2).

Additional proficiency in AWS services like Amazon S3 for scalable storage, AWS Glue for ETL, and Amazon Redshift for high-performance data warehousing.

Utilized analytical applications like SPSS, Rattle, and Python for trend identification and relationship analysis.

Applied machine learning techniques, optimization tools, and statistics to interpret and solve business problems.

Professional Accomplishments:

Successfully executed data engineering projects, ensuring high-quality results within deadline-driven environments.

Applied performance tuning techniques to optimize data processing and retrieval, enhancing overall system efficiency.

Demonstrated adaptability in rapidly evolving technology landscapes, incorporating cutting-edge solutions into project workflows.

SKILLS

Big Data Technologies: Hadoop, HDFS, Hive, HBase, Flume, Yarn, Spark SQL, Kafka, Presto

Languages: Python, Scala, PL/SQL, SQL, T-SQL, UNIX, Shell Scripting

Cloud Platform: AWS, Azure, Snowflake

Python Libraries: NumPy, Matplotlib, NLTK, Stats models, Scikit-learn, SciPy.

ETL Tools: Pentaho, Informatica Power, Teradata, Web Intelligence

Operating System: Windows, UNIX, Linux

Modeling Tools: Oracle Designer, Erwin, ER/Studio

BI Tools: SSIS, SSRS, SSAS

Database Tools: Oracle, Microsoft SQL Server, Teradata, dbt, Poster SQL,Azure synapse

Other tools: TOAD, SQL PLUS, SQL LOADER, MS Project, MS Visio, and MS Office.

EXPERIENCE

PROFESSIONAL EXPERIENCE

GCP Data Engineer

Walmart Bentonville, Arkansas Sept 2024 - Current

Designed and implemented scalable data pipelines using Google Cloud Platform (GCP) services, including Cloud DataProc, Google Cloud Storage (GCS), and BigQuery, to handle large-scale data processing and analytics.

Developed and scheduled complex ETL workflows using Apache Airflow and Astronomer, ensuring seamless integration with existing data infrastructure.

Wrote optimized PySpark scripts for processing massive datasets, achieving significant performance improvements in data processing tasks.

Collaborated with cross-functional teams via Slack to address data-related challenges, enabling real-time communication and quicker resolution of issues.

Implemented version control and collaborative development practices using GitHub, ensuring code integrity and seamless team contributions.

Automated and orchestrated complex ETL pipelines using Apache Airflow, optimizing task dependencies and ensuring high availability across workflows.

Designed and implemented data extraction, transformation, and loading (ETL) processes to migrate data between on-premises systems and GCP environments, ensuring data accuracy and integrity.

Developed reusable and parameterized Python scripts to streamline ETL processes, reducing manual intervention and enhancing maintainability.

Utilized PySpark to build distributed data processing solutions, handling terabyte-scale datasets with efficient resource utilization.

Designed Airflow DAGs with robust monitoring, logging, and alerting capabilities, ensuring early detection and resolution of pipeline failures.

Implemented custom Python operators and hooks in Airflow to integrate workflows with third-party APIs and internal data systems.

Conducted unit and integration testing for PySpark jobs and Python scripts, ensuring high-quality code deployment to production environments.

Automated metadata-driven ETL workflows, enabling dynamic pipeline configurations and reducing manual setup effort.

Used Airflow task retries and SLA configurations to enhance workflow reliability during peak business periods.

Enhanced reporting pipelines by integrating SQL-based aggregations with Airflow workflows, enabling faster generation of daily and ad-hoc reports.

Deployed a secure and efficient sandbox environment for testing and validating new data engineering solutions before production release.

Monitored and supported mission-critical workflows during the holiday season by providing on-call support, addressing issues proactively, and ensuring timely execution of all scheduled jobs in Airflow.

Created comprehensive documentation for workflows, pipelines, and troubleshooting steps, reducing onboarding time for new team members by 30%.

Analyzed data patterns and provided actionable insights to stakeholders, driving informed decision-making for business-critical processes.

Environment : GCP, DataProc, GCS, BigQuery, Apache Airflow, Astronomer, PySpark, Slack, GitHub, Sandbox Environments,GCP IAM, Cloud KMS, Data Loss Prevention (DLP) API.

Azure Data Engineer

UHG Optum Eden Prairie, MN Dec 2023 – Aug 2024

Designed and implemented data pipelines on Azure using Spark and Scala, leveraging Azure Databricks for scalable data processing and analytics, resulting in improved data reliability and performance.

Integrated Snowflake data warehouse with Azure ecosystem, including Azure Data Lake Storage and Azure SQL Database, ensuring seamless data ingestion, transformation, and querying capabilities for analytics and reporting purposes.

Spearheaded the design, development, and deployment of complex ETL solutions using Talend, managing a team of data engineers to deliver scalable data pipelines in Azure environments for enterprise-level clients.

Developed custom ETL solutions using Matillion for Azure, facilitating data extraction, transformation, and loading tasks from various sources into Snowflake data warehouse, ensuring data consistency and integrity.

Managed codebase and version control using GitHub, collaborating with cross-functional teams to review, test, and deploy data engineering solutions, ensuring code quality and maintainability.

Implemented data integration workflows using IBM DataStage on Azure, enabling seamless data movement between disparate systems and applications, ensuring data consistency and accuracy across the organization.

Designed and managed data pipelines that load and transform data into Snowflake, ensuring efficient data processing and reducing ETL times by 40%.

Utilized Databricks notebooks to perform complex data transformations and cleaning tasks using PySpark and SQL.

Led the migration of on-premise data warehouses to Azure by leveraging Azure Synapse Analytics and Data Lake Storage, improving data accessibility, scalability, and cost-efficiency.

Enabled seamless integration of Snowflake with Azure services, such as Azure Data Factory and Azure Databricks, to create a robust data ecosystem.

Developed real-time data ingestion pipelines using Talend and Azure Event Hubs, enabling near real-time analytics and data availability for business intelligence applications.

Integrated dbt with Snowflake to automate and optimize SQL transformations, reducing manual coding efforts and improving the scalability of data operations.

Leveraged Databricks notebooks for collaborative data analysis and development, enabling seamless teamwork among data engineers, data scientists, and analysts.

Designed and optimized complex data processing workflows using Apache Airflow on Azure, orchestrating tasks such as data ingestion, transformation, model training, and deployment, resulting in improved workflow reliability and efficiency.

Developed and maintained data models using dbt to streamline data transformations and ensure consistency across multiple data sources.

Collaborated with data scientists to deploy machine learning models into production environments, integrating model inference pipelines with Snowflake data warehouse for real-time scoring and analysis of streaming data.

Integrated Talend with Azure services such as Azure Blob Storage, Azure Data Factory, and Azure Data Lake Storage to build seamless data pipelines, ensuring smooth data movement across the cloud environment.

Integrated Databricks with data lakes (e.g., Azure Data Lake Storage, AWS S3) and data warehouses (e.g., Snowflake, Delta Lake) to ensure seamless data flow and storage.

Conducted performance tuning and optimization of ETL processes using Spark and Scala, optimizing resource utilization, query performance, and data throughput, resulting in reduced processing time and cost savings.

Developed and maintained data models using dbt to streamline data transformations and ensure consistency across multiple data sources.

Environment : Azure Data Factory, Azure SQL, Blob Storage, Azure SQL Data Warehouse, Azure Gen 2 Storage, Python, Py-Spark, Unix Shell Scripting, Azure Machine Learning, Git hub, Data Stage, Azure Data Lakes (ADLS), Matillion Spark Streaming, Scala, Azure Stream Analytics, Azure Data Bricks, Azure Event Hubs, Azure Blob Storage, HDFS, Azure Logic Apps, Snowflake.

Azure Data Engineer

Vanguard Valley Forge, PA March 2023 – Nov 2023

Created ADF pipelines using Linked Services, Datasets, and Pipeline components, enabling seamless ETL processes from diverse sources such as Azure SQL, Blob storage, Azure SQL Data Warehouse, and write-back tools.

Configured ADF pipelines to handle incremental data loads, ensuring data consistency and reducing processing time.

Led the automation of ETL workflows using Talend integrated with Azure Data Factory and Azure DevOps, enabling continuous integration and continuous deployment (CI/CD) for faster and more reliable data pipeline execution.

Developed production-level machine learning classification models using Python and Py-Spark to predict binary values within specific time frames.

Integrated Azure Machine Learning services to deploy and manage machine learning models seamlessly within Azure ecosystem.

Utilized Azure DevOps for version control and CI/CD pipeline integration.

Managed Azure Data Lakes (ADLS) and Data Lake Analytics, integrating with other Azure services. Proficient in USQL.

Implemented cloud-native solutions using Azure Service Fabric, ensuring reliable, distributed, and highly available microservices for complex data processing workloads.

Built real-time pipelines for streaming data using Kafka and Spark Streaming.

Implemented operational data stores (ODS) to serve as a central repository for real-time transactional data, enabling near real-time analytics and decision-making capabilities for business stakeholders.

Automated the deployment of data models in dbt, integrating with CI/CD pipelines to enable continuous delivery and reduce deployment times.

Implemented log monitoring with Datadog Log Management, collecting logs from Azure services like Azure Storage and Azure Functions to troubleshoot issues and enhance visibility.

Set up automated alerts in Datadog for anomalies in Azure-based data processing jobs, reducing downtime and improving response times for incident resolution.

Led the migration of on-premises data (Oracle/SQL Server/DB2/MongoDB) to Azure Data Lake and Stored (ADLS) using Azure Data Factory.

Leveraged Azure Data Bricks for big data analytics and Apache Spark job orchestration.

Used Apache Spark Data Frames, Spark-SQL, and Spark MLlib extensively, developing and designing POCs using Scala, Spark SQL, and MLlib libraries.

Integrated Azure Event Hubs for event streaming and Kafka-compatible endpoints.

Implemented Azure Data Factory Data Flows for efficient and scalable data migration.

Responsible for comprehensive data ingestion using HDFS commands and accumulating partitioned data in various storage formats.

Designed and implemented scalable ETL processes using Talend to extract, transform, and load data into Azure Data Lake, Azure SQL Database, and Azure Synapse Analytics from multiple sources, including on-premise systems and cloud platforms.

Collaborated with Azure Monitor and Azure Log Analytics for proactive incident detection and resolution.

Implemented data governance and data quality measures, designing models and processes to ensure data integrity.

Collaborated with data stewards to establish data quality rules and validation processes within Azure Data Catalog.

Designed 3NF data models for OLTP systems and dimensional data models using star and snowflake schemas.

Demonstrated expertise in integrating DevOps practices within the data engineering lifecycle, ensuring collaboration, automation, and continuous improvement.

Architected and implemented medium to large-scale BI solutions on Azure using Azure Data Platform services.

Collaborated with Azure Logic Apps for orchestration and automation of data workflows.

Utilized SQL Server Integration Services (SSIS) for efficient extraction, transformation, and loading of data from multiple sources into the target system.

Applied SSIS best practices to ensure optimal performance and reliability in data workflows. Implemented Agile methodologies in data engineering projects, fostering iterative development, collaboration, and adaptability to changing requirements.

Automated recurring data ingestion and transformation processes on Azure through Talend’s job scheduler and triggers, reducing manual intervention and ensuring consistent data availability for analytics.

Applied various machine learning algorithms and statistical modeling to integrate structured data using traditional ETL tools and methodologies.

Collaborated with Azure Kubernetes Service (AKS) for containerized machine learning model deployment.

Led initiatives to optimize the performance of Talend data pipelines by implementing advanced partitioning, parallel processing, and resource management techniques, reducing ETL processing time by up to 40%.

Environment: Azure Data Factory, Azure SQL, Azure SQL Data Warehouse, Azure Gen 2 Storage, Python, Py-Spark, Azure Machine Learning, Azure DevOps, Azure Data Lakes (ADLS), Kafka, Spark Streaming, Azure Stream Analytics, Azure Data Bricks, Azure Event Hubs, Azure Blob Storage, HDFS, Azure Logic Apps, Azure Kubernetes Service (AKS).

Big Data Engineer

Blue Cross Blue Shield Chicago, IL Feb 2022 – Mar 2023

Introduced strategic partitioning techniques in HIVE for improved data access efficiency. For instance, partitioning data based on date or region enhances query performance by limiting the scan to specific partitions, reducing processing time.

Actively engaged in real-time data processing using Spark Streaming and Kafka. This involved submitting and managing Spark jobs that consume streaming data from Kafka topics, enabling timely insights and actions based on live data.

Engineered Spark jobs to consume data from Kafka topics, implementing robust validation checks before pushing the cleansed data into HBase and Oracle databases. This process ensures data integrity and reliability in downstream applications.

Created and administered S3 buckets, establishing policies for efficient data storage and backup using AWS Glacier. This not only ensures secure and scalable storage but also facilitates reliable data recovery and archival.

Exported thoroughly analyzed data to relational databases, facilitating visualization and report generation using Tableau. This step ensures that business stakeholders have access to meaningful insights derived from the processed data.

Leveraged AWS Glue for Extract, Transform, Load (ETL) tasks, automating the preparation of data for analysis. This streamlines the data pipeline, making it more efficient and reducing the time required for data preprocessing.

Employed the AWS Glue Catalog for effective metadata management and schema discovery. This ensures a centralized repository of metadata, making it easier to understand, trace, and govern data structures.

Implemented the ELK (Elasticsearch, Logstash, Kibana) stack to aggregate, analyze, and visualize logs, enabling proactive monitoring and quicker resolution of incidents.

Utilized AWS Athena to execute SQL queries directly on data stored in S3, enabling ad-hoc analysis. This approach provides agility in querying vast datasets without the need for complex data movement.

Configured and deployed Hive and Spark applications on AWS EMR for large-scale data processing. This distributed processing approach enhances scalability, allowing efficient handling of big data workloads.

Selected and generated structured data, stored it in CSV files on AWS S3 using EC2 instances, and further organized it in AWS Redshift. This optimized storage and retrieval for analytical queries.

Established continuous integration and continuous delivery (CI/CD) pipelines using AWS Code Pipeline. This automated the deployment process, ensuring a seamless and efficient release cycle.

Configured AWS CloudWatch for monitoring and logging, providing real-time insights into system behavior. This proactive approach helps identify and address performance issues promptly.

Configured AWS Lambda for serverless computing, allowing the execution of code without the need to provision or manage servers. This serverless architecture enhances scalability and reduces operational overhead.

Installed and configured Apache Airflow for orchestrating tasks related to S3 buckets and Snowflake data warehouse. This involved creating Directed Acyclic Graphs (DAGs) to automate and schedule workflows.

Collaborated with business analysts and subject matter experts across departments to gather detailed business requirements. This collaborative effort ensures that data engineering solutions align with the overarching business goals.

Authored Spark programs in Scala and Spark SQL to implement complex data transformations. This coding practice ensures that the data is processed and prepared according to specific business needs.

Employed Python within Spark for extracting data from Snowflake and uploading it to Salesforce on a daily basis. This automated data transfer ensures data consistency and timeliness.

Designed and developed complex dbt models to transform raw data into analytical datasets in Snowflake.

Created staging, intermediate, and final models using dbt's Jinja-based SQL templating. Enhanced data governance by using dbt to track data lineage and dependencies across various data models.

Worked on designing ETL pipelines to retrieve datasets from MySQL and MongoDB into AWS S3 buckets. Managed bucket and object access permissions for secure and controlled data retrieval.

Architected and designed serverless applications' CI/CD using AWS Serverless (Lambda) application model. This serverless architecture ensures efficient.

Environment: HIVE, Spark Streaming, Kafka, HBase, Oracle, AWS S3, AWS Glacier, Tableau, AWS Glue, AWS Glue Catalog, AWS Athena, AWS EMR, AWS Redshift, AWS Code Pipeline, AWS CloudWatch, AWS Lambda, Apache Airflow, Python, Snowflake, Salesforce, MySQL, MongoDB

Data Engineer

Next Sphere Technologies Hyderabad Mar 2018 – Nov 2021

Spearheaded the implementation and management of ETL solutions, demonstrating expertise in automating operational processes for enhanced efficiency.

Developed advanced Spark Applications using Scala, showcasing proficiency in Apache Spark data processing projects that seamlessly handled diverse data sources, including RDBMS and streaming platforms.

Played a pivotal role in optimizing algorithmic performance within Hadoop by effectively utilizing Spark, emphasizing a commitment to continuous improvement and innovation.

Utilized SonarQube to monitor and enforce code quality standards, ensuring high-quality, maintainable, and secure code in data engineering projects.

Took charge of defining, deploying, and managing monitoring, metrics, and logging systems on AWS. Demonstrated expertise in security group management, high-availability, fault-tolerance, and auto scaling using Terraform templates

Implemented Continuous Integration and Continuous Deployment methodologies with AWS Lambda and AWS CodePipeline.

Applied hands-on experience in leveraging big data solutions on AWS cloud services, encompassing EC2, S3, EMR, and DynamoDB.

Orchestrated the seamless migration of on-premise database structures to the Confidential Redshift data warehouse, showcasing adaptability to cloud-based technologies.

Exhibited a strong understanding of AWS components, particularly EC2 and S3, showcasing versatility in cloud computing environments.

Employed a structured approach to defining facts and dimensions, designing data marts using Ralph Kimball's Dimensional Data Mart modeling methodology with Erwin.

Developed a Kafka consumer API in Scala, demonstrating proficiency in real-time data consumption from Kafka topics.

Led the optimization and tuning efforts within the Redshift environment, resulting in queries performing up to 100 times faster, especially beneficial for Tableau and SAS Visual Analytics.

Environment: Scala, Apache Spark, Hadoop, ERD, Functional Diagrams, Data Flow Diagrams, Erwin, AWS Technologies (EC2, S3, EMR, DynamoDB, Terraform, AWS Lambda, AWS Code Pipeline, AWS Redshift), Kafka, Tableau.

Hadoop Developer

Textron India Private Limited Bengaluru Aug 2016 – Mar 2018

Created processing pipelines, incorporating transformations, estimations, and evaluation of analytical models.

Imported and exported data into HDFS and Hive using Sqoop.

Utilized Hive Context for transformations and actions (map, flat Map, filter, reduce, reduce by Key).

Migrated Pig scripts to Hive enabling transformations, joins, and pre-aggregations before storing data onto HDFS.

Imported data from AWS S3 into Spark RDD, performing transformations and actions on RDDs.

Worked with different file formats like Sequence files, XML files, and Map files using MapReduce Programs.

Wrote Hive jobs to parse logs and structure them in tabular format for effective querying on log data.

Tuned SQL queries and Stored Procedures for speedy data extraction and troubleshooting in OLTP environments.

Implemented live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipeline system.

Developed Spark jobs using Scala on top of Yarn/MRv2 for interactive and Batch Analysis.

Worked with various data sources such as Teradata and Oracle. Loaded files to HDFS from Teradata and loaded data from HDFS to Hive and Impala.

Utilized Oozie workflow engine for job scheduling.

Worked with Avro Data Serialization system to handle JSON data formats.

Implemented various performance optimizations, including using distributed cache for small datasets, partitioning and bucketing in Hive, and map-side joins.

Performed pre-processing on datasets prior to training, including standardization and normalization.

Evaluated model accuracy by dividing data into training and test datasets and computing metrics using evaluators.

Environment: Hadoop, Hive, HDFS, Spark, Oozie, Map Reduce, Scala, Python, Py-Spark, AWS, Oracle 10g, SQL, OLTP, Windows, MS Office

Database Engineer

SITEL India Private Limited Chennai May 2015 – AUG 2016

Employed best practices in creating database objects like tables, views, indexes (clustered and no clustered), stored procedures and triggers.

Build ETL processes utilizing SSIS including FTP data from remote location, transform it, merge it to data warehouse and provide proper error handling and alerting.

Created triggers to keep track of changes to the fields of tables when changes are made.

Planned, design, and implement application database code objects, such as stored procedures and views.

Actively involved in extracting data from database tables and write the data to text files

Successfully managed Extraction, Transformation and Loading (ETL) process by pulling large volume of data from various data sources using BCP.

Data migration (import & export - BCP) from Text to SQL Server.

Used Crystal Reports to generate ad-hoc reports Filtered bad data from legacy system using complex T-SQL statements, and implemented various constraint and triggers for data consistency

Used SQL Profiler and Query Analyser to optimize DTS package queries and Stored Procedures.

Environment: MS SQL Server, Windows Server, SSIS, SSRS, MS Visio, T-SQL, Business Intelligence studio, Management Studio

EDUCATION

Bachelors in Computer Science Saveetha University, 2012 – 2016

Masters in Information System Central Michigan University, 2022 - 2023

Contact this candidate