Data Engineer Sql Server

Location:

Dallas, TX

Posted:

February 27, 2024

Contact this candidate

Resume:

Viplav

Data Engineer

Contact: +1-617-***-****

Email: **********@*****.***

PROFESSIONAL SUMMARY:

8+ years of experience in Analysis, Design, Development, and Implementation as a Data Engineer.

Hands on experience in developing Spark applications using Spark tools like RDD transformations, Spark core, Spark streaming, Pyspark and Spark SQL.

Experience in development and design of various scalable systems using Hadoop technologies in various environments. Extensive experience in analyzing data using Hadoop Ecosystems including HDFS, MapReduce, Hive & PIG.

Good understanding of Spark Architecture with Databricks, Structured Streaming. Setting Up AWS and Microsoft Azure with Databricks, Databricks Workspace for Business Analytics and managing Clusters.

Proficiency in multiple databases like MongoDB and Oracle.

Good experience on MongoDB scaling across data centers and/or in-depth understanding of MongoDB HA strategies, including replica sets.

Extensive experience in importing and exporting data using Sqoop and RDBMS - MySQL, Oracle, MS SQL server, DB2, Teradata and PostgreSQL databases.

Hands-on working experience on dealing with multiple file formats like JSON, AVRO, Parquet, and leverage storm/spark/pig scripts to load into NoSQL DB’s like MongoDB, Cassandra.

Experience in using Python for Data Engineering and Modeling.

Experience in data cleansing scripts like Spark, MapReduce, and Pig.

Worked on moving data from on-premises to the AWS cloud using services like S3, S3- Glacier, Kinesis, DynamoDB and RDS. Familiar with the Hadoop and Big Data model.

Experience in developing pipelines in spark using Scala and python.

Experience in writing distributed Scala code for efficient big data processing.

Solid Experience and understanding of Implementing large scale Data warehousing Programs and E2E Data Integration Solutions on Snowflake Cloud, AWS Redshift, Informatica PowerCenter integrated with multiple Relational databases (MySQL, Teradata, Oracle, Sybase, SQL server, DB2).

Expertise in Querying RDBMS such as Oracle, MY SQL and SQL Server by using SQL for data integrity.

Proficient with complex workflow orchestration tools namely Oozie, Airflow, Data pipelines and Azure Data Factory, CloudFormation & Terraforms.

Experience in designing, developing, scheduling reports/dashboards using Tableau and PowerBI.

Experienced with version control tools like Git and Gitlab, project management tools such as JIRA, and various software development methodologies like Agile and Scrum.

Experience in building Docker images using Docker files and container-based deployments on Kubernetes.

Experience in utilizing different design and coding techniques to improve the performance of the queries on Redshift Database.

Experience in GCP Dataproc, GCS, Cloud functions, BigQuery, Azure Data Factory.

Experience in building efficient pipelines for moving data between GCP and Azure using Azure Data Factory.

Experience in developing Docker file to containerized applications to deploy on managed Kubernetes service EKS and AKS.

TECHNICAL SKILLS:

Databases

Snowflake, AWS RDS, DynamoDB, Teradata, Oracle, MySQL, Microsoft SQL, Postgres SQL, SQL Server

NoSQL Databases

MongoDB, Hadoop HBase, and Apache Cassandra.

Programming

Languages

Python, SQL, Scala, JavaScript

Cloud Technologies

AWS, Azure

Data Formats

CSV, JSONs

Querying Languages

SQL, NO SQL, PostgreSQL, MySQL, Microsoft SQL, PL/SQL

Integration Tools

Jenkins, Kubernetes, Docker

Scalable Data Tools

Hadoop, Pyspark, Hive, Apache Spark, Kafka, Elastic Search, Pig, Presto, Map Reduce, Airflow, Sqoop, Impala, Zookeeper

Operating Systems

Red Hat Linux, Unix, Windows, macOS.

Reporting & Visualization

Tableau, Quicksight, Power BI, MS-Excel.

Professional Experience:

Client: HP, Vancouver, WA May 2022 – Present

Role: Data Engineer

Responsibilities:

Identify, design, and implement internal process improvements: automating manual processes, optimizing data delivery, re-designing infrastructure for greater scalability, etc.

Deployed workflow orchestration and demonstrate expertise in data modeling, ETL development, and data warehousing.

Developed Spark applications using Spark-SQL and PySpark in Azure Databricks for data extraction, transformation, and aggregation from multiple file formats for Analyzing & transforming the data.

Developed data ingestion modules (both real time and batch data load) to data into various layers in S3, Redshift and Snowflake using AWS Kinesis, AWS Glue, AWS Lambda and AWS Step Functions.

Optimized and tuned the Redshift environment, enabling queries to perform up to 100x faster for Tableau and SAS Visual Analytics.

Managed deployments in EKS managed Kubernetes, setup multi nodes cluster and deployed containerized applications.

Developed and maintained Terraform modules and scripts for creating, updating, and deleting infrastructure resources.

Monitored end-to-end infrastructure using CloudWatch integrated with SNS for notification.

Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in production.

Implemented Continuous Integration and Continuous Delivery process using GitLab along with Python and Shell scripts to automate routine jobs, which includes synchronize installers, configuration modules, packages and requirements for the applications.

Environment: Amazon Redshift, AWS Data Pipeline, Kinesis, Tableau, EC2, EKS, CloudWatch, Glue, S3, Unix shell scripting, Spark, GitHub, Azure Databricks, Terraform, Kubernetes, Airflow, JIRA.

Client: Epsilon, Dallas, TX Aug 2021 – Apr 2022

Role: Cloud Data Engineer

Responsibilities:

Designed and implemented highly performant data ingestion pipelines from multiple sources using Azure Data Factory and Azure Databricks.

Built ETL solutions using Databricks by executing code in Notebooks against data in Data Lake and Delta Lake and loading data into Azure DW following the bronze, silver and gold layer architecture.

Integrated the end-to-end data pipeline to take data from source systems to target data repositories.

Created complex ETL Azure Data Factory pipelines using mapping data flows with multiple Input/output transformations, Schema Modifier transformations, row modifier transformations using Scala Expressions.

Developed dynamic Data Factory pipelines using parameters and trigger them as desired using events like file availability on Blob Storage, based on schedule and via Logic Apps.

Developed Custom Email notification in Azure Data Factory pipelines using Logic Apps and standard notifications using Azure Monitor.

Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity.

Developed spark applications in python (PySpark) on distributed environment to load huge number of CSV files.

Created Databricks notebooks using SQL, Python and automated notebooks using jobs.

Created Multiple Dashboards using SQL as a source and reducing the Performance Issues in the pyspark dashboards with Live connections with the Data Source.

Maintaining and troubleshooting Terraform deployments, including fixing errors, resolving conflicts, and managing state files.

Transformed and analyzed the data using Pyspark, HIVE, based on ETL mappings.

Developed pyspark programs and created the data frames and worked on transformations.

Environment: Azure Logic Apps, Scala, Databricks, Oracle, Power BI, Unix shell scripting, Azure data factory, Airflow, Stored procedures, GIT, JIRA, SFTP, Adobe, Blob storage.

Client: Mars McLean, VA Jan 2020 – Jul 2021

Role: Data Engineer

Responsibilities:

Created Elastic Cache for the database systems to ensure quick access to frequently requested databases. Created backup of database systems using S3, EBS and RDS services of AWS.

Written pyspark and spark sql transformation in Azure Databricks to perform complex transformations for business rule implementation.

Integrated and automated data workloads to Snowflake Warehouse and made sure ETL/ELTs succeeded and loaded data successfully in Snowflake DB.

Redesigned the Views in snowflake to increase the performance.

Created Athena data sources on S3 buckets for adhoc querying and business dashboarding using Quicksight and Tableau reporting tools.

Designed and developed Security Framework to provide fine grained access to objects in AWS S3 using AWS Lambda, DynamoDB.

Built data ingestion system with Kafka, Cassandra, and Zookeeper for event processing in AWS.

Created Pyspark framework to bring data from DB2 to Amazon S3.

Developed python code for different tasks, dependencies, SLA watcher and time sensor for each job for workflow management and automation using Airflow tool.

Imported and managed multiple corporate applications into GitHub code management repo.

Worked on Amazon AWS concepts like EMR and EC2 web services for fast and efficient processing of Big Data.

Involved on configuration, development of Hadoop Environment on AWS cloud such as Lambda, S3, EC2, EMR (Electronic MapReduce), Redshift.

Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time.

Environment: Linux, Redshift, RDS, Snowflake, Tableau, EC2, AWS Data Pipeline, Hadoop, Kubernetes, S3, Unix shell scripting, Spark, GitHub, Scala, Kafka.

Client: Magellan Health Bangalore, India Sep 2017 – Aug 2019

Role: Big Data Engineer

Responsibilities:

Involved in Requirements and Analysis: Understanding the requirements of the client and the flow of the application.

Migrated an in-house database to AWS Cloud and also designed, built, and deployed a multitude of applications utilizing the AWS stack (Including S3, EC2, RDS, Athena) by focusing on high availability and auto-scaling.

Created an AWS RDS MySQL DB cluster and connected to the database through an Amazon RDS MySQL DB Instance using the Amazon RDS Console.

Created external tables with partitions using Hive, AWS Athena.

Used Cloning and Time Travel features in Snowflake to ensure maintenance and availability of historical data.

Developed stored procedures/views in Snowflake and used Talend for loading Dimensions and Facts.

Implemented AWS Step Functions to automate and orchestrate the Amazon SageMaker related tasks such as publishing data to S3, training ML model and deploying it for prediction.

Integrated Apache Airflow with AWS to monitor multi-stage ML workflows with the tasks running on Amazon SageMaker.

Optimized data processing workflows by identifying and addressing performance bottlenecks in Bash scripts and Unix/Linux command line tools.

Worked with EMR and setup Hadoop environment in AWS EC2 Instances.

Analyzed the sql scripts and designed it by using PySpark SQL for faster performance.

Created Pods, deployments, services, and replication controller in Kubernetes.

Used CloudWatch functions and Step Functions to trigger Glue Jobs and orchestrate the data pipeline.

Installed and configured apache airflow for workflow management and created workflows in python.

Used Lambda functions and Step Functions to trigger Glue Jobs and orchestrate the data pipeline.

Implemented AWS Elastic Container Service (ECS) scheduler to automate application deployment in the cloud using Docker Automation techniques.

Environment: Spark, Scala, AWS, ETL, EC2, Hadoop, Python, Snowflake, HDFS, Hive, Amazon SageMaker Tableau, PySpark, AWS Glue, SQL, Kubernetes.

Client – Visa Bangalore, India Jun 2016 – Aug 2017

Role-Big Data Developer

Responsibilities:

Designed and Developed ETL data pipelines using Scala and Spark.

Automated workflows using Oozie job scheduling tool.

Tested Big Data Hadoop (HDFS, Hive, Sqoop and Flume), Master Data Management (MDM) and Tableau Reports.

Built data platforms using Azure stack and Spark SQL.

Performed end-to-end delivery of pyspark ETL pipelines on Azure-databricks to perform the transformation of data orchestrated via Azure Data Factory (ADF) scheduled through Azure automation accounts and trigger them using Tidal Schedular.

Managed and maintained NoSQL databases such as MongoDB, Cassandra, and DynamoDB.

Conducted data cleansing for unstructured dataset by applying Informatica Data Quality to identify potential errors and improve data integrity and data quality

Created Informatica PowerCenter mappings to execute necessary data transformations tasks to be loaded to database table for later possibility analysis use.

Implemented ingestion pipelines to migrate ETL to Hadoop using Spark Streaming and Oozie workflow. Loaded unstructured data into Hadoop distributed File System (HDFS).

Transformed batch data from several tables containing tens of thousands of records from SQL Server, MySQL, PostgreSQL, and csv file datasets into data frames using PySpark.

Re-coded the existing SQLs in Redshift to speed up the run time that helped to meet SLAs.

Integrated Hadoop into traditional ETL, accelerating the extraction, transformation, and loading of massive semi structured and unstructured data. Used HDFS to Load unstructured data.

Developed Nifi workflow to pick up the data from rest API server, from data lake as well as from SFTP server and send that to Kafka broker.

Responsible for extracting data loads from Teradata into Hadoop environment while also creating Hive tables.

Involved in converting Hive/SQL queries into Spark Transformations using Spark RDD’S and PySpark.

Environment: Python, Hadoop, HDFS, Hive, HBase, Databricks, Oozie, Airflow, Apache Spark, Spark SQL, Kafka, Linux, Tableau, Cassandra, Teradata, Sqoop.

Client: Wipro Hyderabad, India Nov 2015 –May 2016

Role: Hadoop Engineer

Responsibilities:

Worked on both External and Managed HIVE tables for optimized performance.

Good work experience with UNIX/Linux command, scripting and deploying the applications on the servers.

Experienced on loading and transformation of large sets of structured - semi structured data.

Done data migration from an RDBMS to a NoSQL data base and was also involved in data deployed in various data systems.

Structured the aspects of business by collaborating with Data Services and Business stake holders and developed analytical Insights.

Involved in designing the Data pipeline from end-to-end, to ingest data into the Data Lake.

Optimized MapReduce code for mapper and reducer, pig scripts and performance tuning and analysis.

Developed Sqoop jobs to import and store massive volumes of data in HDFS and Hive.

Responsible to manage data coming from different sources.

Experienced shell scripting and Python

Environment: Hortonworks, Hive, HDFS, Yarn, Shell Scripting, ETL, OLTP, Metadata, Pig, Spark

Contact this candidate