Data Engineer Azure Sql

Location:

Ashburn, VA

Posted:

September 05, 2023

Contact this candidate

Resume:

DATA ENGINEER

VAISHNAVI

Email: *************@*****.***

Phone: +1-571-***-****

PROFESSIONAL SUMMARY:

•A Dynamic and motivated IT professional with around 8 years of experience with ETL, Big Data, Python/Scala, Relational Database Management Systems (RDBMS), and enterprise-level cloud-based computing and applications.

•Solid Big Data Analytics work experience, including installing, configuring, and using components like Hadoop Map reduce, HDFS, HBase, Zookeeper, Hive, Sqoop, Pig, Flume, Cassandra, Kafka, and Spark.

•Experience in data architecture including data ingestion pipeline design, Hadoop information architecture, data modeling, data mining, and advanced data processing.

•Excellent knowledge of Hadoop Architecture and various components such as HDFS, MapReduce, Hadoop GEN2 Federation, High Availability, YARN architecture, workload management, schedulers, scalability, and distributed platform architectures.

•Expertise in Developing Spark application using PySpark and Spark Streaming APIs in Python, deploying in yarn cluster in client, cluster mode.

•Detailed exposure on Azure tools such as Azure Data Lake, Azure Data Bricks, Azure Data Factory, HDInsight, Azure SQL.

•Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL.

•Experience in importing and exporting data from HDFS to Relational Database Systems and vice-versa using Sqoop. Hands on experience in VPN Putty and WinSCP, CI/CD(Jenkins).

•Solid understanding and experience with large-scale data warehousing and E2E data integration solutions on Snowflake Cloud and AWS Redshift.

•Created Azure SQL database, performed monitoring and restoring of Azure SQL database. Performed migration of Microsoft SQL server to Azure SQL database.

•Exposure to Cloudera Installation on Azure Cloud instances

•Good knowledge of using job scheduling tools like Oozie, Airflow, Autosys and Cronos.

•Experience in building Docker Images to run airflow on local environment to test the Ingestion as well as ETL pipelines.

•Experienced in providing security to Hadoop cluster with Kerberos and integration with LDAP/AD at Enterprise level.

•Improved continuous integration workflow, project testing, and deployments with Jenkins.

•Customized the dashboards and managed user and group permissions on Identity and Access Management (IAM) in AWS.

•Hands on experience in writing SQL programming, creating databases, tables, and joins. Designing Oracle database, application development and strong knowledge in PL/SQL.

•Experience on dbt streamlines raw data into structured formats for analysis.

•Ensured consistent and error-free data transformations with dbt models.

•Well versed with Big Data on cloud services i.e., EMR, EC2, Athena, S3, Glue, DynamoDB and RedShift.

•Experience in all the phases of Data warehouse life cycle involving Requirement Analysis, Design, Coding, Testing, and Deployment.

•Good knowledge of NoSQL Database and knowledge of writing applications on HBase.

•Experience on UNIX commands and Shell Scripting.

•Customized the dashboards and managed user and group permissions on Identity and Access Management (IAM) in AWS.

•Extensive experience working with AWS Cloud services and AWS SDKs to work with services like AWS API Gateway, Lambda, S3, IAM and EC2

SKILL SET

Big Data Ecosystem

HDFS, Hive, MapReduce, Hadoop distribution, and HBase, Spark, Spark Streaming, Yarn, Zookeeper, Kafka, Pig, Sqoop, Flume, Oozie

Hadoop Distributions

Apache Hadoop 2.x/1.x, Cloudera CDP, Hortonworks HDP, Amazon EMR (EMR, EC2, EBS, RDS, S3, Athena, Appflow, Glue, Elasticsearch, Lambda, DynamoDB, Redshift, ECS) Azure HDInsight (Data Bricks, Data Lake, Blob Storage, Data Factory ADF, SQL DB, SQL DWH, Cosmos DB, Azure AD)

Programming Languages

Python, R, Scala, Spark SQL, SQL, T-SQL, HiveQL, PL/SQL, UNIX shell Scripting, Pig Latin, C#

Cloud

AWS, Microsoft Azure, GCP

SQL Databases

MySQL, Teradata, Oracle, MS SQL SERVER, PostgreSQL, DB2

NoSQL Databases

HBase, Cassandra, Mongo DB, DynamoDB and Cosmos DB

DevOps Tools

Jenkins, Docker, Maven

PROFESSIONAL WORK EXPERIENCE:

Senior Data Engineer (Feb 2022- Present)

Lincoln financial Group, PA

Key Responsibilities:

•Involved in writing Spark applications using Python to perform various data cleansing, validation, transformation, and summarization activities according to the requirement.

•Developed multiple POCs using PySpark and deployed them on the Yarn cluster, compared the performance of Spark, with Hive and SQL/Teradata and developed code in reading multiple data formats on HDFS using PySpark.

•Assisted with Hadoop infrastructure upgrades, configuration, and maintenance, such as Pig, Hive, and HBase.

•Created data flows using AWS Appflow to connect with various sources like Salesforce to s3.

•Using Spark Context, Spark-SQL, Data Frame, Pair RDD's, and Spark YARN, improved the performance and optimization of current Hadoop methods.

•Developed a Spark job in python which indexes data into Elasticsearch from external Hive tables which are in HDFS.

•Worked on the design and implementation of a Hadoop cluster as well as a variety of Big Data analytical tools such as Pig, Hive, HBase, Oozie, Zookeeper, SQOOP, Flume, Spark, Impala.

•Using Git, to track changes and collaborate effectively on dbt models.

•Created reusable dbt models for consistent data transformations.

•Used dbt's to incremental processing speeds up and data refresh by handling only changes.

•Generated ETL scripts to transform, flatten and enrich the data from source to target using AWS Glue and created event driven ETL pipelines with AWS Glue.

•Used AWS Glue catalog with crawler to get the data from S3 and perform SQL query operations.

•Exploring with Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.

•Imported the data from various sources like HDFS/HBase into Spark RDD and developed a data pipeline using Kafka and Storm to store data into HDFS.

•Used Spark streaming to receive real time data from Kafka and store the stream data to HDFS.

•Configured Oozie workflow to run multiple Hive and Pig jobs which run independently with time and data availability.

•Imported and exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.

•Developed Python code for tasks, dependencies, SLA watcher and time sensor for each job for workflow management and automation using Airflow.

Environment: AWS EMR, AWS Glue, AWS Appflow, Redshift, Hadoop, HDFS, Teradata, Python, SQL, Oracle, Hive, HBase, Oozie, Zookeeper, Flume, Spark, Impala, Apache Airflow, Sqoop, Excel.

Role: Senior Data Engineer (May 2021 - Jan 2022)

Client: CVS Health, RI

Key Responsibilities:

•Experience with the complete SDLC process, including code reviews, source code management, and build process.

•Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP and coordinate task among the team.

•Experience in Loading Data to BigQuery using Google DataProc, GCS bucket, HIVE, Spark, Scala, Python, Gsutil and Shell Script.

•Developed Terraform script and deployed it in cloud deployment manager to spin up resources like cloud virtual networks.

•Used Spark-SQL to load JSON data and create Schema RDD and loaded it into Hive Tables and handled structured data using Spark SQL.

•Managed model dependencies and minimizing errors in dbt.

•Used dbt's tests to verify data quality and accuracy to enhancing integrity.

•Auto-generated dbt documentation for easy sharing and understanding.

•Used dbt to handle complex data transformations by breaking them down.

•Experience creating complex schema and tables for analysis using Hive.

•I have set required Hadoop environments for clusters to perform Map Reduce jobs and monitored Hadoop cluster connectivity and security using tools such as Zookeeper and Hive.

•Experience in building ETL systems using Python and in-memory computing framework (Apache Spark), scheduling and maintaining data pipelines at regular intervals in Apache Airflow.

•Invoking Python scripts for data transformations on large data sets in Azure Stream Analytics or Azure Databricks.

•Worked with HBase and MySQL for optimizing data and over File Sequence, AVRO, and Parquet file formats.

•Created data frames in Azure Databricks using Apache Spark to perform business analysis.

•Optimized and implemented data storage formats (e.g., Parquet, Delta Lake) through Databricks to effective partitioning strategies.

•Experience in building some complex data pipelines using PL/SQL scripts, Cloud REST APIs, Python scripts, GCP Composer, and GCP Dataflow.

•Monitored BigQuery, DataProc, and cloud Dataflow jobs via Stackdriver for all environments and opened an SSH tunnel to Google DataProc to access the Yarn manager to monitor Spark jobs.

•Submitted Spark jobs using Gsutil and Spark submission gets it executed in the DataProc cluster.

•Experience performing data wrangling to clean, transform and reshape the data utilizing pandas library.

•Used digital signage software for displaying real-time metrics, such as data transfer rates, system uptime, and system response times. Mainly used to monitor the progress of the migration in real-time.

•Experience analyzing data using SQL, Scala, Python, Apache Spark and presented analytical reports to management and technical teams.

•Created firewall rules to access Google DataProc from other machines.

•Used Python to ingest data into BigQuery.

•Involved in porting the existing on-premises Hive code migration to GCP (Google Cloud Platform) BigQuery.

•Monitored Hadoop cluster connectivity and security using tools such as Zookeeper and Hive.

•Manage and review Hadoop log files.

•AGILE(Scrum) development methodology has been followed to develop the application.

•Involved in managing the backup and disaster recovery for Hadoop data.

Environment: Google Cloud Platform (GCP), BigQuery, Google DataProc, GCS (Google Cloud Storage), Hive, Spark, Scala, Python, Gsutil, Shell Script, Terraform, Azure Stream Analytics, Azure Databricks, HBase, MySQL, Parquet, Delta Lake, PL/SQL, Cloud REST APIs, GCP Composer, GCP Dataflow, Stackdriver, Apache Airflow, Pandas, Agile (Scrum) development methodology, Zookeeper.

Senior Data Engineer (June 2020 – April 2021)

Smithfield foods, Smithfield, Va

Key Responsibilities:

•Used Power BI, Power Pivot to develop data analysis prototype, and used Power View and Power Map to visualize reports.

•Worked on Azure Data Factory to integrate data of both on-prem (MY SQL, Cassandra) and cloud (Blob storage, Azure SOL DB) and applied transformations to load back to Azure Synapse.

•Managed, Configured, and scheduled resources across the cluster using Azure Kubernetes Service.

•Monitored Spark cluster using Log Analytics and Ambari Web UI.

•Transitioned log storage from Cassandra to Azure SQL Datawarehouse and improved query performance.

•Involved in developing data ingestion pipelines on Azure HDInsight Spark cluster using Azure Data Factory and Spark SQL and worked with Cosmos DB (SQL API and Mongo API).

•Developed dashboards and visualizations to help business users analyze data as well as providing data insight to upper management with a focus on Microsoft products like SQL Server Reporting Services (SSRS) and Power BI.

•Performed the migration of large data sets to Databricks (Spark), create and administer cluster, load data, configure data pipelines, loading data from ADLS Gen2 to Databricks using ADF pipelines.

•Created Databrick notebooks to streamline and curate the data for various business use cases and mounted blob storage on Databrick.

•Utilized Azure Logic Apps to build workflows to schedule and automate batch jobs by integrating apps, ADF pipelines, and other services like HTTP requests, email triggers etc.

•Worked extensively on Azure data factory including data transformations, Integration Runtimes, Azure Key Vaults, Triggers and migrating data factory pipelines to higher environments using ARM Templates.

•Ingested data in mini-batches and performs DD transformations on those mini-batches of data by using Spark Streaming to perform streaming analytics in Data bricks.

Environment: Azure SQL DW, Databrick, Azure Synapse, Cosmos DB, ADF, SSRS, Power BI, Azure Data Lake, ARM, Azure HDInsight, Blob Storage, Apache Spark.

Senior Data Engineer (Jan 2019 – May 2020)

FedEx, Pittsburgh, PA

Key Responsibilities:

•Designed and developed scalable and cost-effective architecture in AWS Big Data services for data life cycle, collection, ingestion, storage, processing, and visualization.

•Used Spark SQL with Scala for creating data frames and performed transformations on data frames.

•Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.

•Created airflow production jobs status report to stakeholders.

•Developed spark scripts and python functions that involve performing transformations and actions on data sets.

•Worked on migrating Map Reduce programs into Spark transformations using Spark and Scala, initially done using Python (PySpark).

•Extracted data from multiple systems and sources using Python and loaded the data into AWS EMR.

•Converted all Hadoop jobs to run in EMR by configuring the cluster according to the data size.

•Responsible for creating on-demand tables on S3 files using Lambda functions and AWS Glue using Python and PySpark.

•Created monitors, alarms, notifications and logs for Lambda functions, Glue jobs, EC2 hosts using CloudWatch and used AWS Glue for data transformation, validation, and cleansing.

•Performs quality check on the existing code to improve performance.

•Implemented Partitioning, Dynamic Partitions and Buckets in HIVE for increasing performance benefit and helping in organizing data in a logical fashion.

•Worked in the BI team in Big Data Hadoop cluster implementation and data integration in developing large-scale system software.

•Create/Modify shell scripts for scheduling various data cleansing scripts and ETL loading process.

•Designed Data warehouse/Database, Modelling, building SQL objects such as tables, views, user defined/ table valued functions, stored procedures, triggers, and indexes.

•Designed and implemented scalable Cloud Data and Analytical a solution for various public and private cloud platforms using Azure.

•Designed and implemented distributed data processing pipelines using Apache Spark, Hive, Python, Airflow DAGs and other tools and languages in Hadoop Ecosystem.

Environment: AWS EMR, AWS Glue, AWS Appflow, ETL, Redshift, Hadoop, HDFS, Teradata, Python, SQL, Oracle, Hive, HBase, Oozie, Zookeeper, Flume, Spark, Impala, Apache Airflow, Sqoop, Git hub, Agile.

Role: Data Engineer (SQL/SSIS/ETL)

Client: Anthem Health – Woodland hills, CA. (June 2017 – Dec 2018)

Key Responsibilities:

•Worked on the final SQL database architecture design and ETL process.

•Created a work containing 30 packages to migrate data from old legacy systems to CUP.

•Stored procedures that interact with the previous general search for different criteria.

•Fix ETL errors and perform manual loading using SQL stored procedures.

•Designing a client platform using SQL Server Integration Services to integrate new metrics at the client site.

•Transparent conversions in various incoming data to change as needed.

•Performance tuning to improve execution time by selecting appropriate transparent conversions.

•Created files, forms, tables, and databases to support business processes and analysis teams.

•Created and scheduled SQL Agent jobs and maintenance schedules.

•The procedure is developed by flushing parameters after updating the stored procedure using the same stored procedure reset option to eliminate high-performance overhead and optimize the execution schedule of T SQL queries.

•Write SQL queries against persistent tables using nested types, dynamic SQL, and join clauses.

•Create and monitor indexes, reducing processing time from minutes to seconds.

Environment: MS SQL 2012 SQL SSIS, SSAS, SSRS, DTS, TCAD V2R5 SUN SPARK 3000(HARDWARE), SAAS, Microsoft Office Suite, Visual Studio.NET, Hadoop 2.5.2, crystal reports 2008/XI/10.0/9.0/8.5.

Role: Data Engineer (Sep 2016 – May 2017)

Client: Capitol Federal Savings Bank, Topeka, Kansas

Key Responsibilities:

•Collaborated in SDLC, designing databases using T-SQL.

•Crafted complex T-SQL queries, triggers, and optimized procedures.

•Managed data with Talend ETL and Unix scripting, across various databases.

•Expert in Talend designed transformations, managed anomalies, executed migrations.

•Implemented Power BI models, custom roles, and automation, training users.

•Developed and tuned Talend ETL packages for seamless integration.

•Collaborated cross-functionally to meet project data goals.

•Scheduled and optimized jobs with SQL Server Agent and Talend.

•Enhanced Spark job performance, built machine learning models using Python.

•Created data pipelines, managed workflows via Apache Airflow.

•Troubleshot, debugged using shell scripting.

Environment: SQL, T-SQL, Triggers, ETL, UNIX, Talend, Power BI, Python, Data Pipelines, Shell Scripting.

Contact this candidate