Data Engineer Azure

Location:

Charlotte, NC

Salary:

75$

Posted:

November 21, 2023

Contact this candidate

Resume:

SEEMA

Email: *********@*****.*** PH: 803-***-****

Sr Data Engineer https://www.linkedin.com/in/seema-m-620962278

PROFESSIONAL SUMMARY:

8+ years of Full Software Development Life Cycle (SDLC) experience in Software, System Analysis, Design, Development, Testing, Deployment, Maintenance, Enhancements, Re-Engineering, Migration, Troubleshooting and Support of multi-tiered web applications in high performing environments.

Hands-on experience in Hadoop Framework and its ecosystem including HDFS Architecture, Map Reduce Programming, Hive, Pig, Sqoop, HBase, Zookeeper, Couchbase, Storm, Solr, Oozie, Spark, Scala, Flume, Strom, and Kafka.

Good experience in installing, configuring, and administrating Hadoop cluster of major Hadoop distributions Hortonworks, and Cloudera.

Hands-on experience on Google Cloud Platform (GCP) in all the Big data products Big Query, Cloud DataProc, Google Cloud Storage, Composer (Air Flow as a service).

Expert in working with cloud PUB/SUB to replicate data real-time from source system to GCP Big Query.

Experienced on GCP's compute and networking services, such as virtual machines (VMs), Kubernetes clusters, and serverless computing options like Cloud Functions and App Engine.

Experience in batch processing and writing programs using Apache, Spark for handling real-time analytics and real Streaming of data.

Experience on Migrating SQL database to Azure Data Lake, Azure data Lake Analytics, Azure SQL Database, Databricks and Azure SQL Data warehouse and controlling, granting database access, and migrating on premise databases to Azure Data Lake store using Azure Data factory.

Good understanding of Zookeeper and Kafka for monitoring and managing Hadoop jobs and used ClouderaCDH4, CDH5 for monitoring and managing Hadoop cluster.

Experience in Analysis, Design, Development and Big Data in Scala, Spark, Hadoop, Pig and HDFS environment.

Good experience with design, coding, debug operations, reporting and data analysis utilizing python and using Python libraries to speed up development.

Experience with Airflow, Spark, Scala, Python, and PySpark.

Expertise in deploying Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, and Map Reduce/ Yarn concepts.

Experience in loading the data into Spark RDD and performing in-memory data computation to generate the output responses.

Hands on experience in programming using Python, Scala, Java and SQL.

Good working experience on Spark (Spark Core Component, Spark SQL, Spark Streaming, Spark MLlib, Spark GraphX, SparkR) with Scala and Kafka.

Understanding of Spark Architecture including Spark SQL, Data Frames, Spark Streaming, experience in analyzing Data with Spark while using Scala.

Hands on experience in using other Amazon Web Services like S3, VPC, EC2, Autoscaling, RedShift, DynamoDB, Route53, RDS, Glacier, EMR.

Knowledge of various scripting languages like Linux/Unix shell scripting and Python, continuous integration, automated deployment (CI/CD), and management using Jenkins.

Experienced on GCP's compute and networking services, such as virtual machines (VMs), Kubernetes clusters, and serverless computing options like Cloud Functions and App Engine.

Experience on moving raw data between different systems using Apache NIFI.

Experience in Datacenter Migration, Azure Data Factory (ADF) V2. Managing Database, Azure Data Platform services (Azure Data Lake (ADLS), Data Factory (ADF), Data Lake Analytics, Stream Analytics, Azure SQL DW, and HDInsight/Databricks.

Experience in developing and designing POCs using Scala and deployed on the Yarn cluster, compared the performance of Spark with Hive and SQL/ Teradata.

Strong understanding of RDD operations in Apache Spark i.e., Transformations, Actions, Persistence (Caching), Accumulators, Broadcast Variables, and Optimizing Broadcasts.

In-depth understanding of Apache spark job execution Components like DAG, lineage graph, Dag Scheduler, Task scheduler, Stages, and task.

Experience in migrating Python Machine learning modules to scalable, high-performance, and fault-tolerant distributed systems like Apache Spark.

Expertise in developing multiple Spark jobs in Scala for data cleaning, pre-processing and aggregating.

Expertise in working with Hive data warehouse tool-creating tables, data distribution by implementing partitioning and bucketing, writing and optimizing the HiveQL queries.

Excellent understanding and knowledge of job workflow scheduling and locking tools/ services like Oozie and Zookeeper.

Knowledge of ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions and Data Warehouse tools for reporting and data analysis.

Comprehensive knowledge in Debugging, Optimizing and Performance Tuning of DB2, Oracle and MYSQL databases.

Good experience with Informatica, AWS Glue for designing ETL Jobs for Processing of data.

Experience with Azure transformation projects and implement ETL and data movement solutions using Azure Datafactory (ADF), and SSIS.

Good understanding and knowledge of NoSQL databases like MongoDB, HBase and Cassandra.

Experience in Monitoring Hadoop clusters using tools like Nagios, CloudWatch, and Cloudera Manager.

Experience with operating systems such as Windows, Linux, RedHat, and UNIX.

Experienced and skilled Agile Developer with a strong record of excellent teamwork and successful coding.

Strong Problem Solving and Analytical skills and abilities to make Balanced Independent decisions.

Exceptional ability to quickly master new concepts and capable of working in-group as well as independently with excellent communication skills.

TECHNICAL SKILLS:

Operating Systems Linux, UNIX, Windows.

GCP: Data proc, Cloud Composer, Big Query, GCS, Pub/Sub, Data Flow.

Development Methodologies Agile/ Scrum, Waterfall.

IDEs: Eclipse, Net Beans, IntelliJ

Big Data Platforms: Hortonworks, Cloudera CDH4, CDH5

Programming Languages: Python, Scala, SQL, Python NumPy, SciPy, Pandas, NLTK, Matplotlib, Beautiful Soup, Text Blob

Hadoop Components: HDFS, Sqoop, Hive, Pig, MapReduce, YARN, Impala, Hue, Zookeeper

Spark Modules: Spark Core Component, Spark SQL, Spark Streaming, Spark MLlib, Spark GraphX, SparkR

RDBMS/NOSQL Databases: Oracle, MYSQL, Microsoft SQL Server, HBase, MongoDB, Cassandra

Cloud Technologies: Amazon Web Services (AWS)

AWS Services: EC2, IAM, S3, Autoscaling, CloudWatch, Route53, EMR, RedShift, Glue MS Azure SQL DW, HDInsight, Databricks, Azure Data Factory

DevOps Tools: Ant, Maven, Jenkins, Git, GitHub

Visualization and Reporting: Power BI, Tableau

PROFESSIONAL EXPERIENCE:

Merck Pharma, Charlotte, NC January 2022 to Present

Role: GCP Data Engineer

Responsibilities:

Involved in all phases of Software Development Life Cycle (SDLC) such as requirements gathering, modelling, analysis, design, development, and testing.

Experience in building multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP and coordinate task among the team.

Develop and deploy the outcome using spark and Scala code in Hadoop cluster running on GCP.

Developed and deployed Spark applications using Pyspark and Spark-SQL for data extraction, transformation, and aggregation from multiple le formats for analyzing & transforming the data to uncover insights into the customer usage patterns

Developed HIVE UDFs to incorporate external business logic into Hive script and developed join data set scripts using HIVE join operations.

Using g-cloud function with Python to load data into big query for on arrival csv files in GCS bucket.

Process and load bound and unbound data from Google pub/subtopic to big query using cloud Dataflow with Python.

Migrated an Oracle SQL ETL to run on Google Cloud Platform using cloud Dataproc & BigQuery, cloud pub/sub for triggering the airflow jobs.

Worked on using Hive, Spark-SQL, BigQuery using python client libraries and building interoperable and faster programs for analytics platforms.

Created various Hive external tables, staging tables and joined the tables as per the requirement. Implemented static Partitioning, Dynamic partitioning and Bucketing.

Worked with the Spark for optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, PySpark, Pair RDD's, and Spark YARN.

Got involved in migrating on prem Hadoop system to using GCP (Google Cloud Platform).

Migrated previously written cron jobs to airflow/composer in GCP.

Involved in migrating the client Datawarehouse architecture from on-premises into MS Azure cloud.

Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.

Worked with NoSQL databases like HBase in creating HBase tables to load large sets of semi-structured data coming from various sources.

Experience in GCP Dataproc, Dataflow, PubSub, GCS, Cloud functions, BigQuery, Stackdriver, Cloud logging, IAM, Data studio for reporting etc.

Created pipelines in ADF using linked services to extract, transform and load data from multiple sources like Azure SQL, Blob storage and Azure SQL Data warehouse.

Implemented Cluster for NoSQL tool HBase as a part of POC to address HBase limitations.

Used Spark Data Frames Operations to perform required Validations in the data and to perform analytics on the Hive data.

Developed Apache Spark applications by using Spark for data processing from various streaming sources.

Design and implement end-to-end data solutions (storage, integration, processing, and visualization) in Azure.

Worked on architecture and components of Spark, and efficient in working with Spark Core, Spark SQL.

Used Kafka and Kafka brokers to initiate spark context and processing livestreaming.

Experience in GCP Dataproc, GCS, Cloud functions, Cloud SQL & BigQuery.

Used Cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery.

Worked on GCP POC to migrate data and applications from On-Prem to Google Cloud.

Exposure on IAM roles in GCP

Developed custom Kafka producer and consumer for different publishing and subscribing to Kafka topics.

Migrated Map reduce jobs to Spark jobs to achieve better performance.

Developed Python, Shell/ Perl Scripts and Power shell for automation purpose and Component unit testing using Azure Emulator.

Working on designing the MapReduce and Yarn flow and writing MapReduce scripts, performance tuning and debugging.

Used HUE for running Hive queries. Created partitions according to day using Hive to improve performance.

Implement ad-hoc analysis solutions using Azure Data Lake Analytics/ Store, and HDInsight.

Process and load bound and unbound Data from Google pub/subtopic to BigQuery using cloud Dataflow with Python.

Developed Oozie workflow engine to run multiple Hive, Pig, Sqoop and Spark jobs.

Implement ETL and data movement solutions using Azure Data Factory (ADF), and SSIS.

Developed Airflow DAGs in python by importing the Airflow libraries.

Worked on auto scaling the instances to design cost effective, fault tolerant and highly reliable systems.

Environment: MS Azure, GCP, Azure SQL, Blob Storage, Azure Data Factory (ADF), HDInsight, Python, OLAP, Hadoop (HDFS, MapReduce), YARN, Spark, Spark Context, Spark-SQL, PySpark, Pair RDD's, Spark DataFrames, Spark YARN, Hive, Pig, HBase, Oozie, Hue, Sqoop, Flume, Oracle, NIFI, Kafka.

UBS Weehawken, NJ August 2019 to December 2021

Role: Azure/GCP Data Engineer

Responsibilities:

Analyzed stories, participated in grooming sessions and point estimation for development according to agile methodology.

Experience in moving data between GCP and Azure using Azure Data Factory.

Developed Spark jobs using Scala and Python on top of Yarn/ MRv2 for interactive and Batch Analysis.

Worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, PySpark, Pair RDD's, Spark YARN, and Spark MLLib.

Involved in porting the existing on-premises Hive code migration to GCP BigQuery.

Involved in implementation of data movements from on-premises to cloud in MS Azure.

Evaluated the performance of Apache Spark in analyzing genomic data.

Developed a NIFI Workflow to pick up the data from SFTP server and send that to Kafka broker.

Used Python - NumPy, SciPy, Pandas, NLTK, Matplotlib, Beautiful Soup, and TextBlob to finish the ETL process of clinical data for future NLP analysis.

Used rest API with Python to ingest Data from and some other site to BIGQUERY.

Migrated data from traditional database systems to Azure SQL databases.

Used different Machine Learning modules for ruining the spark jobs on daily and weekly basis.

Migrated complex Map reduce programs into Spark RDD transformations, actions.

Got involved in migrating on prem Hadoop system to using GCP (Google Cloud Platform).

Hands-on experience in migrating on premise ETLs to Google Cloud Platform (GCP) using cloud native tools such as BIG query, Cloud DataProc, Google Cloud Storage, Composer.

Worked on Azure Data bricks to run Spark-Python Notebooks through ADF pipelines.

Designed and developed Oozie workflows to orchestrate Hive scripts, Sqoop.

Involved in developing the Spark Framework to provide structure to data on the fly and process the data using Spark core API's, Data Frame, Spark-SQL and Scala Evaluated and improved application performance with Spark.

Recreated existing application logic and functionality in the Azure Data Lake, Data Factory, and Azure SQL Database.

Using g-cloud function with Python to load data into big query for on arrival csv files in GCS bucket.

Process and load bound and unbound data from Google pub/subtopic to big query using cloud Dataflow with Python.

Got involved in migrating on prem Hadoop system to using GCP (Google Cloud Platform).

Migrated previously written cron jobs to airflow/composer in GCP.

Developed Spark scripts to import large files from Azure and imported the data from different sources like HDFS/HBase into Spark RDD.

Worked on using Hive, Spark-SQL, BigQuery using python client libraries and building interoperable and faster programs for analytics platforms.

Extensively worked on Python and build the custom ingest framework.

Design, build and deliver the operational and management tools, frameworks and processes for the Azure Data Lake and drive the implementations into the Azure Data Lake Cloud Operations team.

Involved in working with Spark on top of YARN for interactive and batch analysis.

Implemented Spark SQL to update queries based on the business requirements.

Used Spark API over Hortonworks Hadoop YARN to perform analytics on data in HDFS.

Designed Columns families in Cassandra and Ingested data from RDBMS, performed transformations and exported the data to Cassandra.

Successfully migrated Data Pipeline jobs from Oozie to Airflow.

Developed Airflow DAGs in python by importing the Airflow libraries.

Developed Scala scripts, UDFs using both Data frames/ SQL/ Data sets and RDD in Spark for Data Aggregation, queries and writing data back into OLTP system through Sqoop.

Created Azure Stream Analytics Jobs to replication the real time data to load to Azure SQL Data warehouse.

Implemented the real time streaming ingestion using Kafka and Spark Streaming.

Developed visualizations and dashboards using Power BI.

Created dashboards for analyzing POS data using Power BI.

Environment: Big Data, Spark, Python, GCP, Google BigQuery, Cloud DataProc, NumPy, SciPy, Pandas, NLTK, Matplotlib, Beautiful Soup, Text Blob, MS Azure, Azure SQL, Azure Data Lake, Azure Data Factory, Scala, YARN, Spark Context, Spark-SQL, PySpark, Pair RDD's, Spark YARN, Spark MLLib, NiFi, Kafka, Oozie, OLAP, Power BI.

Avon Technologies Pvt Ltd Hyd India January 2017 to May 2019

Role: Data Engineer

Responsibilities:

Administered, maintained, provisioned, patched, and maintained Cloudera Hadoop clusters on Linux.

Worked on analyzing Hadoop stack and different big data analytic tools including Pig and Hive, HBase database, and Sqoop.

Utilized Python matplotlib and SciKit-Learn modules to generate the basic prototype visualization restored using visualization tools such as Tableau.

Written multiple MapReduce programs to extract data for extraction, transformation, and aggregation from more than 20 sources having multiple file-formats including XML, JSON, CSV & other compressed file formats.

Worked in AWS environment for development and deployment of Custom Hadoop Applications.

Created Oozie workflows for Hadoop-based jobs including Sqoop, Hive, and Pig.

Involved in file movements between HDFS and AWS S3.

Created Hive External tables and loaded the data into tables and query data using HQL.

Performed data validation on the data ingested using MapReduce by building a custom model to filter all the invalid data and cleanse the data.

Handled the importing of data from various data sources, performed transformations using Hive, and Map-Reduce, loaded data into HDFS, and extracted data from MySQL into HDFS using Sqoop.

Transferred the data using the Informatica tool from AWS S3 to AWS Redshift.

Wrote HiveQL queries by configuring a number of reducers and mappers in the query needed for the output.

Transferred data between Pig Scripts and Hive using HCatalog, transferred relational database using Sqoop.

Involved in Extract Transfer and Load (ETL) process using SSIS and generated reports using SSRS.

Analyzed the data by performing Hive queries (HiveQL) and running Pig Scripts (Pig Latin).

Worked with Toad DB2, which is in production region, Oracle and Teradata are sources where the claims are extracted.

Monitor the usage of database space and performance. Performance optimization of the database and applications. Resolve on-call issues for many production databases.

Cluster coordination services through Zookeeper. Installed and configured Hive and written Hive UDFs.

Environment: Hadoop, Pig, Hive, HBase, Sqoop, Python, Matplotlib, Scikit Learn, Tableau, HDFS, Map Reduce, AWS, S3, RedShift, MySQL, OLAP, Oozie, Sqoop, HQL, SSIS, SSRS, MS SQL Server, HiveQL.

Grapesoft Solutions Hyderabad, India July 2014 to December 2016

Data Engineer

Responsibilities:

Worked with BI team in gathering the report requirements and Sqoop to export data into HDFS and Hive.

Involved in the below phases of Analytics using R, Python, and Jupyter Notebook.

Data collection and treatment: Analyzed existing internal data and external data, worked on entry errors, classification errors and defined criteria for missing values.

Data Mining: Used cluster analysis for identifying customer segments, Decision trees used for profitable and non-profitable customers, Market Basket Analysis used for customer purchasing behavior and part/ product association.

Developed multiple Map Reduce jobs in Java for data cleaning and pre-processing.

Assisted with data capacity planning and node forecasting.

Installed, Configured and managed Flume Infrastructure.

Administrator for Pig, Hive and HBase installing updates patches and upgrades.

Worked closely with the claims processing team to obtain patterns in filing of fraudulent claims.

Developed Map Reduce programs to extract and transform the data sets and results exported back to RDBMS using Sqoop.

Patterns observed in fraudulent claims using text mining in R and Hive.

Exported the data required information to RDBMS using Sqoop to make the data available for the claims processing team to assist in processing a claim based on the data.

Developed Map Reduce programs to parse the raw data, populate staging tables and store the refined data in partitioned tables in the EDW.

Created tables in Hive and loaded the structured (resulted from Map Reduce jobs) data

Using Hive QL developed many queries and extracted the required information.

Created Hive queries that helped market analysts spot emerging trends by comparing fresh data with EDW reference tables and historical metrics.

Responsible for importing the data (mostly log files) from various sources into HDFS using Flume.

Enabled speedy reviews and first mover advantages by using Oozie to automate data loading into the Hadoop Distributed File System and PIG to pre-process the data.

Managed and reviewed Hadoop log files.

Tested raw data and executed performance scripts.

Environment: HDFS, PIG, HIVE, Map Reduce, Linux, HBase, Flume, Sqoop, R, VMware, Eclipse, Cloudera, and Python.

Contact this candidate