Data Engineer Quality

Location:

Kozhikode, Kerala, India

Posted:

October 04, 2024

Contact this candidate

Resume:

Girish

************@*****.***

469-***-****

PROFESSIONAL SUMMARY:

Data Engineer with 8+ years of experience designing and executing solutions for complex business problems. Migrated local infrastructures to Azure, using big data technologies and Azure tech stack ensuring data quality and integrity.

Proficient in leveraging Azure Databricks and Spark for distributed data processing and transformation tasks, enhancing efficiency and scalability of data operations.

Skilled in ensuring data quality and integrity through robust validation, cleansing, and transformation operations, ensuring accurate and reliable insights.

Implemented Azure Policy integrated with Azure Key Vault to manage encryption and access control policies, ensuring data security and compliance with business quality rules.

Automated the configuration and implementation of robust business quality rules using Azure DevOps Pipelines, ensuring consistent application across diverse environments.

Adept at designing cloud-based data warehouse solutions using Snowflake on Azure, effectively optimizing schemas, tables, and views for efficient data storage and retrieval.

Extensive experience with Snowflake Multi-Cluster Warehouses, enabling scalable compute resources and significantly improving query performance for large-scale data analysis.

Demonstrated proficiency in utilizing Snowflake Clone and Time Travel features, facilitating efficient data copying and seamless access to historical data for comprehensive analysis and reporting.

Utilized Azure Resource Manager templates as Infrastructure-as-Code to define and manage infrastructure configurations, facilitating automated provisioning and deployment of resources necessary for implementing business quality rules.

Actively contributes to the development, improvement, and maintenance of Snowflake database applications, ensuring high performance and reliability of the system.

Good Knowledge and Experience of AWS (EMR, EC2, S3, & Glacier).

Proven ability to build logical and physical data models for Snowflake, adapting to changing requirements and optimizing data structures for efficient data organization and access.

Used SageMaker to support checkpointing for AWS Containers and for subset of built-in algorithms without requiring script changes.

Good experience of software development in Python and IDEs: pycharm, sublime text, Jupyter Notebook

Expertise in defining roles and privileges necessary for accessing different database objects, ensuring data security and appropriate access controls.

Possesses in-depth knowledge of Snowflake Database, Schema, and Table structures, allowing for effective management and utilization of Snowflake's powerful capabilities.

Collaborated closely with data analysts and stakeholders to implement effective data models and structures, ensuring alignment with business needs and objectives.

Demonstrated strong expertise in optimizing Spark jobs and leveraging Azure Synapse Analytics for efficient processing and analytics of large-scale data.

Successfully achieved performance optimization and capacity planning goals, ensuring scalability and efficiency in data processing workflows.

Developed CI/CD frameworks for seamless deployment of data pipelines, worked closely with DevOps teams to automate the pipeline deployment process.

Experience in developing data analytic solutions using AWS services like S3, EMR, Kinesis, Lambda, Glue, Redshift, Cloud Watch, Data pipeline and Step Functions.

Proficient in scripting languages like Python and Scala, enabling efficient data manipulation and processing tasks.

Skilled in utilizing Hive, Spark SQL, Kafka, and Spark Streaming for ETL tasks and real-time data processing, enabling timely and accurate insights.

Possess solid working experience in Hadoop, HDFS, Map-Reduce, Hive, Teg, Python, and PySpark, facilitating comprehensive data processing capabilities.

Demonstrated hands-on experience in developing large-scale data pipelines using Spark and Hive, ensuring efficient data flow and processing.

Utilized Apache Sqoop to import and export data from HDFS and Hive, enabling seamless integration with relational database systems.

Proficient in setting up workflows using Apache Oozie workflow engine, effectively managing and scheduling Hadoop jobs for streamlined data processing.

Optimized query performance in Hive using advanced techniques like bucketing and partitioning, delivering improved query execution times.

Highly proficient in Agile methodologies, leveraging tools like JIRA for efficient project management and reporting.

TECHNICAL SUMMARY

Operating Systems

Windows, UNIX, Linux

Big Data Technologies

Hadoop, MapReduce, HDFS, HBase, Zookeeper, Hive, Sqoop, Oozie, Kafka, YARN, Ranger, Impala, PySpark.

Database

SQL, Oracle, MySQL, Snowflake, HBase.

Programming Languages

Python, Scala, JavaScript, SQL, T-SQL, PL/SQL

Scripting

PowerShell 3.0/2.0. UNIX Shell Scripting

Web Technologies

HTML, CSS, JavaScript, XML, JSP, Restful, SOAP

Build Automation Tools

Ant, Maven

Version Control

GIT, GitHub

Analytical Tools

Tableau, Elasticsearch, Splunk, Power BI

Azure Services

Azure Data Factory, Azure Data Bricks, Azure Synapse Analytics, Logic Apps, Functional Apps, Azure DevOps

AWS Tools

Kinesis, EC2, EMR, S3, Lambda, Glue, Redshift, SNS, SQS, Step Functions

PROFESSIONAL EXPERIENCE

GlaxoSmithKline, Warren, NJ July 2021 – Present

Azure Data Engineer

Designed and implemented secure and scalable data pipelines for HealthCare applications using Azure Data Factory, ensuring efficient extraction, transformation, and loading (ETL) processes.

Performed data aggregation and analysis on large-scale datasets using Apache Spark, Scala, and Hive, resulting in improved insights for the business.

Experience in managing Hadoop clusters using the Cloudera Manager tool.

Utilized big data ecosystems such as Hadoop, Spark, and Cloudera to load and transform large sets of structured, semi- structured, and unstructured data.

Utilized Hive queries and Spark SQL to analyze and process data, meeting specific business requirements and simulating MapReduce functionalities.

Building ETL jobs using PySpark API with Jupyter notebooks in on premise cluster for certain transforming needs and HDFS as data storage system.

Developed and maintained data processing workflows using Azure Data bricks and Spark to perform complex data transformations and aggregations for analytics and reporting.

Built scalable and optimized Snowflake schemas, tables, and views to support complex analytics queries and reporting requirements.

Utilized Azure Event Hubs and Azure Functions to enable real-time data streaming and processing of patient health records, facilitating timely insights for outbreak detection and patient monitoring.

Implemented robust data governance practices by defining and enforcing data quality rules, data lineage, and data cataloging within the Azure ecosystem, ensuring compliance with regulatory requirements.

Utilized Azure DevOps Pipelines to automate the configuration and implementation of business quality rules, ensuring consistent application across different environments.

Leveraged Azure Resource Manager (ARM) templates as Infrastructure-as-Code to define and manage infrastructure configurations, enabling automated provisioning and deployment of resources required for implementing business quality rules.

Leveraged Azure Data Lake Storage as a data lake for storing raw and processed data, implementing data partitioning and data retention strategies.

Utilized Azure Blob Storage for efficient storage and retrieval of data files, implementing compression and encryption techniques to optimize storage costs and data security.

Integrated Azure Data Factory with Azure Logic Apps for orchestrating complex data workflows and triggering actions based on specific events.

Utilized SageMaker to support Checkpoints for Frameworks and Algorithms.

Implemented data replication and synchronization strategies between Snowflake and other data platforms using AZURE Data Factory and change Data Capture techniques.

Implemented advanced analytics and machine learning workflows using Azure Machine Learning and Snowflake, enabling predictive analytics and data-driven insights.

Designed and implemented data archiving and data retention strategies using Azure Blob Storage and Snowflake's Time Travel feature.

Developed an interactive customer analytics dashboard leveraging Power BI and Azure Analysis Services to provide insights into customer behavior, segmentation, and profitability.

Designed and implemented customized monitoring and alerting solutions utilizing Azure Monitor and Snowflake Query.

Orchestrated Apache NiFi data pipelines to integrate and consolidate healthcare data from various sources, such as electronic health records systems, medical imaging devices, and wearable health trackers, enabling a whole view of patient health and treatment outcomes.

Designed and implemented Apache NiFi data ingestion workflows for healthcare organizations, enabling real-time acquisition and processing of medical data from various sources, including hospital information systems, medical devices, and remote patient monitoring systems.

Implemented secure data transmission protocols within Apache NiFi, including encryption and authentication mechanisms, to safeguard sensitive patient health information during ingestion and transfer, ensuring compliance with healthcare regulations.

Successfully integrated Snowflake with Power BI and Azure Analysis Services, enabling the creation of interactive dashboards and reports that empower business users with self-service analytics capabilities.

Streamlined data pipelines and optimized Spark jobs in Azure Databricks, leveraging performance enhancements such as Spark configuration tuning, caching, and data partitioning techniques to ensure efficient data processing for HealthCare operations.

Collaborated seamlessly with cross-functional teams, including data scientists, data analysts, and business stakeholders, to gather and comprehend data requirements, delivering scalable and reliable data solutions tailored specifically to the HealthCare industry.

Developed and maintained data pipelines using Kafka to ingest, transform, and process customer behavioral data for analysis.

Utilized Git as a version control tool to maintain the code repository, ensuring better code management and tracking of changes.

Environment: Apache Spark, Hive, Apache NiFi Cloudera, HBase, Kafka, MapReduce, HDFS, Snowflake,Azure Databricks, Data Factory, Logic Apps, Functional Apps, MYSQL, RDBMS, Python, PySpark, Scala, shell script, Power BI, JIRA, GIT, Jenkins.

Credit-Suisse, Raleigh, NC January 2019 – June 2021

Azure Data Engineer

Initiated the design and implementation of scalable data ingestion pipelines using Azure Data Factory, ingested data from SQL databases, CSV files, and REST APIs which reduced ingestion time, and increased data accuracy.

Developed, maintained, and provided the team with Various Azure DevOps-related tools like deployment tools, staged virtual environments, and provisioning scripts.

Used Shared Image Gallery to store the created images and built Azure pipelines in Azure DevOps to implement all these services in Azure.

Experienced in working on DevOps /Agile operations process and tools area (Code review, unit test automation, Build & Release automation Environment, Incident, and Change Management) including various tools.

Develop and optimize data processing workflows using Azure Databricks, resulting in improvement in data processing time and a reduction of errors.

Designed and created various forms for finance team, sales team and account managers with implementing the business logic. Worked with non-technical teams to help them to use the application.

Implemented advanced data validation techniques, including outlier detection and anomaly detection, resulting in an accuracy rate for data transformation.

Collaborated with cross-functional stakeholders to design and implement optimized Snowflake schemas, tables, and views that supported complex analytics queries over big data.

Running analytics on power plant data using PySpark API with Jupyter notebooks in on premise cluster for certain transforming needs.

Leveraged Terraform to deploy and manage cloud-based data platforms, such as Azure Blob Storage and scalable data storage.

Designed and implemented Terraform scripts to create and manage data pipelines using tools like Apache Kafka, Apache Spark, for real-time streaming and batch data processing in insurance applications.

We created an Enterprise data warehouse that integrates critical business and financial data to develop KPIs and analytical reporting to support and enable the business's exceptional growth—having two primary systems.

Utilized Terraform to define and deploy infrastructure resources for insurance-specific analytics and reporting systems, ensuring high availability, scalability, and disaster recovery capabilities.

Actively participated in the insurance industry's community forums and user groups related to Apache NiFi, staying updated on the latest advancements, best practices, and emerging trends in data engineering for insurance applications.

Integrated Azure Policy with Azure Key Vault for managing and enforcing encryption and access control policies, ensuring data security and compliance with business quality rules.

Architected and optimized Snowflake schemas, tables, and views to streamline data storage and retrieval for analytics and reporting needs, resulting in an increase in data query speed.

Collaborated cross-functionally with data analysts and stakeholders to identify business requirements and implemented tailored data models and structures in Snowflake, resulting in an increase in user satisfaction with the platform.

Developed Data Models in Power BI for financial requirements and published into Service.

Developed efficient Spark jobs to perform complex big data transformations, aggregations, and machine learning tasks on large datasets, leading to a reduction in processing time.

Developed and implemented an automated data pipeline using Azure Logic Apps and Azure Functions, reducing manual effort and improving data accuracy.

Well experienced in designing solutions to process high volume data stream ingestion, processing, and low latency data provisioning using Hadoop Ecosystems NiFi, Hive, NoSQL databases MongoDB.

Implemented AWS Step Functions to automate and orchestrate the Amazon SageMaker related tasks such as publishing data to S3, training ML model and deploying it for prediction.

Developed custom Apache NiFi processors to handle data transformations and enrichment specific to insurance data, such as policy information, claims data, and customer demographics, ensuring data quality and consistency.

Leveraged Azure Synapse Analytics to integrate big data processing and analytics capabilities, enabling seamless data exploration and insights generation.

Configured event-based triggers and scheduling mechanisms to automate data pipelines and workflows using Azure Logic apps, Apache Airflow, and Azure Functions.

Optimized data processing and storage layers by implementing partitioning, indexing, and caching strategies in Snowflake and Azure services, resulting in a reduction in query execution time.

Conducted performance tuning and capacity planning exercises to ensure the scalability and efficiency of the data infrastructure.

Collaborated with DevOps engineers to automate CI/CD and test-driven development pipelines using Azure as per the client's requirement, resulting in a more efficient workflow.

Played a key role in executing Hive scripts through Hive on Spark and Spark SQL for data processing tasks.

Collaborated closely on ETL activities, ensuring data integrity, and verifying the stability of data pipelines.

Applied hands-on expertise in utilizing Kafka and Spark Streaming to handle streaming data for specific use cases.

Designed and implemented real-time data processing solutions using Kafka and Spark Streaming, enabling the efficient processing and analysis of large-scale streaming data.

Developed Spark core and Spark SQL scripts using Scala to enhance data processing speed and efficiency.

Utilized JIRA to report on projects and created sub-tasks for Development, QA, and Partner validation.

Proficient in Agile methodologies, participating in a wide range of Agile ceremonies, including daily stand-ups and internationally coordinated PI Planning sessions.

Environment: Azure Databricks, Data Factory, Azure Synapse Analytics, Snowflake, Kafka, Logic Apps, Functional Apps, Snowflake, MS SQL, Oracle, MongoDB, HDFS, MapReduce, YARN, Spark, Hive, SQL, Python, Scala, PySpark, shell scripting, GIT, JIRA, Jenkins, ADF Pipeline, Power BI.

Optum, Eden Prairie, MN March 2017 – December 2018

AWS Data Engineer

Responsible for Designing, implementing, and testing Batch/Streaming data pipelines using AWS services.

Extensively worked on PySpark to read data from S3 Data Lake to preprocess and store it back in S3 to create tables using Athena.

Deployed AWS Lambda functions and other dependencies into AWS using EMR Spin for Data Lake jobs.

Created AWS Lambda functions and assigned IAM Roles to schedule Python scripts using Cloud Watch Triggers to support business needs (SNS, SQS).

As a Hadoop admin, monitoring cluster health status on daily basis, tuning system performance related configuration parameters, backing up configuration xml files.

Conducted ETL data integration and transformations using AWS Glue Spark scripts.

Created partitioned tables in Athena, also designed data warehouse using Athena external tables, used Athena Queries for data analysis.

Worked with different source and destination file formats like Parque, ORC, Avro, and JSON.

Developed ETL Processes in AWS Glue to migrate data from external sources and S3 files into AWS Redshift.

Worked on Ingesting data by going through cleansing and transformations and leveraging AWS Lambda, AWS Glue and Step Functions.

Worked with commercial software and healthcare platforms to meet specific business requirements.

Developed a python script to hit REST API’s and extract data to AWS S3.

Performed performance tuning of long running Spark streaming applications by looking into Spark UI, implementing features like fault tolerance and failover.

Created and run jobs on AWS cloud to extract transform and load data into AWS Redshift using AWS Glue, S3 for data storage and AWS Lambda to trigger the jobs.

Collected data from the Health Information Network system for reporting purposes.

Implemented Workload Management (WLM) in Redshift to prioritize basic dashboard queries over more complex longer-running ad hoc queries. This allowed for a more reliable and faster reporting interface, giving sub-second query response for basic queries.

Developed a dashboard and story in Tableau showing the benchmarks and summary of model's measure.

Environment: AWS, S3, Redshift, Athena, Lambda, Glue, Bitbucket, Tableau, EMR, Step Functions, Spark, Cloud Watch, Agile, HBase, Kafka Impala, Hive, Python, Agile, YARN, Shell Scripting, Git, Jenkins, Bitbucket, and JIRA.

EXL Services, India May 2014 – July 2016

Application Development Analyst

Expert in designing ETL data flows using SSIS, creating mappings/workflows to extract data from SQL Server and Data Migration and Transformation from Access/Excel Sheets using SQL Server SSIS.

Efficient in Dimensional Data Modeling for Data Mart design, identifying Facts and Dimensions, and developing, fact tables, dimension tables, using Slowly Changing Dimensions (SCD).

Experience in Error and Event Handling: Precedence Constraints, Break Points, Check Points, Logging.

Experienced in Building Cubes and Dimensions with different Architectures and Data Sources for Business Intelligence and writing MDX Scripting.

Thorough knowledge of Features, Attributes, Hierarchies, Structure, Star and Snowflake Schemas of Data Marts.

Good working knowledge on Developing SSAS Cubes, Aggregation, KPIs, Measures, Partitioning Cube, Data Mining Models and Deploying and Processing SSAS objects.

Experience in creating Ad hoc reports and reports with complex formulas and to query the database for Business Intelligence.

Expertise in developing Parameterized, Chart, Linked, Dashboard, Scorecards, Graph, Report on SSAS Cube using Drill-down, Drill-through and Cascading reports using SSRS.

Flexible, enthusiastic, and project-oriented team player with excellent written, verbal communication and leadership skills to develop creative solutions for challenging client needs.

Environment: MS SQL Server 2016, Visual Studio, SSIS, Share point, MS Access, Git, Team Foundation server.

EDUCATION & CERTIFICATIONS

Master of Science in Computer Science, University of North Texas, TX

Bachelor of Science in Computer Science, SRM institute of Science and Technology, India

Certified Azure Data Engineer- Data Engineering on Microsoft Azure (DP-203)

Contact this candidate