Data Engineer Azure

Location:

Macomb, IL

Posted:

March 19, 2025

Contact this candidate

Resume:

Ajith

Contact No:- +1-779-***-****

Mail ID:- *************@*****.***

Sr. Data Engineer

PROFESSIONAL SUMMARY:

Sr. Data Engineer with over 10 years of experience designing, developing, and managing large-scale data architectures. Specialized in cloud technologies, data integration, and CI/CD practices with an emphasis on Informatica Intelligent Cloud Services (IICS) and AWS. Extensive expertise in developing and maintaining data pipelines, enabling efficient data governance, and ensuring data integrity. Proven ability to deliver scalable solutions in enterprise environments, leveraging S3, SQL, Jenkins, and other modern tools.

Proficient in a wide array of technologies including AWS (Amazon Web Services), PL/SQL, Apache NiFi, Hive Query Language (HiveQL), Oracle Database, Unix Shell Scripting, Pig Latin, and HiveQL Scripts.

Skilled in developing and maintaining data pipelines for ingestion, transformation, and management, ensuring data governance and integrity throughout the process.

Experienced in real-time streaming and processing using tools like Apache NiFi and Spark Streaming, enabling the creation of real-time dashboards and analytics.

Profound experience in working with various AWS services including RDS (Relational Database Service), Aurora, Redshift, DynamoDB, and S3 (Simple Storage Service), utilizing them for data storage, migration, and management.

Knowledgeable in Azure Cloud technologies such as Azure Data Factory, Azure Databricks, and Azure SQL DB, Azure Data Lake Storage, Azure Synapse Analytics (formerly Azure Data Warehouse), Azure Cosmos DB, and Azure Data Explorer (ADX), facilitating seamless integration and migration of data across cloud platforms.

Hands on experience in setting up and managing CI/CD pipelines using Jenkins, ensuring smooth deployment and delivery of data solutions.

Expertise in visualizations using tools like Tableau, Power BI, and Domo, enabling effective communication of insights derived from data analysis.

Designed and implemented data pipelines leveraging GCP services such as BigQuery, Cloud Dataflow, Pub/Sub, and Cloud Storage to handle large-scale data processing and analytics.

Used BigQuery to perform complex SQL queries on multi-terabyte datasets, enabling advanced analytics and reporting.

Developed real-time data streaming applications using Pub/Sub and Cloud Dataflow to support time-sensitive decision-making processes.

Implemented robust data lake solutions on GCP using Cloud Storage, ensuring scalability and high availability for structured and unstructured data.

Leveraged Cloud Composer (Apache Airflow) for orchestrating complex ETL workflows and automating data pipeline management.

Well-versed in SDLC methodologies including Waterfall and Agile Scrum, with experience in iterative algorithms and requirements gathering for business analysis.

Led the data analysis techniques, including user behavior analysis, and integration with analytics platforms such as SAS Analytics and Tableau for comprehensive insights.

Strong experienced in working with various AWS databases and services including Elastic Cache, Redis, EC2, AWS CloudFormation, and Route 53, enabling scalable and reliable data storage, computation, and infrastructure management in cloud environments.

Good experience with utilizing Snowflake for cloud-based data warehousing, enabling scalable and efficient storage, processing, and analysis of large datasets.

Proficient in Hadoop ecosystem technologies including HDFS, MapReduce, Spark, Scala, Yarn, Kafka, PIG, HIVE, Sqoop, Flume, Oozie, and Impala.

Leveraged in working with distributed databases like HBase and Teradata, leveraging their capabilities for efficient data storage and retrieval.

Skilled in data integration using tools such as Informatica, ensuring seamless flow of information across systems.

Expertise in database management, reengineering processes, and optimizing performance through thorough understanding of relational and NoSQL databases, including MongoDB.

Worked with managing data workflows and orchestrating complex tasks using Airflow operators, ensuring efficient data pipeline operations.

Experience in leveraging Python libraries for data manipulation, analysis, and automation, enhancing productivity and flexibility in data engineering tasks.

Skilled in IBM DataStage for ETL processes, facilitating the extraction, transformation, and loading of data from various sources into target systems with high efficiency and reliability.

TECHNICAL SKILLS:

Programming Languages: Python, R, Java, Scala, SQL

Big Data Technologies: Hadoop, MapReduce, HDFS, Sqoop, PIG, Hive, HBase, Kafka, Yarn, Apache Spark, Apache NiFi

Web Technologies: XML, JSON

Data Integration Tools: Informatica Intelligent Cloud Services (IICS), Informatica PowerCenter

Python Libraries/Packages: Boto3, Pandas, Matplotlib, HTTPLib2, Urllib2, Beautiful Soup, NumPy, Neo4j

Cloud Platforms: Amazon Web Services (AWS), Microsoft Azure

AWS Services: Amazon EC2, Amazon S3, Amazon DynamoDB, Amazon Glue, Amazon Simple DB, Amazon MQ, Amazon ECS, Amazon Lambdas, Amazon RDS, Amazon CloudWatch, Amazon SQS, AWS Identity and access management

Azure Services: Azure Data Factory, Azure SQL Database, Azure Databricks, Azure Data Lake Storage, Azure Synapse Analytics (formerly Azure Data Warehouse), Azure Cosmos DB, Azure Data Explorer (ADX), Azure Kubernetes Service (AKS), Azure Functions, Azure Blob Storage, Azure Queue Storage, Azure Event Hubs, Azure Stream Analytics, Azure DevOps, Azure Active Directory (AAD), Azure Key Vault.

GCP Services: BigQuery, Cloud Dataflow, Pub/Sub, Cloud Storage, Cloud Composer, Vertex AI, AutoML, Cloud Functions, Google Kubernetes Engine (GKE), Data Catalog, Cloud Spanner, Cloud SQL, Cloud Monitoring, Cloud Logging, IAM, Cloud Run.

Databases/Servers: MySQL, Redis, PostgreSQL, Neo4j, Aurora DB, DynamoDB, MongoDB, DW-BI, Snowflake, SQL Server, DBT, Cassandra, Redshift, IBM DB2.

ETL : DataStage, Informatica, Redshift, Postgres-SQL, Snowflake, SQL Server Integration Services (SSIS), Ab Initio

Data Visualization: PowerBI, Tableau, Superset

Build and CI Tools: Jenkins, Airflow, Apache Oozie

Web Services/ Protocols: TCP/IP, UDP, FTP, HTTP/HTTPS, SOAP, Rest, Restful APIs

Automation: Terraform, Git, BitBucket

IDE: PyCharm, Jupyter, PyStudio, Sublime Text, Visual Studio Code

SDLC/Testing Methodologies: Agile, Waterfall, Scrum, TDD

Statistical Analysis Skills: A/B Testing, Time Series Analysis

PROFESSIONAL Experience:

Truist Bank, Charlotte, NC February 2023 to Present

Sr. Data Engineer

Responsibilities:

Designed and implemented robust data pipelines utilizing Snowflake, DataStage, and AWS services such as S3, Athena, and Glue, ensuring efficient data governance and integrity within a secure VPC environment.

Leveraged Apache NiFi and HiveQL for real-time streaming and processing of financial data, enabling the creation of real-time dashboards and analytics to support decision-making processes.

Architected and implemented a comprehensive Security Framework leveraging AWS Lambda and DynamoDB to enable fine-grained access control to sensitive financial data stored in AWS S3.

Conducted end-to-end Architecture and implementation assessments of various AWS services including EMR, Redshift, and S3 to optimize data processing efficiency and cost-effectiveness.

Applied machine learning algorithms using Python to forecast customer behavior and financial trends, leveraging technologies like Kinesis Firehose and S3 data lake for scalable data storage and processing.

Utilized AWS EMR for large-scale data transformation and movement between AWS data stores and databases, including S3 and DynamoDB, ensuring data consistency and reliability.

Designed and implemented scalable data architectures utilizing GCP services such as BigQuery, Cloud Dataflow, and Cloud Storage to optimize performance and cost efficiency.

Developed and maintained end-to-end data pipelines for processing rebates data, ensuring compliance and accuracy in reporting.

Leveraged Cloud Composer to orchestrate complex ETL workflows, reducing manual intervention and improving data pipeline reliability.

Applied cost optimization techniques on GCP services, reducing operational expenses by 20%.

Collaborated with business stakeholders to align data strategies with rebate processing requirements, ensuring accurate and timely data delivery.

Worked with various POC’s related to graph databases (Neo4j) and BI solution to improve the current analytics platform.

Designed and implemented data pipelines using Snowflake, DataStage, and GCP services such as BigQuery and Cloud Dataflow to handle large-scale financial datasets.

Leveraged Pub/Sub for real-time streaming and integrated with Cloud Dataflow to enable dynamic dashboards for decision-making.

Designed and developed scalable data pipelines using IICS to integrate data from disparate sources into AWS S3 for centralized storage.

Implemented data profiling and quality checks using Informatica tools to ensure the accuracy of enterprise data.

Automated CI/CD workflows using Jenkins, enabling seamless deployment of pipeline updates and reducing delivery times by 30%.

Worked extensively on SQL scripting to optimize query performance for reporting and analytics platforms.

Collaborated with cross-functional teams to design ETL workflows for migrating data to the AWS ecosystem, achieving scalability and resilience.

Migrated legacy ETL processes to GCP, optimizing performance and reducing processing time by 30%.

Developed CI/CD pipelines for deploying GCP resources using Terraform and Jenkins, ensuring efficient and consistent infrastructure management.

Enhanced data security by configuring IAM roles and policies, ensuring compliance with financial regulations.

Developed Lambda functions with Boto3 for automating tasks such as deregistering unused AMIs across all application regions, optimizing EC2 resource costs and enhancing operational efficiency.

Monitored and resolved issues within data pipelines and systems using AWS CloudWatch and other monitoring tools, ensuring smooth data flow, and minimizing downtime for critical financial systems.

Implemented data quality checks and validations within DBT pipelines to ensure the accuracy and completeness of transformed financial data before loading it into advanced analytics and reporting.

Designed and managed ETL workflows utilizing AWS Step Functions and Apache Airflow, ensuring efficient data flow and timely processing of financial transactions and reporting.

Used Reactive programming for building reactive web applications that automatically updates the values that are added to the database. Sure, here are concise experience points for each Microsoft Office product:

Utilized Pivot Tables for complex data analysis and visualization, enabling the extraction of actionable insights from large datasets.

Optimized big data processing on Cloudera’s Hadoop platform, enabling real-time analytics and reducing data query response times by 50%, which improved decision-making capabilities across the bank.

Implemented CI/CD pipelines and automated testing frameworks to streamline the software development lifecycle, ensuring rapid and reliable delivery of high-quality data engineering solutions.

Conducted troubleshooting of DataStage job failures, providing comprehensive root cause analysis and implementing corrective actions to maintain optimal data processing efficiency.

Leveraged Databricks Delta Tables, Delta Lake, and Delta Live Tables to design and implement robust ETL pipelines, ensuring data accuracy, integrity, and real-time processing capabilities, thereby supporting advanced analytics and business intelligence.

Managed large datasets and integrated diverse data sources across platforms, ensuring data integrity and supporting strategic decision-making.

Streamlined ETL workflows using Ab Initio GDE, reducing data processing time by 40% and enhancing integration across financial systems.

Developed new DataStage jobs to facilitate efficient data loading processes into the data warehouse, conducting thorough testing for accuracy and performance to meet regulatory requirements.

Implemented DBT for data modeling, analysis, and transformation, ensuring the generation of high-quality, version-controlled SQL code to support financial reporting and regulatory compliance.

Managed and optimized DB2 databases, ensuring data integrity, performance, and security for critical financial systems, and executed database queries and provided support for data-related issues.

Using data from several sources to create interactive dashboards and reports in Power BI that enable data-driven decision-making and offer actionable insights. Certainly! Here are concise experience points based on your qualifications:

Successfully led cross-functional teams in state government environments, fostering collaboration and achieving project goals through effective delegation and communication.

Environment: AWS (Amazon Web Services), Snowflake, Apache NiFi, DataStage, AWS Glue, GCP, Amazon S3, Amazon Athena, Apache HiveQL, AWS Lambda, AWS CloudWatch, Python, Java, SQL, Scikit-learn, Apache Airflow, AWS Step Functions, AWS DynamoDB, AWS Redshift, Neo4J, IBM DB2, Boto3, Git

State Street, New York, NY April 2021 to January 2023

Sr. Data Engineer

Responsibilities:

Led the architecture and design of a comprehensive data warehouse solution using Hadoop, Apache Spark, and Azure Synapse Analytics, ensuring alignment with business objectives and compliance requirements.

Implemented end-to-end data pipelines for efficient data processing and analysis, utilizing Azure Data Factory, Apache NiFi, and Python scripts for data ingestion, transformation, and loading.

Developed MapReduce applications in Hadoop for processing large datasets and implemented compression techniques to optimize job performance and reduce storage costs.

Orchestrated data workflows using Oozie actions, including Hive, Shell, and Java tasks, to schedule and monitor data processing jobs in the Hadoop cluster.

Designed and optimized ETL processes using Ab Initio’s Graphical Development Environment (GDE) to streamline data integration from diverse financial systems, resulting in a 40% reduction in processing time and improved data accuracy for critical reporting and analytics.

Built and deployed cloud-native data solutions on GCP, focusing on data lakes and cloud databases such as BigQuery and Cloud SQL.

Built and managed enterprise-level ETL pipelines using Informatica Cloud and AWS Glue, supporting end-to-end data ingestion and transformation.

Led efforts to implement secure and cost-effective AWS S3 data lakes, ensuring compliance with organizational standards.

Enhanced CI/CD practices by incorporating Jenkins pipelines for seamless deployment, monitoring, and rollback capabilities.

Provided mentorship to junior engineers on best practices for designing workflows in IICS and optimizing SQL-based solutions.

Designed and developed custom Python scripts for data validation and transformation, integrating them into existing pipelines.

Migrated legacy ETL pipelines to GCP using Data Fusion, improving data processing efficiency.

Designed data models using Erwin to support rebate calculations and financial reporting.

Orchestrated data workflows using Cloud Composer and automated data pipeline deployment with Terraform.

Implemented streaming solutions using Kafka and Pub/Sub for real-time rebate data processing.

Built end-to-end data pipelines leveraging GCP services like BigQuery, Cloud Storage, and Cloud Composer to streamline data integration and transformation.

Orchestrated workflows using Cloud Composer to automate ETL processes and improve reliability and scalability.

Designed and deployed Vertex AI pipelines for predictive analytics, enabling actionable insights for business strategies.

Collaborated with cross-functional teams to migrate on-premises data warehouses to GCP, reducing operational costs and improving data accessibility.

Implemented and managed Cloudera-based data processing solutions, enabling efficient analysis and reporting on large-scale financial datasets, resulting in improved decision-making capabilities and regulatory compliance.

Enhanced data quality by implementing robust validation and cleansing processes using Ab Initio, reducing data errors by 25% and ensuring compliance with financial regulations.

Designed and implemented modern data solutions on Azure PaaS services such as Azure Databricks and Azure Data Lake Storage, facilitating real-time and batch processing of streaming and historical data.

Utilized Azure Cosmos DB as a globally distributed NoSQL database for handling large volumes of semi-structured and unstructured data, ensuring high availability, low latency, and scalability for critical applications.

Developed and managed data processing frameworks within the Medallion architecture, incorporating Apache Spark and Apache Flink for real-time and batch processing of streaming data.

Strong knowledge in JavaScript using Closure, Promise, Inheritance, AJAX and experience in both Object-Oriented Programming and Functional Reactive Programming.

Used Python files to load data from CSV files to Neo4J graphical databases.

Built robust data pipelines for moving data across various systems, including import/export of data from Teradata, Oracle, and Azure Data Lake Storage to Snowflake, ensuring data integrity and reliability.

Implemented proof of concepts for SOAP and REST APIs, enhancing data accessibility and interoperability, and utilized Azure Databricks for backend data processing and analytics.

Leveraged Azure Synapse Analytics for on-demand querying and analysis of data stored in Azure Data Lake Storage, eliminating the need for managing dedicated resources and enhancing agility in analytics tasks.

Designed infrastructure and systems on Azure for scalability, resiliency, and recovery using Infrastructure as Code (IaC) principles with Azure Resource Manager (ARM) templates.

Implemented end-to-end data integration solutions leveraging IBM Infosphere DataStage deployed on Azure virtual machines, orchestrating complex ETL processes to extract, transform, and load data from diverse sources into Azure data services such as Azure Data Lake Storage and Azure SQL Database.

Collaborated with cross-functional teams to gather requirements, perform business analysis, and deliver data solutions that meet stakeholder needs and drive business outcomes.

Implemented continuous data processing with Delta Live Tables, enabling real-time data ingestion and transformation for up-to-date business insights.

Developed robust data models in Power BI, incorporating relationships between different datasets to create meaningful visualizations and support complex business analytics.

Ensured data quality and governance standards are met across all stages of the data lifecycle, implementing data lineage and metadata management using Azure Data Catalog.

Environment: Apache NiFi, SOAP (Simple Object Access Protocol), Azure (Microsoft Azure cloud platform), GCP, Hadoop, Apache Spark, Azure Databricks, Azure Synapse Analytics, Azure Data Factory, Azure Data Lake Storage, Azure Cosmos DB, Azure Event Hubs, Azure Functions, Azure SQL Database, Azure Blob Storage, Azure Virtual Networks (VNets), Azure Active Directory, Terraform (for infrastructure provisioning), Visual Studio Team Services (VSTS), Docker,Datastage,Snowflake, Kubernetes

Cigna, Bloomfield, CT

Sr. Data Engineer Aug 2019 to Mar 2021

Responsibilities:

Orchestrated the deployment, automation, and maintenance of AWS cloud-based production systems, ensuring high availability, performance, scalability, and security of critical applications.

Established Infrastructure as Code (IaC) on the AWS cloud platform using CloudFormation templates, configuring and integrating AWS services to meet specific business requirements.

Implemented security measures on Unix systems, including user authentication, access controls, and firewall configurations, to protect against unauthorized access and mitigate security threats, ensuring compliance with industry standards and regulations.

Conducted monthly critical patch management for Windows servers using AWS Patch Manager, ensuring ongoing security and compliance.

Implemented Databricks Delta Tables, Delta Lake, and Delta Live Tables to streamline data operations, ensuring efficient data storage, seamless batch and streaming processing, and simplified data management, resulting in improved productivity and data reliability.

Designed data pipelines to integrate rebate data from multiple sources into GCP Cloud Storage and processed using DataFlow.

Managed cost-efficient data processing strategies using GCP cost monitoring and optimization techniques.

Developed real-time analytics solutions leveraging BigQuery for rebate trend analysis.

Automated infrastructure provisioning using Terraform and implemented CI/CD pipelines via Jenkins.

Led the successful migration of data from on-premises SQL Server to AWS Cloud (Redshift), resulting in substantial cost savings for the organization.

Developed and maintained ETL processes using IBM InfoSphere DataStage, enabling seamless data integration across diverse sources and systems.

Utilized SQL queries to monitor database loads, optimize table structures, and improve query performance in conjunction with ETL processes.

Led the migration of enterprise data from AWS and Azure to GCP, ensuring minimal downtime and data integrity.

Developed robust data lake solutions using GCP’s Cloud Storage and optimized queries in BigQuery for reporting and analytics.

Implemented real-time data ingestion and processing using Pub/Sub and Dataflow, supporting critical healthcare analytics.

Automated infrastructure provisioning on GCP with Terraform and managed deployments via Jenkins pipelines.

Leveraged Amazon MSK's monitoring and alerting capabilities to maintain high availability and performance of Kafka clusters for critical streaming data workflows.

Utilized AWS SageMaker for ML model creation and optimization, enhancing data-driven decision-making capabilities.

Designed, implemented, and optimized Snowflake cloud data warehousing solutions, including data modeling, schema design, and performance tuning.

Developed sophisticated data integration solutions using Ab Initio, ensuring smooth data transfer and transformation across platforms and systems.

Automated AWS EMR cluster provisioning for monthly data processing tasks, optimizing costs and leveraging AWS EMR capabilities effectively.

Optimized Cypher queries in Neo4j to extract insights from complex graph structures, enhancing decision-making capabilities.

Orchestrated data flows and performed real-time transformations using Apache NiFi, ensuring data security, scalability, and reliability.

Utilized PySpark and Spark SQL to import, transform, and analyze data from AWS S3, and developed PySpark scripts for data aggregation, queries, and integration with OLTP systems, optimizing data processing workflows.

Designed and implemented DW-BI solutions using Informatica, SAP BusinessObjects, and Tableau, enabling data-driven decision-making.

Leveraged Java programming for optimizing data processing algorithms, implementing custom transformations, and integrating disparate data systems within complex architectures.

Utilized PySpark for handling large datasets, leveraging in-memory capabilities, efficient joins, and transformations for data ingestion and processing.

Environment: AWS, AWS Lambda, AWS EMR, Snowflake, SQL, Java, ETL, Redshift, GCP, Informatica, Sagemaker, Apache NiFi, DevOps, Jenkins, Ab initio, Cloudera (CDH3), HDFS, Pig 0.15.0, DW-BI, Oracle 8i/9i/10g, Hive 2.2.0, Kafka, Neo4j, Sqoop, Shell Scripting, Spark 1.8, Linux- Cent OS, Map Reduce, python 2, Eclipse 4.6.

Abbvie Healthcare, North Chicago, IL February 2017 to July 2019

Data Engineer

Responsibilities:

Developed an end-to-end data pipeline solution in AWS cloud which reads data from Redshift, performs transformations using AWS Glue(spark) and writes data to neo4j, which is based on SQS events from upstream data.

Orchestration of glue jobs using step functions, RDS and DynamoDB for audit purpose, and lambdas to invoke step function and python code for RDS database updates and queries.

Develop Glue jobs with Python code to query data from Aurora DB and create tables in Redshift.

Work with PySpark classes, traits, objects using data frame and RDD in Spark for Data Aggregation, queries and writing data back into graphical database neo4j.

Streamlined data integration processes by implementing Informatica IICS workflows to process healthcare data efficiently.

Utilized AWS S3 to store and manage large datasets, reducing operational costs by 15% through optimized storage policies.

Developed and scheduled ETL processes using SQL and Python to enable real-time reporting and analytics for critical business functions.

Led automation efforts by integrating CI/CD workflows into AWS environments, improving deployment cycles and reducing errors.

Collaborated with stakeholders to define business requirements, ensuring alignment with technical solutions and deliverables.

Proficient in diagnosing and resolving Unix system issues, including hardware failures, software conflicts, and network connectivity problems, using diagnostic tools and log analysis techniques, minimizing downtime and improving system reliability.

Created and managed S3 buckets in AWS and imported data from AWS S3 into Spark RDD and performed transformations on RDD.

Creation of AWS services such as S3, Glue, IAM roles, Lambda, Security Groups, Step functions, Glue connection, SQS, DynamoDB using Terraform.

Implemented Kafka for real-time data streaming and messaging, enabling efficient and scalable data ingestion and processing across distributed systems.

Orchestrated IBM DataStage to streamline ETL processes, ensuring seamless extraction, transformation, and loading of data across heterogeneous environments.

Leveraged Snowflake for cloud-based data warehousing solutions, enabling scalable and high-performance analytics and querying of large datasets.

Designed and implemented a robust data pipeline to efficiently extract, transform, and load (ETL) data from Amazon S3 storage to AWS Glue, ensuring seamless integration and transformation of raw data into a structured format suitable for analysis and reporting purposes.

Performed data modeling using Erwin for rebate data management systems.

Designed and optimized SQL queries to support rebate calculations and business reporting.

Created interactive dashboards using Power BI for rebate tracking and forecasting.

Created Amazon DynamoDB table and created items pertaining to step function executions for auditing purpose.

Experience deploying recommendation systems using machine learning to personalize user recommendations based on historical interactions and behavior patterns, optimizing models for accuracy and engagement metrics in various domains.

Designed, deployed, and maintained PostgreSQL databases, ensuring data integrity, security, and high availability to support mission-critical applications and business operations.

Deployment of resources using GitLab CICD pipeline using YAML file.

Utilized Ab Initio to develop and manage efficient ETL processes, reducing job execution time by 40% for datasets exceeding 10 million records through advanced parallel processing and data transformation techniques.

Creation of python classes and modules and work with setup and wheel files.

Worked on Redshift complex SQL queries, stored procedures, and views.

Worked on different file formats like ORC, Avro, Parquet, CSV and JSON file formats.

Configured the GitLab CI runner and CICD variables for the GitLab repository for the execution of pipelines.

Working in Agile environment and good knowledge on Scrum methodologies, planning, breaking down the work into stories and effort calculation.

Environment: AWS Services, SQL, PostgreSQL, Hadoop, Spark, GCP, Terraform, GitHub, Python, Agile-Scrum, Parquet, Avro, PySpark, Jenkins, Ctrl-M.

Zensar Technologies Bangalore, India November 2013 to December 2016

Data Analyst

Responsibilities:

Performed Column Mapping, Data Mapping and Maintained Data Models and Data Dictionaries.

Built system to perform real-time data processing using Spark streaming and Kafka.

Involved in retrieving multi-million records for data loads using SSIS and by querying against Heterogeneous Data Sources like SQL Server, Oracle, Text files and some Legacy systems.

Expertise in using different Transformations like Lookups, Derived Column, Merge Join, Fuzzy Lookup, For Loop, For Each Loop, Conditional Split, Union all, Script component etc.

Transferred data from various data sources/business systems including MS Excel, MS Access, and Flat Files to SQL Server using SSIS/DTS packages using various features.

Involved in Performance tuning of ETL transformations, data validations and stored procedures.

Strong experience in designing and implementing ETL packages using SSIS for integrating data using OLE DB connection from heterogeneous sources.

Architected and deployed scalable data pipelines using GCP services such as Cloud Dataflow, Pub/Sub, and BigQuery.

Enhanced ETL workflows with Cloud Composer and optimized data transformations for large-scale clinical datasets.

Designed monitoring solutions using GCP’s Operations Suite, ensuring system reliability and performance.

Developed and designed system to collect data from multiple portal using Kafka and then process it using spark.

Extensively worked on UNIX Shell Scripting for splitting group of files to various small files.

Analyzed the SQL scripts and designed it by using PySpark SQL

Contact this candidate