Praveen CH
Sr Data Engineer
***************@*****.***/ 618-***-****/ linkedin.com/in/praveen-ch-ab1832345
Professional Summary:
Senior Data Engineer with 6+ years of extensive experience in designing, building, and maintaining scalable data architecture and solutions.
Expertise in data pipeline development, data modeling, and data integration for large-scale analytics platforms, ensuring efficient data processing and storage.
Proficient in a wide range of technologies, including SQL, Python, Hadoop, Spark, and cloud platforms like AWS, Azure, and GCP, ensuring high-performance and scalable solutions.
Led advanced data engineering projects focused on enhancing security, scalability, and efficiency in large-scale environments.
Skilled in Python, Scala, Java, and Shell scripting within UNIX/Linux environments, enabling automation and optimizing complex workflows.
Extensive experience with AWS services, such as EMR, EC2, RDS, S3, Lambda, Glue, and Redshift, ensuring highly available and scalable cloud-based data solutions.
In-depth expertise in architecting complex data ingestion strategies and managing Hadoop infrastructures, excelling in data modeling, mining, and optimizing ETL processes.
Proficient in Spark technologies including RDDs, Data Frames, Spark SQL, MLlib, and GraphX, delivering high efficiency in big data processing.
Solid background in ETL methodologies, utilizing tools such as Informatica PowerCenter, Microsoft Integration Services, and Snow SQL for efficient data transformation and loading.
Developed scalable microservices with Spring Boot, enabling real-time data processing and seamless integration within large data ecosystems.
Designed logical and physical data models using Star Schema and Snowflake Schema to support robust reporting and analytics needs.
Expert in working with SQL Server, NoSQL databases, like MongoDB and DynamoDB, and experienced in executing complex PL/SQL queries for effective data extraction.
Proficient in data visualization using tools like Tableau and Power BI, providing actionable insights and enhancing decision-making.
Experienced in designing secure API endpoints with JWT, OAuth2, and API keys, ensuring secure and flexible data access.
Strong advocate of Agile and Scrum methodologies, with proficiency in Test-Driven Development (TDD) and CI/CD pipelines using tools like Jenkins, Docker, Concourse, and Bitbucket.
Leverages cutting-edge technologies like Apache Kafka and Apache Storm for real-time data streaming and analytics, ensuring timely insights and decision-making.
Implemented robust data governance frameworks, ensuring compliance with regulations like GDPR and CCPA, maintaining data integrity and security.
Skilled in integrating machine learning models and AI algorithms into data processing pipelines using platforms like TensorFlow and PyTorch, providing predictive insights to enhance business decisions.
Experienced in building real-time data processing feeds using Spark Structured Streaming and Kafka, optimizing the performance and scalability of data pipelines.
Technical Skills:
Programming Languages: Python, Scala, Java, GoLang
Scripting: Shell scripting within UNIX/Linux environments
Cloud Platforms: AWS Azure, Google Cloud Platform
Big Data Technologies: Hadoop ecosystem (HDFS, MapReduce, Hive, Pig, Oozie, Flume, Cassandra), Spark (Scala integration, PySpark, RDDs, Data Frames, Spark SQL, Spark MLLib, Spark GraphX)
ETL Tools: Microsoft Integration Services, Informatica Power Center, Snow SQL, Talend
Data Modeling: Star Schema, Snowflake Schema
Databases: SQL Server, NoSQL (DynamoDB, MongoDB), Oracle (PL/SQL)
Data Visualization Tools: Tableau, Power BI
Infrastructure as Code: Terraform
API Development: JWT, OAuth2, API keys
CI/CD Tools: Jenkins, Docker, Concourse, Bitbucket
Version Control Systems: Git, SVN, Bamboo
Testing Tools: Apache JMeter, Query Surge, Talend Data Quality
Network Protocols: DNS, TCP/IP, VPN
Real-Time Data Streaming: Apache Kafka, Apache Storm
Machine Learning & AI Platforms: TensorFlow, PyTorch
Methodologies: Agile, Scrum, Test-Driven Development (TDD)
Professional Experience:
Client: Centene Corporation, St Louis, MO September 2023 to Present
Role: Sr. Data Engineer
Responsibilities:
Responsible for delivering US state reports regulated by the FDA by monitoring inbound data and managing data solutions on Azure Cloud, as part of the Teradata Retirement initiative replicating data to Google Cloud Platform (GCP).
Built Azure Synapse pipelines to ingest inbound state report data in CSV format through APIs, followed by processing using SQL pools.
Utilized Synapse Spark pools for data cleansing and transformation, enabling real-time data processing by integrating Apache Kafka with Synapse Analytics and using Spark Streaming.
Developed Azure Data Factory (ADF) ingestion pipelines to fetch data from Google AdWords API, running every hour, and performed data transformation using Databricks.
Migrated on-premises databases to Azure Data Lake Store using ADF and built a metadata-driven ingestion framework in Databricks that runs every 15 minutes to capture database events.
Implemented Delta Live Tables (DLT) in Databricks for building interconnected live analytics jobs, with auto-processing using the cloud files pattern and Databricks Auto Loader.
Built streaming live tables using DLT, ensuring that latest data is available for analytics, and implemented data quality checks using constraints in DLT scripts.
Leveraged TBL properties to document changes and define Medallion Architecture layers (Bronze, Silver, Gold) within Databricks.
Automated schema evolution for JSON feeds using Databricks Auto Loader, enabling schema inference and routing unknown columns to a rescue column for seamless downstream processing.
Designed Spark applications using Spark SQL on Databricks to extract, transform, and aggregate data from various file formats for insights into customer usage patterns.
Used Spark Structured Streaming in Databricks to process claim events from Azure Stream Analytics, implementing watermarking to handle late-arriving data.
Built robust ADF pipelines using Linked Services, connecting to sources like Azure SQL, Blob Storage, and performed ETL operations to Azure Data Lake Analytics using T-SQL and Spark SQL.
Migrated Hive and HDFS data using ADF Copy Activity, connecting to on-prem HDFS via Self-hosted Integration Runtime and WebHDFS connector.
Managed and optimized Databricks Spark clusters, estimated cluster size, and tuned performance using repartition, coalesce, broadcast joins, and job cluster scheduling.
Designed and maintained data solutions on GCP, including Big Query, Cloud Storage, Cloud SQL, and implemented data pipelines using Dataflow, Dataproc, and Pub/Sub, with monitoring via Stack driver and Cloud Monitoring.
Environment: Azure Data factory, Azure Databricks, Azure Synapse, Azure DW, Spark, PostgreSQL, DBeaver, Big Query, Cloud Storage, Cloud SQL, Dataflow, Dataproc, Pub/Sub, IAM, Python, Looker, MongoDB, SharePoint.
Client: Broadridge, Newark, NJ April 2022 to July 2023
Role: Azure Data Engineer
Responsibilities:
Developed Full Data Pipeline (FDPL) frameworks in Databricks using PySpark, creating configurable, metadata-driven pipelines to build SILVER and GOLD layer tables.
Sized Databricks clusters for development, integrated version control with GIT using Azure DevOps, and built ingestion pipelines via Azure Data Factory (ADF) and Azure Web Apps.
Migrated ETL jobs from Databricks to Azure Synapse Spark pools, ingesting structured data from Oracle using Synapse Pipelines and Oracle SQL connectors.
Designed and implemented star schema for customer analytics and snowflake schema for financial reporting, improving performance and data normalization.
Created Synapse Spark notebooks to standardize vendor files from SFTP and store the cleaned data in Azure Blob Storage.
Built Fact and Dimension models in Azure Synapse Dedicated SQL Pools using 24-hour delta records for KPI dashboards.
Developed Synapse Dataflows to read shared internal data sources and load data into SQL Pools (MPP), integrated with Power BI for reporting.
Migrated legacy batch jobs from on-prem systems (handling XML, CSV, XLS) to Azure Data Lake using Synapse Pipelines.
Built a data delivery framework on Databricks Spark for caching, vendor integrations, and real-time access via Power BI and Snowflake.
Migrated Data Marts from Teradata to Snowflake using Snow SQL, External Stages, and COPY INTO, automated with shell scripts.
Created streaming pipelines for Clickstream data by integrating Databricks, Azure Event Hubs, and Azure Stream Analytics.
Built Delta Live Tables (DLT) in Databricks, using MERGE SQL for SCD Type-2, and applied Z-Optimize and VACUUM for data lifecycle management.
Enabled GIT integration on Databricks, used widgets for dynamic parameterization, and implemented remote job recovery routines.
Used Python SDK to trigger Databricks and ADF pipelines, and optimized PySpark jobs with custom Spark configurations.
Developed Shell scripts for ingesting data from SFTP to HDFS and applied data quality checks using DBT and Great Expectations.
Environment: Azure Data factory, Azure Databricks, spark, Kafka, Log Analytics, Azure DevOps (GIT & CI/CD), HDP, hive, Sqoop, Oracle Goldengate, Teradata, Google Campaign Manager, python, shell scripting, snowflake, Azure WebApp, Azure AppService, Azure Data hub.
Client: HSBC, Hyderabad, India December 2019 to November 2021
Role: Data Engineer
Responsibilities:
Developed and managed ETL workflows using AWS Step Functions, orchestrating complex data pipelines with high reliability.
Wrote advanced SQL scripts for data migration and transformation, including Teradata to Snowflake conversion.
Designed and executed big data solutions using AWS Lambda, Amazon Redshift, S3, and EC2 for scalable processing and storage.
Migrated on-premises applications to AWS EMR, maintaining and monitoring Hadoop clusters for batch processing.
Built robust ETL pipelines to ingest and transform data from multiple sources into Amazon S3 and Redshift.
Utilized Spark SQL and PySpark to process large JSON datasets, create Schema RDDs, and manage structured data.
Converted legacy Hive/SQL logic into optimized PySpark transformations using RDDs and Data Frames.
Managed YARN clusters for efficient job scheduling, resource allocation, and tuning for high-performance computing.
Deployed and administered HDFS clusters, enabling distributed data storage for high-volume processing.
Imported relational data using Sqoop from RDBMS to HDFS and implemented streaming into HBase with PySpark.
Created and managed Hive tables, optimizing HDFS storage with partitioning and bucketing techniques.
Built real-time streaming pipelines using Amazon Kinesis, Firehose, and Spark Structured Streaming for time-sensitive analytics.
Developed full-stack tools and APIs using Python, with frameworks like Django, Flask, and tested with PyUnit.
Utilized MongoDB for handling JSON-like documents, implemented ingestion pipelines, and optimized using indexing and projections.
Containerized applications and services using Docker, streamlining development workflows and environment management.
Environment: Python, Pandas, Microsoft, SQL server, Snowflake, MongoDB, GitHub, Jenkins, Django, HTML5, CSS, Bootstrap, JSON, JavaScript, AJAX, RESTful, MongoDB, MySQL, Docker, AWS (EC2, S3), PyUnit, Jenkins,
Client: RX Sense, Hyderabad, India Jun 2018 to December 2019
Role: Data Engineer
Responsibilities:
Built Databricks jobs to implement data quality checks and transform sales data received from front-end applications.
Developed Databricks jobs for processing real-time sales data using Spark Structured Streaming.
Created ETL jobs to read from SFTP accounts, extract files, and process them into Databricks DBFS using PySpark.
Built Data Factory pipelines using auto-loader for data ingestion from Oracle and Salesforce, landing it into Azure Data Lake Storage Gen1 (ADLS Gen1).
Developed interdependent jobs to capture KPI metrics for sales funnels and feed the results to Power BI dashboards.
Built streaming jobs using Spark Streaming to process transactional data in real-time, stored in Cassandra, for transaction verification via machine learning models.
Developed Spark jobs to convert Oracle materialized views, optimizing data extraction for Tableau dashboards. Automated end-to-end execution using EMR and Step Functions.
Created NiFi pipelines to transform raw JSON data into categorized datasets, migrating workflows from On-prem NiFi to EC2-based NiFi.
Developed Spark jobs in Scala on YARN for interactive and batch processing, and UNIX shell scripts for file loading into HDFS.
Used Spark SQL for faster querying of large datasets, and Sqoop to offload data from EDW to Hadoop clusters.
Configured CRON jobs for scheduled RAW layer data extraction and used Oozie for scheduling Sqoop and Hive jobs to build the data model in Hive.
Built Sqoop jobs to incrementally load data from Teradata and Oracle into HDFS using the last updated timestamp for efficient data transfer.
Worked with HCatalog to manage schema between PIG and HIVE frameworks, ensuring smooth integration with mainframe fixed-width data.
Designed partitioning and bucketing schemes in Hive for efficient data access and created external Hive tables for optimized data analysis.
Developed batch Spark jobs using Spark SQL to perform data transformations and update master data in Cassandra based on business requirements.
Migrated Oracle DataMart to Redshift using AWS Database Migration Service and optimized EMR workloads for cost-effective and scalable data analysis.
Environment: AWS, S3, Sqoop, Kafka, Spark, Spark SQL, Hive, LINUX, Oozie, Java, Scala, Eclipse, Tableau, UNIX Shell Scripting, Putty.