Bharti Rao
Senior Data Engineer
*************@*****.*** +1-817-***-****
https://www.linkedin.com/in/bharti-rao-20948b209/
PROFESSIONAL SUMMARY:
Having 8+ years of overall IT experience in a variety of industries, this includes hands-on experience as a data engineer in Big Data Analytics and development.
Experience in collecting, processing, and aggregating large amounts of streaming data using Kafka, Spark Streaming.
Proficient in ingesting, cleansing, and transforming data using AWS services such as AWS Lambda, AWS Glue, and Step Functions, ensuring data consistency and quality.
Worked with EMR clusters for processing Big Data across a Hadoop cluster, utilizing Spark Streaming for near real-time data ingestion, transformation, and aggregation.
Developed ETL pipelines using Informatica, Ab Initio, Talend, and Apache NiFi to efficiently move and transform data across multiple platforms and databases.
Integrated machine learning models into data pipelines using PySpark and AWS Lambda, enabling real-time predictions and automated decision-making in production environments.
Hands-on experience in handling streaming data using Apache Kafka, integrating Kafka topics with Spark Streaming for real-time processing and storing transformed data in AWS S3 and HDFS.
Experience using Snowflake, Jira, Confluence, Databricks, and Alteryx across various projects.
Applied data science techniques such as feature engineering, regression modeling, and predictive analytics using Python (Pandas, NumPy, Scikit-learn), contributing to data-driven decision-making.
Extensive experience in cloud platforms, particularly AWS and Azure, leveraging services like AWS Glue, EMR, Redshift, Athena, Azure Data Factory (ADF), and Azure Data Lake for scalable data solutions.
Designed and implemented Data Lake architectures using AWS S3, Azure Data Lake, and OneLake, optimizing structured and unstructured data storage for analysis.
Optimized relational databases such as MySQL, SQL Server, and Oracle 12c, improving query performance through indexing and tuning.
Well-versed in data security, data profiling, and data modeling techniques. Comfortable working in Agile environments with knowledge of DevOps tools and CI/CD pipelines.
Strong expertise in data warehousing and NoSQL technologies including DynamoDB, MongoDB, Cassandra, and Cosmos DB for low-latency data access and flexible schema modeling.
Proficient with Power BI and Excel for reporting and data visualization. Skilled in handling multiple data file formats like JSON and XML.
Optimized Spark and Snowflake jobs by tuning queries, indexing, and caching mechanisms, resulting in improved efficiency in data processing and analytics.
Worked with Docker for containerized application deployment, managing snapshots, directories, and running containers efficiently.
Managed security groups, auto-scaling, and infrastructure deployment using Terraform templates, integrating AWS Lambda, CI/CD pipelines, and AWS CodePipeline for continuous deployment.
Built interactive dashboards and reports using Tableau, visualizing data insights and trends for business intelligence and decision-making.
Experience in full data pipeline development, from ingestion to transformation, storage, and reporting, ensuring end-to-end data lifecycle management.
Worked closely with Data Architects, Business Analysts, and DevOps teams to define data requirements, implement scalable solutions, and follow Agile and Scrum methodologies.
Skilled in working with a diverse technology stack including Hadoop, Spark, Databricks, Airflow, Kafka, AWS, Azure, and Ab Initio to build scalable and efficient data engineering solutions.
Create data sourced out of Lake in to Databricks in different phases environments and perform Delta strong data engineering experience in Spark and Azure Databricks, running notebooks using ADF
Developed and implemented Historical and Incremental Loads using Databricks & Delta Lake run using ADF pipelines.
Technical Skills:
Cloud Platforms
AWS (EC2, S3, EBS, ELB, RDS, SNS, SQS, VPC, Lambda, CloudFormation, CloudWatch, ELK Stack, Redshift, Data Pipelines, Glue), Azure (Data Factory, Databricks, Data Lake, Synapse), Google Cloud Platform
Programming
Python, PySpark, Shell Scripting, PowerShell, Scala
Databases
MySQL, MongoDB, DynamoDB, Cassandra, SQL Server, Oracle 12c, Cosmos DB, HBase
Big Data Technologies
Hadoop, Hive, MapReduce, HDFS, Cloudera, Spark, Spark Streaming, Sqoop, Flume, YARN, Informatica, Talend, Fivetran, Control-M, Oozie
DevOps Tools
Docker, Kubernetes, Ansible, GIT, Bitbucket, Jira, Bamboo, Maven, Jenkins
Data Visualization and Reporting
Power BI, Tableau, Iris Studio, Kensho
Project Management and Collaboration Tools
SharePoint, Windows 10
Web/Application Servers
JBOSS, WebLogic
PROFESSIONAL EXPERIENCE:
Client: StoneEagle May ’2021 – May’ 2025
Role: Senior Data Engineer
Responsibilities:
Worked on Ingesting data by going through cleansing and transformations and leveraging AWS Lambda, AWS Glue and Step Functions.
Created monitors, alarms, notifications and logs for Lambda functions, Glue Jobs, EC2 hosts using CloudWatch.
Worked on EMR clusters of AWS for processing Big Data across a Hadoop Cluster of virtual servers.
Developed various Mappings with the collection of all Sources, Targets, and Transformations using Informatica Designer.
Performed advanced data profiling, quality checks, and data modeling for Snowflake-based data warehousing solutions.
Worked extensively with JSON/XML file formats for ingestion pipelines and schema validation.
Leveraged AI-powered data transformation techniques (text classification, sentiment analysis, clustering) using pre-trained models in SageMaker and integrated into Spark-based data workflows.
Integrated ML models hosted in SageMaker with streaming data via Kafka and Spark Streaming, enabling real-time classification and fraud detection pipelines.
Built extract / load / transform (ETL) processes in the Snowflake Data Factory using dbt to manage and store data from internal and external sources.
Hands on experience in integrating Snowflake and dbt (data build tool) to transform data.
Created and maintained the tables/views in Snowflake data warehouse for downstream consumers.
Worked on Docker containers snapshots, attaching to a running container, removing images, managing Directory structures, and managing containers.
Built CI/CD pipelines for ML workflows using AWS CodePipeline, Terraform, and Git for continuous integration of updated models and retraining triggers.
Improved data search speeds by using better indexing methods and tuning performance in MySQL and MongoDB, enhancing system speed.
Leveraged Jira and Confluence for sprint planning, documentation, and collaboration with cross-functional teams.
Built Power BI dashboards for insights derived from Snowflake and Databricks pipelines.
Utilized Alteryx for quick data wrangling and transformation tasks where applicable.
Collected data using Spark Streaming from AWS S3 bucket in near real time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS. Data engineering using Spark, Python and PySpark.
Experience with deploying Hadoop in a VM and AWS Cloud as well as physical server environment.
Monitor Hadoop cluster connectivity and security and File system management.
Data replacement and appending to the Hive database via pulling it using Sqoop processing tool to HDFS database from multiple data marts as source.
Extensive hands-on experience of writing notebooks in data bricks using python/Spark SQL for complex data aggregations, transformations, schema operations. Good familiarity with Databricks delta and data frames concepts
Created pipelines to load data from Lake to Databricks and Databricks to Azure SQL DB.
Designing and implementing data integration solutions using Ab Initio tools to ensure efficient and accurate processing of large volumes of data.
Ensured data security compliance for sensitive datasets using encryption and IAM policies in AWS.
Built feature engineering workflows in PySpark and deployed on AWS EMR and Glue, enabling efficient transformation of raw data into ML-ready datasets used in churn prediction and forecasting models.
Developing and implementing data quality checks and controls using Ab Initio's data quality features to ensure the accuracy, completeness, and consistency of data.
Creating and maintaining metadata management and lineage solutions using Ab Initio's metadata
Successfully implemented POC (Proof of Concept) in a Development Databases to validate the requirements and benchmarking the ETL loads.
Supporting Continuous storage in AWS using Elastic Block Storage, S3, and Glacier. Created Volumes and configured Snapshots for EC2 instances.
Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline, which can be written to Glue Catalog and can be queried from Athena.
Managed security groups on AWS, focusing on high-availability, fault-tolerance, and auto scaling using Terraform templates. Along with Continuous Integration and Continuous Deployment with AWS Lambda and AWS, code pipeline. Used Pandas in Python for Data Cleansing and validating the source data.
After the transformation of data is done, this transformed data is then moved to Spark cluster where the data is set to go live on to the application using Spark Streaming and Kafka.
Environment: AWS (EC2, S3, EBS, ELB, RDS, SNS, SQS, VPC, Alteryx, LAM Cloud formation, CloudWatch, ELK Stack), Power BI, Bitbucket, Ansible, Python, PySpark, MySQL, Mongo DB, Snowflake, Shell Scripting, PowerShell, GIT, Jira, Power BI, JBOSS, Bamboo, Docker, Web Logic, Maven, Unix/Linux, AWS, Dynamodb, Kinesis, Hadoop, Hive.
Client: Veritex Community Bank Dec’ 2019 – April’ 2021
Role: Senior Data Engineer
Responsibilities:
Collaborated with Business Analysts, Engineers across departments to gather business requirements, and identify workable items for development.
Selected and generated data into CSV files and stored them into AWS S3 by using AWS EC2 and then structured and stored in AWS Redshift.
Hands on experience working with AWS EMR, EC2, S3, Redshift, DynamoDB, lambda, Athena and Glue.
Worked on Data Migration from Data Stage to AWS Snowflake Environment using DBT.
Designed and developed the ELT jobs using DBT to achieve the best performance
Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, Data Frame, RDD’s, Memory optimization.
Worked with Hadoop ecosystem and Implemented Spark using Scala and utilized Data frames and Spark SQL API for faster processing of data.
Built and optimized data pipelines with Databricks, Snowflake, and SQL to support data analytics and visualization in Power BI.
Actively contributed to data modeling sessions and metadata-driven architecture discussions.
Performed in-depth data profiling to identify anomalies and ensure data consistency across systems.
Utilized Jira for ticket management and Agile sprint tracking, with Confluence used for documentation.
Subscribing the Kafka topic with Kafka consumer client and process the events in real time using spark.
Developed Spark Streaming job to consume the data from the Kafka topic of different source systems and push the data into S3.
Designed and Developed Real Time Stream Processing Application using Spark, Kafka, Scala and Hive to perform Streaming ETL and apply Machine Learning.
Used Alteryx Designer for data blending, cleansing, and enrichment from diverse sources.
Managed ingestion and parsing of files in JSON and XML format across AWS S3 and Redshift environments.
Collected data using Spark Streaming from AWS S3 bucket in near real time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
Converting Hive/SQL queries into Spark transformations using Spark RDDs and PySpark
Hands on experience working with Snowflake database.
Worked on loading the data to snowflake from S3, Worked on Databricks and with delta tables.
Worked on performance tuning of spark and snowflake jobs.
Use python to write a service, which is event, based using AWS Lambda to achieve real time data to One-Lake (A Data Lake solution in Cap-One Enterprise).
Boosted the performance of regression models by applying polynomial transformation, feature selection, and used those methods to select stocks.
Generated report on predictive analytics using Python and Tableau including visualizing model performance and prediction results.
Utilized Agile and Scrum methodology for team and project management.
Used Git for version control with colleagues.
Environment: AWS Redshift, AWS S3, AWS Data Pipe Lines, Alteryx, AWS Glue, Snowflake, Hadoop, YARN, SQL Server, Spark, Spark Streaming, Scala, Kinesis, Python, Pyspark, Hive, Linux, Sqoop, Informatica, Power BI, Talend, Cassandra, oozie, Control-M, Fivetran, EMR, EC2, RDS, Dynamo DB, Oracle 12c.
Client: Broadridge Financial June 2017 – Nov’ 2019
Role: Data Engineer
Responsibilities:
Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
Used Azure Data Factory, SQL API and MongoDB API and integrated data from MongoDB, MS SQL, and cloud (Blob, Azure SQL DB).
Involved in complex data modeling and profiling activities using SQL and Azure tools to ensure high data quality.
Created and managed reports in Excel and Power BI to monitor pipeline health and data quality metrics.
Automated ingestion and transformation pipelines for semi-structured formats like JSON and XML.
Used Jira and Confluence as part of Agile ceremonies and collaborative delivery.
Strong experience of leading multiple Azure Big Data and Data transformation implementations in Banking and Financial Services, High Tech and Utilities industries.
Aided in the design and development of the logical and physical data models, business rules and data mapping for the Enterprise Data Warehouse system.
Implemented large Lambda architectures using Azure Data platform capabilities like Azure Data Lake, Azure Data Factory, HDInsight, Azure SQL Server, Azure ML and Power BI.
Designed end to end scalable architecture to solve business problems using various Azure Components like HDInsight, Data Factory, Data Lake, Storage and Machine Learning Studio.
Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity.
Designed complex SSIS Packages for Extract, Transform and Load (ETL) with data from different sources.
Developed PL/SQL triggers and master tables for automatic creation of primary keys.
Used Azure Data Factory, SQL API and MongoDB API and integrated data from MongoDB, MS SQL, and cloud (Blob, Azure SQL DB, cosmos DB).
Responsible for resolving the issues and troubleshooting related to performance of Hadoop cluster.
Involved in designing and developing tables in HBase and storing aggregated data from Hive Table.
Used Apache Spark Data frames, Spark-SQL, Spark MLlib extensively and developing and designing POC's using Scala, Spark SQL and MLlib libraries.
Environment: Azure Data Factory, Azure Databricks, Azure DataLake, Azure synapse, PySpark, Hadoop, HDFS, Hive, HBase, Sqoop, Flume, Blob, cosmos DB, MapReduce, Cloudera, SQL, Apache Kafka, Azure, Python, power BI, Unix, C assandra, SQL Server.
EDUCATIONAL DETAILS:
Bachelors of Commerce - 2006 - Berhampur University