Manisha.L
+1-469-***-**** ****************@*****.***
PROFESSIONAL SUMMARY
Over 5 years of experience designing and developing data-intensive applications using the Hadoop Ecosystem, Big Data, Cloud Data Engineering, and Data Warehousing.
Proficient in Big Data tools and frameworks like Hadoop, Spark, Hive, Sqoop, Kafka, Cassandra, and
MongoDB, with expertise in Scala and Python programming.
Skilled in creating data pipelines using AWS services such as S3, Redshift, Glue, Lambda, DynamoDB, and Step Functions, as well as GCP and Azure ecosystems.
Designed and implemented ETL processes using Informatica Intelligent Cloud Services (IICS) and Azure Data Factory for seamless data integration.
Strong experience in SQL for database design, data mining, and creating objects like Views, Triggers, and Procedures to optimize performance.
Expertise in real-time data streaming using Kafka and Spark, with hands-on experience in Delta Lake for efficient data processing.
Proficient in Tableau for data visualization and analytics on large datasets, generating actionable insights for business teams.
Skilled in Linux shell scripting for automation and managing containerized environments with Kubernetes
and Docker for CI/CD pipelines.
Experienced in data modeling with Erwin (Conceptual, Logical, Physical) and advanced analytics using
Python libraries like Pandas, PySpark, and Scikit-learn.
Deep understanding of normalization, denormalization, and performance tuning in relational and dimensional databases.
SKILLS
Programming Languages: Python, R, C++, C#, SQL, Scala, SAS, YAML Java, JavaScript, Matlab, HTML5
Database Management Systems (DBMS): MySQL, PostgreSQL, NoSQL, Oracle, SQL Server, T-SQL, MongoDB, RDS, Cassandra, Elasticsearch. OLTP, RDBMSs, DeltaStreaam
Data Warehousing: Amazon Redshift, Google BigQuery, Snowflake, Apache Hive, Teradata
ETL Tools: Apache Spark, Apache Airflow, Talend, Informatica, Apache NiFi, Data Build Tool (DBT), Dataproc, DataFlow, Fivetran, Stitch, Rivery, Airbyte
Big Data Technologies: Apache Hadoop, Apache Kafka, Apache HBase, Apache Flink, Apache Storm, Apache Iceberg, Apache Druid, Databricks, EMR, Kinesis, Cloudera
Cloud Platforms: AWS, Azure, GCP, Amazon S3, Azure Data Lake Storage, GCS
Version Control Systems: Git, GitHub
Data Visualization: Tableau, Power BI, Crystal Reports OBIEE, Qlik sense, Alteryx, Cognos
Machine Learning/AI: TensorFlow, PyTorch, TAMR
Operating System: Linux/Unix, Windows, macOS
Containerization and Orchestration: Docker, Kubernetes
Tools & IDE: Git, IntelliJ, Visual Studio Code, Jupyter Notebook, PyCharm, ER Studio, JIRA, Confluence
Hadoop Ecosystem: HDFS, YARN, MapReduce
Monitoring and Logging: Prometheus, Grafana, ELK stack EXPERIENCE
Data Engineer
CHG HealthCare – Salt Lake City, UT
03/2023 - Present
Participated in all phases of software development, including requirements gathering and business analysis.
Devised PL/SQL Stored Procedures, Functions, Triggers, Views, and packages.
Designed data models for AWS Lambda applications and analytical reports.
Built a full-service catalog system using Elasticsearch, Logstash, Kibana, Kinesis, and CloudWatch with the effective use of MapReduce.
Utilized Indexing, Aggregation, and Materialized views to optimize query performance.
Implemented Python and Scala code for data processing and analytics leveraging inbuilt libraries.
Utilized various Spark Transformations, including mapToPair, filter, flatMap, groupByKey, sortByKey, join, cogroup, union, repetition, coalesce, distinct, intersection, map Partitions, map Partitions with Index, and Actions for cleansing input data.
Develop and manage scalable gRPC and RESTful APIs using Flask or Django to ensure efficient, secure communication across distributed systems.
Developed PySpark code used to compare data between HDFS and S3.
Developed data transition programs from DynamoDB to AWS Redshift (ETL Process) utilizing AWS Lambda to create functions in Python for specific events based on use cases.
Created scripts to read CSV, JSON, and parquet files from S3 buckets using Python, executed SQL operations, and loaded data into AWS S3, DynamoDB, and Snowflake, utilizing AWS Glue with the crawler.
Designed, developed, and optimized data transformation workflows and models using DBT (Data Build Tool), ensuring efficient and scalable data pipelines.
Developed and optimized complex data pipelines using Snowpark for efficient data processing and analytics.
Performed data wrangling in Alteryx by efficiently joining multiple entities using various join tools, ensuring data consistency and accuracy for analytical workflows.
Designed the Staging and Operational Data Storage (ODS) environment for the enterprise data warehouse
(Snowflake), including Dimension and fact table design following Kimball's Star Schema approach.
Unit tested data between Redshift and Snowflake.
Implemented scalable data storage solutions optimized for handling security-related datasets in a SaaS
product context.
Trained in QlikView and Splunk Reporting and Dashboard
Implemented DBT workflows to optimize data modeling, enhance pipeline performance, and seamlessly integrate DBT into the data engineering workflow through cross-functional collaboration.
Employed bash shell scripts and UNIX utilities for data processing and automation tasks.
Implemented Infrastructure as Code (IaC) using Terraform to automate the provisioning and management of AWS resources, ensuring consistency, scalability, and rapid deployment of cloud infrastructure.
Utilize Data Science algorithms and ML ops techniques to optimize data processing and analysis.
Developed predictive models and converted SAS programs into Python to enhance efficiency and scalability.
Developed and implemented advanced algorithms in Matlab to analyze and visualize complex data sets, ensuring efficient data processing.
Reviewed system specifications related to DataStage ETL and developed functions in AWS Lambda for event- driven processing.
Developed and maintained ETL pipelines to process and analyze geospatial data using tools like GDAL
and PostGIS, ensuring efficient data integration and retrieval.
Wrote reports using Tableau Desktop to extract data for analysis using filters based on the business use case. Data Engineer
Zensar Technolgies, India
01/2020 – 08/2022
Utilized AWS services to design and develop scalable, high-performance enterprise data warehouse and business intelligence solutions, enhancing decision-making capabilities..
Developed Scala scripts and User Defined Functions (UDFs) using data frames/SQL and Resilient Distributed Datasets (RDD) in Spark for data aggregation, querying, and writing back into the S3 bucket.
Executed data cleansing and data mining operations.
Programmed, compiled, and executed programs using Apache Spark in Scala for ETL jobs with ingested data.
Crafted Spark application programs for data validation, cleansing, transformation, and custom aggregation, employing Spark engine and Spark SQL for data analysis, provided to data scientists for further analysis.
Automated ingestion processes using Python and Scala, pulling data from various sources such as API, AWS S3, Teradata, and Snowflake.
Designed and developed Spark workflows using Scala for data extraction from AWS S3 bucket and Snowflake, applying transformations.
Designed and implemented ETL pipelines between various Relational Databases and Data Warehouse using Apache Airflow.
Implemented continuous integration and continuous deployment (CI/CD) pipelines, integrating data engineering workflows seamlessly into DevOps practices for efficient software delivery.
Developed Custom ETL Solutions and Real-Time data ingestion pipelines to move data in and out of
Hadoop using Python and shell Script.
Utilized GCP Dataproc, GCS, Cloud Functions, and BigQuery for data processing.
Worked on Data Extraction, aggregations, and consolidation of Adobe data within AWS Glue using PySpark.
Implemented Spark RDD transformations to map business analysis and applied actions on top of transformations.
Installed and configured Apache Airflow, automating resulting scripts to ensure daily execution in production.
Created Directed Acyclic Graphs (DAG) utilizing Email Operator, Bash Operator, and Spark Livy Operator for execution in EC2.
Developed scripts to read CSV, JSON, and parquet files from S3 buckets in Python and load them into
AWS S3, DynamoDB, and Snowflake.
Ingested real-time data streams to the Spark streaming platform, saving data in HDFS and HIVE through GCP.
Implemented AWS Lambda functions to execute scripts in response to events in Amazon DynamoDB table or S3 bucket or HTTP requests using Amazon API Gateway.
Worked on Snowflake Schemas and Data Warehousing, processing batch, and streaming data load pipeline using Snow Pipe and Matillion from data lake Confidential AWS S3 bucket.
Profiled structured, unstructured, and semi-structured data across various sources to identify patterns and implemented data quality metrics using necessary queries or Python scripts based on the source.
Demonstrated proficiency in the Microsoft Suite (PowerPoint, Excel, etc.) to efficiently create presentations and streamline data analysis processes.
EDUCATION
Master of Science in Data Science
University of North Texas