Ambalika
Data Engineer
Email: *******************@*****.*** Phone: 346-***-****
LinkedIn: linkedin.com/in/ambalika-r-325abb33a/
Professional Summary:
Over 12+ years of experience in Data Engineering, Data Pipeline Design, Development, and Implementation as a Data Engineer/Data Developer and Data Modeler.
Experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
Experience in Data transformation, Data mapping from source to target database schema, Data Cleansing procedures.
Experience in Spark eco - system, core, SQL, Streaming modules.
Experience in writing scripts using Python API, PySpark API and Spark API for analyzing the data.
Experience with Spark and Scala APIs to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames in Scala.
Experience in developing Kafka producers and Kafka consumers for streaming millions of events per minute on streaming data using PySpark, Python & Spark Streaming.
Experience in Hadoop Ecosystem components Map - Reduce, HDFS, Yarn/MRv2, Pig, Hive, HDFS, HBase, Spark, Kafka, Sqoop, Flume, Avro, Sqoop, AWS, Avro, Solr and Zookeeper.
Experience in developing SSIS/DTS Packages to extract, transform and load (ETL) data into data warehouse/data marts from heterogeneous sources.
Experience in using Tableau Desktop to extract data for analysis using filters based on the business use case.
Experience in Azure Cloud. Experience on Migrating SQL database to Azure Data Lake, Azure data Lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and controlling and granting database access and Migrating On premise databases to Azure Data Lake store using Azure Data factory.
Experience in moving data between GCP and Azure using Azure Data Factory (ADF).
Hands-on experience with Amazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, CloudWatch, SNS, SES, SQS, Lambda, EMR and other services of the AWS family.
Experience in OLTP/OLAP System Study, Analysis and E - R modeling, developing Database Schemas like Star schema and Snowflake schema used in relational, dimensional and multidimensional modeling.
Experience in designing star schema, Snowflake schema for Data Warehouse, ODS architecture.
Experienced in setting up workflow using Apache Airflow for managing and scheduling Hadoop jobs.
Experience in source code & build management with Git & Enterprise GitHub with Jenkins, Artifactory.
Experienced working on NoSQL databases like MongoDB and HBase.
Experience with Unix/Linux systems with bash scripting experience and building data pipelines.
Experience in all phases of software development life cycle in Agile, Scrum and Waterfall management process.
Excellent analytical, communication skills which helps to understand the business logics and develop a good relation between stakeholders and team members.
Technical Skills:
Databases
Snowflake, AWS RDS, Teradata, Oracle, MySQL, Microsoft SQL, Postgre SQL.
NoSQL Databases
MongoDB, Hadoop HBase and Apache Cassandra.
Programming Languages
Python, SQL, Scala, MATLAB.
Cloud Technologies
Azure, AWS, GCP
Data Formats
CSV, JSON
Querying Languages
SQL, NO SQL, PostgreSQL, MySQL, Microsoft SQL
Integration Tools
Jenkins
Scalable Data Tools
Hadoop, Hive, Apache Spark, Pig, Map Reduce, Sqoop.
Operating Systems
Red Hat Linux, Unix, Windows, macOS.
Reporting & Visualization
Tableau, Matplotlib.
Professional Experience:
Client: TCF Bank, Wayzata, MN. Dec 2024 – Till Date
Role: Data Engineer
Responsibilities:
Gathered, analyzed, and translated business requirements to technical requirements, communicated with other departments to collect client business requirements and access available data.
Developed Spark jobs using Scala/PySpark and Spark SQL for faster data processing.
Developed Scala scripts using both Data frames/SQL/Data sets and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
Developed spark applications in PySpark on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables.
Design and developed end-to-end ETL process from various source systems to Staging area, from staging to Data Marts and data load.
Created Data Connections, Published on Tableau Server for usage with Operational or Monitoring Dashboards.
Created graphical reports, tabular reports, scatter plots, geographical maps, dashboards, and parameters on Tableau and Microsoft Power BI.
Developed Python applications to push data from big data tables to archive tables in order to make sure, data models don't exhibit any performance specific challenges.
Developed python code for different tasks, dependencies, SLA watcher and time sensor for each job for workflow management and automation using Airflow tool.
Developed Hive queries to pre-process the data required for running the business process.
Provided ad-hoc report metric and dashboard to drive key business decisions using Tableau and Confidential tools.
Implemented multiple modules in microservices to expose data through Restful APIs.
Prepared data warehouse using Star/Snowflake schema concepts in Snowflake using Snowsql.
Worked with building data warehouse structures, and creating facts, dimensions, aggregate tables, by dimensional modeling, Star and Snowflake schemas.
Created Spark clusters and configuring high concurrency clusters using Azure Databricks to speed up the preparation of high-quality data.
Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, Spark SQL, and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.
Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity.
Apache Kafka is used to send message objects to client queues and topics.
Used Kafka for Publish/Subscribe pattern in application dealing with messaging.
GIT used for version control and font management.
Created airflow DAGs to sync files from box, analyze data quality, and alert for missing files.
Implemented a CI/CD pipeline using Jenkins, Airflow for Containers from Docker and Kubernetes.
Designed, documented operational problems by following standards and procedures using JIRA.
Worked on designing, building, deploying and maintaining Mongo DB.
Implemented SQL, PL/SQL stored procedures.
Utilized Agile and Scrum Methodology to help manage and organize a team of developers with regular code review session.
Actively involved in code review and bug fixing for improving the performance.
Environment: Spark, Scala, Python, PySpark, MapReduce, ETL, Tableau, Power BI, Azure, Git, Apache Kafka, Snowflake, Star schema, Apache Airflow, CI/CD, Jenkins, Docker, Kubernetes, Jira, MongoDB, SQL, Agile and Windows.
Client: LPL Financial, San Diego, CA. Jun 2022 - Sep 2023
Role: Data Engineer
Responsibilities:
Involved in full Software Development Life Cycle (SDLC) - Business Requirements Analysis, preparation of Technical Design documents, Data Analysis, Logical and Physical database design, Coding, Testing, Implementing, and deploying to business users.
Involved in developing Spark applications using Spark – SQL in Databricks for data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
Involved in developing Spark code using Scala and Spark-SQL for faster testing and processing of data.
Built scalable and robust data pipelines for Business Partners Analytical Platform to automate their reporting dashboard using Spark SQL and Pyspark, and also scheduled the pipelines.
Developed ETL pipelines in and out data warehouse using combination of Python and Snowflake Writing SQL quires against Snowflake.
Creates reports using Power BI for SharePoint list items.
Developed Tableau reports on the business information to examine the examples in the business.
Developed multiple Map Reduce jobs in Java for data cleaning and preprocessing.
Developed Python-based API (RESTful Web Service) to track revenue and perform revenue analysis.
Installed and configured Apache airflow for workflow management and created workflows in python.
Written python DAGs in airflow which orchestrate end to end data pipelines for multiple applications.
Worked on PySpark APIs for data transformations.
Design Setup maintain Administrator the Azure SQL Database, Azure Analysis Service, Azure SQL Data warehouse, Azure Data Factory, Azure SQL Data warehouse.
Worked on Configuration of services like API Manager, Azure Cosmos DB, App Service, Key Vaults, Container Registries and App Registration on Azure.
Configures Azure cloud services for endpoint deployment.
Involved in designing and developing Azure Data Factory (ADF) extensively for ingesting data from different source systems like relational and non-relational to meet business functional requirements.
Configured to receive real time data from the Apache Kafka and store the stream data to HDFS using Kafka connect.
Optimized the Hive tables using optimization techniques like partitions and bucketing to provide better performance with HiveQL queries.
Worked extensively with Sqoop for importing and exporting the data from HDFS to Relational Database system and vice-versa.
Utilized Kubernetes and Docker for the runtime environment of the CI/CD system to build, test deploy during production.
Used tools like Jira, GitHub to update the documentation and code.
Worked on SQL queries in dimensional data warehouses and relational data warehouses. Performed Data Analysis and Data Profiling using Complex SQL queries on various systems.
Developed NoSQL database by using CRUD, Indexing, Replication and Sharing in MongoDB.
Followed agile methodology for the entire project.
Actively participated and provided feedback in a constructive and insightful manner during weekly Iterative review meetings to track the progress for each iterative cycle and figure out the issues.
Environment: Spark, Scala, Python, PySpark, MapReduce, ETL, Tableau, Power BI, Azure, Git, Snowflake, star schema, Apache Airflow, CI/CD, Jenkins, Apache Kafka, Docker, Kubernetes, Jira, MongoDB, SQL, Agile and Windows.
Client: TJX, Framingham, MA. Dec 2019 – May 2022
Role: Data Engineer
Responsibilities:
Worked with the business users to gather, define business requirements and analyze the possible technical solutions.
Developed Spark jobs to clean data obtained from various feeds to make it suitable for ingestion into Hive tables for analysis.
Developed Custom Input Formats in Spark jobs to handle custom file formats.
Developed multiple Spark jobs in Scala & Python for data cleaning and preprocessing.
Used PySpark jobs to run on Kubernetes Cluster for faster data processing.
Optimized Map/Reduce Jobs to use HDFS efficiently by using various compression mechanisms.
Design and develop Tableau visualizations which include preparing Dashboards using calculations, parameters, calculated fields, groups, sets and hierarchies.
Performed Data Integration, Extraction, Transformation, and Load (ETL) Processes.
Automated workflows and CI/CD tools: Airflow.
Worked on Snowflake environment to remove redundancy and load real time data from various data sources into HDFS using Kafka.
Used Scala to convert Hive / SQL queries into RDD transformations in Apache Spark.
Install and configure Apache Airflow for S3 bucket and Snowflake data warehouse and created dags to run the Airflow.
Worked on Snowflake Schemas and Data Warehousing and processed batch and streaming data load pipeline using Snow Pipe and Matillion from data lake Confidential AWS S3 bucket.
Worked with Airflow to schedule ETL jobs and Glue and Athena to extract the data from AWS data warehouse.
Designed and developed Security Framework to provide fine grained access to objects in AWS S3 using AWS Lambda, DynamoDB.
Used AWS glue catalog with crawler to get the data from S3 and perform SQL query operations using AWS Athena.
Written PySpark job in AWS Glue to merge data from multiple tables and in Utilizing Crawler to populate AWS Glue data Catalog with metadata table definitions.
Worked on designing, building, deploying and maintaining Mongo DB.
Implemented SQL, PL/SQL stored procedures.
Involved in story-driven agile development methodology and actively participated in daily scrum meetings.
Environment: Spark, Scala, PySpark, ETL, Tableau, MapReduce, Snowflake, Airflow, MongoDB, SQL, Agile and Windows.
Company: Byteridge Software Private Limited, India. Jan 2016 – Nov 2019
Role: Data Engineer
Responsibilities:
Participated in daily agile stand-up meetings, updating the internal Dev team on project statuses, and collaborated through Palantir Foundry for real-time data management and decision-making processes.
Designed and developed web app BI for performance analytics.
Convert raw data with sequence data format, such as Avro and Parquet to reduce data processing time and increase data transferring efficiency through the network.
Wrote Shell scripts to orchestrate execution of other scripts and move the data files within and outside of HDFS.
Created Spark applications using Pyspark and Spark-SQL for extracting, transforming, and aggregating data from multiple file formats, uncovering valuable insights into customer usage patterns.
Designed Python-based notebooks for automated weekly, monthly, quarterly reporting ETL.
Implemented AWS Fully Managed Kafka streaming to send data streams from the company APIs to Spark cluster in AWS Databricks, Redshift, Glue and Lambda/Python.
Migrated various Hive UDFs and queries into Spark SQL for faster requests.
Designed the backend database and AWS cloud infrastructure for maintaining company proprietary data.
Orchestrated Airflow/workflow in hybrid cloud environment from local on-premises server to the cloud.
Wrote Shell FTP scripts for migrating data to AWS S3.
Analyzed large amounts of data sets to determine optimal way to aggregate and report on them.
Used Oozie workflow engine to manage interdependent Hadoop jobs and automate several types of Hadoop jobs such as Java map-reduce Hive, Pig, and Sqoop.
Produced scripts for doing transformations using Scala/Java.
Developed and implemented Hive custom UDFs involving date functions.
Installed, configured, and monitored Apache Airflow cluster.
Involved in converting HiveQL into Spark transformations using Spark RDD and through Scala programming.
Developed an API to write XML documents from a database. Utilized XML and XSL Transformation for dynamic web-content and database connectivity.
Used various AWS services including S3, EC2, AWS Glue, Athena, RedShift, EMR, SNS, SQS, DMS, and Kenesis.
Wrote simple SQL scripts on the final database to prepare data for visualization with Tableau.
Developed DAG data pipeline to on-board and change management of datasets.
Involved in Apache Spark development using Scala for batch and real-time data processing.
Programmed Flume and HiveQL scripts to extract, transform, and load the data into database.
Used Kafka on publish-subscribe messaging as a distributed commit log.
Created Airflow Scheduling scripts in Python to automate data pipeline and data transfer.
Created a benchmark between Hive and HBase for fast ingestion.
Configured AWS Lambda for triggering parallel Cron jobs scheduler for scraping and transforming data.
Use Cloudera Manager for installation and management of a multi-node Hadoop cluster.
Scheduled jobs using Control-M.
Environment: Hadoop, Hive, Sqoop, Apache Spark, Kafka, Redshift, Azure Data Bricks, Airflow, Python, Scala, Cloudera Manager, Shell, Glue, Tableau and Windows.
Company: 3i Infotech Ltd., India. Nov 2011 – Dec 2015
Role: Data Engineer
Responsibilities:
Collaborated with Business Analysts, SMEs across departments to gather business requirements, and identify workable items for further development.
Design and develop spark job with Scala to implement end to end data pipeline for batch processing.
Used spark and spark-SQL to read the parquet data and create the tables in hive using the Scala API.
Developed highly complex Python and Scala code, which is maintainable, easy to use, and satisfies application requirements, data processing and analytics using inbuilt libraries.
Developed Spark code in Python and Spark SQL environment for faster testing and processing of data and loading the data into Spark RDD and doing In-memory computation to generate the output response with less memory usage.
Designed, developed, tested, and maintained Tableau functional reports based on user requirements.
Developed ETL’s in using Spark SQL, RDD, and Data Frames.
Worked on Scala code base related to Apache Spark performing the Actions, Transformations on RDDs, Data Frames and Datasets using Spark SQL and Spark Streaming Contexts.
Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using Scala.
Worked on migrating MapReduce programs into Spark transformations using Scala.
Worked with different feeds data like JSON, CSV, XML and implemented Data Lake concept.
Analyzed the SQL scripts and designed the solution to implement using PySpark.
Use SQL queries and other tools to perform data analysis and profiling.
Followed agile methodology and involved in daily SCRUM meetings, sprint planning, showcases and retrospective.
Environment: Spark, Scala, Hadoop, Python, Pyspark, MapReduce, ETL, HDFS, Hive, HBase, SQL, Agile and Windows.
Education:
B.E., Bachelor of Engineering (Computer Science & Engineering) Anna University, India.
References: Will be provided upon request.