ANKITHA SAJJAN
Location: Austin, TX
Mail: ****************@*****.*** Contact: 913-***-****
PROFESSIONAL SUMMARY:
●Around 5 Years of experience as a Data Engineer dealing with Python, Apache Hadoop Ecosystem like HDFS, MapReduce, Hive, Sqoop, Oozie, HBase, Spark-Scala, Kafka and Big Data Analytics, AWS, GCP, and Azure.
●Excellent understanding experience of Hadoop architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, and MapReduce programming paradigm.
●Hands-on experience in ingesting data into Data Warehouse using various data loading techniques.
●Experience on Advanced Python Skills like OOPS.
●Working experience on Hadoop Cluster architecture and sources and involved in HDFS maintenance and loading of structured and unstructured data.
●Experience in managing and reviewing Hadoop log files and NoSQL databases like HBase, MongoDB.
●Experienced in using distributed computing architectures and building pipelines using AWS products such as EC2, Redshift, EMR, Elasticsearch, EBS, S3, Hadoop, Python, Spark and effective use of MapReduce, SQL, and Cassandra to solve big data type problems.
●Working capability of using AWS utilities such as EMR, S3, and CloudWatch to run and monitor Hadoop and Spark jobs on Amazon Web Services (AWS).
●Well-versed with Data Migration, Data Conversions, Data Extraction/Transformation/Loading (ETL).
●Good experience in using Apache NiFi for automation of the data movement between various Hadoop systems.
●Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems.
●Experience in writing complex SQL, PL/SQL, T-SQL queries, implementing procedures, functions, and improving the performance of databases.
●Good understanding of Web Services, SOAP, REST API, and WSDL.
●Experience in cluster automation using Shell Scripting and designing data platforms using AWS, GCP for Data Analysis and Machine Learning.
●Implemented NoSQL database solutions (MongoDB/Cassandra/DynamoDB) for handling high-volume, semi-structured data.
●Good experience working on analysis tools like Power BI, Tableau.
●Experience in developing CI/CD (Continuous Integration and Continuous Deployment) and automation using Jenkins, Git, Docker, Kubernetes for ML models deployment.
●Worked on several Python modules such as NumPy, Pandas, Matplotlib, Scikit-learn, PySpark modules.
TECHNICAL SKILLS:
BigData Technologies
Hadoop, HDFS, MapReduce, Hive, Sqoop, HBase, Apache NiFi, Apache Kafka, Apache Spark, PySpark
Programming Languages
, Scala, SQL, PL/SQL, T-SQL, C#, Java Script
Cloud Platforms
AWS (S3, Lambda, Glue, Redshift, Athena, EMR, CloudWatch, EC2), GCP (BigQuery, Dataflow, Cloud Storage, Cloud SQL, Pub/Sub), Azure (Data Factory, Synapse Analytics, Databricks, ADLS)
Databases & Data Warehousing
Amazon Redshift, Snowflake, PostgreSQL, BigQuery, MongoDB, MySQL, Oracle, SQL Serve
ETL & Data Pipelines
Apache Airflow, DBT, Informatica, Talend, Apache NiFi, ADF
Data Visualization
Tableau, Power BI
DevOps & CI/CD
Git, Jenkins, Docker, Kubernetes, Terraform
Machine Learning & Analytics
Pandas, NumPy, Matplotlib, Spark MLlib
WORK EXPERIENCE:
Nordstrom Austin, TX Oct 2023- Present
Data Engineer
Responsibilities:
Developed and optimized SQL queries, stored procedures, and views for data extraction, transformation, and loading (ETL/ELT).
Migrating Teradata to GCP Big Query and done R&D on authentication, successfully fetched the data from Big Query and populated the reports.
Designed and implemented ETL processes with DBT, transforming raw data into analytical models in Snowflake/Redshift.
Regionalizing the data into various charts to understand the trends, patterns and correlations using the tools such as Power BI and responsible for the SSRS migration to the Power BI.
Created data models following Star and Snowflake schema designs, improving query performance and reporting efficiency.
Developed data transformation scripts in Python (Pandas, NumPy) to clean and preprocess large datasets.
Scheduled and automated SSIS package execution using SQL Server Agent for daily data loads.
Designed and developed SSIS packages to extract, transform, and load (ETL) data from diverse sources into SQL Server.
Optimized SSIS package performance by tuning data flow tasks and leveraging parallel execution.
Leveraged Tableau and Power BI to create interactive dashboards for business intelligence and data visualization.
Responsible for Web Application upgradation and built libraries using Java Script.
Performing the SQL query optimization and performance tuning using indexes.
Migrated SSIS packages from 2012 - 2015 version and worked on 300+ packages using the script generated and deployed them using Python And TSQL.
Built incremental and snapshot-based ETL workflows to manage Slowly Changing Dimensions (SCD) in DBT.
Used Apache Spark and PySpark for large-scale data processing and performance optimization.
Automated data ingestion from APIs, JSON, and flat files using Python and SQL-based pipelines.
Collaborated with stakeholders to define data governance policies, ensuring compliance with security and privacy standards.
Implemented CI/CD pipelines for ETL workflows using Git, Docker, and Kubernetes, ensuring seamless deployment.
Optimized Snowflake warehouse performance, using clustering keys, materialized views, and caching strategies.
Developed unit tests for data pipelines using Great Expectations and PyTest to validate data quality.
Performed data lineage tracking and impact analysis using dbt documentation and data cataloging tools.
Environment: SQL, Amazon Redshift, Snowflake, Trino (PrestoSQL), PostgreSQL, BigQuery, DBT, Apache Airflow, Apache NiFi, Python (Pandas, NumPy), AWS (S3, Lambda, Glue, Redshift, Athena), GCP (BigQuery, Dataflow), Azure Data Factory, Apache Spark, PySpark, Apache Kafka, Tableau, Power BI, Terraform, Git, Docker, Kubernetes, Great Expectations, PyTest.
Murphy Gas Irving, TX Aug 2022 – Sept 2023
Data Engineer
Responsibilities:
•Implemented Hadoop framework to capture user navigation across the application to validate the user interface and provide analytic feedback/result to the UI team.
•Loaded data into the cluster from dynamically generated files using Flume and from relational database management systems using Sqoop.
•Built a real time streaming pipeline utilizing Kafka, Spark Streaming and Redshift.
•Developed logical and physical data flow models for Informatica ETL applications.
•Worked on creation of customer Docker container images, tagging, and pushing of data images.
•Written Hive queries for data analysis to meet the business requirements.
•Implemented and analyzed SQL query performance issues in databases.
•Responsible for design development of Spark SQL Scripts based on Functional Requirements and Specifications.
•Designed SSIS data flows with transformations such as Lookup, Merge Join, Conditional Split, and Aggregation.
•Hands on experience in loading data from UNIX file system to HDFS.
•Experienced on loading and transforming of large sets of structured, semi structured and unstructured data from HBase through Sqoop and placed in HDFS for further processing.
•Managing and scheduling Jobs on a Hadoop cluster using Oozie.
•Involved in creating Hive tables, loading data, and running hive queries in those data.
•Extensive Working knowledge of partitioned table, UDFs, performance tuning, compression-related properties, thrift server in Hive.
•Worked on Microsoft Azure services like HDInsight clusters, BLOB, ADLS, Data Factory and Logic Apps and done POC on Azure Data Bricks.
•Monitored Azure Synapse Analytics using Dynamic Management views to identify the performance bottlenecks.
•Implemented SSIS logging, checkpoints, and error handling to improve package reliability and monitoring.
•Experience in developing spark applications using spark-SQL in databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing and transforming the data to uncover insights into the customer usage patterns.
•Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics.
•Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
•Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity.
•Designed end to end scalable architecture to solve business problems using various Azure Components like HDInsight, Data Factory, Data Lake, Storage and Machine Learning Studio.
•Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity.
•Worked alone on Google Cloud Migration project from Azure PaaS and converted their ETLs through Databricks and SSIS.
•Scheduled jobs thru Databricks Jobs and SQL Server deployed packages with SQL Server Agent.
•Worked on Databricks in creation of multiple notebooks as per the requirements and scheduling jobs using Apache Airflow.
Environment: Hadoop, Flume, Sqoop, Kafka, Spark Streaming, Redshift, Informatica, Docker, Hive, SQL, Spark SQL, HDFS, HBase, Oozie, Microsoft Azure (HDInsight, BLOB, ADLS, Data Factory, Logic Apps, Synapse Analytics, Data Lake, Azure Storage, Azure SQL, Azure DW, Azure Databricks), Google Cloud Platform (GCP), Databricks, SSIS, Apache Airflow, Machine Learning Studio.
RMSI PVT. LTD. Hyderabad, INDIA March 2020- July 2021
Software Engineer
Responsibilities:
Design robust, reusable, and scalable data driven solutions and data pipeline frameworks to automate the ingestion, processing and delivery of both structured and unstructured batch and real time data streaming data using Python Programming.
Applied transformation on the data loaded into Spark Data Frames and done in memory data computation to generate the output response.
Applied transformation on the data loaded into Spark Data Frames and done in memory data computation to generate the output response.
Converting Hive/SQL queries into Spark transformations using Spark RDDs and PySpark.
Good knowledge of troubleshooting and tuning Spark applications and Hive scripts to achieve optimal performance.
Hands-on experience on developing UDF, Data Frames and SQL queries in Spark SQL.
Creating and modified existing data ingestion pipelines using Kafka and Sqoop to ingest the database tables and streaming data into HDFS for analysis.
Finalize the naming Standards for Data Elements and ETL jobs and create a Data dictionary for Meta Data Management.
Worked on developing ETL workflows on the data obtained using Python for processing it in HDFS and HBase using Flume.
Analyzed large and critical datasets using HDFS, HBase, Hive, HQL, Pig, Sqoop, and Zookeeper.
Developed multiple POC's using Spark, Scala and deployed on the Yarn Cluster, compared the performance of Spark with Hive and SQL.
Use Amazon Elastic Cloud Compute (EC2) infrastructure for computational tasks and Simple Storage Service (S3) as storage mechanism.
Capable of using AWS utilities such as EMR, S3, and CloudWatch to run and monitor Hadoop and Spark jobs on AWS.
Maintain AWS Data Pipeline as web service to process and move data between Amazon S3, Amazon EMR, and Amazon RDS resources.
Worked on Data cleaning, pre-processing, and modelling using Spark and Python.
Implemented real-time data-driven secured REST APIs for data consumption using AWS (API, API Gateway, Route 53, Certificate Manager, CloudWatch, Kinesis), Swagger, Okta, and Snowflake.
Develop the automation scripts to transfer the data from on-premise clusters to Google Cloud Platform (GCP).
Developed AWS CloudWatch Dashboards for monitoring API Performance.
Environment: Python, Spark, PySpark, Scala, Hive, HQL, Pig, SQL, Hadoop, HDFS, HBase, Sqoop, Zookeeper, Flume, Kafka, Yarn, Snowflake, Amazon RDS, Data Lake, AWS (EC2, S3, EMR, CloudWatch, RDS, Data Pipeline, Kinesis, Route 53, Certificate Manager, API Gateway), Google Cloud Platform (GCP), Databricks, AWS Kinesis, REST API, Swagger, Okta, Splunk, Apache Spark, Spark SQL, Spark Data Frames.
EDUCATION DETAILS:
Master Of Science in Computer Science at University of Central Missouri Aug 2021- July 2022