Deepika
Sr Data Engineer
********@*****.***
Professional Summary:
Around 9 years of experience as a DATA ENGINEER with excellent knowledge of the Data warehouse environment, Data Lake, Oracle, Teradata, SQL SERVER, and AWS Services with Data Visualization tools like Tableau.
Hands on experience in Apache Hadoop technologies like Hadoop distributed file system (HDFS), Map Reduce framework, Hive, PIG, Python, Sqoop, Oozie, HBase, Spark, Scala, and Python.
Hands-on experience in developing Spark applications using Spark tools like RDD transformations, Spark core, Spark streaming, and Spark SQL.
Experience in Evaluation/design/development/deployment of additional technologies and automation for managed services on Lambda, EC2, EMR, Kinesis, SQS, SNS, CloudWatch, Data Pipeline, Redshift, Dynamo DB, AWS Glue, Athena, Aurora DB, RDS.
Extensive working experience in Normalization and De-Normalization techniques for both OLTP and OLAP systems in creating Database Objects like Tables, Constraints (Primary key, Foreign Key, Unique, Default), Indexes.
Hands-on experience in developing and deploying enterprise-based applications using major Hadoop ecosystem components like MapReduce, YARN, Hive, HBase, Flume, Sqoop, Spark SQL, and Kafka.
Experience in writing Sub Queries, Stored Procedures, Triggers, Cursors, and Functions on MySQL, PostgreSQL, and Oracle.
Strong Experience with writing and understanding complex Hive queries (HQL)
Expertise in writing SQL Queries, Dynamic-queries, sub-queries, and complex joins for generating Complex Stored Procedures, Triggers, User-defined Functions, Views, Cursors and extensive experience in advanced SQL Queries and PL/SQL stored procedures.
Strong knowledge of Amazon Kinesis, AWS Lambda, Amazon Simple Queue Service (Amazon SQS), Amazon Simple Notification Service (Amazon SNS), and Amazon Simple Workflow Service (Amazon SWF).
Hands-on experience in creating and developing DBT models for ELT transformations.
Experience in implementing big data solutions using Azure Data Lake, Data Factory, HDInsight, Data bricks, Delta Lake, Synapse.
Experience in writing complex queries and stored procedures in Teradata and other MPP systems (Redshift, Teradata and Netezza) as well.
Strong experience in extracting and loading data using complex business logic using impala from different data sources.
Hands-on experience in implementing Dashboards, Data Visualizations and Analytics using Tableau Desktop.
Sound knowledge of database architecture for OLTP and OLAP applications, Data Analysis and ETL processes and data modeling and dimension modeling.
Experience in tasks like Project tracking, Mentoring, Version Controls, Software Change Request management, Project Deliveries / Quality Control and Migration.
Experience in using analytic data warehouses like Snowflake.
Experience in Data Modeling with expertise in creating Star & Snowflake Schemas, FACT and Dimensions Tables, Physical and Logical Data Modeling.
Experience in data integration and building data pipelines in Airflow using Python scripting.
Experience in Migrating to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse, granting database access, and migrating on-premises databases to Azure Data Lake Store using Azure Data Factory.
Experience in managing MongoDB environment from availability, performance, and Scalability perspectives.
Proficiency in SQL across several dialects (we commonly write MySQL, PostgreSQL, Redshift, SQL Server, and Oracle)
Good experience in automating the scheduled jobs in Control-M and Airflow.
Excellent Communication and presentation skills along with good experience in communicating and working with various stakeholders.
Good knowledge of Google Cloud Platform (GCP) with services like Big Query, Cloud SQL, Data Flow, and Cloud Storage.
Gained brief exposure to client server web applications environment, web development, Automation Development and testing, CI/CD, Database Design & Development, writing complex SQL Queries.
Team Player, able to work independently with minimum supervision, innovative & efficient, good at debugging, and strong desire to keep pace with the latest technologies.
Technical Skills
Databases: Snowflake, AWS RDS, Teradata, Oracle, MySQL, Microsoft SQL, PostgreSQL
NoSQL Databases: MongoDB, Hadoop HBase, Apache Cassandra
Programming Skills: Python, Scala, Java, Shell Scripting, SQL
Cloud Technologies: Amazon Web Services (AWS), Azure, Google Cloud Platform (GCP)
Data Formats: CSV, JSON, Parquet, Avro
Querying Languages: SQL, NO SQL, PostgreSQL, MySQL, Microsoft SQL, PL/SQL
Integration Tools: Jenkins
Scalable Data Tools: Hadoop, Hive, Apache Spark, Pig, Map Reduce, Sqoop, Impala
Operating Systems: Linux, UNIX, Windows, macOS
Reporting & Visualization: Tableau, Looker, Excel, Informatica, Power BI
Professional experience:
Client: USAA, Dallas, Tx Mar 2022 - Present
Role: Sr. Data Engineer
Responsibilities:
Developed Scala scripts and UDFs using both data frames/SQL and RDD in Spark for data aggregation, queries, and writing back into the S3 bucket.
Creating and developing DBT models to apply ELT transformations on Snowflake.
Designed and Developed Spark workflows using Scala for data pulled from AWS S3 bucket and Snowflake applying transformations on it.
Good experience in creating tables in Redshift and loading the data from the Spark for the reporting purpose of the business.
Implemented Data Quality Checks, Validations, and ETL Transformations using Apache PySpark.
Developed MapReduce programs for pre-processing and cleansing the data in HDFS obtained from heterogeneous data sources to make it suitable for ingestion into Hive schema for analysis.
Involved in setting up Apache Airflow service in GCP.
Involved in developing Pig Scripts for change data capture and delta record processing between newly arrived data and already existing data in HDFS.
Wrote reports using Tableau Desktop to extract data for analysis using filters based on the business use case.
Implemented a proof of concept deploying this product in AWS S3 bucket and Snowflake.
Utilized AWS services with a focus on big data architect /analytics/enterprise Data warehouse and business intelligence solutions to ensure optimal architecture, Scalability, flexibility, availability, performance, and to provide meaningful and valuable information for better decision-making.
Wrote, compiled, and executed programs as necessary using Apache Spark in Scala to perform ETL jobs with ingested data.
Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in production.
Used Spark Streaming to divide streaming data into batches as input to the Spark engine for batch processing.
Wrote Spark applications for data validation, cleansing, transformation, and custom aggregation and used Spark engine, Spark SQL for data analysis, and provided to the data scientists for further analysis.
Prepared scripts to automate the ingestion process using Python and Scala as needed through various sources such as API, AWS S3, Teradata, and Snowflake.
Implemented a Continuous Delivery pipeline with Docker and GitHub.
Created HIVE Queries to process large sets of structured, semi-structured, and unstructured data and store them in Managed and External tables.
Created scripts to read CSV, JSON, and parquet files from S3 buckets in Python and load them into AWS S3, DynamoDB, and Snowflake.
Migrated data from AWS S3 bucket to Snowflake by writing custom read/write Snowflake utility function using Scala.
Creating AWS Lambda, EC2 instances provisioning on AWS environment and implemented security groups, administered Amazon VPCs.
Used Apache Airflow in GCP composer environment to build a data pipeline and used various Airflow operators like Hadoop operator, bash operator and Python callable and branching operator.
Worked with building data warehouse structures, and creating facts, dimensions, aggregate tables, by dimensional modeling, Star and Snowflake schemas.
Worked on Snowflake Schemas and Data Warehousing and processed batch and streaming data load pipeline using Snow Pipe.
Profile structured, unstructured, and semi-structured data across various sources to identify patterns in data and Implement data quality metrics using necessary queries or Python scripts based on the source.
Worked on building dashboards in Tableau with ODBC connections from the GCP Big Query and presto SQL Engine.
Worked on designing, building, deploying, and maintaining Mongo DB.
Used Agile methodology named SCRUM for all the work performed.
Environment: Spark, Scala, AWS, ETL, Hadoop, Python, Snowflake, HDFS, Hive, MapReduce, PySpark, Pig, Docker, GitHub, Teradata, JSON, DBT, PostgreSQL, MongoDB, SQL, Agile.
Client: Capital One, McLean, Virginia Nov 2020 – Feb 2022
Role: Data Engineer
Responsibilities:
Designed and set up Enterprise Data Lake to provide support for various use cases including Analytics, processing, storing and reporting of voluminous, rapidly changing data.
Responsible for maintaining quality reference data in source by performing operations such as cleaning, transformation and ensuring Integrity in a relational environment by working closely with the stakeholders & solution architect.
Created shell script to run data stage jobs from UNIX and then schedule this script to run data stage jobs through scheduling tool.
Designed and developed Security Framework to provide fine grained access to objects in AWS S3 using AWS Lambda, DynamoDB.
Set up and worked on Kerberos authentication principals to establish secure network communication on cluster and testing of HDFS, Hive, Pig and MapReduce to access cluster for new users.
Performed end- to-end Architecture & implementation assessment of various AWS services like Amazon EMR, Redshift, and S3.
Implemented the machine learning algorithms using Python to predict the quantity a user might want to order for a specific item so we can automatically suggest using kinesis firehose and S3 Data Lake.
Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.
Used Spark SQL for Scala amp, and Python interface that automatically converts RDD case classes to schema RDD.
Import the data from different sources like HDFS/HBase into Spark RDD and perform computations using PySpark to generate the output response.
Creating Lambda functions with Boto3 to deregister unused AMIs in all application regions to reduce the cost for EC2 resources.
Importing & exporting database using SQL Server Integrations Services (SSIS) and Data Transformation Services (DTS Packages).
Coded Teradata BTEQ scripts to load, transform data, fix defects like SCD 2 date chaining, cleaning up duplicates.
Developed reusable framework to be leveraged for future migrations that automates ETL from RDBMS systems to the Data Lake utilizing Spark Data Sources and Hive data objects.
Conducted Data blending, Data preparation using Alteryx and SQL for Tableau consumption and publishing data sources to Tableau server.
Developed Kibana Dashboards based on the Log stash data and Integrated different source and target systems into Elasticsearch for near real time log analysis of monitoring End to End transactions.
Implemented AWS Step Functions to automate and orchestrate the Amazon Sage Maker related tasks such as publishing data to S3, training ML model and deploying it for prediction.
Integrated Apache Airflow with AWS to monitor multi-stage ML workflows with the tasks running on Amazon Sage Maker.
Environment: AWS EMR, S3, RDS, Redshift, Lambda, Boto3, DynamoDB, Amazon Sage Maker, Apache Spark, HBase, Apache Kafka, HIVE, SQOOP, Map Reduce, Snowflake, Apache Pig, Python, Tableau.
Client: Lowe's., Mooresville, North Carolina Jan 2020 – Sep 2020
Role: Data Engineer
Responsibilities:
Developed MapReduce jobs in both PIG and Hive for data cleaning and pre-processing.
Imported Legacy data from SQL Server and Teradata into Amazon S3.
Created consumption views on top of metrics to reduce the running time for complex queries.
Drive collection, cleaning, processing, and analysis of new and existing data sources including oversight for defining, and reporting data quality and consistency metrics.
Scripting using a Python programming language.
Handling errors and debugging production issues in real-time to minimize outages.
Enhance the applications based on the changes in the requirements.
Participate in production support monitoring IT/Business alerts globally.
Used AWS data pipeline for Data Extraction, Transformation, validation, and loading from heterogeneous data sources.
Predominantly associated with using Python and PySpark Lambda Functions to create on-demand tables on S3 files and setting up steps when there is a modification of files.
Exported Data into Snowflake by creating Staging Tables to load Data of different files from Amazon S3.
Several ETL flows are implemented to transfer data from multiple sources to S3 and Redshift.
Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems (MySQL) and vice-versa.
Installed and configured Hadoop MapReduce, HDFS and developed multiple MapReduce jobs in Java for data cleansing and preprocessing.
Compare the data in a leaf-level process from various databases when data transformation or data loading takes place.
Developed Spark code using Python and Spark-SQL/streaming for faster testing and processing of data.
Closely involved in scheduling Daily, Monthly jobs with Preconditions/Postcondition based on the requirement.
Integrating Apache Airflow with AWS to monitor multi-stage ML workflows with the tasks running on Amazon Sage Maker.
Implementing AWS Step Functions to automate and orchestrate the Amazon Sage Maker-related tasks such as publishing data to S3, training ML model, and deploying it for production.
Generating various kinds of reports using Power BI and Tableau based on Client specifications.
Involved in creating dashboards and reports in Tableau and Monitored server activities, user activity and customized views
Monitor the Daily, Weekly, Monthly jobs and provide support in case of failures/issues.
Worked on analyzing Hadoop Cluster and different big data analytic tools.
Used Cassandra CQL and Java API's to retrieve data from the Cassandra table.
Worked with various HDFS file formats like Avro, Sequence File, JSON, and various compression formats like Snappy, bzip2.
Working experience with data streaming processes with Kafka, Apache Spark, and Hive.
Developed and Configured Kafka brokers to pipeline server logs data into Spark streaming.
Developed Spark scripts by using Scala shell commands as per the requirement.
Imported the data from CASSANDRA databases and stored it in AWS.
Managed Hadoop clusters using Cloudera. Extracted, Transformed, and Loaded (ETL) of data from multiple sources like Flat files, XML files, and Databases.
Used Spark SQL for Scala and Python interface that automatically converts RDD case classes to
schema RDD.
Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs.
Used Amazon CLI for data transfers to and from Amazon S3 buckets.
Executed Hadoop/Spark jobs on AWS EMR using programs and data stored in S3 Buckets.
Implemented the workflows using the Apache Oozie framework to automate tasks.
Implemented Spark RDD transformations and actions to implement the business analysis using PySpark.
Developed Spark scripts by using Scala shell commands as per the requirement
Devised PL/SQL Stored Procedures, Functions, Triggers, Views, and packages. Made use of Indexing, Aggregation and Materialized views to optimize query performance.
Used Spark-Streaming APIs to perform necessary transformations and actions on the data got from Kafka.
Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
Environment: HDFS, MapReduce, Snowflake, Pig, Hive, Java, Kafka, Python, Spark, PL/SQL, AWS, Scala, SQL Server, Cassandra, Oozie, PySpark.
Client: Symbioun Solutions Pvt. Ltd India Feb 2016 – Jul 2018
Role: Big Data Engineer
Responsibilities:
The near real-time reporting was achieved by an event-based processing approach adoption instead of micro-batching to deal with data coming from Kafka.
Used Scala to convert Hive/SQL queries into RDD transformations in Apache Spark.
Implemented Spark solutions to generate reports, fetch and load data in Hive.
Implemented Spark using Scala, and Python and utilizing Data frames and Spark SQL API for faster processing of data.
Responsible for wide-ranging data ingestion using Sqoop and HDFS commands. Accumulate ‘partitioned’ data in various storage formats like text, JSON, Parquet, etc. Involved in loading data from the LINUX file system to HDFS.
Implemented Copy activity, and Custom Azure Data Factory Pipeline Activities.
Transforming business problems into Big Data solutions and defining Big Data strategy and Roadmap. Installing, configuring, and maintaining Data Pipelines.
Recreated existing SQL Server objects in Snowflake.
Also converted SQL Server mapping logic to Snow SQL queries.
Primarily involved in Data Migration using SQL, Azure SQL, Azure Storage, and Azure Data Factory.
Implemented Databricks Delta tables and Delta Lake to enable ACID transaction logging.
Designing the business requirement collection approach based on the project scope and SDLC methodology.
Creating Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backward.
Designed and maintained reports in Power BI built on top of Azure Synapse.
Developed various Mappings with the collection of all Sources, Targets, and Transformations using Informatica Designer.
Extracted the raw data from Microsoft Dynamics CRM to staging tables using Informatica Cloud.
Tuned the existing Informatica mappings to maximize performance to reduce run times
Extract Transform and Load data from sources Systems to Azure Data Storage services using a combination of Azure Data factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Data ingestion to one or more Azure services (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks.
Develop Spark applications using PySpark and Spark SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing and transforming the data to uncover insight into customer usage patterns.
Migration of on-premises data (Oracle/ SQL Server/ DB2/ MongoDB) to Azure Data Lake and Stored (ADLS) using Azure Data Factory (ADF V1/V2).
Work on data that was a combination of unstructured and structured data from multiple sources and automate the cleaning using Python scripts.
Architect & implement medium to large-scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB).
Tackle highly imbalanced Fraud datasets using under-sampling with ensemble methods, oversampling, and cost-sensitive algorithms.
Created action filters, parameters, and calculated sets for preparing dashboards and worksheets using Power BI
Develop SQL reports that meet client expectations for the application. Install, configure, test, monitor, upgrade, and tune new and existing PostgreSQL databases.
Developed visualizations and dashboards using Power BI
Implement ad-hoc analysis solutions using Azure Data Lake Analytics/Store, HDInsight.
Created and maintained SQL Server scheduled jobs, executing stored procedures for the purpose of extracting data from Oracle into SQL Server. Extensively used Tableau for customer marketing Data Visualization
Worked on analyzing Hadoop clusters and different big data analytic tools including Pig, and Hive.
Performed all necessary day-to-day GIT support for different projects, Responsible for the design and maintenance of the GIT Repositories, and the access control strategies.
Environment: Hadoop, Hive, Pig, Spark, Zookeeper, Kafka, Flume, Impala, Sqoop, Informatica, Snowflake, Azure, Azure data factory, Azure Synapse, Azure data bricks, HDInsight, Azure Data Lake, Power BI, PL/SQL, Oracle, PostgreSQL, SQL Server, DB2, MongoDB, Python, Yarn, Git.
Client: Helical IT Solutions, Hyderabad, India. June 2013 –Jan 2016
Role: Hadoop Developer
Responsibilities:
Analyzed data using Hadoop components Hive and Pig Queries, HBase queries.
Worked on whole ETL process.
Created data analytics Qlik dashboard for drill down analysis to root cause of the network problem and provide recommendations.
Used Regular expressions for filtering data from Logs for further data Analytics.
Responsible for running Hadoop streaming jobs to process terabytes of xml's data.
Load and transform large sets of structured, semi structured, and unstructured data using Hadoop/Big Data concepts.
Involved in loading data from UNIX file system to HDFS.
Responsible for creating Hive tables, loading data and writing Hive queries.
Handled importing data from various data sources, performed transformations using Hive, Map Reduce/Apache Spark, and loaded data into HDFS
Extracted the data from Oracle Database into HDFS using the Sqoop.
Exported the patterns analyzed back to Teradata using Sqoop.
Installed Oozie workflow engine to run multiple Hive and Pig jobs which run independently with time and data availability.
Worked with Spark Machine learning libraries, Streaming log files.
Proficient in Apache Spark and Scala.
Wrote/debug code in Java/Scala for User defined functions for Pig, Hive, and Map-reduce Jobs.
Used Flume to collect, aggregate, and store the call trace data logs, PM Counters, Site configuration data from different sources for different vendors for LTE/UMTS.
Education:
Bachelors in Computer Science Engineering from GITAM University