Kavya P
Big Data Engineer
**********@*****.***
LinkedIn: https://www.linkedin.com/in/kavya-8678456k/ PROFESSIONAL SUMMARY
• Dynamic and motivated IT professional with around 7 years of experience as a Big Data Engineer with expertise in designing data intensive applications using Hadoop Ecosystem, Big Data Analytical, Cloud Data engineering, Data Warehouse / Data Mart, Data Visualization, Reporting, and Data Quality solutions.
• In - depth knowledge of Hadoop architecture and its components like YARN, HDFS, Name Node, Data Node, Job Tracker, Application Master, Resource Manager, Task Tracker and Map Reduce programming paradigm.
• Extensive experience in Hadoop led development of enterprise level solutions utilizing Hadoop components such as Apache Spark, MapReduce, HDFS, Sqoop, PIG, Hive, HBase, Oozie, Flume, NiFi, Kafka, Zookeeper, and YARN.
• Profound experience in performing Data Ingestion, Data Processing (Transformations, enrichment, and aggregations). Strong Knowledge on Architecture of Distributed systems and Parallel processing, In-depth understanding of MapReduce programming paradigm and Spark execution framework.
• Experienced with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data frame API, Spark Streaming, MLlib, Pair RDD 's and worked explicitly on PySpark and Scala.
• Handled ingestion of data from different data sources into HDFS using Sqoop, Flume and perform transformations using Hive, Map Reduce and then loading data into HDFS. Managed Sqoop jobs with incremental load to populate HIVE external tables. Experience in importing streaming data into HDFS using Flume sources, and Flume sinks and transforming the data using Flume interceptors.
• Experience in Oozie and workflow scheduler to manage Hadoop jobs by Direct Acyclic Graph (DAG) of actions with control flows.
• Implemented the security requirements for Hadoop and integrating with Kerberos authentication infrastructure- KDC server setup, creating realm /domain, managing.
• Experience of Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance. Experience with different file formats like Avro, parquet, ORC, Json, and XML.
• Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow and Oozie. Experienced with using most common Operators in Airflow - Python Operator, Bash Operator, Google Cloud Storage Download Operator, Google Cloud Storage Object Sensor, Google Cloud Storage To S3Operator.
• Hands-on experience in handling database issues and connections with SQL and NoSQL databases such as MongoDB, HBase, Cassandra, SQL server, and PostgreSQL. Created Java apps to handle data in MongoDB and HBase. Used Phoenix to create SQL layer on HBase.
• Experience in designing and creating RDBMS Tables, Views, User Created Data Types, Indexes, Stored Procedures, Cursors, Triggers and Transactions.
• Expert in designing ETL data flows using creating mappings/workflows to extract data from SQL Server and Data Migration and Transformation from Oracle/Access/Excel Sheets using SQL Server SSIS.
• Expert in designing Parallel jobs using various stages like Join, Merge, Lookup, remove duplicates, Filter, Dataset, Lookup file set, Complex flat file, Modify, Aggregator, XML.
• Have Extensive Experience in IT data analytics projects, Hands on experience in migrating on premise ETLs to Google Cloud Platform (GCP) using cloud native tools such as BIG query, Cloud Data Proc, Google Cloud Storage, Composer.
• Experience on Migrating SQL database to Azure data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and controlling and granting database access and Migrating On premise databases to Azure Data Lake store using Azure Data factory. Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns
• Hands-on experience with Amazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, CloudWatch, SNS, SES, SQS, Lambda, EMR and other services of the AWS family.
• Created and configured new batch job in Denodo scheduler with email notification capabilities and Implemented Cluster setting for multiple Denodo node and created load balance for improving performance activity.
• Instantiated, created, and maintained CI/CD (continuous integration & deployment) pipelines and apply automation to environments and applications.
• Worked on various automation tools like GIT, Terraform, Ansible.
• Experienced in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD
(Slowly changing dimension)
• Experienced with JSON based RESTful web services, and XML/QML based SOAP web services and worked on various applications using python integrated IDEs like Sublime Text and PyCharm.
• Developed web-based applications using Python, XML, CSS3, HTML5.
• Experience in using various packages in python like ggplot2, pandas, NumPy, seaborn, SciPy, Matplotlib, scikit-learn, Beautiful Soup.
TECHNICAL SKILLS
• Big Data Technologies: Hadoop, MapReduce, HDFS, Spark2.3, Sqoop, PIG, Hive, HBase, Oozie, Flume, NiFi, Kafka, Zookeeper, Yarn.
• ETL Tools: Informatica, Talend
• Databases: Oracle, MySQL, SQL Server, PostgreSQL, Teradata.
• Programming Languages: Python, PySpark, Scala, Java, PL/SQL, SQL.
• Cloud Technologies: AWS, Google Cloud Platform, Microsoft Azure.
• Tools: PyCharm, Eclipse, Visual Studio, SQL*Plus, SQL Developer, SQL Navigator, SSIS, SSRS, SSAS.
• Build Tools: Maven, Jenkin
• Versioning tools: Git, GitHub
• Operating Systems: Windows, Linux.
• Database Modelling: Dimension Modeling, ER Modeling, Star Schema Modeling, Snowflake Modeling
• Monitoring & Scheduling Tool: Apache Airflow, Oozie
• Visualization/ Reporting: Tableau, SSRS and Power BI
• Machine Learning Techniques: Linear & Logistic Regression, Classification and Regression Trees, Random Forest, Associative rules, NLP, and Clustering.
PROFESSIONAL EXPERIENCE
PINTREST– USA May 2022 – Present
Data Engineer
Roles & Responsibilities:
• Designed and setup Enterprise Data Lake to provide support for various uses cases including Analytics, processing, storing, and reporting of voluminous, rapidly changing data.
• Responsible for maintaining quality reference data in source by performing operations such as cleaning, transformation and ensuring Integrity in a relational environment by working closely with the stakeholders & solution architect.
• Designed and developed Security Framework to provide fine grained access to objects in AWS S3 using AWS Lambda, DynamoDB.
• Worked on Kerberos authentication principals to establish secure network communication on cluster and testing of HDFS, Hive, Pig and MapReduce to access cluster for new users.
• Performed end- to-end Architecture & implementation assessment of various AWS services like Amazon EMR, Redshift, S3.
• Implemented the machine learning algorithms using python to predict the quantity a user might want to order for a specific item so we can automatically suggest using kinesis firehose and S3 data lake.
• Expertise in Using AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.
• Expertise in Using Spark SQL for Scala & amp, Python interface that automatically converts RDD case classes to schema RDD.
• Experience in Importing the data from different sources like HDFS/HBase into Spark RDD and perform computations using PySpark to generate the output response.
• Created Lambda functions with Boto3 to deregister unused AMIs in all application regions to reduce the cost for EC2 resources.
• Worked on Importing & exporting database using SQL Server Integrations Services (SSIS) and Data Transformation Services (DTS Packages).
• Coded Teradata BTEQ scripts to load, transform data, fix defects like SCD 2 date chaining, cleaning up duplicates.
• Developed reusable framework to be leveraged for future migrations that automates ETL from RDBMS systems to the Data Lake utilizing Spark Data Sources and Hive data objects.
• Conducted Data blending, Data preparation using Alteryx and SQL for Tableau consumption and publishing data sources to Tableau server.
• Developed Kibana Dashboards based on the Log stash data and Integrated different source and target systems into Elasticsearch for near real time log analysis of monitoring End to End transactions.
• Implemented AWS Step Functions to automate and orchestrate the Amazon Sage Maker related tasks such as publishing data to S3, training ML model and deploying it for prediction.
• Integrated Apache Airflow with AWS to monitor multi-stage ML workflows with the tasks running on Amazon Sage Maker.
Environment: AWS EMR, S3, RDS, Redshift, Lambda, Boto3, DynamoDB, Amazon Sage Maker, Apache Spark, HBase, Apache Kafka, HIVE, SQOOP, Map Reduce, Snowflake, Apache Pig, Python, SSRS, Tableau. Thales - USA July 2021 to May 2022
Data Engineer
Roles & Responsibilities:
• Experience in Migrating an entire oracle database to Big Query and using of power bi for reporting.
• Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
• Experience in GCP Dataproc, GCS, Cloud functions, Big Query.
• Experience in moving data between GCP and Azure using Azure Data Factory.
• Experience in building power bi reports on Azure Analysis services for better performance.
• Experience in using cloud shell SDK in GCP to configure the services Data Proc, Storage, Big Query Coordinated with team and Developed framework to generate Daily adhoc reports and Extracts from enterprise data from Big Query.
• Expertise in designing and developing ETL mappings, Procedures and schedules following the standard development life cycle.
• Worked on writing custom spark programs to track job performance overtime.
• Designed and Co-ordinated with Data Science team in implementing Advanced Analytical Models in Hadoop Cluster over large Datasets.
• Wrote scripts in Hive SQL for creating complex tables with high performance metrics like partitioning, clustering, and skewing.
• Experience in downloading Big Query data into pandas or Spark data frames for advanced ETL capabilities.
• Worked with google data catalog and other google cloud APIs for monitoring, query, and billing related analysis for Big Query usage.
• Worked on creating POC for utilizing the ML models and Cloud ML for table Quality Analysis for the batch process.
• Knowledge about cloud dataflow and Apache beam.
• Experience in using cloud shell for various tasks and deploying services.
• Experience Creating Big Query authorized views for row level security or exposing the data to other teams.
• Expertise in designing and deployment of Hadoop cluster and different Big Data analytic tools including Pig, Hive, SQOOP, Apache Spark, with Cloudera Distribution.
Environment: Google Cloud Storage, Big Query, Data Proc, Composer, Cloud SQL, Cloud Functions, Cloud Pub/Sub, PySpark, Apache Spark, HBase, Apache Kafka, HIVE, SQOOP, FLUME, Apache oozie, Zookeeper, ETL, UDF, Map Reduce, Snowflake, Apache Pig, Python, Java, SSRS. Indeed– USA May 2020 to July 2021
Big Data Engineer
Roles & Responsibilities:
• Worked on Azure Data Factory to integrate data of both on-prem (MY SQL, Cassandra) and cloud (Blob storage, Azure SQL DB) and applied transformations to load back to Azure Synapse.
• Managed, Configured, and scheduled resources across the cluster using Azure Kubernetes Service.
• Monitored Spark cluster using Log Analytics and Ambari Web UI. Transitioned log storage from Cassandra to Azure SQL Datawarehouse and improved the query performance.
• Involved in developing data ingestion pipelines on Azure HDInsight Spark cluster using Azure Data Factory and Spark SQL. Also Worked with Cosmos DB (SQL API and Mongo API).
• Developed dashboards and visualizations to help business users analyze data as well as providing data insight to upper management with a focus on Microsoft products like SQL Server Reporting Services (SSRS) and Power BI.
• Performed the migration of large data sets to Databricks (Spark), create and administer cluster, load data, configure data pipelines, loading data from ADLS Gen2 to Databricks using ADF ETL pipelines.
• Created various pipelines to load the data from Azure data lake into Staging SQLDB and followed by to Azure SQL DB.
• Created Databrick notebooks to streamline and curate the data for various business use cases and mounted blob storage on Databrick.
• Utilized Azure Logic Apps to build workflows to schedule and automate batch jobs by integrating apps, ADF pipelines, and other services like HTTP requests, email triggers etc.
• Worked extensively on Azure data factory including data transformations, Integration Runtimes, Azure Key Vaults, Triggers and migrating data factory pipelines to higher environments using ARM Templates.
• Ingested data in mini-batches and performs RDD transformations on those mini-batches of data by using Spark Streaming to perform streaming analytics in Data bricks. Environment: Azure SQL DW, Databrick, Azure Synapse, Cosmos DB, ADF, SSRS, Power BI, Azure Data Lake, ARM, Azure HDInsight, Blob storage, Apache Spark.
Tera Soft Solutions - India June 2015 to May 2019
Hadoop Developer
Roles & Responsibilities:
• Expertise in Performing data transformations like filtering, sorting, and aggregation using Pig.
• Creating Sqoop to import data from SQL, Oracle, and Teradata to HDFS.
• Created Hive tables to push the data to MongoDB.
• Experience in writing complex aggregate queries in mongo for report generation.
• Developed scripts to run scheduled batch cycles using Oozie and present data for reports Worked on a POC for building a movie recommendation engine based on Fandango ticket sales data using Scala and Spark Machine Learning library.
• Developed big data ingestion framework to process multi-TB data including data quality checks, transformation, and stored as efficient storage formats like parquet and loaded into AmazonS3 using Spark Scala API and Spark.
• Implemented automation, traceability, and transparency for every step of the process to build trust in data and streamline data science efforts using Python, Java, Hadoop streaming, Apache Spark, Spark SQL, Scala, Hive, and Pig.
• Performed data validation and transformation using Python and Hadoop streaming. Loading the data from the different Data sources like (Teradata and DB2) into HDFS using SQOOP and load into Hive tables, which are partitioned.
• Developed bash scripts to bring the TLOG file from FTP server and then processing it to load into hive tables.
• Automated workflows using shell scripts and Control-M jobs to pull data from various databases into Hadoop Data Lake.
• Extensively used DB2 Database to support the SQL Involved in story-driven Agile development methodology and actively participated in daily scrum meetings.
• Inserted Overwriting the HIVE data with HBase data daily to get fresh data every day and used Sqoop to load data from DB2 into HBASE environment.
• Worked in converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala and have a good experience in using Spark-Shell and Spark Streaming.
• Created Hive, Phoenix, HBase tables and HBase integrated Hive tables as per the design using ORC file format and Snappy compression.
• Developed Oozie Work flows for daily incremental loads, which gets data from Teradata and then imported into hive tables.
• Sqoop jobs, PIG and Hive scripts were created for data ingestion from relational databases to compare with historical data.
• Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL, and a variety of portfolios.
• Developed pig scripts to transform the data into structured format and it are automated through Oozie coordinators. Environment: Hadoop, HDFS, Spark, Hive, Pig, Sqoop, Oozie, DB2, Java, Python, Oracle, SQL, Splunk, Unix, Shell Scripting.
EDUCATION
Bachelor of Technology in Computer science & Engineering GITAM, India.