Data engineer

Location:

Sault Ste. Marie, ON, Canada

Salary:

110

Posted:

June 28, 2023

Contact this candidate

Resume:

Candidate Name: Naga Saketh Rentala

EMAIL: *****************@*****.***

MOBILE :249-***-****

PROFESSIONAL SUMMARY:

5+ years of overall IT experience in Data Engineering, Data Pipeline Design, Development, and Implementation across various industries including Retail, Financial, Technology etc.

Experience in designing Data Marts by following Star Schema and Snowflake Schema Methodology.

In-depth understanding of Hadoop architecture and its components such as resource manager, Application manager, name node, Edge node and data node.

Good understanding of Spark Architecture including Spark core, Spark SQL, Data Frame, Spark streaming, Driver Node, Worker Node, Stages, Executors and Tasks, Deployment modes, the Execution hierarchy, fault tolerance, and collection

Experienced in working with big data technologies such as SparkSQL, SparkRDD (Resilient Distributed Dataset), Spark streaming using Scala and PySpark.

Expertise in working with Hadoop ecosystem tools such as writing pig scripts, validating data in Hive with HQL queries, In-depth understanding of MapReduce concepts.

Worked with orchestrating tools such as Apache Airflow, developing workflows in python to orchestrate and manage data ingestion process for batch and streaming use cases.

Expertise in creating complex DAGs (Directed Acyclic Graphs) to manage data pipeline orchestration at scale.

Experience in working with public cloud services such as AWS (Amazon Web Services) created and managed IAM (Identity and Access management) roles and Access policies at the enterprise level to control data access within the organization, adhering to enterprise data governance rules.

Very strong experience in creating and managing AWS S3(Simple storage service) buckets by writing bucket access policies to control data access, managed data retention timelines and S3 Life cycle policy.

Experienced in building cloud native solutions for serverless data streaming use cases with real time data ingestion patterns leveraging AWS Lambda. Experienced in working with Amazon SQS (Simple Queue service) to consume real-time streaming data by configuring Lambda triggers.

Experienced in designing and deploying solutions on Microsoft Azure platform using Data Factory, Data Lake store, Data Lake Analytics, Azure Data Bricks, and Integration solutions.

Experienced in extracting and loading the data from different relational data bases such as Oracle, Teradata, DB2 into Azure Datalake storage using Azure data factory.

Extensive experience using Azure cloud components such as Blob storage account, Key Vault, ADLSv2, Azure Databricks

Expertise in working with python libraries PySpark to create Spark Data Frames and consume raw data received in various file formats such as CSV, TSV, Parquet, ORC, JSON and AVRO.

Experience working with Databricks notebooks, leveraged boto3 library to access AWS resources such as S3 object storage with partitioned data and performed data analysis by leveraging Spark Data Frame API and running automated built-in Notebook workflows.

Experience working with relational databases like MySQL, PostgreSQL, Snowflake. Hands on experience in developing views on top of SQL Tables to control and manage access to relational data.

Good experience with ETL tools like Informatica and Visualization tools like Power BI and Tableau.

Experience in leveraging complex SQL Functions such as Aggregation and Window functions during data analysis and data validation.

Extensive experience in working on complex IT Projects including Onsite Offshore model utilizing Waterfall as well as Agile Methodologies.

TECHINCAL SKILLS:

Big Data

Apache Spark, Apache Airflow, PySpark, Spark, Scala, Apache Hadoop, MapReduce, Pig, Hive, HDFS, Databricks.

Cloud services

Azure Data Factory (ADF), Azure Datalake (ADLS), AWS (Amazon Web Services), EMR (Elastic Map Reduce), S3 (Simple Storage Service), Lambda (serverless), ECS (Elastic Container Service), SNS (Simple Notification Service), SQS (Simple Queue Service), Amazon Redshift, AWS Glue,.

Data Warehousing/BI Tools

Informatica, Tableau, Power BI

Operating Systems

MacOS, Windows, Linux Kernels.

Programming languages

Python, Scala, Bash, Groovy, PL/SQL, HQL, Java, C.

Databases

MySQL, PostgreSQL, AWS RedShift, Azure SQL DB, Snowflake, Oracle, MongoDB.

Monitoring

AWS CloudWatch, Splunk, PagerDuty.

Data Analysis

Databricks, Jupyter Notebooks, Microsoft Excel.

Container and CI/CD

Docker, Jenkins.

Project Management

JIRA, Confluence

Version Control

Git, GitHub, GitLab, Bitbucket.

PROFESSIONAL EXPERIENCE:

Cigna, Ontario, Canada Mar 2021- Till

Sr Data Engineer

Description: The team works closely with Data/Application Architects and other stakeholders to design and maintain advanced data Pipelines and I was responsible for developing and supporting advanced reports/data pipelines that provide accurate and timely data for internal and external clients.

Responsibilities:

Create Spark Clusters and manage the all-purpose clusters and job clusters in Databricks running and hosting in Azure cloud service.

Mount Azure Data Lake containers to Databricks and create service principals, access keys, tokens to access Azure Data Lake Gen2 storage account.

Import raw data such as csv, json files into Azure Data Lake Gen2 to perform data ingestion by writing PySpark to extract flat files.

Construct data transformation by writing PySpark in Databricks to rename, drop, clean, validate and reformat into parquet files and load them into Azure Blob storage container.

Develop Azure linked services to construct connections with on-premises Oracle Database, SQL Server, Apache Hive with Azure datasets in the cloud.

Build ETL data pipelines in Azure Data Factory (ADF) to manage and process >1B+ rows into Azure SQL DW.

Configured Input & Output bindings of Azure Function with Azure Cosmos DB collection to read and write data from the container whenever the function executes.

Connected the data bricks notebooks with Airflow to schedule and monitor the ETL process.

Implement text preprocessing by removing stop words, punctuations, digits and stemming, lemmatizing each token in sentences using NLTK and Spacy.

Train NLP Question & Answering models using BERT Transfer Learning to answer domain questions; expedite Name-Entity Recognition process.

Launch NLP dashboards utilizing Dash, Plotly, Power BI and maintain them in the server; save 20% of the time in Sprint review meeting.

Provided user management and support by administering epics, user stories, tasks in Jira using Agile methodology, logged process flow documents in Confluence

Environment: Azure HDInsight, Databricks, Data Lake, Cosmos DB, MySQL, Azure SQL, Snowflake, Cassandra, Teradata, Ambari, PowerBI, Azure, Blob Storage, Data Factory, Data Storage Explorer, Scala, Hadoop 2.x (HDFS, MapReduce, Yarn), Spark, Git, PySpark, Airflow, Hive, HBase, Airflow

Wayfair, Boston, Massachusetts/ Hyderabad, India Sept 2019-2021 Jan

Sr Big Data Engineer

Description: Wayfair is one of the world’s largest e-commerce companies that sells furniture and home-goods. The project is to leverage the largest data set for products sold in the home space, our team treats data as an asset and determines how to maximize its business value and extend our competitive advantage.

Responsibilities:

Worked on the business requirements to design aggregate data model for complex data ingestion use case, combining multiple data streams of credit card data into a single stream.

Designed and developed DAG (Directed Acyclic Graph) orchestration model based on the architecture capabilities within the organization to streamline data ingestion in Apache airflow.

Created AWS S3 bucket to allow file drop mechanism of raw data files into specified object key locations and configured bucket access policy to control and manage human IAM roles and machine IAM roles to access raw data inside the S3 buckets

Developed PySpark code to consume raw data received in CSV format from several business streams into S3 buckets leveraging boto3 library and Machine IAM role attached to the EMR (Elastic Map Reduce) cluster and Glue cluster.

Implemented Spark Data Frame API and Spark SQL extensively to perform basic and complex data transformations to consolidate the raw data into common aggerate data model (Schema).

Extensively worked on python to develop Apache Airflow DAGs. Further, dockerized the application development process within the local environment by creating custom docker images based on the Linux kernels and python 3.7, airflow 1.10 environmental dependencies to make development and deployment process seamless.

Configured Airflow connection to AWS EMR (Elastic Map Reduce) cluster by providing a YAML configuration further provided HTTPS proxies, VPCs, Subnets to restrict & secure EMR IP address within the organization firewall and developed bash shell bootstrap scripts to initialize the EMR cluster with necessary environmental/run time configurations.

Leverage boto3 client to access outbound S3 bucket data location, partitioned and versioned every new data load to maintain historical record of data processing events in Enterprise data lake.

Further enabled connection to snowflake data warehouse leveraging Spark. Snowflake, JDBC driver library to write the processed data into an aggregate relational schema for all enterprise operational use case.

Performed Data Ingestion into Snowflake warehouse by writing data to stage table and destination table and developed snowflake views on top of it to perform entries level deduplication.

Developed CI/CD pipelines in groovy by onboarding GitHub repos to Jenkins, maintained exclusive git repos for the data processing code and pipeline orchestration code to decouple the interdependencies, better maintain code & provide scope for migration in near future.

Developed unit test cases for PySpark data processing code leveraging pytest, unittest, unittest.mock, pyunit.

Onboarded, repos to SonarQube implemented coverage API to identify the code coverage in unit testing and perform static application security testing (SAST) and generated coverage.xml report.

Extensively Worked on data pipeline design, development, processing and deployment leveraging AWS resources S3, Redshift, ECS, EMR and Glue to orchestrate and process data ingestion streams.

Hands on experience working with Databricks notebooks to access processed and raw sources integrated in the cloud data lake to perform data analysis using Spark Data frame API, Spark SQL.

Developed SQL stored procedure, Functions, views, indexes, and triggers to perform data validations on the processed relational data along with database testing.

Experience in Integrating real-time monitoring for the data ingestion process by migrating logs to AWS CloudWatch configured CloudWatch Alarms and CloudWatch Alerts

Environment: Apache Airflow, Apache Spark, Apache Hadoop, Python 3.7, PySpark 2.4.4, Spark SQL, Coverage, Behave, boto3, Unit-test, PyUnit, YAML, Bash, Docker, Jenkins, AWS EMR(Elastic Map Reduce), AWS Glue, AWS Redshift, AWS ECS(Elastic Container Service), AWS CloudWatch, AWS S3 (Simple Storage Service), Snowflake, Databricks, Splunk.

Evoke Technologies, Hyderabad, India May 2018- Aug 2019

ETL Developer

Responsibilities:

Created HBase tables to store variable data formats of PII data coming from different portfolios.

• Designed data models with industry standards up to 3rd NF (OLTP) and de normalized (OLAP) data marts with Star & Snow flake schemas.

• Monitor ETL support queue in ServiceNow, responsible for quick incident/request analysis and

addressing it for closure.

• Responsible for monitoring ~1500 ETL jobs and fixing and troubleshooting the failures.

• Leveraged R-Shiny to develop user-friendly interface to view behavior of driver.

• Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD's.

• Used PySpark and Spark SQL to read parquet data and create tables in hive using python API.

• Implemented PySpark using python and utilizing Data frames and PySpark SQL API for faster processing of data.

• Developed python scripts, UDFs using both Data frames/SQL/Data sets and RDD/Map Reduce in Spark 1.6 for data Aggregation, queries writing data back into OLTP system through Sqoop.

• Handling large datasets using Partitions, PySpark in Memory capabilities, Broadcasts in PySpark, effective & efficient Joins, Transformations and other during ingestion process itself.

• Processing the schema oriented and non-schema-oriented data using python and Spark.

• Involved in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.

• Worked with Snowflake’s stored procedures, used procedures with corresponding DDL statements, used JavaScript API to easily wrap and execute numerous SQL queries.

Environment: Cloudera (CDH3), AWS, Snowflake, HDFS, Pig 0.15.0, Hive 2.2.0,Kafka, Sqoop, Shell Scripting, Spark 1.8, Linux- Cent OS, Map Reduce, python 2, Eclipse 4.6. Informatica PowerCenter, Teradata, Oracle, Microsoft SQL Server, Flatfiles, ActiveBatch,UDeploy, GitLab, Jenkins

Education:

Bachelor of Technology: JNTUH, Telangana, India

Contact this candidate