Resume

Data Engineering Azure

Location:

Chennai, Tamil Nadu, India

Salary:

70$

Posted:

January 18, 2024

Contact this candidate

Resume:

Gayathri Goukanapally

+1-309-***-****

ad2v0w@r.postjobfree.com

Sr. Cloud Data Engineer

Professional Summary:

Around 10+ years of expertise in the IT sector, with a strong focus on Big Data and Cloud technologies.

Designing, architecting, and creating solutions that ingest, process, and analyze huge, disparate data sets to meet business needs using cloud big data technologies.

Hands-on experience with Spark, Snowflake DW, and Tableau for ETL, Data Modelling, Data Warehousing, and Business Intelligence.

Uncovered insights by developing Spark apps utilizing Scala and Spark SQL for data extraction, transformation, and aggregation from a variety of data file types.

Using Azure Data Factory (ADF V1/V2), migrated on-premises data (Oracle/ SQL Server/ DB2/MongoDB) to Azure Data Lake Store (ADLS).

Using PySpark, designed and created Databricks notebooks.

Configured and deployed Azure Automation scripts for applications that use the Azure stack, such as Blobs, Azure Data Lake, Azure Data Factory, Azure SQL, and utilities, with an emphasis on automating the conversion of Hive/SQL queries into Spark transformations using Spark RDDs.

Practical experience with Azure Data Factory, including creating and managing pipelines to make data integration smoother and more resilient from start to finish.

Expertise in utilizing pipelines to process data, as well as pipeline parameters, activities, activity parameters, string manipulation functions, and manually/window-based/event-based job scheduling.

Used PySpark and Kafka to load data into Redshift and PostgreSQL RDS, designed and developed data pipelines to handle semi-structured and strutted data from various data sources.

Wrote PySpark code and utilized AWS services including EC2 and S3 for data processing and storage, as well as Hadoop cluster maintenance on AWS EMR.

Good understanding of Amazon Web Service (AWS) technologies such as Athena, EMR, and EC2 web services, which enable Teradata Big Data Analytics to be processed quickly and efficiently.

Management and Administration of AWS Services CLI, EC2, CloudTrail, IAM, VPC, S3, ELB Glacier, Route 53, and Trusted Advisor services.

Experience in on-premises and/or cloud-based implementation and end to end data warehouse development (Azure, AWS).

Practical expertise with Hadoop Map Reduce, HDFS, HIVE, PIG, HBase, Zookeeper, SQOOP, Oozie, Cassandra, Flume, and Avro.

Experience in storing log and JSON data in HDFS and processing the data with Hive/Pig.

Expertise in writing sophisticated Scala that works with JSON formats.

Worked with Spark in Scala and Python to create RDDs and execute operations such as transformations and actions.

Expertise in memory management, MapReduce, Impala, Spark Streaming by tuning the Scala program.

Working Knowledge of importing and exporting data from HDFS to RDBMS and vice versa using SQOOP.

Demonstrated experience building and implementing Spark applications using PySpark to evaluate Spark's performance against Hive and SQL.

Expertise in translating complicated RDMS queries (Oracle, MySQL, and Teradata) into Hive query language.

Understanding of ETL methodologies for data extraction, transformation, and loading in corporate-wide ETL solutions, as well as Data Warehouse tools for reporting and data analysis.

Experience conducting data analysis and data profiling on a variety of source systems, including Oracle and Teradata, utilizing sophisticated SQL.

Created Python scripts to automate data sampling and guarantee data integrity by verifying completeness, duplicates, correctness, and consistency.

Extensive expertise building Python programs utilizing various libraries such as Pandas, Scikit- learn, NumPy, SciPy, Matplotlib, and others.

Used JAVA and J2EE best practices to avoid creating unneeded objects.

Using JAVA and associated tools, implemented, maintained, supported testing, and debugged different application programs for administrative and web systems.

Working knowledge of how to log Java applications using the Log4J framework.

Created Tableau dashboards to visualize important data.

Excellent interpersonal, perceptive, analytical, and leadership abilities; fast learner with ability to understand and apply new concepts.

Technical Skills:

Hadoop/Spark Ecosystem

Hadoop, MapReduce, Pig, Hive/impala, YARN, Kafka, Flume, Oozie, Zookeeper, Spark, Airflow, MongoDB, Cassandra, HBase, and Storm.

Programming Languages

Python, Spark, Scala, JDBC, JSON, HTML, CSS, SQL, R, Shell Scripting

Script Languages

JavaScript, jQuery

Databases

Oracle, SQL Server, MySQL, Cassandra, Teradata, PostgreSQL, MS Access, Snowflake, NoSQL, HBase,MongoDB.

Cloud Platforms

AWS, Azure, GCP

Distributed Messaging System

Apache Kafka

Data Visualization Tools

Tableau, Power BI, Excel, ETL

Batch Processing

Hive, MapReduce, Pig, Spark

Operating System

Microsoft Windows

Reporting Tools/ETL Tools

Informatica Power Centre, Tableau, Pentaho, SSIS, SSRS, Power BI

GCP Services

BigQuery, Cloud Storage, Dataflow, Dataproc, Pub/Sub, and others.

Professional Experience:

Client: Credit One, Las Vegas, NV

Role: SR. Data Engineer Sep 2022 to Till Now

Responsibilities:

Running Spark SQL operations on JSON, converting the data into a tabular structure with data frames, and storing and writing the data to Hive and HDFS.

Developing shell scripts for data ingestion and validation with different parameters, as well as writing custom shell scripts to invoke spark Employment.

Tuned performance of Informatica mappings and sessions for improving the process and making it efficient after eliminating bottlenecks.

Familiarity with data science tools and frameworks, such as R, Python, and TensorFlow, and their integration with Hadoop.

Expertise in designing and implementing ETL workflows using workflow management tools, such as Apache Airflow and Oozie.

Familiarity with data visualization tools, such as Tableau and Power BI, and their integration with ETL processes.

Knowledge of machine learning algorithms and their implementation in Hadoop.

Expertise in developing Hadoop-based real-time streaming applications, using tools such as Kafka and Storm.

Strong debugging and troubleshooting skills for Hadoop-based applications.

Worked on complex SQL Queries, PL/SQL procedures and convert them to ETL tasks.

Worked with PowerShell and UNIX scripts for file transfer, emailing and other file related tasks.

Created a risk-based machine learning model (logistic regress, random forest, SVM, etc.) to predict which customers are more likely to be delinquent based on historical performance data and rank order them.

Evaluated model output using the uncertainty matrix (Precision, Recall as well as Teradata resources and utilities (BTEQ, Fast load, Multi Load, Fast Export, and TPUMP).

Developed a monthly report using Python to code the payment results of customers and make suggestions to the manager.

Ingestion and processing of Comcast setup box click stream events in real time with Spark 2.x, Spark Streaming, Databricks, Apache Storm, Kafka, Apache-Memory Igniter’s grid (Distributed Cache)

Used various DML and DDL commands for data retrieval and manipulation, such as Select, Insert, Update, Sub Queries, Inner Joins, Outer Joins, Union, Advanced SQL, and so on.

Using Informatica Power Center 9.6.1, I extracted, transformed, and loaded data into Netezza Data Warehouse from various sources such as Oracle and flat files.

Constructed the data pipelines for pulling the data from SQL Server, Hive. Landed the data in AWS S3 and loaded into snowflake after transforming.

Performed data analysis on large relational datasets using optimized diverse SQL queries.

Developed queries to create, modify, and delete update Oracle database and to analyse the data.

Environment: Scala 2.12.8, Python 3.7.2, PySpark, Spark 2.4, Spark ML Lib, Spark SQL, TensorFlow 1.9, NumPy 1.15.2, Keras 2.2.4, Power BI, Spark SQL, Spark Streaming, HIVE, Kafka, ORC, Avro, Parquet, HBase, HDFS Informatica, AWS

Client: Kaiser Permanente, Pleasanton, CA

Role: SR. Cloud Data Engineer Feb 2021 to Aug 2022

Responsibilities:

Implemented a 'serverless' architecture using API Gateway, Lambda, and Dynamo DB and deployed AWS Lambda code from Amazon S3 buckets. Created a Lambda function and configured it to receive events from your S3 bucket.

Designed the data models to be used in data intensive AWS Lambda applications which are aimed to do complex analysis, creating analytical reports for end-to-end traceability, lineage, definition of Key Business elements from Aurora.

Writing code that optimizes performance of AWS services used by application teams and provide Code-level application security for clients (IAM roles, credentials, encryption, etc.)

Creating AWS Lambda functions using python for deployment management in AWS and designed and implemented public facing websites on Amazon Web Services and integrated it with other applications infrastructure.

Lambda, EC2, VPC, RDS, S3, IAM, Cloud Watch services implementation and integrated with Service Catalog.

Regular monitoring activities in Unix/Linux servers like Log verification, Server CPU usage, Memory check, Load check, Disk space verification, to ensure the application availability and performance by using cloud watch and AWS X-ray. implemented AWS X-Ray service inside Confidential, it allows development teams to visually detect node and edge latency distribution directly from the service map Tools.

Design and Develop ETL Processes in AWS Glue to migrate data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.

Utilized Python Libraries like Boto3, NumPy for AWS.

Used Amazon EMR for MapReduce jobs and test locally using Jenkins.

Created external tables with partitions using Hive, AWS Athena, and Redshift.

Developed the PySpark code for AWS Glue jobs and for EMR.

Good Understanding of other AWS services like S3, EC2 IAM, RDS Experience with Orchestration and Data Pipeline like AWS Step functions/Data Pipeline/Glue.

Experience in writing SAM template to deploy serverless applications on AWS cloud.

Hands-on experience on working with AWS services like Lambda function, Athena, DynamoDB, Step functions, SNS, SQS, S3, IAM etc.

Designed and Developed ETL jobs in AWS GLUE to extract data from S3 objects and load it in data mart in Redshift.

Responsible for Designing Logical and Physical data modelling for various data sources on Redshift.

Experienced with event-driven and scheduled AWS Lambda functions to trigger various AWS resources.

Integrated lambda with SQS and DynamoDB with step functions to iterate through list of messages and updated the status into DynamoDB table.

Environment: AWS EC2, S3, EBS, ELB, EMR, Lambda, SRDS, SNS, SQS, Scala, VPC, IAM, Cloud formation, Cloud Watch, ELK Stack, Bit bucket, Python, Shell Scripting, GIT, Jira, Unix/Linux, AWS X-Ray, Dynamo DB, Kinesis.

Client: TIAA, TX

Role: Bigdata Developer July 2019 to Jan 2021

Responsibilities:

Designed and setup Enterprise Data Lake to provide support for various uses cases including Storing, processing, Analytics and reporting of voluminous, rapidly changing data by using various AWS Services.

Integrated large-scale/complex systems effectively, while producing solutions that extracted data from multiple sources, such as Enrollments, Carriers, HIX applications, and customer databases.

Performed initial assessment of current and future technology needs, while eliciting end-user manual procedures to make recommendations that improve functionality.

Expertise in building CI/CD on AWS environment using AWS commit code build, code deploy, and code pipeline and experience in using AWS cloud formation, API gateway and AWS lambda in automation and securing the infrastructure on AWS.

Created Python/SQL scripts to transform Data bricks notebooks from Redshift table into Snowflake S3 buckets.

Build and managed servers, firewall rules storage and authentication to server on open stack and AWS.

Designed and implemented ETL programs/PLSQL objects and carried out performance tuning on multiple long- running programs using Pentaho ETL.

Developed Data bricks ETL pipelines using notebooks, Spark Data frames Spark SQL and Python Scripting.

Leveraged Tableau to quickly visualize raw data in the requirements stage itself and suggest trends, patterns of interest. This greatly helped our users to fine-tune the end dashboard’s requirements to provide deeper insight using visual analytics.

Data Lake and Object Storage: With Terraform, you can provision and configure object storage services like Amazon S3 or Azure Blob Storage, which are commonly used in data engineering for storing large volumes of raw or processed data. You can define storage buckets, access policies, encryption settings, and lifecycle rules to automate the creation and management of these resources.

Provided FTP administration actives like automation scheduling trouble shooting with UNIX Linux AS 400 and windows platforms.

Data automation and migration.

Implemented SQL Alchemy which is a python library for complete access over SQL.

Creation of database objects like tables, views, materialized views, procedures, and packages using oracle tools like Toad, PL/SQL Developer and SQL* plus.

Used different data warehousing concepts to build a data warehouse for internal departments of the organization.

Used cloud shell SDK in GCP to configure the services Data proc, storage, big query.

Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.

Work with team of developers who are transitioning current in-house data warehouse solution to GCP.

Created and enhanced data solutions enabling seamless delivery of data and responsible for collecting, parsing, managing, and analyzing large sets of data.

Lead the design of the logical data model and implements the physical database structure and constructs and implements operational data stores and data marts.

Applied agile methodology for design/ development, prepared project plan and test plans.

Implemented data structures using best practices in data modeling, ETL/ELT processes, SQL and python.

Established scalable, efficient, automated processes for large scale data analyses.

Identified, designed, and implemented internal process improvements: automating manual processes, optimizing data delivery, re-designing infrastructure for greater scalability.

Improved the data quality and reliability of the ETL pipelines using innovative solutions (monitoring and failure detection).

Worked in Agile development (scrum, Kanban) incorporating Continuous Integration and Continuous Delivery, utilizing technologies such as GIT, SVN.

Environment: AWS Glue, S3, IAM, stack, EC2, RDS, Tableau, Scala, Automation, Redshift, EC2, Azure, Lambda, Boto3, GCP CI/CD, DynamoDB, Apache Spark, Terraform, Kinesis, Athena, Hive, Sqoop, Python, Snowflake.

Client: Cyient Ltd, Hyd, India

Role: Bigdata Developer Mar 2017 to Mar 2019

Responsibilities:

Designed and Developed ETL Processes with Pyspark in AWS Glue to migrate data from S3 to generate Reports.

Involved in writing and Scheduling the Databricks jobs Using Airflow.

Authored Spark Jobs for data filtering and data transforming through Pyspark data frames both in AWS glue and Data Bricks.

Used AWS glue catalog with Athena to get the data from S3 and perform SQL query operations.

Data Security: Implemented data encryption and access control measures in Scala for sensitive data, ensuring compliance with data security and privacy regulations.

Wrote various data normalization jobs for new data ingested to s3.

Designed and Developed ETL Processes with Pyspark in AWS Glue to migrate data from external sources and S3 Files into AWS Redshift

Authored Spark Jobs for data filtering and data transforming through Pyspark data frames.

Environment: AWS, EC2, S3, Kafka, Spark, Spark SQL, PostgreSQL, Shell Script, Scala.

Client: Kelton Tech, Hyd, India

Role: Business Intelligence Developer April 2013 to Feb 2017

Responsibilities:

Involved in analysis, design and development of projects to consume transactional data, consolidate into data warehouse and creating reports and dashboards, writing complex Stored Procedures for creating datasets for Reports and dashboards.

Utilize Power Query to transform and clean data from various sources for optimal performance in Power BI.

Write complex MDX and DAX queries to create calculated measures and custom KPIs.

Used SSIS and SQL procedures to transfer data from OLTP databases to staging area and finally transfer into data warehouse.

Attained expertise in Business Intelligence and Data Visualization tools: Tableau, Micro Strategy.

Environment: MS SQL Server 2016, Power BI, SQL Profiler, VS 2015 / 17, TFS 2015,2012 / 15, MS Office 2010, SSIS/ SSRS/ SSAS.

Contact this candidate