Data Engineer Azure

Location:

Posted:

October 23, 2024

Resume:

CAREER OBJECTIVE

Results-driven Data Engineer with 6+ years of extensive experience in designing, building and maintaining scalable data infrastructure on cloud platforms, including Azure, AWS and Google Cloud (GCP). Adept at leveraging cloud-native services to optimize ETL processes, data pipelines and storage solutions, ensuring high performance and reliability.

PROFILE SUMMARY

Results-driven Data Engineer with 6+ years of experience in designing and implementing robust data solutions to drive business insights and enhance data-driven decision-making.

Designed, built and managed ELT data pipelines using Airflow, Python and GCP solutions.

Experience in building data pipelines using AWS services such as S3, Redshift, Glue, AWS Lambda functions, CloudWatch, SNS and DynamoDB. Familiarity with Front end technologies like HTML, CSS, JS, ReactJS.

Experience in Azure Marketplace where to search, deploy and purchase wide range of applications and services.

Experience using various Hadoop Distributions (Cloudera, MapR, Hortonworks, Azure) to fully implement and leverage new Hadoop features. Develop batch processing solutions by using Data Factory and Azure Data bricks.

Utilized Python libraries like Scikit-Learn, TensorFlow, PyTorch and Keras to build machine learning models, focusing on data preparation and feature extraction within data workflows.

Expert in creating various Kafka producers and consumers for seamless data streaming with AWS services.

Highly skilled vein using visualization tools like Tableau, matplotlib, ggplot2 for creating dashboards.

Experience in Azure Services like PaaS, IaaS and worked on storages like Blob (Page and Block), SQL Azure. Well experienced in deployment & configuration management and Virtualization.

Experience working with various SDLC methodologies like Agile Scrum, RUP and Waterfall model.

Experience in implementing Azure data solutions, provisioning storage account Azure Data Factory, SQL Server, SQL Databases, SQL Data warehouse, Azure Data Bricks and Azure Cosmos DB.

Implemented production scheduling jobs using Control-M and Airflow.

Experience with Snowflake cloud data warehouse and AWS S3 bucket for integrating data from multiple source system which include loading nested JSON formatted data into snowflake table. Experience in Migration & deployment of Applications with upgrade version of Application & Hardware, MS build, batch script, IIS and Jenkins Administrator.

Skilled in the design and management of enterprise IT frameworks including databases, ERP solutions, middleware, user interfaces, networking components and overall infrastructure.

Hands-on experience in implementing, Building and Deployment of CI/CD pipelines managing projects often including tracking multiple deployments across multiple pipeline stages (Dev, Test/QA staging and production).

Skilled in setting up and administering Windows and Linux servers to support data applications and workflows, with a focus on maximizing uptime and maintaining security in data engineering environments.

Experience in Cisco Cloud Center to more securely deploy and manage applications in multiple data center, private cloud, and public cloud environments. Designed and implemented a Scalable data architecture on AWS using Kubernetes, Terraform and Snowflake, enabling seamless data integration and processing across multiple data sources.

EDUCATION

Master’s from University of North Texas, Denton, USA

TECHNICAL SKILLS

Languages

Python (NumPy, Pandas, Scikit-learn, Matplotlib, Seaborn), R, SQL, MATLAB, Spark, Java, Perl

Statistical Methods

Hypothesis Testing, ANOVA, Principal Component Analysis (PCA), Timeseries, Correlation (Chi-square test, covariance), Multivariate Analysis, Bayes Law.

ML Frameworks

TensorFlow, Keras, PyTorch, Scikit-learn, XGBoost, CNN, RNN, LSTM.

Cloud Platform

AWS (EC2, Lambda, ECS, S3, Glue, EBS, Dynamo DB, Redshift, Aurora, Cloud Watch), Azure, GCP

Data Visualization

Tableau, Python (Matplotlib, Seaborn), R (ggplot2), PowerBI, Alteryx, Qlik View, D3.js.

Hadoop Ecosystem

Hadoop, Spark, Snowflake, MapReduce, HiveQL, HDFS, Sqoop, PigLatin

Data Sources

PostgreSQL, MS SQL Server, MongoDB, MySQL, HBase, Amazon Redshift, Teradata.

Operating Systems

UNIX Shell Scripting (via PuTTY client), Linux, Windows, MacOS.

Other Tools and Technologies

TensorFlow, Kera’s, NLTK, SpaCy, Gensim, MS Office Suite, GitHub, Airflow, Datadog, dbt, SSIS

WORK EXPERIENCE

Client: Liberty Mutual Insurance, Plano, Texas, USA Oct 2023 - Present

Role: AWS Data Engineer

Description: Liberty Mutual Insurance Company is an American diversified global insurer. I am developing and maintaining scalable data models and architectures that support both current and future analytical needs. It involves understanding business requirements and translating them into efficient data structures.

Responsibilities:

Designed and develop JAVA API (Commerce API) which provides functionality to connect to the Cassandra through Java services.

Developed ETL solutions using dbt, Matillion and SSIS alongside Teradata to pull data from sources such as databases, APIs, flat files and JSON ensuring efficient data transformation and integration.

Employed Alteryx for data cleaning and transformation ensuring high-quality datasets for analysis.

Worked on functions in Lambda that aggregates the data from incoming events then stored result data in Amazon DynamoDB.

Participated in development/implementing of Cloudera Hadoop development.

Developed Spark Streaming programs to process near real time data from Kafka and process data with both stateless and state full transformations.

Created PySpark and Spark SQL scripts to analyze customer behavior patterns. Skilled in writing advanced database queries using t-SQL, PL-SQL and similar SQL languages to support efficient data extraction, transformation and analysis.

Automated and monitored AWS infrastructure with Terraform for high availability and reliability, reducing infrastructure management time by 90% and improving system uptime.

Involved in the entire lifecycle of the projects including Design, Development, Deployment, Testing, Implementation and support.

Proficient in utilizing scripting languages including Perl and Shell to automate data workflows.

Developed Triggers, stored procedures, functions, and packagers using cursors associated with the project using PL/SQL.

Built and configured Jenkins slaves for parallel job execution. Installed and configured Jenkins for continuous integration and performed continuous deployments.

Implemented automated Data pipelines for Data migration, ensuring a smooth and reliable transition to the Cloud environment.

Used Docker for managing the application environments.

Built Power BI dashboards to provide real-time insights, supporting strategic project decisions.

Developed Kibana Dashboards based on the Log stash data and Integrated different source and target systems into Elasticsearch for near real time log analysis of monitoring End to End transactions.

Conducted query optimization and performance tuning tasks such as query profiling, indexing and utilizing Snowflake's automatic clustering to improve query response times and reduce costs.

Performed continuous Integration/Continuous delivery (CI/CD) on Jenkins build pipeline and fixed failure issues.

Hands on experience in working with AWS Cloud Services like EMR, S3, Glue and Redshift.

Environment: API, AWS, Cassandra, DynamoDB, CI/CD, Cloudera, Teradta, Docker, Elasticsearch, EMR, ETL, dbt, Java, Jenkins, Power BI, Kafka, Lambda, PL/SQL, PySpark, Redshift, S3, SNS, Alteryx, Snowflake, Spark SQL, Spark Streaming, SQL.

Client: CenterPoint Energy, Houston, Texas, USA Nov 2022 - Sep 2023

Role: Azure Data Engineer

Description: CenterPoint Energy, Inc. is an American utility company. I continuously optimizing data pipelines for performance, reliability, and scalability. It involves tuning database queries, optimizing data storage strategies, and implementing caching mechanisms.

Responsibilities:

Developed custom multi-threaded Java based ingestion jobs as well as Sqoop jobs for ingesting from FTP servers and data warehouses.

Executed Extract, Transform and Load (ETL) operations, extracting data from source systems and loading it into Azure Data Storage services.

Achieved 70% faster EMR cluster launch and configuration, optimized Hadoop job processing by 60%, improved system stability and utilized Boto3 for seamless file writing to S3 bucket.

Native integration with Azure Active Directory (Azure AD) and other Azure services enables to build modern data warehouse and machine learning and real-time analytics solutions.

Built and optimized both serverless and traditional data solutions, using Scala for primary data processing tasks. Applied C# for application logic requirements and T-SQL for complex database queries, ensuring smooth and efficient data operations.

Wrote Spark-Streaming applications to consume the data from Kafka topics and write the processed streams to HBase.

Performed the migration of large data sets to Databricks (Spark), create and administer cluster, load data, configure data pipelines, loading data from ADLS Gen2 to Databricks using ADF pipelines.

Used Spark Data Frames, Spark-SQL, Spark MLLib extensively.

Created Data tables utilizing PyQt to display customer and policy information and add, delete, update customer records.

Used Python based GUI components for the Front-End functionality such as selection criteria.

Designed and implemented Infrastructure as code using Terraform, enabling automated provisioning and scaling of cloud resources on Azure.

Involved in various phases of Software Development Lifecycle (SDLC) of the application, like gathering requirements, design, development, deployment and analysis of the application.

Have used T-SQL for MS SQL Server and ANSI SQL extensively on disparate databases.

Involved in monitoring and scheduling the pipelines using Triggers in Azure Data Factory.

Used Jira for ticketing and tracking issues and Jenkins for continuous integration and continuous deployment.

Controlling and granting database access and migrating on premise databases to Azure Data Lake store using Azure Data Factory.

Deployed models as Python package, as API for backend integration and as services in a Microservices architecture with a Kubernetes orchestration layer for the Dockers containers.

Skilled in monitoring servers using Nagios, Cloud watch and using ELK Stack- Elastic search and Kibana.

Implemented Performance tuning techniques in Azure Data Factory and Azure Synapse Analytics.

Conducted Performance tuning and optimization of Snowflake data warehouse, resulting in improved query execution times and reduced operational costs.

Used Bitbucket as source control to push the code and Bamboo as deployment tool to build CI/CD pipeline.

Environment: Analytics, API, Azure, Azure Synapse Analytics, Bitbucket, CI/CD, Data Factory, Scala, Docker, EMR, ETL, Factory, HBase, Java, Jenkins, Jira, Kafka, Kubernetes, lake, Python, S3, Snowflake, Spark, SQL, Sqoop

Client: HonorHealth, Mumbai, India Sep 2020 - Jul 2022

Role: GCP Data Engineer

Description: HonorHealth is an operator of a network of community healthcare facilities intended to improve the health and well-being of patients. I designed, developed, and maintained data pipelines to efficiently move data from various sources into the data ecosystem. It involves working with tools like Apache Kafka, Apache NiFi, or custom ETL (Extract, Transform, Load) scripts.

Responsibilities:

Developed workflows using Apache Oozie for executing MapReduce jobs and Hive queries to process healthcare data.

Created Data Studio reports to analyze billing and service usage, contributing to cost optimization and efficiency in healthcare operations.

Utilized Dataproc and BigQuery to develop and maintain cloud-based solutions for managing large healthcare datasets.

Implemented a real-time data ingestion pipeline using Java/Scala and Apache Kafka, enabling near-real-time analysis of critical healthcare metrics.

Extracted Twitter feeds related to healthcare topics using the Python Twitter library to gain insights into patient sentiment.

Experienced in Google Cloud components, Google container builders and GCP client libraries and Cloud SDK'S.

Developed and configured databases and backend applications to manage large datasets effectively using Pandas and SQL.

Coordinated with the team to develop frameworks for generating daily ad hoc reports and extracts from enterprise data stored in BigQuery.

Used Sqoop to ingest raw healthcare data into Google Cloud Storage by deploying Cloud Dataproc clusters.

Processed and loaded data from Google Pub/Sub topics into BigQuery using Cloud Dataflow with Python.

Migrated existing cron jobs to Airflow/Cloud Composer for improved scheduling and orchestration of healthcare data workflows.

Environment: Airflow, Apache, BigQuery, Blob, GCP, Hive, Java, Kafka, MapR, Oozie, Pandas, PySpark, Python, Scala, SDK, Services, Spark, SQL, Sqoop, VPC

Client: Bajaj Electricals, Mumbai, India Jun 2018 - Aug 2020

Role: Hadoop Developer

Description: Bajaj Electricals Ltd is an Indian consumer electrical equipment manufacturing company. Implemented the processes and tools to ensure the accuracy, completeness, and consistency of data throughout its lifecycle. It includes data profiling, cleansing and validation procedures.

Responsibilities:

Created ETL jobs for populating data into a Hadoop Data Lake from various source systems such as ODS, flat files, and Parquet files, ensuring seamless data integration and transformation.

Developed and managed workflows using Apache Oozie for executing and scheduling Hadoop jobs, enhancing the efficiency of data processing pipelines.

Wrote Kafka producers to stream data from external REST APIs into Hadoop topics, facilitating real-time data ingestion for analytics.

Built and optimized Spark jobs within the Hadoop ecosystem using PySpark, performing table-to-table operations to process both structured and unstructured data.

Improved performance and optimization of existing algorithms in Spark using the Hadoop framework, employing Spark SQL, DataFrames, and pair RDDs.

Actively participated in all phases of the Software Development Life Cycle (SDLC) from implementation to deployment, focusing on Hadoop-based solutions.

Wrote and executed SQL queries against data stored in Hadoop using tools like Hive and Spark SQL for data analysis.

Developed and orchestrated data pipelines using Apache Airflow for ETL-related jobs within the Hadoop ecosystem, ensuring timely data availability.

Successfully completed a proof of concept (POC) for migrating on-premises data sources to Hadoop, demonstrating scalability and performance improvements.

Instantiated and maintained CI/CD (Continuous Integration & Deployment) pipelines for Hadoop-based applications, applying automation using tools like Git, Terraform, and Ansible.

Environment: Hadoop, ETL, Git, JS, Kafka, Kubernetes, MySQL, Oozie, PySpark, Python, Scala, CI/CD, Snowflake, Spark, SQL

TEJA SREE MANDADI

Data Engineer

Phone: 945-***-****

Email ID: **************@*****.***

Contact this candidate