Harshavardhan Reddy Yaramala
Data Engineer
CONTACT
TECHNICAL SKILLS
EDUCATION
PROFESSIONAL SUMMARY
OBJECTIVE
***********@*****.***
Master’s in computer science
from Wright State University,
USA
Big Data Technologies: Spark,
Cloudera, Hive, Impala, HBase,
Oozie, KaDa, Data bricks,
Airflow
Search and BI tools: PowerBI,
Data Studio, Tableau
Software Development Life
Cycle (SDLC): Agile, Waterfall
Project management: Microsoft
Teams, JIRA
Languages: SQL, Python and
Scala
Cloud: Azure, AWS, GCP
Databases: Oracle, MySQL,
SQLServer, Teradata
IDE and Notebooks: Eclipse,
IntelliJ, PyCharm, Jupiter,
Databricks notebooks
Database design and data
warehousing: Database Design,
Data Warehouse Design
Agile methodology: Scrum,
Agile, Iterative Development
Data formats: JSON, Parquet,
AVRO, XML and CSV
Web Technologies: JDBC, JSP,
Servlets, Struts (Tomcat, JBoss)
Operating Systems: Windows,
Linux, Unix
Others: DevOps, ETL, Data
Modeling &Database Design,
Data Quality & Data
Governance, Machine Learning
&Analytics
Seasoned Data Engineer with 5+ years of comprehensive experience in designing, developing, and optimizing data pipelines and architectures. Adept at integrating and managing large-scale datasets, ensuring data quality, and leveraging advanced analytics to drive business insights.
• Results-driven Data Engineer with 5+ years of experience in designing and implementing robust data solutions to drive business insights and enhance data- driven decision-making.
• Experience in Azure Marketplace where to search, deploy and purchase wide range of applications and services. Experience in Big Data analytics, Data manipulation, using Hadoop Eco system tools Map - Reduce, Yarn/MRv2, Pig, Hive, HDFS, HBase, Spark, Kafka, Flume, Sqoop, Flume, Oozie, Avro, Sqoop, AWS, Spark integration with Cassandra, Avro, and ZooKeeper.
• Practical experience using KTables, Global KTables, and KStreams in Apache Kafka and Confluent environments to work with Kafka streaming.
• Develop batch processing solutions by using Data Factory and Azure Data bricks.
• Good experience working with PyTest, PyMock, Selenium web driver frameworks for testing different front end and backend components.
• Experience in automating day-to-day activities by using Windows PowerShell.
• Strong expertise developing complicated Oracle queries and Database architecture utilizing PL/SQL to Construct Stored Procedures, Functions, and Triggers.
• Experienced with Dimensional modelling, Data migration, Data cleansing, Data profiling, and ETL Processes features for data warehouses.
• Experience with Snowflake cloud data warehouse and AWS S3 bucket for integrating data from multiple source system which include loading nested JSON formatted data into snowflake table. In-depth knowledge of Data Sharing in Snowflake and experienced in Snowflake Database, Schema and Table structures.
• Has experience with story grooming, Sprint planning, daily standups, and software techniques like Agile and SAFe. Proficient in building CI/CD pipelines in Jenkins using pipeline syntax and groovy libraries.
• Worked in dealing with Windows Azure IaaS - Virtual Networks, Virtual Machines, Cloud Services, Resource Groups, Express Route, Traffic Manager, VPN, Load Balancing, Application Gateways, and Auto-Scaling. Worked with creating Docker files, Docker containers in developing the images and hosting them in antifactory.
• Proficient with Spark Core, Spark SQL, Spark MLlib, Spark GraphX and Spark Streaming for processing and transforming complex data using in-memory computing capabilities written in Scala.
• Hands-on experience interacting with REST APIs developed using the micro- services architecture for retrieving data from different sources.
• Implemented Big Data solutions using Hadoop technology stack, including Pyspark, Hive, Sqoop, Avro and Thrift. Worked on production support looking into logs, hot fixes and used Splunk for log monitoring along with AWS CloudWatch. PROFESSIONAL EXPERIENCE
Client: Huntington Bancshares, Columbus, Ohio, USA Jan 2023 - Present Role: Azure Data Engineer
Description: Huntington Bancshares, Inc. operates as a bank holding company. I develop, construct, test, and maintain architectures such as databases and large-scale processing systems. Create and manage ETL (Extract, Transform, Load) processes to ensure data is processed and made available for analysis. Responsibilities:
• Analyzed data using SQL, Python, Apache Spark and presented analytical reports to management and technical teams.
• Installing and automation of application using configuration management tools Puppet and Chef.
• Provided high availability for IaaS VMs and PaaS role instances for access from other services in the VNet with Azure Internal Load Balancer. Used Kafka functionalities like distribution, partition, replicated commit log service for messaging systems by maintaining feeds. Involved in loading data from rest endpoints to Kafka.
• Developed tools using Python, Shell scripting, XML to automate tasks. Converted and parsed data formats using PySpark Data Frames, reducing time spent on data conversion and parsing by 40%.
• Involved in building database Model, APIs and Views utilizing Python, in order to build interactive web based solutions.
• Designed and implemented Infrastructure as code using Terraform, enabling automated provisioning and scaling of cloud resources on Azure. Used Continuous Delivery Pipeline. Deployed Microservices, including provisioning Azure environments and developed modules using Python scripting and Shell Scripting.
• Implemented Synapse Integration with Azure Databricks notebooks which reduce about half of development work. And achieved performance improvement on Synapse loading by implementing a dynamic partition switch.
• Implemented automated Data pipelines for Data migration, ensuring a smooth and reliable transition to the Cloud environment. Wrote scripts to Import and Export data to CSV and EXCEL formats from different environments using Python and made a Celery action using REST API call.
• Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Sql Activity and creating UNIX shell scripts for database connectivity and executing queries in parallel job execution.
• Worked on Jenkins pipelines to run various steps including unit, integration and static analysis tools.
• Implemented airflow for workflow automation and scheduling tasks and created DAGs tasks.
• Worked on Big Data Integration & Analytics based on Hadoop, SOLR, PySpark, Kafka, Storm and web Methods. Involved in requirement gathering, business analysis, and technical design for Hadoop and Big Data projects.
• Handled importing of data from various data sources, performed transformations using B, loaded data into HDFS and Extracted the data from SQL into HDFS using Sqoop.
• Responsible for Building and Testing of applications. Experience in handling database issues and connections with SQL and NoSQL databases like MongoDB by installing and configuring various packages in python (Teradata, MySQL, MySQL connector, PyMongo and SQLAlchemy). Written queries in MySQL and Native SQL.
• Design and build scalable data pipelines to ingest, translate, and analyze large sets of data. Created pipelines to load the data using ADF. Spearheaded HBase setup and utilized Spark and SparkSQL to develop faster data pipelines, resulting in a 60% reduction in processing time and improved data accuracy.
• Design and configure database, Back-end applications and programs. Managed large datasets using Pandas data frames and SQL. Scripted simulation hardware for testing using Simis’s simulator.
• Developed analytical components using Scala, Spark, Apache Mesos and Spark Stream and Installed Hadoop, Map Reduce, and HDFS and developed multiple MapReduce jobs in PIG and Hive for data cleaning and pre-processing.
• Worked on CI/CD tools like Jenkins, Docker in Devops Team for setting up application process from end-to-end using Deployment for lower environments and Delivery for higher environments by using approvals in between.
• Designed and developed ETL pipelines for real-time data integration and transformation using Kubernetes and Docker. Environment: Analytics, Apache, API, APIs, Azure, CI/CD, Docker, ETL, Factory, HBase, HDFS, Hive, IaaS, Java, Jenkins, JS, KaDa, Kubernetes, MapR, MySQL, PaaS, Pandas, PIG, python, Python, Scala, Spark, SQL, Sqoop, Storm, Teradata Client: Medpace, Cincinnati, Ohio, USA Oct 2021 - Dec 2022 Role: AWS Data Engineer
Description: Medpace is a scientifically-driven, global, full-service clinical contract research organization (CRO) providing Phase I-IV clinical development services to the biotechnology, pharmaceutical and medical device industries. Implemented data ingestion processes from various data sources, ensuring data is accurately integrated into the system. Responsibilities:
• Designed and develop JAVA API (Commerce API) which provides functionality to connect to the Cassandra through Java services. Created clusters to classify control and test groups.
• Automated AWS infrastructure to be deployed periodically using Docker images and CloudFormation templates.
• Converted SAS codes to Python for predictive models with Pandas, NumPy and scikit-learn.
• Involved in data validations and reports using PowerBI. Staged the API and Kafka Data (in JSON file format) into Snowflake DB by Flattening the same for different functional services.
• Enhanced by adding Python XML SOAP request/response handlers to add accounts, modify trades and security updates.
• Set up base Python structure with the create python-App package, SRSS, PySpark. Developed Python Spark modules for Data ingestion & analytics loading from Parquet, Avro, JSON data and from database tables.
• Provisioned high availability of AWS EC2 instances, migrated legacy systems to AWS, and developed Terraform plugins, modules, and templates for automating AWS infrastructure.
• Experience in creating Kubernetes replication controllers, Clusters and label services to deployed Microservices in Docker.
• Working on data management disciplines including data integration, modeling and other areas directly relevant to business intelligence/business analytics development.
• Created Data tables utilizing PyQt to display customer and policy information and add, delete, update customer records.
• Used Python to write Data into JSON files for testing Django Websites, Created scripts for data modelling and data import and export. Used Docker for managing the application environments.
• The AWS Lambda functions were written in Spark with cross - functional dependencies that generated custom libraries for delivering the Lambda function in the cloud. Performed raw data ingestion into, which triggered a lambda function and put refined data into ADLS. Creating job flow using Airflow in python and automating the jobs. Airflow will have separate stack for developing DAGs on and will run jobs on EMR or EC2 Cluster.
• Consult leadership/stakeholders to share design recommendations and thoughts to identify product and technical requirements, resolve technical problems and suggest Big Data based analytical solutions.
• Involved in loading and transforming large sets of Structured, Semi-Structured and Unstructured data and analyzed them by running Hive queries. Processed the image data through the Hadoop distributed system by using Map and Reduce then stored into HDFS.
• Storing different configs in No SQL database Mongo DB and manipulating the configs using PyMongo.
• Well versed with various aspects of ETL processes used in loading and updating Oracle data warehouse. Environment: Airflow, API, AWS, Cassandra, CI/CD, CloudFormation, Cluster, Docker, EC2, EMR, ETL, GCP, HDFS, Hive, Java, JS, KaDa, Kubernetes, lake, Lambda, MapR, Mongo DB, Oracle, Pandas, Pig, python, Python, SAS, Scala, Selenium, Snowflake, Spark, SQL
Client: Tata AIA Life Insurance, Mumbai, India Sep 2020 - Jul 2021 Role: Data Engineer
Description: Tata AIA Life Insurance Company Limited operates as an insurance company. It offers protection, wealth, savings, child, health and retirement solutions. I developed and managed data warehousing solutions to consolidate data from different sources.
Responsibilities:
• Instantiated, created, and maintained CI/CD continuous integration & deployment pipelines and apply automation to environments and applications. Worked on Java Message Service JMS API for developing message-oriented middleware MOM layer for handling various asynchronous requests.
• Implemented Navigation rules for the application and page outcomes, written controllers using annotations.
• Created Complex Stored Procedures, Slow Changing Dimension Type 2, Triggers, Functions, Tables, Views and other T SQL code and SQL joins to ensure efficient data retrieval.
• Ensured data quality and accuracy with custom SQL and Hive scripts and created data visualizations using Python and Tableau for improved insights and decision-making.
• Installed, configured, administered, monitored Azure, IAAS and PAAS, AzureAD, Azure VMs, Networking (VNets, Load Balancers, App Gateway, Traffic Manager, etc.)
• Configured Spark streaming to get ongoing information from the Kafka and store the stream information to DBFS.
• Expertise in creating and developing applications for an android operating system using Android Studio, Eclipse IDE, SQLite, Java, XML, Android SDK, and ADT plugin.
• Extensively used Data bricks spark and Jupyter notebooks for Data Analytics.
• Extracting and Analyzing data from various sources. Data wrangling and cleanup using Python-Pandas. Used ML techniques for predictions using statistical libraries in python. Utilized Elasticsearch and Kibana for indexing and visualizing the real-time analytics results, enabling stakeholders to gain actionable insights quickly.
• Deployed models as python package, as API for backend integration and as services in a Microservices architecture with a Kubernetes orchestration layer for the Dockers containers. Using Azure Cluster services, Azure Data Factory V2 ingested a large amount and diversity of data from diverse source systems into Azure Data Lake Gen2. Environment: Airflow, Analytics, Apache, API, Azure, Blob, Bricks, CI/CD, Cluster, Docker, Elasticsearch, ETL, Factory, Gateway, Git, HDFS, Hive, Java, JS, KaDa, Kubernetes, lake, Lake, MapR, Oozie, Oracle, Pandas, python, Python, SDK, Spark, SQL, Sqoop, Tableau, Teradata.
Client: Jio, Mumbai, India May 2019 - Aug 2020
Role: Data Engineer
Description: Jio is an Indian telecommunications company and a subsidiary of Jio Platforms. Streamlined data operations and integration processes across cloud platforms to enhance efficiency and reliability. Responsibilities:
• Controlling and granting database access and migrating on premise databases to Azure data lake store using Azure Data Factory. Developed remote integration with third-party platforms by using RESTful web services.
• Involved in various phases of Software Development Lifecycle (SDLC) of the application, like gathering requirements, design, development, deployment, and analysis of the application. Conducted performance tuning and optimization of Kubernetes and Docker deployments to improve overall system performance.
• Analysed existing systems and propose improvements in processes and systems for usage of modern scheduling tools like Airflow and migrating the legacy systems into an Enterprise data lake built on Azure Cloud.
• Responsible for loading the data from BDW Oracle database, Teradata into HDFS using Sqoop. Implemented AJAX, JSON, and Java script to create interactive web screens.
• Create data-pipelines and maintain integrity of the data being passed through. Generate KPI's to reflect data passage.
• Build Data pipelines using Python, Apache Airflow for ETL related jobs inserting data into Oracle.
• Execute the Validation Process through SIMICS.
• Developed workflow using Oozie for running MapReduce jobs and Hive Queries.
• Integrated Azure Data Factory with Blob Storage to move data through DataBricks for processing and then to Azure Data Lake Storage and Azure SQL data warehouse.
• Instantiated, created, and maintained CI/CD (continuous integration & deployment) pipelines and apply automation to environments and applications. Worked on various automation tools like Git, Terraform, Ansible.
• Analysed the SQL scripts and designed it by using Spark SQL for faster performance.
• Performed ETL to move the data from source system to destination systems and worked on the Data warehouse. Involved in database migration methodologies and integration conversion solutions to convert legacy ETL processes into Azure Synapse compatible architecture.
Environment: Apache, API, Azure, Blob, Cloudera, Docker, ETL, Factory, Git, HDFS, HTML, JS, KaDa, Kubernetes, lake, Lake, Linux, MySQL, Oracle, Pandas, python, Python, Services, Snowflake, Spark, SQL, Sqoop, Tableau, Teradata.