HARSHITH Senior Data Engineer ********.****@*****.*** https://www.linkedin.com/in/harshith45/ +1-575-***-****
Professional Summary:
Over 10+ years of hands-on experience, I've refined my expertise in crafting and executing comprehensive data solutions and architectures, with a specialization in Data Engineering.
My proficiency spans various domains:
Skilled in utilizing GCP services such as Compute Engine, Cloud Load Balancing, Cloud Storage, Cloud Dataproc, Cloud Functions, Cloud Pub/Sub, Cloud Shell, Cloud SQL, BigQuery, Cloud Data Fusion, Cloud Dataflow, Stack Driver Monitoring, And Cloud Deployment Manager.
Hands-on experience in data analysis using HiveQL, custom MapReduce programs (Java/Python), and complex HiveQL queries for data extraction, along with developing Hive User Defined Functions (UDFs) as necessary. Proven track record of utilizing Snowflake database and Python for development and optimization tasks, ensuring high proficiency and efficiency in database-related coding and implementation.
Profound understanding and practical application of Hadoop/Big Data technologies, encompassing storage, querying, processing, and analysis.
Proficient in Python and Apache Beam for data validation and processing in Google Cloud Dataflow.
Implemented custom-built input adapters using Spark, Hive, and Sqoop to ingest data for analytics from various sources (Snowflake, MS SQL, MongoDB) into HDFS. Imported data from web servers and Teradata using Sqoop, Flume, and Spark Streaming API.
Expertise in data analysis using HiveQL, HBase, and custom MapReduce programs.
Experienced in designing and implementing migration strategies for traditional systems on Azure, utilizing services like Azure SQL Database and Azure Data Factory.
Hands-on experience in GCP, particularly in BigQuery, Cloud Dataflow, and Data Proc.
Developed complex data mappings and executed Proof of Concepts (POC) for transitioning MapReduce jobs into Spark transformations.
Skilled in developing Apache Spark jobs using Python and Spark SQL for efficient data processing.
Strong understanding of statistics and experience in developing machine learning models, including of Databricks.
Proficient in Python, Scala, and core computer science concepts like data structures and algorithms.
Experienced in requirement analysis, application development, application migration, and maintenance using Software Development Lifecycle (SDLC) and Python/Java technologies.
Developed applications for data processing tasks using databases including Teradata, Snowflake & Postgres.
Built ETL pipelines, visualizations, and analytics-based solutions using AWS, Azure Databricks, and frameworks.
Well-versed in Hadoop distributions like Cloudera, Hortonworks, and AWS EMR.
Experienced in data importation, transformation, and migration using Sqoop, Hive, Pig, and Spark.
Strong understanding of Data Modeling, encompassing relational, dimensional, and star/snowflake schemas.
Experienced in both Waterfall and Agile methodologies, covering the complete Software Development Life Cycle.
Proficient in generating knowledge reports using Tableau, Power BI, and Qlik based on business specifications.
Technical Skills
Big Data Technologies: HDFS, YARN, MapReduce, Hive, Pig, Impala, Sqoop, Storm, Flume, Spark, Apache Kafka, Zookeeper, Ambari, Oozie, MongoDB, Cassandra, Mahout, Puppet, Avro, Parquet, Snappy, Falcon.
NO SQL Databases: Postgres, HBase, Cassandra, MongoDB, Amazon DynamoDB, Redis
Hadoop Distributions : Cloudera (CDH3, CDH4, and CDH5), Hortonworks, MapR, and Apache.
Languages: Scala, Python, R, XML, XHTML, HTML, AJAX, CSS, SQL, PL/SQL, HiveQL, Unix, Shell Scripting
Source Code Control: GitHub, CVS, SVN, ClearCase
Cloud Computing Tools: Amazon AWS, (S3, EMR, EC2, Lambda, VPC, Route 53, Cloud Watch, CloudFront), Microsoft Azure, GCP
Databases: Teradata Snowflake, Microsoft SQL Server, MySQL, DB2
DB languages: MySQL, PL/SQL, PostgreSQL & Oracle
Build Tools: Jenkins, Maven, Ant, Log4j
Business Intelligence Tools: Tableau, Power BI
Development Tools: Eclipse, IntelliJ, Microsoft SQL Studio, Toad, NetBeans
ETL Tools: Talend, Pentaho, Informatica, Ab Initio, SSIS
Development Methodologies: Agile, Scrum, Waterfall, V model, Spiral, UML
Professional Experience
Client: Edward Jones St. Louis, MO November 2022 to Present
Role: SR Cloud Engineer
Description: As a Senior GCP Data Engineer at Edward Jones, I lead GCP solution implementations, optimize data warehousing with BigQuery, Dataproc, and Storage, and conduct real-time analytics using Snowflake and Looker to drive informed decision-making.
Responsibilities:
Specializing in big data solutions like BigQuery, Cloud Dataproc, Google Cloud Storage, and Composer (using Airflow), and proficient in configuring GCP services via Cloud Shell SDK.
Designed and implemented a highly scalable and secure data management framework in GCP, proficiently utilizing Snowflake database. Defined data models and ensured seamless integration with on-premises sources.
Designed and implemented the GCP Organizations setup, Project setup, IAM access and GCP Service Account setup for development, QA and production support teams.
Designed and implemented data solutions on Google Cloud Platform (GCP) using BigQuery, DataProc, DataFlow, and Cloud Storage to optimize data processing pipelines.
Developed and optimized large-scale ETL workflows leveraging PySpark, Apache Spark, Hive, and SQL to process and analyze structured and unstructured data.
Engineered scalable data solutions across platforms such as Hadoop, Teradata, and SQL Server, ensuring seamless data integration and accessibility.
Migrating on-premise ETLs to GCP using cloud native tools such as BigQuery, Cloud DataProc, Google Cloud Storage, Composer.
Process and load inbound and outbound Data from Google pub/sub topic to Bigquery using cloud Dataflow with Python.
Working with Cloud Data Catalog and other Google Cloud APIs for monitoring, query, and billing related analysis for BigQuery usage.
Configured and maintained Prometheus for monitoring hundreds of microservices, improving alert accuracy and system observability.
Established and maintained a scalable EDW environment, consolidating disparate datasets into a single source of truth for strategic analytics.
Spearheaded the migration of legacy build pipelines to Jenkins, increasing deployment frequency by 40% and reducing build times.
Managed end-to-end CI/CD workflows using GitLab CI, automating builds, tests, and deployment processes across multiple projects, improving team productivity and operational efficiency.
Designed and implemented NoSQL data models in Google Cloud Bigtable, tailoring structures to specific application requirements.
Optimized ETL processes with SSIS and Spark, showcasing strong coding skills for database components and enhancing scalability and efficiency of OLTP and OLAP workflows within the GCP ecosystem.
Proficient in writing complex SQL queries for data analysis, manipulation, and reporting tasks within GCP environments.
Collaborated closely with business stakeholders to streamline data refresh intervals, enhancing user experience and reducing time-to-insight for ad hoc reporting.
Documented the inventory of modules, infrastructure, storage, components of existing On-Prem data warehouse for analysis and identifying the suitable technologies/strategies required for Google Cloud Migration.
Developed a POC for project migration from on prem Hadoop MapR system to Snowflake.
Tailored the framework for efficient ETL patterns, incorporating post-ingestion transformations. Integrated seamlessly with DLP/DTG blocking service using on-prem Encryption APIs to ensure data security.
Performed Data analysis and Data profiling using complex SQL on various sources systems including Oracle and Teradata.
Led the upgrade of the entire batch data processing framework to the latest Spark version, ensuring a smooth transition without dependency issues or vulnerabilities.
Designed and implemented robust CI/CD pipelines that support scalable cloud applications, ensuring seamless code integration and deployment practices.
Developed automation scripts to streamline build processes, significantly reducing manual efforts and minimizing errors in production environments.
Implemented enterprise-wide containerization guidelines that reduced environment discrepancies, ensuring consistent operations across development, testing, and production.
Led the design and deployment of Docker containerization strategies that improved application scalability and reliability across multiple cloud environments.
Proficient in Spark programming for data processing and large-scale data transformations, along with hands-on experience in SQL and Python.
Boosted code coverage for Scala applications from 40% to 91% using the Scala FunSuite library. Developed robust data pipelines for an enterprise data warehouse in Google Cloud, daily processing of 500GB to 1TB of data.
Integrated and optimized data processing workflows using Google Cloud Dataflow for scalable, parallelized data transformations and analysis across various data sources and formats.
Mastered Kubernetes orchestration to manage containerized applications, achieving 99.9% uptime and enabling efficient microservices architecture management.
Command-Line Simplicity: Offers a user-friendly CLI to configure connections, specify source/target schemas, and schedule recurring data jobs with minimal coding overhead.
Parallelization for Performance: Leverages MapReduce under the hood, dividing data transfers into parallel tasks to minimize execution time and improve scalability.
Versatile Configurations: Supports multiple databases, authentication methods, and compression options, allowing flexible deployment in diverse big data environments.
Utilized Stackdriver for comprehensive monitoring, logging, and diagnostics across GCP environments, leading to a reduction in incident response times.
Developed and optimized large-scale ETL workflows leveraging PySpark, Apache Spark, Hive, and SQL to process and analyze structured and unstructured data.
Engineered scalable data solutions across platforms such as Hadoop, Teradata, and SQL Server, ensuring seamless data integration and accessibility.
Created PySpark scripts utilizing DataFrames/Spark SQL and RDD in Spark for efficient data aggregation and queries. Conducted Proof of Concept (POC) to validate Delta Lake compatibility with Hive and Spark clusters deployed on Google Dataproc.
Implemented incremental data loading every 15 minutes into BigQuery's raw and UDM layers using Google DataProc, GCS bucket, HIVE, Spark, Scala, gsutil, and Shell Scripting.
Environment: GCP, BigQuery, GCS Bucket, G-Cloud Functions, SSIS, Cloud Dataflow, Cloud Data Fusion, Cloud Shell, Cloud Composer, Gsutil, Dataproc, Snowflake, VM Instances, Airflow, Jenkins, Jira, Git, Gitlab, Cloud SQL, MySQL, Postgres, DBeaver, Scala, Spark, Hive, Spark-SQL.
Client: Merck Pharma, Branchburg, NJ March 2019 to October 2022
Role: SR Data Engineer
Description: During my role as a GCP Data Engineer at Merck Pharma, I focused on improving our data models and overseeing the transition of our Hadoop systems to the Google Cloud Platform (GCP). I developed Python-based Spark jobs to transform our data effectively and integrated Snowflake for seamless analytics. Additionally, I utilized Tableau for data visualization and Apache Airflow to manage complex data pipelines from start to finish.
Responsibilities:
Developed Spark programs to parse raw data, populate staging tables, and store refined data in partitioned tables within the Enterprise Data Warehouse.
Wrote SQL scripts for data mismatch analysis and managed history data loading during data migration from Teradata SQL to Snowflake.
Leveraged cloud and GPU computing technologies for automated machine learning and analytics pipelines such as GCP.
Orchestrated a Jenkins-based CI/CD pipeline that supported multi-branch deployment strategies, enhancing collaboration and code quality for a team of 12 developers.
Customized GitLab CI pipelines with dynamic job configurations and environment-specific parameters, resulting in a reduction in deployment failures.
Strong experience in Business and Data Analysis, Data Profiling, Data Migration, Data Integration, Data governance and Metadata Management, Master Data Management and Configuration Management.
Worked on POC to check various cloud offerings including Google Cloud Platform (GCP).
Compared Self hosted Hadoop with respect to GCs Data Proc and explored Big Table (managed HBase) use cases, performance evolution.
Accelerated real-time insights by creating and managing operational data marts that updated critical metrics for agile decision-making in sales, inventory, and customer service.
Spearheaded data quality initiatives within ODMs, improving data reliability through continuous monitoring and governance protocols.
Integrated robust data governance practices (including dimensional modeling and schema design), enabling consistent, high-quality data ingestion from multiple on-premises and cloud sources.
Enabled executive-level analytics by implementing star/snowflake schemas, facilitating meaningful trend analysis and predictive modeling across departments.
Developed and maintained CI/CD pipelines specifically for cloud-native applications, ensuring zero-downtime deployments and continuous integration across global teams.
Automated code merges and quality checks using Jenkins, integrating static and dynamic analysis tools that preemptively resolved potential deployment issues.
Built data pipelines in airflow in CP for ETL related jobs using different airflow operators.
Designed various Jenkins jobs to continuously integrate the processes and executed CI/CD pipeline using Jenkins.
Involved in setting up of Apache airflow service in GCP.
Proficiently built Power BI reports leveraging Azure Analysis Services for enhanced performance. Created streaming applications with PySpark to read data from Kafka and persist it in NoSQL databases like HBase and Cassandra.
Imported data from different relational data sources like RDBMS, Teradata to HDFS using Sqoop.
Played a pivotal role in implementing Big Data Hadoop clusters and integrating data for large-scale system software development.
Engineered Docker images optimized for security and performance, which became the standard for all production deployments in the organization.
Designed and administered Kubernetes clusters with auto-scaling and complex networking configurations to support a resilient infrastructure as a service (IaaS) platform.
Successfully migrated an entire Oracle database to BigQuery and utilized Power BI for reporting. Constructed data pipelines in Google Cloud Platform's Apache Airflow for ETL tasks, leveraging various Airflow operators.
Experienced in Implementing Continuous Delivery pipeline with Maven, Ant, Jenkins, and GCP. Experienced in Hadoop 2.6.4 and Hadoop 3.1.5.
Developed multi cloud strategies in better using GCP (for its PAAS).
Experienced in migrating legacy systems into GCP technologies, storing data files in Google Cloud S3 Buckets daily basis, Using DataProc, Big Query to develop and maintain GCP cloud base
Implemented a Prometheus-based monitoring framework capable of handling over 10,000 metrics per second, significantly enhancing the observability of high-load systems.
Created Hive queries enabling market analysts to identify emerging trends by comparing fresh data with EDW reference tables and historical metrics.
Designed and developed data pipelines for integrated data analytics, utilizing Hive, Spark, Sqoop, and MySQL.
Environment: GCP, GCS, Pub/Sub, Airflow, Data Proc, Airflow, Looker, Hadoop, Spark, Sqoop, Data Warehouse, Kafka, Cloud Functions, Big Query, Pyspark, Spark SQL, No SQL.
Client: Fifth Third Bank, Evansville, IN July 2017 to February 2019
Role: SR Data Engineer
Description: As a Senior AWS Data Engineer at Fifth Third Bank, I managed Hadoop ecosystems, developed data pipelines in PySpark and Spark SQL, and optimized data processing with AWS EMR and Glue. Led migration of on-premises Hadoop to AWS, implemented ETL processes, and ensured metadata governance for data integrity.
Responsibilities:
Engineered data pipelines in Apache Airflow on Google Cloud Platform (GCP) to efficiently handle ETL tasks using a diverse set of Airflow operators.
Explored Spark to optimize performance and refine existing algorithms on Hadoop, leveraging Spark's robust features including Spark Context, Spark SQL, DataFrames, and Spark YARN.
Utilized Spark Streaming to seamlessly ingest data into an in-house ingestion platform.
Provided technical support and troubleshooting for Python-based implementations in Athena and Teradata, resolving issues promptly to maintain continuous operation of real-time data analytics pipelines.
Implemented one time Data Migration of Multistate level data from SQL server to Snowflake by using Python and SnowSQL.
Designing and implementing data processing systems on GCP using services such as Big Query, Dataflow, and Datapost.
Leveraged SQL skills across various relational and NoSQL databases, and collaborated seamlessly with ETL tools such as Talend, Informatica, and Apache NiFi.
Build data pipelines in airflow in GCP for ETL-related jobs using different airflow operators. Experience in GCP Data proc, GCS, Cloud functions, Big Query.
Building data transformation pipelines using GCP services like Dataflow or Apache Beam to cleanse, normalize, and enrich data.
Implemented CI/CD pipelines using Jenkins and GitLab to automate build, testing, and deployment processes for data-driven applications
Assisted with migration and upgrade of Jira instances, ensuring a seamless transition with minimal disruption to projects. Facilitated communication between development, operations, and business teams through Jira's collaboration features
Developed Spark-based ETL pipelines and integrated Jenkins for CI/CD automation in cloud deployments.
Designed and implemented data processing workflows using Dataflow and BigQuery, improving efficiency.
Automated infrastructure deployment using Terraform, ensuring repeatable cloud architecture provisioning.Worked on migrating data pipelines from AWS EMR to GCP, reducing processing costs by 25%.Education
Environment: GCP, GCP Dataproc, Apache Beam, Airflow, Hadoop, Hive, Teradata, SAS, Teradata Spark, EMR, S3, Hive, Python, Airflow, Teradata, Sqoop, snowflake, Hive, Spark SQL, SQL.
Client: Grapesoft Solutions Hyderabad, India Decemer2015 to April 2017
Role: Data Engineer
Description: During my tenure at Grapesoft Solutions as an Azure Data Engineer, I specialized in integrating Azure services like Data Factory to gather data from various sources, managed Spark clusters using Azure Kubernetes, and improved query performance by migrating data to Azure SQL Data Warehouse. I also developed dashboards using Power BI for insightful data visualization.
Responsibilities:
Seamlessly integrated Azure Data Factory to efficiently gather data from diverse sources, encompassing on-premises systems like MySQL and Cassandra, as well as cloud sources such as Blob storage and Azure SQL DB.
Strategically designed and configured relational servers and databases on the Azure Cloud, meticulously assessing both current and future business requirements.
Played a pivotal role in the seamless migration of data from on-premises SQL servers to Cloud databases, specifically Azure Synapse Analytics (DW) and Azure SQL DB.
Developed ETL jobs to efficiently load, serve, and transport data into buckets, facilitating the transfer of S3 data to the Data Warehouse.
Leveraged Kusto Explorer for log analytics and enhanced query response times, crafting alerts using Kusto query language.
Efficiently worked with Azure BLOB and Data Lake storage, seamlessly loading data into Azure SQL Synapse Analytics (DW).
Addressed complex business queries involving multiple tables from different databases by crafting both correlated and non-correlated sub-queries.
High-Volume Data Transfers: Efficiently moves large datasets between relational databases (Oracle, MySQL, Teradata) and HDFS for processing and analytics.
Incremental Imports and Exports: Supports both bulk loading and incremental data transfers, ensuring timely, up-to-date ingestion without reprocessing entire datasets.
Straightforward Integration: Works seamlessly with Hadoop ecosystems, enabling direct imports into Hive, HBase, or HDFS for further data transformation.
Designed and implemented business intelligence solutions using SQL Server Data Tools 2015 and 2017 versions, effectively loading data into both SQL and Azure Cloud databases.
Conducted comprehensive analysis of data quality and enforced business rules at all stages of the data extraction, transformation, and loading process.
Crafted insightful reports in TABLEAU for data visualization and thoroughly tested native Drill, Impala, and Spark connectors for data exploration.
Developed diverse Python scripts for vulnerability assessment, including SQL injection, permission checks, and performance analysis.
Extensive experience working with the Azure cloud platform, including implementing data solutions with Azure Data Factory, Databricks, and Azure Data Lake.
Proficiently orchestrated the import of data from various sources into HDFS using Sqoop, executed transformations using Hive and MapReduce, and subsequently loaded data into HDFS for further processing.
Environment: Microsoft Azure Cloud, Apache Flume, HDFS, Hive, HBase, Pig, Jenkins, Power BI, Databricks, Spark, Scala, Hadoop, DBT, SQL, Oracle, UNIX, Postgres, Pl/SQL, Talend Open Studio, Informatica.
Client: Couth InfoTech Pvt. Ltd, Hyderabad, India June 2014 to August 2015
Role: Hadoop Developer
Description: During my time as a Hadoop Developer at Couth InfoTech, my work centered around optimizing data workflows. I utilized Apache Flume for seamless log data loading into HDFS, integrated Kafka for real-time data processing, and employed Solr for advanced search functionalities. Additionally, I managed Hadoop clusters to process distributed data using tools like Hive, Pig, and Impala.
Responsibilities:
Effectively utilized Apache Flume to streamline the loading of log data from diverse sources directly into HDFS for efficient data management.
Applied a range of data transformations, encompassing Lookup, Aggregate, Sort, Multicasting, Conditional Split, Derived Column, and more.
Developed Mappings, Sessions, and Workflows to efficiently extract, validate, and transform data in compliance with business rules using Informatica.
Tailored target tables based on reporting team requirements and formulated Extraction, Transformation, and Loading (ETL) processes utilizing Talend.
Utilized Netezza SQL scripts to seamlessly transfer data between Netezza tables.
Scheduled Talend Jobs using Job Conductor, a scheduling tool within the Talend ecosystem.
Took charge of querying, stored procedure creation, crafting complex queries, and leveraging T-SQL joins to address varied reporting operations and handle ad-hoc data requests.
Prioritized performance monitoring and optimized indexes using tools like Performance Monitor, SQL Profiler, Database Tuning Advisor, and Index Tuning Wizard.
Acted as the primary contact for resolving locking, blocking, and performance-related issues. Authored scripts and devised indexing strategies for migrating data to Amazon Redshift from SQL Server and MySQL databases.
Utilized AWS Data Pipeline to configure seamless data loads from S3 into Redshift.
Employed JSON schema to define table and column mappings from S3 data to Redshift, devising indexing and data distribution strategies optimized for sub-second query response.
Environment: Hadoop, HDFS, Hive, Pig, HBase, Big Data, Oozie, ODM, Zookeeper, Map Reduce, Cassandra, Scala, Linux, No SQL, MySQL Workbench, Java, Eclipse, Oracle 10g, SQL.
EDUCATION
K.L University VIJAYAWADA, INDIA JUN 2010-MAY 2014
Bachelor’s in Computer Science