BACKGROUND SUMMARY
Overall ** plus years’ Experience in importing and exporting data using Sqoop from HDFS to Relational Database systems and vice- versa and load into Hive tables, which are partitioned.
Developed custom Kafka producer and consumer for different publishing and subscribing to Kafka topics.
Developed and maintained data integration pipelines tailored for various use cases, ensuring efficient data flow and processing from various sources to downstream analytics platforms.
Implemented custom Kafka producers and consumers optimized for real-time data ingestion, processing, and analysis.
Implemented data integration solutions leveraging Google Cloud Platform (GCP) services such as Cloud Dataflow for scalable data processing, Cloud Pub/Sub for messaging, and Big Query for analytics.
Utilized Google Cloud Storage for storing and managing data, and Cloud Bigtable for high-performance NoSQL database needs.
Designed and implemented robust data integration pipelines, orchestrating the movement and transformation of data between various systems and platforms, ensuring data consistency, integrity, and availability throughout the pipeline.
Played a key role in platform engineering, architecting, and building scalable and reliable data platforms, leveraging technologies like Hadoop, Spark, Kafka, and cloud-native solutions on GCP.
Leveraged Google Big Query for ad-hoc and batch analytics, performing complex queries and aggregations on large datasets to derive actionable insights.
Utilized GCP services such as Cloud Dataflow, Cloud Pub/Sub, and Big Query for scalable data processing, messaging, and analytics.
Designed robust data integration pipelines, orchestrating data movement and transformation across systems to ensure data consistency and availability.
Designed and optimized data models in Big Query to support efficient querying and analysis, ensuring optimal performance for reporting and visualization needs.
Good working experience on Spark (spark streaming, spark SQL) with Scala and Kafka. Worked on reading multiple data formats on HDFS using Scala.
Extensive knowledge in writing Hadoop jobs for data analysis as per the business requirements using Hive and worked on HiveQL queries for required data extraction, join operations, writing custom UDF's as required and having good experience in optimizing Hive Queries.
Excellent programming skills with experience in Java, C, SQL, and Python Programming.
Involved in converting Hive/SQL queries into Spark transformations using Spark Data frames and Scala.
Experience in using Kafka and Kafka brokers to initiate spark context and processing livestreaming.
Hands-on experience in writing Map Reduce programs using Java to handle different data sets using Map and Reduce tasks.
Good understanding of distributed systems, HDFS architecture, Internal working details of MapReduce and Spark processing frameworks.
Knowledge of ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions.
Good understanding and knowledge of NoSQL databases like MongoDB, PostgreSQL, HBase, and Cassandra.
Strong experience in core Java, Scala, SQL, PL/SQL, and Restful web services.
Experience in developing custom UDFs for Pig and Hive to incorporate methods and functionality of Python/Java into Pig Latin and HQL (HiveQL) and Used UDFs from Piggybank UDF Repository.
Experience in Microsoft Azure/Cloud Services like SQL Data Warehouse, Azure SQL Server, Azure Databricks, Azure Data Lake, Azure Blob Storage, Azure Data Factory.
Experience in extracting files from MongoDB through Sqoop and placed in HDFS and processed.
Worked with NoSQL databases like HBase in creating HBase tables to load large sets of semi-structured data coming from various sources.
Experience with ETL concepts using Informatica Power Center, AB Initio.
Hands-on experience working with DevOps pipeline using Jenkins and AWS CI/CD tool sets.
Technical Skills
Languages
Python, Scala, PL/SQL, SQL, T-SQL, UNIX, Shell Scripting
Big Data Technologies
Hadoop, HDFS, Hive, Pig, HBase, Sqoop, Flume, Yarn, Spark SQL, Kafka, Presto
Operating System
Windows, ZOS, Unix, Linux
BI Tools
SSIS, SSRS, SSAS
Modeling Tools
IBM Infosphere, SQL Power Architect, Oracle Designer, Erwin, ER/Studio, Sybase Power Designer
Database Tools
Oracle 12c/11g/10g, MS Access, Microsoft SQL Server, Teradata, SQL, Netezza
Cloud Platform
AWS (Amazon Web Services), Microsoft Azure, Cloud Functions (GCP), CI/CD on GCP, GCP (Google Cloud Datapost, Google Cloud Storage, Google Cloud SQL, Google Cloud Composer, Google Cloud Dataflow, Data prep, Google Cloud AI Platform
Reporting Tools
Business Objects, Crystal Reports
Windows Tools
Perfmon, Event Viewer, Task Manager
Tools & Software
TOAD, MS Office, BTEQ, Teradata SQL Assistant
ETL Tools
Pentaho, Informatica Power, SAP Business Objects XIR3.1/XIR2, Web Intelligence
Other tools
TOAD, SQL PLUS, SQL LOADER, MS Project, MS Visio, C++, UNIX, PL/SQL,
Work Experience
Client: Kaiser Permanente, Oakland, California Mar 2022 till now
Role: Data Engineer
Responsibilities:
Utilized Google Cloud Functions (GCP) for serverless computing tasks, potentially integrated with Cloud Storage triggers for handling data processing tasks in response to events.
Leveraged Google Cloud Storage (GCS Bucket) for storing and querying large datasets, including data ingestion from various sources using REST APIs with Python.
Experience working in a CI/CD environment using tools like JIRA suggests familiarity with Google Cloud Build, automating build, test, and deployment processes on GCP.
Experienced with Cloud Shell, a browser-based shell for managing resources and running commands directly from the Google Cloud Console.
Utilized Google Cloud SQL for exporting result sets from Hive to MySQL, demonstrating proficiency in relational database management on GCP.
Utilized Google Cloud Functions for running small, single-purpose functions in response to events, for specific data processing tasks or integration points.
Automated data pipelines and batch jobs using SAS macros and stored processes, improving data processing efficiency and reducing manual intervention.
Designed and implemented RESTful APIs using Cloud Endpoints to facilitate seamless data exchange between microservices, reducing integration time.
Involved in data ingestion into HDFS using Sqoop for full load and Flume for incremental load from various sources like web servers, RDBMS, and Data APIs.
Built infrastructure for optimal extraction, transformation, and loading of data from diverse sources using SQL and big data technologies like Hadoop Hive, Azure Data Lake storage.
Installed Oozie workflow engines to run multiple Hive and Pig jobs independently based on time and data availability.
Utilized Kubernetes and Docker for the runtime environment of the CI/CD system to build, test, and deploy.
Implemented workflows using the Apache Oozie framework to automate tasks.
Migrated tables from RDBMS into Hive tables using SQOOP and generated visualizations using Tableau.
Built pipelines to move hashed and un-hashed data from XML files to Data Lake.
Developed Spark scripts using Python on Azure HDInsight for Data Aggregation and Validation, verifying performance over MR jobs.
Experience in building multiple Data pipelines, end-to-end ETL, and ELT processes for Data ingestion and transformation in GCP, coordinating tasks among the team.
Leveraged SAS Enterprise Guide for advanced data preparation, cleansing, and integration, enabling accurate and consistent reporting across business units.
Collaborated with BI teams to deliver analytical datasets in SAS, supporting downstream visualization tools such as Tableau and Power BI.
Architected and deployed API Gateway solutions to manage, secure, and monitor API traffic across multiple GCP services, implementing rate limiting and authentication policies.
Created ETL mappings with Talend Integration Suite to pull data from sources, apply transformations, and load data into the target database.
Utilized REST API with Python to ingest data from various sites into Big Query.
Developed programs with Python and Apache Beam, executing them in Cloud Dataflow for Data validation between raw source files and Big Query tables.
Wrote SQL queries to create and replace materialized views in Big Query.
Environment: Hadoop, Cloudera, Flume, HBase, HDFS, MapReduce, Kafka, YARN, Hive, Pig, Sqoop, Oozie, Tableau, Google Cloud Platform (GCP), Big Query, GCS Bucket, Cloud Functions, Cloud Dataflow, Cloud Shell, Cloud SQL,GCP (Google Cloud Dataproc, Google Cloud Storage, Google Cloud Bigtable, Google Cloud SQL, Google Cloud Composer, Google Cloud Dataflow, Google Cloud Data prep, Google Cloud AI Platform
Client: Walmart, Bentonville, AR Jul 2020 to Feb 2022
Role: Data Engineer
Responsibilities:
Implemented real-time data ingestion applications using Flume and Kafka, facilitating continuous storage in Google Cloud Platform (GCP) using services such as Big Query, Cloud Storage, and Cloud Dataflow.
Developed Spark programs with Scala for batch processing, incorporating functional programming principles to cleanse and process data before loading into Hive or HBase for further analysis.
Designed and developed data pipelines for automated ingestion of data from various sources into Google Cloud Storage (GCS), building Hive/Impala tables for efficient data processing and analysis in GCP ecosystems.
Utilized GCP services such as Big Query, Cloud Storage, and Cloud Dataflow for scalable data processing and analytics in various domains.
Designed and deployed multi-tier applications on GCP, ensuring high availability, fault tolerance, and auto- scaling to support workloads.
Implemented data governance and compliance checks within SAS pipelines to ensure alignment with GDPR and HIPAA standards.
Migrated legacy SAS jobs into AWS/Azure cloud-based pipelines, re-engineering code to integrate with Spark, Databricks, and SQL-based data lakes.
Responsible for importing data from PostgreSQL to GCS, Hive using SQOOP tool.
Developed stored procedures/views in Snowflake and used in Talend for loading Dimensions and Facts.
Created Sqoop jobs, Hive scripts for data ingestion from relational databases to compare with historical data.
Implemented Hive tables on GCS to store data processed by Apache Spark on GCP in Parquet format.
Developed tools using Scala and Akka to scrub numerous files in Google Cloud Storage, eliminating unwanted characters and performing other activities.
Leveraged Apigee API Management platform to create developer portals, analytics dashboards, and monetization strategies for data service APIs.
Developed Java MapReduce programs for the analysis of sample log files stored in GCP clusters.
Installed and configured Hadoop MapReduce, HDFS, developed multiple MapReduce jobs in Java and Scala for data cleaning and preprocessing.
Designed and Developed Oracle PL/SQL and Shell Scripts, Data Import/Export, Data Conversions, and Data Cleansing.
Created on-demand tables on GCS files using Lambda Functions and AWS Glue using Python and PySpark.
Developed Spark scripts using Scala Shell commands and PySpark for faster testing and processing of data.
Developed Hive UDFs using Java as per business requirements.
Built scalable API backends using Cloud Run and Cloud SQL, enabling real-time data access while maintaining sub-100ms response times under peak loads.
Created data pipelines for different events to load data from DynamoDB to GCS and then into HDFS location.
Performed real-time event processing data from multiple servers in the organization using Apache Storm by integrating with Apache Kafka.
Developed a Spark job in Java to index data into Elasticsearch from external Hive tables in HDFS.
Developed PySpark and Spark SQL code to process data in Apache Spark on Google Cloud Dataproc to perform necessary transformations based on the STMs developed.
Environment: Cloudera, Hive, Impala, Spark, Apache Kafka, Flume, Scala, Google Cloud Platform (GCP), Big Query, Cloud Storage, Cloud Dataflow, AWS, EC2, DynamoDB, Auto Scaling, Lambda, Nifi, Snowflake, Java, Shell- scripting, SQL, Sqoop, Oozie, PL/SQL, Oracle 12c, SQL Server, HBase, Bitbucket, Control-M, Python, GCP (Google Cloud Dataproc, Google Cloud Storage, Google Cloud Bigtable, Google Cloud SQL, Google Cloud Composer, Google Cloud Dataflow, Google Cloud Data prep, Google Cloud AI Platform
Client: Citizens Bank, Providence, RI Nov 2018 to June 2020
Role: Data Engineer
Responsibilities:
Developed data mapping, transformation, and cleansing rules for data management involving both OLTP and OLAP on Google Cloud Platform (GCP).
Utilized Google Cloud Dataproc to transform and move large volumes of data into and out of various GCP data stores and databases, such as Google Cloud Storage (GCS) and Google Cloud Bigtable.
Managed Apache Hadoop distributions, including Cloudera and MapR, on Google Cloud Dataproc to optimize data latency through parallel processing.
Created comprehensive API documentation and developer resources using Cloud Endpoints Portal, increasing API adoption.
Implemented machine learning algorithms in Python for data processing on GCP, working with different data formats such as JSON and XML.
Executed reporting tasks in PySpark, Google Cloud Dataproc Notebooks, and Jupyter, along with querying using Google Cloud Composer (Apache Airflow), Presto, and Big Query.
Configured and managed Google Cloud Composer for workflow management, developing workflows in Python for efficient data processing.
Designed and optimized GraphQL APIs on top of BigQuery to provide flexible, efficient data querying capabilities for data science teams.
Implemented UNIX shell scripting for automation tasks and optimized database performance through table defragmentation, partitioning, compression, and indexing on Google Cloud SQL.
Established AWS CI/CD data pipelines and migrated to GCP Cloud Build for continuous integration and deployment, alongside setting up a data lake using GCS, Cloud Dataproc, and Cloud Functions.
Developed reusable PL/SQL program units, database procedures, and functions on Google Cloud SQL to enforce business rules and enhance team productivity.
Leveraged Google Cloud Dataflow and Cloud Data prep for ETL processes, replacing SQL Server Integration Services (SSIS) for data extraction, transformation, and loading from multiple sources.
Implemented machine learning models in R and Python for business forecasting and customer churn prediction, utilizing Google Cloud AI Platform for model deployment and integration with GCP databases.
Expedited model creation in Python using libraries like pandas, NumPy, scikit-learn, and plotly, with integration into Google Cloud Big Query and scheduled updates using Google Cloud Scheduler.
Environment: MapReduce, Spark, Hive, Pig, Sqoop, HBase, Oozie, Impala, GCP (Google Cloud Dataproc, Google Cloud Storage, Google Cloud Bigtable, Google Cloud SQL, Google Cloud Composer, Google Cloud Dataflow, Google Cloud Data prep, Google Cloud AI Platform, Cloud Build, Cloud Functions, Big Query), Kafka, JSON, XML, PL/SQL, SQL, HDFS, Unix, Python, PySpark
Client: Optum Inc, Eden Prairie, Minnesota Feb 2017 to Oct 2018
Role: Data Engineer
Responsibilities:
Developed, optimized, and maintained ETL/ELT pipelines on Google Cloud Platform (GCP) using Cloud Dataflow, Apache Beam, and Cloud Composer to handle structured and unstructured data at scale.
Managed and optimized large data sets using Google Cloud Storage (GCS), leveraging Cloud Storage buckets for both structured and unstructured data and ensuring efficient data retrieval and access.
Architected BigQuery data models and optimized them for high-performance querying, cost-efficient storage, and seamless data ingestion from various sources. Developed partitioned and clustered tables to enhance query performance and reduce costs.
Implemented real-time data processing systems using Google Cloud Pub/Sub and Cloud Dataflow, enabling real-time analytics, event-driven data processing, and monitoring for multiple applications.
Designed and developed robust data integration solutions using Google Cloud Data Fusion and Cloud Functions to ingest data from multiple on-premises and cloud-based systems (such as databases, APIs, and flat files) into GCP environments.
Automated data processing workflows with Cloud Composer (Apache Airflow) to schedule and manage complex data pipelines, reducing manual intervention and improving operational efficiency.
Applied Google Cloud IAM (Identity and Access Management) to enforce role-based access controls for both data and services, ensuring compliance with data governance and security policies.
Implemented data validation checks and logging frameworks within data pipelines to ensure high data quality and pipeline health. Utilized Cloud Logging for detailed error tracking and Stackdriver Monitoring for performance monitoring.
Worked closely with Data Scientists, Analysts, and DevOps teams to ensure smooth data flow, streamline data access, and prepare data models for analytics, machine learning, and reporting purposes.
Optimized cloud resource utilization and managed the costs of GCP services (e.g., BigQuery, Dataflow, Storage) by employing efficient data partitioning, pruning unused resources, and using BigQuery cost controls and Cloud Monitoring.
Developed detailed technical documentation, maintained pipeline architecture diagrams, and conducted knowledge-sharing sessions with team members to promote best practices in data engineering.
Assisted in the creation of datasets, data models, and views that were used by BI teams and analysts in tools like Looker and Google Data Studio for actionable business insights.
Environment: GCP (Google Cloud Dataproc, Google Cloud Storage, Google Cloud Google Cloud SQL, Google Cloud Composer, Google Cloud Dataflow, Google Cloud Data prep, Google Cloud AI Platform, Big Query), Kafka, JSON, XML, PL/SQL, SQL, HDFS, Unix, Python, PySpark
Client: Brio Technologies Pvt Ltd, Hyderabad, India June2013-Dec2015
Role: Data Engineer
Responsibilities:
Led the migration of SQL Server and Oracle databases to the Google Cloud Platform (GCP) environment.
Responsible for ETL design, including identifying source systems, designing source-to-target relationships, data cleansing, and ensuring data quality, as well as creating source specifications and ETL design documents.
Executed data migration using Google Cloud Database Migration Service (DMS) to seamlessly transfer data to GCP.
Developed pipelines to efficiently move hashed and unhashed data from XML files to Google Cloud Data Lake.
Extensively utilized Spark-SQL context to create data frames and datasets for preprocessing model data within the GCP environment.
Designed row keys in HBase to store text and JSON as key values in sorted order within HBase tables.
Scheduled and generated routine reports based on key performance indicators (KPIs) within the GCP environment.
Designed and developed DataStage jobs for data cleansing, transformation, and loading into the Data Warehouse, encapsulating job flows with sequencers.
Contributed to the development and deployment of custom Hadoop applications within the GCP environment.
Implemented NiFi workflows to automate the daily pickup of multiple files from FTP locations and transfer them to Hadoop Distributed File System (HDFS) within GCP.
Environment: Linux, Erwin, SQL Server, HTML, Google Cloud Platform (GCP), Oracle, Toad, Microsoft Excel, Power BI.
Education
Masters in Cybersecurity with a Specialization in Data Analytics
Bachelor’s from JNTU, Hyderabad