Data Engineer Big

Location:

Raleigh, NC

Posted:

April 26, 2024

Contact this candidate

Resume:

Pravalika

Sr. Data Engineer

Email I’d: ************************@*****.***

Contact No: +1-980-***-****

Linkedin URL: https://www.linkedin.com/in/pravalika-rao-3a9967289/

PROFESSIONAL SUMMARY

●Senior Data Engineer professional with 8+ years of combined experience in the fields of Data analysis and in Data Engineering, Big Data implementations, and Spark technologies.

●Experience in Big Data ecosystems using Hadoop, Pig, Hive, HDFS, MapReduce, Sqoop, Storm, Spark, Airflow, Snowflake, Teradata, Flume, Kafka, Yarn, Oozie, and Zookeeper.

●High Exposure to Big Data technologies and the Hadoop ecosystem, In-depth understanding of Map Reduce and Hadoop Infrastructure.

●Expertise in writing end-to-end Data processing Jobs to analyze data using MapReduce, Spark, and Hive.

●Experience with Apache Spark ecosystem using Spark-Core, SQL, Data Frames, RDD, and knowledge of Spark MLLib.Experienced in data manipulation using Python for loading and extraction as well as with Python libraries such as NumPy, SciPy, and Pandas for data analysis and numerical computations.

●A solid experience and understanding of designing and operationalization of large-scale data and analytics solutions on Snowflake Data Warehouse.

●Developing ETL pipelines in and out of the data warehouse using a combination of Python and SnowSQL.

●Used GCP Cloud which provides tools like Cloud Build, Cloud Source Repositories, and Deployment Manager for DevOps and continuous integration/continuous deployment (CI/CD) processes. So that we can automate the deployment of applications and infrastructure.

●Experience in extracting files from MongoDB through Sqoop, placing them in HDFS, and processing them.

●Worked with NoSQL databases like HBase in creating HBase tables to load large sets of semi-structured data coming from various sources.

●We implemented Cluster for the NoSQL tool HBase as a part of POC to address HBase limitations.

●Experience in building power BI reports on Azure Analysis services for better performance when comparing that to direct query using GCP BigQuery.

●Very keen on knowing the newer techno stack that the Google Cloud platform adds.

●Experience in Big Data/Hadoop, Data Analysis, and Data Modeling professional with applied information Technology.

●Strong Experience working with HDFS, MapReduce, Spark, Hive, Sqoop, Flume, Kafka, Oozie, Pig, and HBase.IT experience in Big Data technologies, Spark, and database development.

●Have experience in Apache Spark, Spark Streaming, Spark SQL, and NoSQL databases like HBase, Cassandra, and MongoDB.

●Expertise in transforming business requirements into analytical models, designing algorithms, building models, developing Data Mining, Data Acquisition, Data Preparation, Data Manipulation, Feature Engineering, Machine Learning Algorithms, Validation, and Visualization, and reporting solutions that scale across a massive volume of structured and unstructured Data.

●Experience in the usage of Hadoop distribution like Cloudera and Hortonworks.

●Excellent Experience in Designing, Developing, Documenting, and Testing ETL jobs and mappings in Server and Parallel jobs using Data Stage to populate tables in Data Warehouse and Data marts.

●Integrated Kafka with Spark Streaming for real-time data processing.

●Skilled in performing data parsing, data manipulation, and data preparation with methods including describing data contents.

●Strong experience in the Analysis, design, development, testing, and Implementation of Business Intelligence solutions using Data Warehouse/Data Mart Design, ETL, BI, Client/Server applications, and writing ETL scripts using Regular Expressions and custom tools (Informatica, Pentaho, and Sync Sort) to ETL data.

●Experienced in Hadoop Ecosystem and Big Data components including Apache Spark, Scala, Python, HDFS, Map Reduce, and KAFKA. We used Google Cloud (GCP) IoT Core can be used to connect and manage Internet of Things (IoT) devices securely.

●Implemented Hadoop-based data warehouses, and integrated Hadoop with Enterprise Data Warehouse systems.Hands-on experience with big data tools like Hadoop, Spark, Hive, Pig, Impala, Pyspark, and Spark SQL.

●Good knowledge in Database Creation and maintenance of physical data models with Oracle, Teradata, Netezza, DB2, MongoDB, HBase, and SQL Server databases.

●Deep understanding of MapReduce with Hadoop and Spark.

●Good knowledge of Big Data ecosystems like Hadoop 2.0 (HDFS, Hive, Pig, Impala), and Spark (SparkSQL, Spark MLLib, Spark Streaming).

●Experienced in writing complex SQL Queries like Stored Procedures, triggers, joints, and Subqueries.

●Large-scale Hadoop environments build and support including design, configuration, installation, performance tuning, and monitoring.

●Good experience in programming languages Python, and Scala and have Experience with Data Analytics, Data Reporting, Ad-hoc Reporting, Graphs, Scales, pivot tables, and OLAP reporting.

●Ability to work with managers and executives to understand the business objectives and deliver as per the business needs and a firm believer in teamwork.

TECHNICAL SKILLS

Big Data Ecosystem

HDFS, Yarn, MapReduce, Spark, Kafka, Kafka Connect, Hive, Airflow, Stream Sets, Sqoop, HBase, Flume, Pig, Ambari, Oozie, Zookeeper, Nifi, Sentry

Hadoop Distributions

Apache Hadoop 2.x/1.x, Cloudera CDP, Hortonworks HDP

Cloud Environment

Microsoft Azure, Google Cloud Platform (GCP), Snowflake

Databases

MySQL, Oracle, Teradata, MS SQL SERVER, PostgreSQL, DB2

NoSQL Database

DynamoDB, HBase

Microsoft Azure

Databricks, Data Lake, Blob Storage, Azure Data Factory, SQL Database, SQL Data Warehouse, Cosmos DB, Azure Active Directory

Software’s/Tools

Microsoft Excel, Stat graphics, Eclipse, Shell Scripting, ArcGIS, Linux, Jupyter Notebook, PyCharm, Vi / Vim, Sublime Text, Visual Studio, Postman, Ansible

Reporting Tools/ETL Tools

Informatica, Talend, SSIS, SSRS, SSAS, ER Studio, Tableau, Power BI, Arcadia, Data stage, Pentaho

Programming Languages

Python (Pandas, SciPy, NumPy, Scikit-Learn, Stats Models, Matplotlib, Plotly, Seaborn, Keras, TensorFlow, PyTorch), PySpark, T-SQL/SQL, PL/SQL, HiveQL, Scala, UNIX Shell Scripting

Version Control

Git, SVN, Bitbucket

Development Tools

Eclipse, NetBeans, IntelliJ, Hue, Microsoft Office

PROFESSIONAL EXPERIENCE

Client: Centene Corporation, CA April 2022 - Till Date

Role: Sr. Data Engineer

Project Objective:

Centene Corporation aims to improve healthcare outcomes and operational efficiency by leveraging Big Data analytics. This project focuses on building a data pipeline using Snowflake as the data warehouse, Apache Airflow for workflow orchestration, and Google Cloud Platform for data storage and processing. The project's primary goal is to provide healthcare insights that support data-driven decision-making.

Responsibilities:

•Led the design and implementation of end-to-end data integration pipelines, integrating healthcare data sources, including Electronic Health Records (EHR), claims data, and medical imaging data into Snowflake on GCP.

•Used cloud shell SDK in GCP to configure the services Data Proc, Storage, and BigQuery.

•Coordinated with the team and Developed a framework to generate Daily ad hoc reports and Extracts from enterprise data from BigQuery.

•Designed and maintained a scalable data architecture in Snowflake, ensuring data quality, consistency, and compliance with healthcare regulations (e.g., HIPAA).

•Migrating an entire Oracle database to BigQuery and using Power BI for reporting.

•Build data pipelines in airflow in GCP for ETL-related jobs using different airflow operators.

•Designed and Coordinated with the Data Science team in implementing Advanced Analytical Models in the Hadoop Cluster over large Datasets.

•Wrote scripts in Hive SQL for creating complex tables with high-performance metrics like partitioning, clustering, and skewing.

•Build a program with Python sdk with Apache beam framework and execute it in Cloud Dataflow to stream pub sub-messages into big query tables.

•Work related to downloading BigQuery data into pandas or Spark data frames for advanced ETL capabilities.

•Worked with Google data catalog and other Google Cloud APIs for monitoring, query, and billing-related analysis for BigQuery usage.

•Worked on creating POC for utilizing the ML models and Cloud ML for table Quality Analysis for the batch process.

•Knowledge about cloud dataflow and Apache beam.

•Good knowledge of using Cloud Shell for various tasks and deploying services. Experienced in GCP Dataproc, GCS, Cloud functions, and BigQuery.

•Created BigQuery-authorized views for row-level security or exposing the data to other teams.

•Development of data pipelines with cloud composer for orchestrating, cloud dataflow for building scalable machine learning algorithms for clustering, and cloud data prep for exploration.

•Expertise in designing and deploying Hadoop clusters and different Big Data analytic tools including Pig, Hive, SQOOP, and Apache Spark, with Cloudera Distribution.

•Oversaw system scalability to accommodate increasing volumes of healthcare data, ensuring uninterrupted data processing.

•Continuously optimized ETL processes, data storage, and query performance, leading to improvement in overall system efficiency.

•Automated ETL and ELT processes using tools such as Cloud Dataflow and Apache Beam, improving data transformation efficiency and accuracy.

•Created detailed documentation of the data pipeline architecture, data dictionaries, and workflow processes for knowledge sharing and Conducted training sessions for data analysts and stakeholders to enable effective use of the data analytics platform.

•Explored machine learning models for predictive healthcare analytics, contributing to [Specific Outcome, e.g., improved disease prediction accuracy]. Leveraged GCP's AI/ML services for model training and deployment.

•Established proactive monitoring and alerting systems, reducing downtime and ensuring data pipeline reliability. Scheduled routine maintenance activities to keep the system up-to-date and perform enhancements as required.

Environment: Big Query, Oracle Database, Hive, Apache Spark, Cloud ML, GCP Dataproc, ETL Processes, Cloud ML, Python, SQL, Cloud functions.

Client: First Republic Bank Oct 2021 - March 2022

Role: Snowflake Developer

Project Objective: Our aim is to Detect and prevent fraudulent transactions in real time to protect First Republic Bank and its customers. To Enhance overall security measures by leveraging big data analytics on GCP and Improve customer trust and satisfaction by reducing the occurrence of fraudulent activities.

Responsibilities:

•Worked with a cross-functional team in the design and implementation of a comprehensive fraud detection and security enhancement project at First Republic Bank.

•Designed and deployed robust data ingestion and processing pipelines on Google Cloud Platform (GCP) to collect and analyze real-time transactional data from multiple sources.

•Wrote scripts in Hive SQL/Presto SQL, using Python plugin for both spark and Presto for creating complex tables with high-performance metrics like partitioning, clustering, and skewing.

•Designed and implemented a highly secure and compliant data warehousing solution on Snowflake within GCP’s infrastructure, ensuring strict adherence to financial regulations and data privacy laws.

•Developed and managed end-to-end ETL processes using tools such as Google Cloud Dataflow, Apache Beam, and Apache Airflow to facilitate the seamless movement of data between on-premises and cloud-based systems.

•Leveraged advanced data modeling techniques, including dimension modeling and slowly changing dimensions, to create a comprehensive data warehouse that supports risk analysis, fraud detection, and customer segmentation.

•Collaborated with the risk management and compliance teams to design data access controls and audit trails, ensuring all data handling and reporting adhered to banking industry standards and best practices.

•Implemented robust data encryption, masking, and anonymization techniques to protect sensitive customer and financial data, achieving a compliance rate with data security requirements.

•Managed the performance tuning and optimization of Snowflake and GCP resources to handle complex analytics queries efficiently, reducing query response times.

•Conducted data lineage and impact analysis to track changes in source systems and assess the potential effects on reporting, ensuring data accuracy and consistency.

•Migrated previously written cron jobs to airflow/composer in GCP. Leveraged cloud and GPU computing technologies for automated machine learning and analytics pipelines.

•Worked on confluence and Jira. Designed and implemented a configurable data delivery pipeline for scheduled updates to customer-facing data stores built with Python

•Proficient in Machine Learning techniques (Decision Trees, Linear/Logistic Regressors) and Statistical Modeling.

•Compiled data from various sources to perform complex analysis for actionable results

•Measured Efficiency of the Hadoop/Hive environment ensuring SLA was met. Optimized the Tensorflow Model for efficiency.

•Analyzed the system for new enhancements/functionalities and performed Impact analysis of the application for implementing ETL changes

•Implemented a Continuous Delivery pipeline with Docker, and Git Hub. Built performant, scalable ETL processes to load, cleanse, and validate data

•Have written Python DAGs in airflow which orchestrate end-to-end data pipelines for multiple applications.

•Was involved in setting up of Apache airflow service in GCP. Build data pipelines in airflow in GCP for ETL-related jobs using different airflow operators.

•Strong knowledge of .NET development using C# and .NET frameworks, building scalable and performant data-driven applications for data processing and analysis.

•Wrote scripts in Hive SQL for creating complex tables with high-performance metrics like partitioning, clustering, and skewing

Environment:

Hive SQL/ Presto SQL, ETL Processes, Docker, GitHub, Python, .Net frameworks, Google Cloud Dataflow, Apache Beam, and Apache Airflow.

Client: Jio Fiber, Hyderabad Sept 2019 - Sept 2021

Role: BigData Engineer

Project Objective: To leverage Microsoft Azure's cloud services and big data capabilities to enhance Jio Fiber's network performance, customer experience, and security.

Responsibilities:

•Created Python Databricks notebooks to handle large amounts of data, transformations, and computations.

•Analyze, design, and build Modern data solutions using Azure PaaS service to support visualization of data. Understand the current Production state of the application and determine the impact of new implementation on existing business processes.

•Extract data from on-premise and cloud storages and Load data to Azure Data Lake by using tools like Databricks and Datafactory.

•Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool, and backward.

•Developed Spark applications using Scala and Spark-SQL for creating dimensions and facts tables and processed data is loaded back to azure data lake.

•Moved data to Azure Data Lake to Azure data warehouse using polybase, created external tables in ADW with 4 compute nodes, and scheduled.

•Design new applications for high transaction processing & scalability to seamlessly support future modifications and the growing volume of data processed in the environment.

•Implement solutions to run effectively in the cloud and improve the performance of big data processing and the high volume of data being handled by the system to provide better customer support.

•Work with business process managers and be a subject matter expert for transforming vast amounts of data and creating business intelligence reports using state-of-the-art big data technologies (Hive, Spark, Scoop, and NIFI for ingestion of big data, python/bash scripting /Apache Airflow for scheduling jobs in GCP/Google’s cloud-based environments).

•Migrated an Oracle SQL ETL to run on the Google Cloud platform using Cloud data proc & big query, cloud pub/sub for triggering the airflow jobs.

•Worked on using presto, hive, and spark SQL, big query using Python client libraries, and building interoperable and faster programs for analytics platforms.

•Hands-on experience in using all the big data-related services in the Google Cloud Platform.

•Used Apache airflow in GCP composer environment to build data pipelines and used various airflow operators like bash operator, Hadoop operators, and python callable and branching operators.

•Moved Data between big query and Azure Data Warehouse using ADF and created Cubes on AAS with lots of complex DAX language for memory optimization for reporting.

Environment: Azure Databricks, Azure Data Lake, Hive, Spark, Scoop, Google Cloud Platform, SQL, Cloud data proc, Scala, and Spark-SQL.

Client: Nykaa, Hyderabad August 2017 - Sept 2019

Role: ETL Developer

Project Objective: By achieving these objectives such as Streamline payment processing, Enhance payment security, Customer Insights, and personalization, Real-time monitoring and alerts, Data Quality and Governance, and Customer experience improvement, the project aims to create a robust and efficient payment processing system, harness the power of customer data for informed decision-making, and ultimately improve the customer experience on the Nykaa platform.

Responsibilities:

•Designed and implemented complex data pipelines in Azure Data Factory, using activities like Copy Data, Data Flow, and Databricks to move, transform, and load data from sources like SQL Server, REST APIs, and flat files.

•Used Azure Databricks to process the data and then ingested into Azure services such as Data Lake, Azure Data Lake Analytics, and Azure SQL Database.

•Used Terraform to generate data factories, storage accounts, and access to the key vault. Used ADF and PySpark with Databrickscreatedpipelines, data flows, and complex data transformations and manipulations.

•Created and maintained dimensional data models, star schemas, and snowflake schemas in Azure SQL Data Warehouse for Nykaa's reporting and analytics needs.

•Linked services to Snowflake, blob storage, and SQL database.

•Used DataFrames and Spark-SQL to develop Databricks PySpark scripts for transformations and loading the data to different targets.

•Created Python Databricks notebooks to handle large amounts of data, transformations, and computations.

•Also Created pipelines, data flows, and complex data transformations and manipulations using ADF and PySpark wif Databricks.

•UsedDatabricks to work extensively on accessing, processing, transforming, and analyzing large amounts of data.

•Have Working knowledge of Python programming, including a variety of packages like NumPy, Matplotlib, SciPy, and Pandas.

•Developed Spark applications for data extraction, transformation, and aggregation from multiple file formats using Pyspark and Spark-SQL and analysis.

•Built ETL pipelines to retrieve data from NoSQL databases and load aggregated data into the analytical platform. Managed data storing and processing using big data tools such as Hadoop, HDFS, Hive, and Spark.

•Utilize Python/Pyspark in Databricks notebooks when creating ETL pipelines in Azure Data Factory.

•Used the combination of Azure Data Factory, T-SQL, Spark SQL, and Azure Data Lake Analytics, to extract, transform, and load data from source systems to Azure Data Storage services.

•Worked with Azure Blob, ADLS Gen-1, and Gen-2 as well as other data storage options. Experience in using Azure Data Factory to bulk import data from CSV, XML, and flat files.

Environment: Azure, Terraform, Snowflake, Azure Databricks, Azure data lake, Azure Blob, SQL, T-SQL, Spark-SQL, Python, NumPy, Matplotlib, SciPy, Pandas, YAML.

Client: Tata Motors, Hyderabad July 2015 - August 2017

Role: Datastage Developer / Data Analyst

Project Objective: To leverage Microsoft Azure cloud services and Big Data analytics to transform the vehicle design and manufacturing process, optimizing efficiency, quality, and sustainability, ultimately delivering better-performing and more reliable vehicles to customers.

Responsibilities:

•Extensively used DataStage Designer and Teradata to build the ETL process which pulls the data from different sources like flat files, DB2, and mainframes system and does the grouping techniques in job design.

•Developed master jobs (Job Sequencing) for controlling the flow of both parallel & server Jobs.

•Good knowledge in parameterizing the variables rather than hardcoding directly, Used Director widely for monitoring the job flow and processing speed.

•Based on the above analysis performance tuning for improving job processing speed.

•Developed Autosys jobs for scheduling the Jobs. These jobs include Box jobs, Command jobs, file watcher jobs, and creating ITG requests.

•Closely monitor schedules and look into the failures to complete all ETL/Load processes within the SLA.

•Designed and developed SQL scripts and extensively used Teradata utilities like BTEQ scripts, FastLoad, and Multiload to perform bulk database loads and updates.

•After Completing ETL activities corresponding load file will be sent to the Cube team for building Cubes.

•Used Teradata Export utilities for reporting purposes. Created spec docs for automating the manual processes.

•Closely work with On-shore people and Business people to resolve critical issues that occurred during the load process.

Environment: DataStage 8.5, Linux 2.5, Oracle 10g, Teradata 13.1.1, TWS maestro, TOAD, SQL*Loader, SQL Plus, SQL, HPSM, Mercury ITG

Contact this candidate