Post Job Free

Resume

Sign in

Data Engineer Machine Learning

Location:
Dallas, TX, 75215
Posted:
November 10, 2023

Contact this candidate

Resume:

Prashanth K

Data Engineer

ad00yy@r.postjobfree.com 469-***-**** Dallas, Texas

Professional Summary:

Data Engineer with 8+ years of IT experience in Developing, Building and Maintaining database systems and constructing ETL data pipelines for continuous and automated data exchange.

Hands on experience in developing Spark applications using Spark tools like RDD transformations, Spark core, Spark MLib, Spark Streaming and Spark SQL.

Experience on Hadoop Ecosystem and Big Data components for developing and deploying enterprise-based applications using Apache Spark, Python, Spark SQL, Spark MLib, HDFS, Map Reduce, KAFKA, Flume, Sqoop, Airflow, Hive.

Exposure in developing Spark applications using Spark-SQL in Databricks for data extraction, transformation aggregation from multiple file formats for analysing and transforming the data to get insights into the customer usage patterns.

Practical Knowledge in data mining, data cleaning, data munging, and machine learning using Python, SQL, Hive, PySpark, Spark SQL, and Spark SQL.

Familiarity in Extraction, Transformation and Loading (ETL) data from various sources into Data Warehouses, as well as data processing like collecting, aggregating and moving data from various sources using Apache Flume, Kafka.

Hands on experience in GCP, Big Query, GCS bucket, G - cloud function, cloud dataflow, Pub/suB cloud shell, GUTIL, BQ command line utilities, Data Proc, Stack driver.

Involvement in analyzing data using Python, R, SQL, Microsoft Excel, Hive, PySpark, Spark SQL for Data Mining, Data Cleansing, Data Munging and Machine Learning.

Have Knowledge in data processing by utilizing MapReduce, Spark and Hive jobs for data analysis.

Proficiency in using PySpark and HIVE to transform and analyse the data depending on ETL mappings.

Familiarity working on creating and running Docker images with multiple micro - services.

Ample knowledge of data architecture including data ingestion pipeline design, Hadoop/Spark architecture, data modelling, data mining, machine learning and advanced data processing.

Orchestrated multiple Spark application jobs and monitor their scheduling and triggering by using Apache Airflow scheduler.

Insights into cloud services like Amazon Web Services and Microsoft Azure.

Knowledge of SQL database migration to Azure Data Lake, Azure Data Lake analytics, data bricks, and controlling and giving database access.

Experience in migrating On-Premises databases to Azure Data Lake store using Azure Data factory.

Extensive knowledge of migrating on-premises databases to Azure Data Lake stores using Azure Data Factory.

Hands-on expertise with AWS and Big Data technologies (Dynamo, Kinesis, S3, HIVE/Spark) to construct the infrastructure needed for effective data extraction, transformation, and loading from a range of data sources using NoSQL and SQL.

Ample experience in Amazon Web Services (Amazon EC2, Amazon S3, Amazon Simple DB, Amazon RDS, Amazon Elastic Load Balancing, Elastic Search, Amazon MQ, Amazon Lambdas, Amazon SQS, AWS Identity and access management, AWS Cloud Watch, Amazon EBS and Amazon Cloud Formation).

Practical understanding of the Data modelling (Dimensional & Relational) concepts like Star-Schema Modelling, Fact and Dimension tables.

Experience in Text Analytics, generating data visualizations using Python and creating dashboards and expertise in development of various reports, dashboards using various Power BI, Tableau Visualization.

Proficient SQL experience in querying, data extraction/transformations and developing queries for a wide range of applications.

Seasoned practice in Machine Learning algorithms and Predictive Modelling such as Linear Regression, Logistic Regression, Naïve Bayes, Decision Tree, Random Forest, KNN, Neural Networks, and K-means Clustering.

Fluent programming experience with Python, SQL and R.

Client: IBM Jan 2022-Present

Location: Dallas, TX

Data Engineer

Responsibilties

Developing production ready software by using Python and Pyspark for Data Engineering use cases.

Working in developing a system that generates an efficient idiomatic Python and Pyspark code for Data Engineering use cases which is used to handle variety of data from multiple sources.

Implementing the techniques of software engineering design principles along with scripting techniques to solve different important issues with readability and performance.

Designing solutions using Pandas and Pyspark which can automatically assemble into larger programs that can perform complex operations.

Design Start schema in Big Query

Building Datasets and tables in Big Query and loading data from cloud storage.

Converting and modifying Hive queries to use in Big Query and Performed data cleaning on unstructured information using various tools.

Extract transform and load from sources system to Azure Data storage services using a combination of Azure data factory, Spark-SQL and Azure data analytics. Ingested the data to Azure data lake and processing the data in Azure Databricks.

High volumes of data from multiple sources is ingested into Azure Cosmos DB for AdHoc querying and change feed functionality is used to reliably and incrementally read inserts and updates made to an Azure Cosmos DB.

Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP and coordinate task among the team.

Used Cosmos queries using SQL as a JSON query language to read the data stored in Azure Cosmos DB.

Contributing to the team by researching, designing and analysing the software programs and coming up with solutions in Python and Pyspark that can handle broad range of use cases with optimized performance.

Performance testing the developed code and identified areas of modification in existing programs and subsequently developed these modifications.

Performing transformations, cleaning and filtering on imported data using Spark Data Frame API, Hive, MapReduce, and loaded final data into Hive.

Used the Spark tools, streamed data from a variety of places, including the Cloud (AWS) and on-premises.

Creating a data pipeline for various ingestion, aggregation and loading consumer response data from an AWS S3 bucket into Hive external tables in HDFS

Using GitHub, CI/CD, Pytest and other technical specifications to deliver and coordinate high quality technical work.

Identifying and reporting issues with the code generator that is used to generate the Python and Pandas code.

Documenting technical requirements, code logic and collaborated with team members for to modify the code and help solve complex problems and also managing the risks.

Client: KROGER

Location: Cincinnati, Ohio

Role: Big Data Engineer May 2020—Jan 2022

Responsibilities:

Developed Spark Applications by using Python and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.

Performed sort, join, aggregations, filter, and other transformations on the datasets using Spark Extract.

Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.

Implemented a generic ETL framework with high availability for bringing related data for Hadoop from various sources using spark.

Involved in converting Map Reduce programs into Spark transformations using PySpark RDD.

Worked on migrating MapReduce programs into Spark transformations using Spark.

Developed Spark programs with Python and applied programming principles to process the complex and large unstructured and structured data frames at scale.

Developed Spark scripts by writing custom RDDs in Python for data transformations and perform actions on RDDs.

Scheduled Apache Airflow DAGs to run multiple Hive and spark jobs, which independently run with time and data availability.

Monitored various task Scheduling and Triggering by using Apache Airflow Scheduler to control the different ETL jobs.

Performance tuning in Spark for several source system domains was developed and integrated into a harmonized layer.

Exploring Spark by leveraging Spark Context, Spark SQL, Data Frame, Pair RDD, and Spark YARN to improve the performance and optimization of existing Hadoop methods.

Migrate the Data from Teradata to Hadoop and data preparation using HIVE Tables.

Create Partitioned and bucketing tables on HIVE. Mainly worked on HiveQL to categorize data according to different subject areas.

Analysed large amounts of data sets to determine optimal way to aggregate and report on it.

Worked with Linux systems and RDBMS database on a regular basis in order to ingest data using Sqoop.

Performed data analytics on Data lake using PySpark on Databricks platform.

Worked on moving data from S3 buckets to AWS redshift using AWS Glue by using built in lambda functions which initiated to run and fetch data from different services in AWS.

Used AWS Athena to perform data analysis by writing queries on data stored in AWS S3 using SQL and also to generate reports and explore data with business intelligence tools, SQL clients, connected via an ODBC or JDBC driver.

Responsible for creating on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and PySpark.

Worked with Amazon Web Services (AWS) using EC2 for hosting and Elastic map reduce (EMR) for data processing with S3 as storage mechanism

Client:Kaiser Permanente May 2018-May2020

Location: Oakland, CA

Role: Big Data Engineer Responsibilities:

Installed and configured Hadoop MapReduce, HDFS Developed multiple MapReduce jobs for data cleaning and pre-processing.

Importing and exporting data jobs, to perform operations like copying data from RDMS to HDFS using Sqoop and developed Spark code and Spark-SQL/Streaming for faster testing and processing of data.

Worked on querying data using Spark SQL on top of Spark Engine, implementing Spark RDD’s using Python.

Built best practice ETLs with Apache Spark to load and transform raw data into easy- to-use dimensional data for self-service reporting.

Used Spark SQL to load JSON data and create Schema RDD and loaded it into Hive Tables and handled structured data.

Exported the analysed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.

Used SQOOP for processing data like importing and exporting of databases to the HDFS, involved in processing large datasets.

Developed scripts to load data to hive from HDFS and involved in ingesting data into Data Warehouse using various data loading techniques.

Designed and implemented data loading and aggregation frameworks and jobs that will be able to handle hundreds of GBs of json files, using Spark, Airflow.

Involved in creating Hive tables, loading with data, and writing hive queries which will run internally in map reduce way.

Worked on developing ETL pipelines on S3 Parquet files on data lake using AWS Glue

Responsible for expanding and optimizing data and data pipeline architecture, as well as optimizing data flow and collection for cross functional teams.

Built reports using Tableau to allow internal and external teams to visualize and extract insights from big data platforms.

Client: Navy Federal

Location: Vienna, Virginia

Role: Python Developer Oct 2015 –May 2018

Responsibilities:

Involved in the development of front end of the application using Python, HTML5, CSS3, AJAX, JSON and jQuery.

Analysed system requirements specifications and involved client interaction during requirements specifications.

Created entire application using Python, Django, MySQL and Linux.

Developed a fully automated continuous integration system using Git, Gerrit, Jenkins, MySQL and custom tools developed in Python and Bash.

Successfully migrated the Django database from SQLite to MySQL to PostgreSQL with complete data integrity.

Developed custom directives (elements, Attributes, and classes) using angular.js

Used Python and Django to interface with the jQuery UI and manage the storage and deletion of content.

Used Python based GUI components for the front-end functionality such as selection criteria.

Created test harness to enable comprehensive testing utilizing Python.

Developed multi-threaded standalone app in Python to view Circuit parameters and performance.

Carried out various mathematical operations for calculation purpose using python libraries.

Developed A.I machine learning algorithms like Classification, Regression and Deep Learning algorithms using Python.

Create and write result reports in different formats like txt, csv, and JSON.

Used Python library Beautiful Soup for web scrapping to extract data for building graphs. modelling Utilize PyUnit, the Python unit test framework, for all Python applications.

Master’s in Computer Science from Southern Arkansas University in 2015

Bachelor’s degree in computer science from JNTU in 2014.



Contact this candidate