Big Data Machine Learning

Location:

Frisco, TX

Salary:

$110,000

Posted:

October 23, 2024

Contact this candidate

Resume:

Big Data Ecosystem:

Hadoop Map Reduce, Impala, HDFS, Hive, Pig, HBase, Flume, Storm, Sqoop, Oozie, Kafka, Spark, Zookeeper.

Hadoop Distributions:

Apache Hadoop 2.x/1.x, Cloudera CDP, Hortonworks HDP, Amazon EMR (EMR, EC2, EBS, RDS, S3, Athena, Glue, Elasticsearch, Lambda, DynamoDB, Redshift).

Programming Languages:

Python, R, Scala, C++, SAS, Java, SQL, HiveQL, PL/SQL, UNIX shell Scripting, Pig Latin

Machine Learning:

Linear Regression, Logistic Regression, Decision Tree, Random Forest, SVM, XGBoost, Naïve Bayes, PCA, LDA, K-Means, KNN, Neural Network

cloud Technologies:

AWS, Azure, GCP, Amazon, S3, EMR, Redshift, Lambda, Athena Composer, Big Query

Databases:

Snowflake, MySQL, Teradata, Oracle, MS SQL SERVER, PostgreSQL, DB2

NoSQL Databases:

HBase, Cassandra, Mongo DB, DynamoDB and Cosmos DB

Version Control:

Git, SVN, Bitbucket

Continues Integration/ Containerization

Jenkins, Docker, Kubernetes

ETL/BI:

Informatica, SSIS, SSRS, SSAS, Tableau, Power BI, QlikView, Arcadia, Erwin, Matillion, Rivery

Operating System:

Mac OS, Windows 7/8/10, Unix, Linux, Ubuntu

Methodologies:

RAD, JAD, UML, System Development Life Cycle (SDLC), Jira, Confluence, Agile, Waterfall Model

Profile Summary

Over 5+ years of diversified IT experience in E2E data analytics platforms (ETL - Bl-Java) as big data, Hadoop, java/j2EE Development. Informatica Data Modeling and System Analysis, In Banking, Finance Insurance and Telecom domains.

Hands on experience Hadoop framework and its ecosystem like Distributed file system (HDFS), Map Reduce Pig Hive Sqoop, Flume; Spark as Data Architect designed and maintained high performance ELT/ETL processes.

Experience with Requests, PysftpGnupg, Report ab. NumPy, SciPy, Matplotlib.HTIPLib2. Urllib2, Pandas Python libraries during development lifecycle. Can work parallelly in both GCP and Azure Clouds coherently

Extensive experience in developing Data warehouse applications using Hadoop, Informatica Oracle, Teradata MS SQL server on UNIX and Windows platforms and Have Extensive Experience in IT data analytics projects, Hands on experience in migrating on premise ETLs to Google Cloud Platform (GCP) using cloud native tools such as BIG query Cloud Data Proc Google Cloud Storage Composer.

Experience in analyzing data using Hive QL, Pig Latin and custom MapReduce programs in Java, custom UDFs.

Strong experience in writing scripts using Python API. PySpark API and Spark API for analyzing the data.

Extensively used Python Libraries PySpark, Pytest Py mongo cx Oracle Py Excel. Boto3 Psycopg, embed Py NumPy and Beautiful Soup, Created modules for spark streaming in data into Data Lake using Spark.

Good Understanding of Hadoop Architecture and various components such as HDFS job Tracker, Task Tracker, and Name Node Data Node and MapReduce concepts. Experience on cloud platforms (Azure) relevant to Azure Data Engineer Solid experience on building ETL ingestion flow using Azure Data Factory Programming experience on working with Python Scala.

Extensively used Informatica client tools Source Analyser, Warehouse designer Mapping designer Maple Designer, ETL Transformations Informatica Repository Manager and Informatica Server Manager, Workflow Manager & Workflow Monitor.

Expertise in using core Java, J2EE Multi-threading JDBC. Shell Scripting and proficient in using Java APl's Collections, Servlets, and JSP for application development working experience with Functional programming languages like Scala and Java.

Worked closely to review pre- and post-processed data to ensure data accuracy and integrity with Dev and QA teams.

Experience in Java. J2ee, JDBC Collections, Servlets, JSP, Struts, Spring, Hibernate, JSON, XM L, REST, SOAP Web services, Groovy, MVC Eclipse, Web logic, Web sphere, and Apache Tomcat severs.

Proficiency with Scala, Apache HBase, Hive, Pig, Sqoop, Zookeeper, Spark, Spark SQL, Spark Streaming, Kinesis, Airflow, Yarn, and Hadoop (HDFS, MapReduce). Designed, build and managed ELT data pipelines leveraging Airflow, python, and GCP solutions.

Professional Experience

Client: McKesson Corporation, Texas, USA Aug 2023 – Present

Role: Azure Data Engineer

Description: McKesson Corporation is one of the largest healthcare services and pharmaceutical distribution companies in the world. Developed and optimized data pipelines, automated workflows, and integrated backend systems to improve data processing efficiency and system reliability.

Responsibilities:

Design and configure database, Back-end applications and programs. Managed large datasets using Pandas data frames and SQL. Created pipelines to load the data using ADF.

Storing different configs in No SQL database Mongo DB and manipulating the configs using PyMongo.

Implemented Synapse Integration with Azure Databricks notebooks which reduce about half of development work. And achieved performance improvement on Synapse loading by implementing a dynamic partition switch.

Staged the API and Kafka Data (in JSON file format) into Snowflake DB by flattening the same for different functional services. Deployed models as python package, as API for backend integration and as services in a microservices architecture with a Kubernetes orchestration layer for the Dockers containers.

Worked on Big Data Integration & Analytics based on Hadoop, SOLR, PySpark, Kafka, Storm and web Methods. Spearheaded HBase setup and utilized Spark and SparkSQL to develop faster data pipelines, resulting in a 60% reduction in processing time and improved data accuracy.

Worked on gathering security (equities, options, derivatives) data from different exchange feeds and storing historical data. Developed automated monitoring and alerting systems using Kubernetes and Docker, ensuring proactive identification and resolution of data pipeline issues. Creating job flow using Airflow in python and automating the jobs. Airflow will have separate stack for developing DAGs on and will run jobs on EMR or EC2 Cluster.

Worked on CI/CD tools like Jenkins, Docker in Devops Team for setting up application process from end-to-end using Deployment for lower environments and Delivery for higher environments by using approvals in between.

Implemented Apache Sqoop for efficiently transferring bulk data between Apache Hadoop and relational databases (Oracle) for product level forecast. Extracted the data from Teradata into HDFS using Sqoop.

Strong at testing and debugging tested the applications, Rest APIs using Pytest, Unit-test, requests libraries.

Integrated Azure Data Factory with Blob Storage to move data through DataBricks for processing and then to Azure Data Lake Storage and Azure SQL data warehouse. Expertise in creating and developing applications for an android operating system using Android Studio, Eclipse IDE, SQLite, Java, XML, Android SDK, and ADT plugin.

Used Pig as ETL tool to do Transformations with joins and pre-aggregations before storing the data onto HDFS and assisted Manager by providing automation strategies, Selenium/Cucumber Automation and JIRA reports.

Zookeeper was utilized to manage synchronization, serialization, and coordination throughout the cluster after migrating from JMS Solace to Kinesis. Implemented data transformations and enrichment using Apache Spark Streaming to clean and structure the data for analysis.

Environment: Pandas data, SQL, ADF, and SQL database Mongo DB, PyMongo. Synapse, Azure Databricks, API, Kafka Data, Kubernetes, Hbase, Spark, Snowflake DB, Hadoop, SOLR, PySpark, Kafka, Storm and web Methods SparkSQL, Docker, Eclipse IDE, SQLite, Java, XML, Android SDK, and ADT plugin.

Client: Toyota Motor North America, Texas, USA Nov 2022 - Jul 2023

Role: AWS Data Engineer

Description: Toyota Motor North America, Inc. (TMNA) is the North American subsidiary of Toyota Motor Corporation, one of the largest and most renowned automobile manufacturers in the world. Managed the design, implementation, and optimization of large-scale data processing solutions, ensuring secure and efficient data handling, storage, and analytics across various platforms.

Responsibilities:

Set up and worked on Kerberos authentication principals to establish secure network communication on cluster and testing of HDFS, Hive, Pig and MapReduce to access cluster for new users.

Performed end-to-end Architecture & implementation assessment of various AWS services like Amazon EMR, Redshift, and $3. Implemented the machine learning algorithms using python to predict the quantity a user might want to order for a specific item so we can automatically suggest using kinesis firehouse and S3 Data Lake.

Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.

Used Spark SQL for Scala & amp, Python interface that automatically converts RDD case classes to schema RDD.

Import the data from different sources like HDFS/HBase into Spark RDD and perform computations using PySpark to generate the output response. Creating Lambda functions with Boto3 to deregister unused AMIs in all application regions to reduce the cost for EC2 resources.

Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra. Used AWS Lambda to build serverless applications for PaaS environments, such as event-driven applications and microservices.

Designed and setup Enterprise Data Lake to provide support for various uses cases including Analytics, processing, storing and Reporting of voluminous, rapidly changing data. Designed and developed Security Framework to provide fine grained access to objects in AWS S3 using AWS Lambda, DynamoDB.

Importing & exporting database using SQL Server Integrations Services (SSIS) and Data Transformation Services (DTS Packages). Developed reusable framework to be leveraged for future migrations that automates ETL from RDBMS systems to the Data Lake utilizing Spark Data Sources and Hive data objects.

Conducted Data blending, Data preparation using Alteryx and SQL for Tableau consumption and publishing data sources to Tableau server. Developed Kibana Dashboards based on the Log stash data and Integrated different source and target systems into Elasticsearch for near real time log analysis of monitoring End to End transactions.

Implemented AWS Step Functions to automate and orchestrate the Amazon SageMaker related tasks such as publishing data to 53, training ML model and deploying it for prediction. Integrated Apache Airflow with AWS to monitor multi-stage ML workflows with the tasks running on Amazon SageMaker.

Environment: AWS EMR, S3, RDS, Redshift, Lambda, Boto3, DynamoDB, Amazon SageMaker, Apache Spark, HBase, Apache Kafka, HIVE, SQOOP, Map Reduce, Snowflake, Apache Pig, Python, SSRS, Tableau

Client: DCB Bank, Mumbai, India Mar 2021 - Jul 2022

Role: GCP Data Engineer

Description: DCB Bank (Development Credit Bank) is a private sector scheduled commercial bank in India, with a rich history and a growing presence in the banking industry. Contributed to building and optimizing data pipelines, managing cloud migrations, and ensuring high availability of data solutions while collaborating across teams for effective reporting and analysis.

Responsibilities:

Experience administering and maintaining source control systems. Including branching and merging strategies with solutions such as GIT (BitbuckeUGitlab) or Subversion.

Strong knowledge in data preparation, data weed and data visualization

I understanding of the (Dimensional & Relational) concepts like Star-Schema Modeling. Snowflake Schema Modeling, Fact and Dimension tables. Experience with lac software terra form for GCP modules.

Excellent communication and interpersonal skills and capable of learning new technologies very quickly.

Experience platform codes such as spark etc. into appropriate GCP resources and building reliable data pipelines highly experienced in developing data marts and developing warehousing designs.

Very keen in knowing the newer techno stack that Google Cloud platform (GCP) adds.

Experience in providing highly available and fault tolerant applications utilizing orchestration technologies like Kubernetes on Google Cloud Platform. Exam in and evaluate reporting requirements for various business units

Experience in handling python and spark context when writing Pyspark programs for ETL.

Experience with JIRA. Confluence and working in Sprints using Agile Methodology.

Experience in using SQOOP for importing and exporting data from RDBMS to HDFS and Hive.

Diverse experience in all phases of software development life cycle (SDLC) especially in Analysis, Design. Development, Testing and Deploying of applications.

Using PowerBI and had experience in developing various reports Dashboards using various Visualizations in Tableau.

Expert in developing SSIS Packages to extract Transform and load (ETL) data into data warehouse/data marts from heterogeneous sources.

Working on migrating Data to the cloud (Snowflake and AWS) from the legacy data warehouses and developing the infrastructure

Strong knowledge in data preparation, data modeling and data visualization using PowerBI and had experience in developing various analysis services using queries.

Utilized Kubernetes and Docker for the run time environment of the ci/cd system to build, Knowledge in various file formats in HDFS like Avro ORC DAX, CSV, Parquet.

Environment: GIT, Data modeling, GCP modules, spark, GCP, Kubernetes, python, Pyspark, ETL, JIRA, SQOOP, RDBMS, HDFS, Hive, PowerBI, ORC DAX, CSV, Parquet. Snowflake and AWS, PowerBI, Tableau, Docker

Client: New India Assurance, Mumbai, India Mar 2019 - Feb 2021

Role: Data Engineer

Description: New India Assurance Co. Ltd. is one of the largest general insurance companies in India. Involved in automating data pipelines, deploying microservices, and providing real-time insights through effective data processing and integration solutions.

Responsibilities:

Used Django evolution and manual SQL modifications were able to modify Django models while retaining all data, while site was in production mode presented the project to faculty and industry experts, showcasing the pipeline's effectiveness in providing real-time insights for marketing and brand management.

Creating job flow using Airflow in python and automating the jobs. Airflow will have separate stack for developing DAGs on and will run jobs on EMR or EC2 Cluster.

Storing different configs in No SQL database Mongo DB and manipulating the configs using PyMongo.

Involved in monitoring and scheduling the pipelines using Triggers in Azure Data Factory. Staged the API and Kafka Data (in JSON file format) into Snowflake DB by flattening the same for different functional services.

Used Continuous Delivery Pipeline. Deployed microservices, including provisioning Azure environments and developed modules using Python scripting and Shell Scripting.

Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Sql Activity and creating UNIX shell scripts for database connectivity and executing queries in parallel job execution.

Consult leadership/stakeholders to share design recommendations and thoughts to identify product and technical requirements, resolve technical problems and suggest Big Data based analytical solutions.

Experience in using different types of stages like Transformer, Aggregator, Merge, Join, Lookup, and Sort, remove duplicated, Funnel, Filter, Pivot for developing jobs. Developed a fully automated continuous integration system using Git, Jenkins, MySQL and custom tools developed in Python and Bash.

Environment: Django, SQL, EMR, EC2 Cluster, PyMongo, python, Azure Data Factory, API, Kafka Data, Snowflake DB, Python scripting, JSON Scripts, Sql Activity, Aggregator, Merge, Join, Lookup, and Sort, Funnel, Filter, Pivot, Git, Jenkins, MySQL

Education:

Masters in Computer Science from Sacred Heart University, Connecticut, USA

GOWTHAM R

Data Engineer

940-***-****

************@*****.***

Objective

To secure a Data Engineer position in 5+ Years of Experience a dynamic organization I can leverage my expertise in building scalable data pipelines, optimizing data architecture, and working with cloud platforms (AWS, GCP, or Azure) to support data-driven decision-making. I aim to apply my skills in SQL, Python, ETL processes, and big data technologies like Hadoop and Spark to help the organization effectively manage and utilize its data resources, thereby driving business growth and operational efficiency.

Technical Skills

Profile Summary

Contact this candidate