Big Data Engineer

Location:

Irving, TX, 75063

Posted:

October 31, 2023

Contact this candidate

Resume:

VIJAY KRISHNA

: +1-669-***-****

: ***********@*****.***

SUMMARY:

Around 7 years of software development experience which includes 4+ years on Big Data Technologies like Hadoop, and other Hadoop Eco-System Components like Hive, Pig, Sqoop, HBase, NIFI, Kafka, Spark.

Have Good exposure to AWS Cloud services for Big Data Development.

Experience in working with various cloud distributions like AWS, and Azure.

Hands-on with scalable applications using AWS tools like Redshift, and DynamoDB.

Developed various scalable big data applications in Azure HDInsight for ETL services.

Good hands-on knowledge in the Hadoop ecosystem and its components such as Map-Reduce and HDFS.

Worked on installing, configuring, and administrating the Hadoop cluster for distributions like HDP and CDP.

Hands-on experience in working on Cloud Platform Microsoft Azure, worked on Azure web application, App services, Azure storage, Azure Key Vault, and Azure Kubernetes Services.

Worked on building pipelines using Snowflake for extensive data aggregations.

Expert in working with Hive data warehouse creating tables, and data distribution by implementing partitioning and bucketing, writing, and optimizing the Hive QL queries.

Having experience in Using Spark and dealing with structured and semi-structured data in HDFS.

Knowledge of UNIX shell scripting.

Have a Good Understanding of ETL tools.

Experienced in developing scripts using Python, and Shell Scripting to do Extract, Load, and Transform data working knowledge of AWS Redshift.

Working knowledge of GCP tools like BigQuery, Pub/Sub, Cloud SQL, and Cloud functions.

Experience in visualizing reporting data using tools like PowerBi, and Google analytics.

Experienced in data formats like JSON, PARQUET, AVRO, and ORC formats.

Have experience in Scripting Language Python.

Experience in using Apache Sqoop to import and export data.

Experienced in requirement analysis, application development, application migration and maintenance using Software Development Lifecycle (SDLC) and Python/Java technologies.

Have experience of working in NIFI.

Experience in using and managing change management tool Git and build server software, Jenkins.

Have experience in Messaging and collection Framework, Kafka.

Have experience in using Streaming technologies.

Strong knowledge in Hadoop cluster installation, capacity planning, and performance tuning, benchmarking, disaster recovery plan, and application deployment in the production cluster.

Good exposure to databases like MYSQL, SQL, POSTGRES.

Have experience in using Software Development Methodologies like Agile for providing solutions.

Automate the tasks for Azure Infra/Resources using CI/CD tools.

Managed Docker orchestration and Docker containerization using Kubernetes

Implemented Docker & Ansible Buildouts, Familiar with Docker, Docker Image Development, and Ansible Playbooks

Comprehensive knowledge of the Software Development Life Cycle coupled with excellent communication skills.

Have experience in HBASE and Apache Phoenix database engines.

Have good exposure to Pyspark Scripts.

Technical Skills:

Databases

Oracle 11g, 10g, 9i, SQL Server 2012/2008R2, Hadoop, HIVE, MongoDB, and MySQL, Netezza, SAS, Hadoop, U-SQL, Snowflake

Languages

SQL, PL/SQL, TSQL, Python (matplotlib, Seaborn, NumPy, SciPy, SciKit-Learn), Power BI, R-Studio, Alteryx, SSRS (SQL Server Reporting Service), HTML, XML, Nodejs, UNIX, JAVASCRIPT

Analytics & Reporting

Tableau, SSRS, Cognos, Power BI, Crystal Reports, Data bricks

ETL & Cloud Platform

Informatica 10, SQL Server DTS, Visual Studio 2012/2010, Azure Data Lake & Factory, Pandas, AWS EC2, S3, Lambda, NumPy, Salesforce CRM, Apex Data Loader, SAP MDM

BI Tools

SSIS, BI Development Studio, Power BI, Visual Studio 2012, Performance Monitor, Power Pivot, Spark, Scala, Kafka, DevOps

Machine Learning

Linear Regression, Logistic Regression, LDA, PCA (Principal Component Analysis), K-Means, Clustering, K-Nearest Neighbors (KNN), Decision Tree, Ada Boosting, Gradient Boosting Trees, Neural Networks.

Design Tools

ERWIN, ER Studio, MS VISIO

Version Control Systems

SVN, Git, GitHub

PROFESSIONAL EXPERIENCE:

Intralot Sept 2022-TILL DATE,

Bolingbrook, IL

Role: Big Data Engineer

Responsibilities:

Provided a solution using HIVE, and SQOOP (to export/ import data), for faster data load by replacing the traditional ETL process with HDFS for loading data to target tables.

Developed complex data cleaning and transformation logic using PySpark on AWS Glue to process unstructured data from S3 into analytics-ready datasets in Redshift.

Created serverless ETL workflows in a cloud platform using AWS Glue, Glue Data Catalog, S3, RDS, Cloud Watch, and Lambda.

Developed the Pig UDFs to preprocess the data for analysis.

Implemented pipeline to load XML into HDFS using STORM & FLINK.

Used Pig Latin and Pyspark scripts to extract the data from the output files, process it, and load it into HDFS.

Worked on Creating Custom Datasets for downstream reporting.

Implemented partitioning, dynamic partitions, and bucketing in HIVE.

Used the messaging Framework Kafka.

Implemented configuration and optimization techniques for Redshift clusters to maximize data processing performance and streamline query execution, resulting in high-performance data analytics capabilities

Implemented AWS Athena for ad-hoc data analysis and querying on data stored in AWS S3

Used Kafka with a combination of Apache Storm, Hive for real-time analysis of streaming of data.

Utilized AWS CloudWatch to monitor and handle resources, configure alarms, and gather metrics

Configured Spark streaming to receive real-time data from the Kafka and store the stream data to HDFS.

Creating Databricks notebooks using SQL, Python

Used Data formats like ORC, Avro, Parquet.

Delivered the solution using Agile Methodology.

Used the Spark to fast processing of data in Hive and HDFS.

Used Spark SQL for Structured data processing using data frames API and Datasets API.

Developed multiple POC’s using Spark, Scala and deployed on the Yarn Cluster, compared the performance of Spark, with Hive and SQL.

Used Kafka in conjunction with Zookeeper for deployment management, which necessitates monitoring its metrics alongside Kafka clusters.

Developed Hive queries to process the data and generate the results in a tabular format.

Written Hive queries for data analysis to meet business requirements.

Used NIFI to expose the data using Restful API.

Used HBase to load huge datasets and maintain the Change Data Capture.

Worked on the upgrading of cluster from HDP 2.6.5 to HDP 3.1.2 and from HDP 2.6.5 to CDP 7.1.2.

Have good exposure to HIVE3 and HWC for Spark.

Worked on Migration activities (i.e.) Migrated all Pig Scripts to Pyspark and Hive on MR and Hive.

Worked with Data Science teams to modularize the Pyspark code and to support after the production deployments.

Have a good exposure on Data encryption on HDFS using Ranger KMS.

Worked on Tag Sync Using Ranger and Atlas for Capturing Metadata and PII information Data Governance Needs.

Environment: Hadoop2, Hadoop3, Java, Python, HDFS, Pig, Sqoop, HBase, Hive2, Hive3, Spark, oozie, NIFI, Storm, Shell Scripting, Linux, and RDBMS.

Zoom Video Communications Inc, Dec 2021 – Aug 2022

Sanjose, CA

Role: Big Data Engineer

Responsibilities:

Developed Spark jobs using Scala for faster real-time analytics and used Spark SQL for querying

Developed storytelling dashboards in Tableau Desktop and published them on to Tableau Server which allowed end users to understand the data on the fly with the usage of quick filters for on-demand needed information.

Manage Azure Active Directory and create new groups for users

Created Azure Data Factory pipelines for applying transformations using Databricks Spark and then finally moved/loaded the transformed data into the Curated Data Model.

Configuration of NoSQL (Azure Cosmos DB) database in an application for storing and fetching client-related data.

Used Snow pipe for continuous data ingestion from Azure Blobs

Integrated data storage solutions with Spark, particularly with Azure Data Lake storage and Blob storage.

Used Git for version control with the Data Engineer team and Data Scientists colleagues. Involved in creating Tableau dashboards using stack bars, bar graphs, scattered plots, geographical maps, Gantt charts, etc. using show me functionality. Dashboards and stories as needed using Tableau Desktop and Tableau Server.

Created re-usable ADF pipelines to extract the data from HDFS and land it to Azure Data Lake Storage and created delta tables.

Creating ARM (Azure Resource Manager) Templates that deploy Azure resources to the cloud in different environments (Development, Testing, Staging, and Production)

Created Azure Data Factory (ADF) Streaming pipelines to consume data from Azure Event Hub using Spark and load into Azure Data Lake Storage (ADLS).

Used Oozie scripts for deployment of the application and perforce as the secure versioning software.

Implemented Partitioning, Dynamic Partitions, and Buckets in HIVE.

Responsibilities include gathering business requirements, developing a strategy for data cleansing and data migration, writing functional and technical specifications, creating source-to-target mapping, designing data profiling and data validation jobs in Informatica, and creating ETL jobs in Informatica.

Experience in creating JavaScript for using DML operation with MongoDB. Responsible for design and development of Spark SQL Scripts based on Functional Specifications. Created HBase tables to store various data formats of data coming from a spark.

The new Business Data Warehouse (BDW) improved query/report performance, reduced the time needed to develop reports, and established self-service reporting model in Cognos for business users.

Performed statistical analysis using SQL, Python, R Programming and Excel.

Import, clean, filter and analyze data using tools such as SQL, HIVE and PIG.

Created Session Beans and controller Servlets for handling HTTP requests from Talend

Responsible for data extraction and data ingestion from different data sources into Hadoop Data Lake by creating ETL pipelines using Pig, and Hive.

Worked on AWS Services like AWS SNS to send out automated emails and messages using BOTO3 after the nightly run.

Implemented multiple Data Quality checks (Completeness, Uniqueness, consistency) in Infogix.

Built APIs that will allow customer service representatives to access the data and answer queries.

Developed several advanced MapReduce programs in Java as part of functional requirements for Big Data.

Designed changes to transform current Hadoop jobs to HBase.

Expertise in writing Hadoop Jobs for analysing data using Hive QL (Queries), Pig Latin (Data flow language), and custom MapReduce programs in Java.

Expert in creating Hive UDFs using Java to analyse the data efficiently.

Implemented AJAX, JSON, and JavaScript to create interactive web screens.

Designed and Developed data mapping procedures ETL-Data Extraction, Data Analysis, and Loading process for integrating data using R programming.

Implemented CloudTrail to capture the events related to API calls made to AWS infrastructure.

Worked on the development of tools that automate AWS server provisioning, automated application deployments, and implementation of basic failover among regions through AWS SDK’s.

Wrote data ingestion systems to pull data from traditional RDBMS platforms such as Oracle and Teradata and store it in NoSQL databases such as Cassandra. Involved in loading and transforming large sets of Structured, Semi-Structured, and Unstructured data and analyzed them by running Hive queries. Processed the image data through the Hadoop distributed system by using Map and Reduce then stored into HDFS.

Environment: Cloudera CDH4.3, Hadoop, AWS, Pig, Hive, Informatica, HBase, MapReduce, HDFS, Sqoop, Impala, SQL, Tableau, Python, SAS, Flume, Scala, JavaScript, Oozie, Linux, No SQL, MongoDB, MS Excel, Talend, Git.

Rise Ahead July 2018 – May 2021

Chennai, INDIA

Role: Data Engineer

Responsibilities:

Designed and implemented data transfer from and to Hadoop and AWS.

Configured flume to ingest near real-time data from various sources to HTTP server which is a stage in Stream Sets Pipeline.

Analyzed and visualized near real-time data in Impala and Solr through Hue and Banana UI to prepare reports.

Deployed Spark/PySpark solution of ingesting of the data, transforming, wrangling, and applying business logic to the data from HDFS to SOLR.

Acted as a lead resource and built the entire Hadoop platform from scratch.

Extracted the needed data from the server into HDFS and bulk-loaded the cleaned data into Hbase.

Lead role in NoSQL column family design, client access software, HBase tuning; during migration from Oracle-based data stores.

Used AWS S3 to store large amounts of data in identical/similar repositories.

Designed, implemented, and deployed within a customer’s existing Hadoop/HBase cluster a series of custom parallel algorithms for various customer-defined metrics and unsupervised learning models.

Using the Spark/PySpark framework Enhanced and optimized product Spark/PySpark code to aggregate, group, and run data mining tasks.

Deployed an Apache Solr/Lucene search engine server to help speed up the search of financial documents.

Develop HIVE queries for the analysts.

Created an e-mail notification service upon completion of the job for the team that requested the data.

Defined job workflows as per their dependencies in Oozie.

Given POC of FLUME to handle the real-time log processing for attribution reports.

Environment: Apache Hadoop, HDFS, AWS, Spark/PySpark, Solr, Hive, DataStax HBase, Java, Flume, Cloudera CDH4, Oozie, Oracle, MySQL

Linkwell Tele Systems Jan 2017 – June 2018

Hyderabad, INDIA

Role: Hadoop Developer/Data Engineer

Responsibilities:

Importing and exporting data into HDFS from Oracle Database and vice versa using Sqoop.

Installed and configured Hadoop MapReduce, HDFS, Developed multiple MapReduce

Designed and implemented MapReduce-based large-scale parallel relation-learning system

Set up and benchmarked Hadoop/HBase clusters for internal use.

Involved in the review of functional and non-functional requirements.

Extensive use of Expressions, Variables, and Row Count in SSIS packages

Created SSIS packages to pull data from SQL Server and exported to Excel Spreadsheets and vice versa.

Loading data from various sources like OLEDB, and flat files to SQL Server database Using SSIS Packages and creating data mappings to load the data from source to destination.

Wrote MapReduce job using Pig Latin. Involved in ETL, Data Integration, and Migration.

Creating Hive tables and working on them using Hive QL. Experienced in defining job flows.

Created batch jobs and configuration files to create automated processes using SSIS.

Designed and implemented data transfer from and to Hadoop and AWS.

Deploying and scheduling reports using SSRS to generate daily, weekly, monthly, and quarterly reports.

Involved in creating Hive tables, loading the data, and writing Hive queries that will run internally in a map-reduced way. Developed a custom File System plugin for Hadoop so it can access files on the Data Platform.

The custom File System plugin allows Hadoop MapReduce programs, HBase, Pig and Hive to work unmodified and access files directly.

Environment: Hadoop, MapReduce, AWS, Amazon S3. Pig, SQL Server, Hive, HBase, SSIS, SSRS, Report Builder, MS Office, Excel, Flat Files, T-SQL.

Contact this candidate