Data Engineer, AWS, Azure

Location:

Toronto, ON, Canada

Posted:

October 16, 2024

Contact this candidate

Resume:

Sr. Data Engineer

Name: Krishna. K

Email: *****************@*****.***

Ph No: 368-***-****

PROFESSIONAL SUMMARY

Dynamic IT professional around 8 years of experience as a Sr. Data Engineer. Expertise in designing data-intensive applications using the Hadoop ecosystem, specializing in big data analytic, cloud data engineering, data warehousing, and data visualization. Proven track record in developing reporting solutions and ensuring data quality. Motivated to leverage technical skills to drive impactful data solutions.

In - depth knowledge of Hadoop architecture and its components like YARN, HDFS, Name Node, Data Node, Job Tracker, Application Master, Resource Manager, Task Tracker and Map Reduce programming paradigm.

Extensive experience in Hadoop led development of enterprise level solutions utilizing Hadoop components such as Apache Spark, Map Reduce, HDFS, Sqoop, PIG, Hive, HBase, Oozie, Flume, NiFi, Kafka, Zookeeper, and YARN.

Implement Spark Kafka streaming to pick up the data from Kafka and send to Spark pipeline.

Created Kafka producer API to send live stream data into various Kafka topics.

Integrated Apache Spark with Kafka to perform Web Analytics. Uploaded click stream data from Kafka to HDFS, HBase and Hive by integrating with Spark.

Profound experience in performing Data Ingestion, Data Processing (Transformations, enrichment, and aggregations). Strong Knowledge on Architecture of Distributed systems and parallel processing, In-depth understanding of Map Reduce programming paradigm and Spark execution framework.

Experienced with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data frame API, Spark Streaming, MLlib, Pair RDD 's and worked explicitly on Pyspark and Scala.

Handled ingestion of data from different data sources into HDFS using Sqoop, Flume and perform transformations using Hive, Map Reduce and then loading data into HDFS. Managed Sqoop jobs with incremental load to populate HIVE external tables. Experience in importing streaming data into HDFS using Flume sources, and Flume sinks and transforming the data using Flume interceptors.

Strong experience in migrating to other databases to Snowflake.

Experience with Snowflake cloud data warehouse and AWSS3 bucket for integrating data from multiple source system which include loading nested JSON formatted data into Snowflake table.

Design Dimensional model, data lake architecture, data vault 2.0 on Snowflake and used Snowflake logical data warehouse for compute.

Update Strategy, Aggregator, Expression, Joiner Transformations and then loaded into data Warehouse using Informatica BDM 10.2.2.

Experience in using various packages in python like RCurl, pandas, numpy, seaborn, scipy, matplotlib, PyQT and Pytest for processing large datasets.

Extensive experience in migrating data from legacy systems into the cloud with AWS, Talend and Snowflake.

Experience of Partitions, bucketing concepts in Hive and designing both Managed and External tables in Hive to optimize performance. Experience with different file formats like Avro, parquet, ORC, JSON, and XML.

Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow and Oozie. Experienced with using most common Operators in Airflow - Python Operator, Bash Operator, Google Cloud Storage Download Operator, Google Cloud Storage Object Sensor, and GoogleCloudStorageToS3Operator.

Agile development methodology and project planning through JIRA, Confluence.

TECHNICAL SKILLS:

Hadoop Ecosystem

HDFS, SQL, YARN, Map Reduce, Hive, Sqoop, Spark, Yarn, Zookeeper, Kafka, Storm, Flume, Open Shift, Gitflow, and Jenkins.

Programming Languages

Python, PySpark, Spark with Scala, JavaScript, Shell Scripting, Perl.

Big Data Platforms

Horton works, Cloudera

AWS Platform

Google Cloud Platform (GCP) Big Query, Dataproc, Pub sub, Cloud Dataflow, GKE, Amazon Web Services (AWS) EKS, EC2, S3, EMR, Kinesis, Athena, Glue, Step Functions, Lambda Functions, Redshift, SNS, SQS, EBS, VPC, IAM, Microsoft Azure HD Insight, Azure Data Lake Storage, Snowflake

Cloud Platforms

Azure (Azure Data Factory, Azure Databricks, Azure Synapse Analytics, Azure Data Lake, Azure Sql DB, Azure Blob Storage)

Operating Systems

Linux, Windows, UNIX

Databases

Netezza, MySQL, UDB, HBase, MongoDB, Cassandra, Snowflake, SSIS

Development Methods

Agile/Scrum, Waterfall

IDE’s

PyCharm, IntelliJ, Jupiter Notebook

EDUCATION:

Bachelor of Technology at Jawaharlal Nehru Technological University

PROFESSIONAL EXPERIENCE

Client: Carfax, London - ON Mar2023 - Present

Data Engineer (AWS, Python)

Responsibilities:

Designed and setup Enterprise Data Lake to provide support for various uses cases including Analytics, processing, storing, and reporting of voluminous, rapidly changing data.

Responsible for maintaining quality reference data in source by performing operations such as cleaning, transformation and ensuring Integrity in a relational environment by working closely with the stakeholders & solution architect.

Designed and implemented data pipelines for data processing using Perl and GCS.

Designed and developed Security Framework to provide fine grained access to objects in AWSS3 using AWS Lambda, DynamoDB.

Working knowledge in repository tools such as GIT, BitBucket and GitHub.

Set up and worked on Kerberos authentication principals to establish secure network communication on cluster and testing of HDFS, Hive, Pig and MapReduce to access cluster for new users.

Extensively worked on AWS CDK (Cloud Development Kit), deployed infrastructures for a static website, or even a complex, multi-stack application that spans multiple AWS accounts and regions using Python/Typescript

Performed end- to-end Architecture & implementation assessment of various AWS services like Amazon EMR, Red shift, and S3.

Identify the test scope for a release, committing the stories for a particular release.

Implemented Informatica BDM mappings for extracting data from DWH to Data Lake

Understanding Customer Demand Story design and documenting Acceptance criteria and Test Plan

Involve in designing and reviewing the test cases, test scenarios etc. to confirm acceptance criteria of the business to meet 100%

Create Python Framework for AWS Cloud automation, Multiprocessing eod daily and intraday upload and extract using AWS SDK and CDK.

Perform Sanity check and verifying the systems prior to start any testing

Tested Map Reduce programs to parse the raw data, populate staging tables and store the data in HDFS

Involved in extensive DATA validation using SQL queries and back-end testing

Used AWS Glue for the data transformation, Validate and Data Cleansing.

Implemented a Python/Java based distributed random forest via PySpark and MLlib.

Created data bricks notebooks with delta format tables and implemented lake house architecture.

Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and

Fixed Kafka and zookeeper related production issues across multiple clusters.

Experience in configuring Kafka producer and consumer microservices to stream the data to and from Kafka topics.

Worked with Python NumPy, SciPy, Pandas, Matplot, Stats packages to perform dataset manipulation, data mapping, data cleansing and feature engineering. Built and analyzed datasets using R and Python.

Importing & exporting database using SQL Server Integrations Services (SSIS) and Data Transformation Services (DTS Packages).

Build ETL pipeline end to end from AWS S3 to Key, Value store DynamoDB, and Snowflake Datawarehouse for analytical queries and specifically for cloud data.

Designed, Developed, and implemented ETL processes using IICS Data integration.

Created IICS connections for various cloud connectors in IICS administrator.

Develop, implement and enhance the Data Warehouse data model and performing data loads using Oracle data Integrator.

Worked on deployment on AWS EC2 instance with Postgres RDS and S3 file storage

Experience in configuring and on boarding the AXON&EDC including business glossaries, dashboards, search, AXON, maps policies and databases.

Developed Spark Python models for machine learning and predictive analytics in Hadoop.

Coded Teradata BTEQ scripts to load, transform data, fix defects like SCD 2 date chaining, cleaning up duplicates.

Developed reusable framework to be leveraged for future migrations that automates ETL from RDBMS systems to the Data Lake utilizing Spark Data Sources and Hive data objects.

Involved setting up Hadoop configuration (Hadoop cluster, Hive connection) using Informatica BDM.

Conducted Data blending, Data preparation using Alteryx and SQL for Tableau consumption and publishing data sources to Tableau server.

Developed Kibana Dash boards based on the Log stash data and integrated different source and target systems into Elastic search for near real time log analysis of monitoring End to End transactions.

Implemented AWS Step Functions to automate and orchestrate the Amazon Sage Maker related tasks such as publishing data to S3, training ML model and deploying it for prediction.

Backing up AWS Postgres to S3 on daily job run on EMR using Data Frames.

Worked on scheduling tools like Autosys, Control M, Zena and TAC

Worked on migrating MapReduce programs into Spark transformations using Spark and Scala, initially done using python (PySpark).

Integrated Apache Airflow with AWS to monitor multi-stage ML workflows with the tasks running on Amazon Sage Maker.

Environment: AWS EMR, S3, RDS, Redshift, Lambda, Boto3, Dynamo DB, Amazon Sage Maker, Apache Spark, HBase, Apache Kafka, HIVE, SQOOP, Map Reduce, Snowflake, Kubernetes, Apache Pig, Python, SSRS, Tableau, Zena

Client: Capital One, Toronto - ON Aug 2022 – Feb 2023

Azure Data Engineer.

Responsibilities:

Highly Involved into Data Architecture and Application Design using Cloud and Big Data solutions on AWS, Microsoft Azure.

Leading the effort for migration of Legacy-system to Microsoft Azure cloud-based solution. Re-designing the Legacy Application solutions with minimal changes to run on cloud platform.

Designed various Jenkins jobs to continuously integrate the processes and executed CI/CD pipeline using Jenkins.

Built the data pipeline using Azure Service like Data Factory to load the data from Legacy SQL server to Azure Data Base using Data Factories, API Gateway Services, SSIS Packages, custom .Net and Python codes.

Worked on Informatica Power Center tool - Source Analyzer, Data warehousing designer, Mapping & Mapplet Designer and Transformation Designer.

Configured documents which allow Airflow to communicate to its PostgreSQL database.

Developed various machine learning models such as Logistic regression, KNN, and Gradient Boosting with Pandas, NumPy, Seaborn, Matplotlib, Scikit-learn in Python

Build various pipeline to integrate the Azure Cloud to AWS S3 to get the data into Azure Database.

Built Azure Web Job for Product Management teams to connect to different APIs and sources to extract the data and load into Azure Data Warehouse using Azure Web Job and Functions.

Have good experience working with Azure BLOB and Datalake storage and loading data into Azure SQL Synapse analytics (DW).

Used apache airflow in GCP composer environment to build data pipelines and used various airflow operators like bash operator, Hadoop operators and python callable and branching operators.

Worked on migration of data from On-prem SQL server to Cloud databases (Azure Synapse Analytics (DW) & Azure SQL DB).

Implemented a CI/CD pipeline with Docker, Jenkins (TFS Plugin installed), Team Foundation Server (TFS), GitHub and Azure Container Service, whenever a new TFS/GitHub branch gets started, Jenkins, our Continuous Integration (CI) server, automatically attempts to build a new Docker container from it.

Set up the Hadoop and Spark cluster for the various POCs, specifically to load the Cookie level data and real-time streaming. Integrate with other ecosystems like Hive, HBase, Spark, HDFS/Data Lake/Blob Storage.

Set up the Spark Cluster to process the more than 2 Tb of data and dumped into SQL Server. In addition, built various Spark jobs to run Data Transformations and Actions.

Converted SAS code to python/spark-based jobs in cloud dataproc/big query in GCP.

Have good experience in logging defects in Jira and MS Azure Devops tools.

Developed a Python script to integrate DDL changes between on-prem Talend warehouse and snowflake

Implemented Azure Data Factory (ADF) extensively for ingesting data from different source systems like relational and unstructured data to meet business functional requirements

Writing a different API s to connect with the different Media Data feeds like, Prisma, Double Click Management, Twitter, Facebook, Instagram to get the Data using Azure Web Job and Functions integration.

Built the trigger-based Mechanism to reduce the cost of different resources like Web Job and Data Factories using Azure Logic Apps and Functions.

Used Oracle Data Integrator Designer (ODI) to develop processes for extracting, cleansing, transforming, integrating, and loading data into data warehouse database.

Worked on PDF parser using Java & python

Predominantly worked on Azure Synapse Analytics to design end-to-end Pipelines, spark applications using notebooks, and loaded the data into dedicated SQL Pools.

Strong understanding of Data Modeling (Relational, dimensional, Data analysis, implementations of Data warehousing using Windows and UNIX.

Implemented Azure Data Factory (ADF) extensively for ingesting data from different source systems like relational and unstructured data to meet business functional requirements

Worked on migration of data from On-prem SQL server to Cloud databases (Azure Synapse Analytics (DW) & Azure SQL DB).

Extensively worked on Relational Database, PostgreSQL as well as MPP database like Redshift.

Experience in Custom Process design of Transformation via Azure Data Factory & Automation Pipelines. Extensively used the Azure Service like Azure Data Factory and Logic App for ETL, to push in/out the data from DB to Blob storage, HDInsight - HDFS, Hive Tables.

Environment: Hadoop (HDFS, MapReduce), Databricks, Spark, Talend, Impala, Hive, PostgreSQL, Jenkins, NiFi, Scala, Mongo DB, Cassandra, Python, Pig, Sqoop, Zena, Hibernate, spring, Oozie, S3, Autoscaling, Scala, Azure, Elastic Search, DynamoDB, UNIX Shell Scripting, TEZ.

Client: LTI, Toronto - ON Jun 2021 – Jul2022

Data Engineer

Responsibilities:

Processed the Web server logs by developing multi-hop flume agents by using Avro Sink and loaded into MongoDB for further analysis, also extracted files from MongoDB through Flume and processed.

Expert knowledge on MongoDB, No SQL data modeling, tuning and disaster recovery backup used it for distributed storage and processing using CRUD.

Extracted and restructured the data into MongoDB using import and export command line utility tool.

Experience in setting up Fan-out workflow in flume to design v shaped architecture to take data from many sources and ingest into single sink.

Experience in creating tables, dropping, and altered at run time without blocking updates and queries using HBase and Hive.

Created many PySpark and Spark SQL scripts in synapse notebooks for doing data transformations as per the given business requirements.

Assisted in developing test plans based on test strategy. Created and executed test cases based on test strategy and ETL mapping document

Written complex SQL queries for querying data against different data bases for data verification process

Performed UAT testing

Written test cases to in HP ALM. Defects identified in testing environment where communicated to the developers using defect tracking tool HP ALM

Worked with business team to test the reports developed in Cognos

Implemented performance tuning on the spark applications in synapse notebooks which improved the overall performance by 5 times than the original jobs.

Effectively communicate testing activities and findings in oral and written formats

Extensively tested several Cognos reports for data quality, fonts, headers & cosmetic

Experience in working with different join patterns and implemented both Map and Reduce Side Joins.

Worked with Tableau and Integrated Hive, Tableau Desktop reports and published to Tableau Server.

Developed MapReduce programs in Java for parsing the raw data and populating staging Tables.

Experience in setting up the whole app stack, setup, and debug log stash to send Apache logs to AWS Elasticsearch.

Evaluate Snowflake Design considerations for any change in the application.

Build the Logical and Physical data model for Snowflake as per the changes required.

Worked on Oracle databases, Snowflakes and Redshift.

Define virtual warehouse sizing for Snowflake for different types of workloads.

Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.

Analyzed the SQL scripts and designed the solution to implement using Scala.

Used Spark-SQL to Load JSON data and create Schema RDD and loaded it into Hive Tables and handled structured data using Spark SQL.

Developed data pipeline using Spark, Hive, Pig, python, Impala, and HBase to ingest customer

Customized reports in JIRA for different teams to represent agile board project status and key projects.

Created reports and dashboards in JIRA to track hours and project status to report to finance team.

Implemented Spark Scripts using Scala, Spark SQL to access hive tables into Spark for faster processing of data.

Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.

Tested Apache Tez for building high performance batch and interactive data processing applications on Pig and Hive jobs.

Operated on several prototype OpenShift projects involving clustered container orchestration and management.

Involved working with a PaaS solution such as RedHat OpenShift.

Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, PostgreSQL, Scala, Data Frame, Impala, OpenShift, Talend, pair RDD's.

Setup data pipeline using in TDCH, Sqoop and PySpark on the basis on size of data loads.

Implemented Real time analytics on Cassandra data using thrift API.

Designed Columnar families in Cassandra and Ingested data from RDBMS, performed transformations and exported the data to Cassandra.

Leading the testing efforts in support of projects/programs across a large landscape of technologies (Unix, AngularJS, AWS, Sause LABS, Cucumber JVM, MongoDB, GitHub, Bitbucket, SQL, NoSQL database, API, Java, Jenkins

Environment: Bitbucket, SQL, NoSQL, Java, API, pyspark, Spark-SQl, JVM, Azure Data Lake, Scala, JIRA, JSON, Redshift, Talend, Tableau, ETL, RDBMS, T- SQL, Azure Data Factory.

Client: Providence, Hyderabad India Jun 2018–May2021

Big Data Engineer

Responsibilities:

Implemented a generic ETL framework with high availability for bringing related data for Hadoop & Cassandra from various sources using spark.

Experienced in using Plat for a data visualization tool specific for Hadoop, and created various Lens and Viz boards for a real-time visualization from hive tables.

Queried and analyzed data from Cassandra for quick searching, sorting and grouping through CQL.

Implemented various Data Modeling techniques for Cassandra.

Have good Experience on Big Data Integration using Informatica BDM and Talend BDI.

Joined various tables in Cassandra using spark and Scala and ran analytics on top of them.

Responsible for establishing Continuous Integration and Continuous Deployment (CI/CD) pipeline for Java business applications (APMS, CMGW, UMDE, PLUTO), using Maven, GIT, Bitbucket, Jenkins, Bamboo, UDeploy, WebSphere Application Servers, JIRA, Confluence, JFrog Artifactory.

Participated in various up gradations and troubleshooting activities across enterprise.

Knowledge in performance troubleshooting and tuning Hadoop clusters.

Applied Spark advanced procedures like text analytics and processing using the in-memory processing.

Implemented Apache Drill on Hadoop to join data from SQL and No SQL databases and store it in Hadoop.

Created architecture stack blueprint for data access with NoSQL Database Cassandra;

Brought data from various sources in to Hadoop and Cassandra using Kafka.

Experienced in using Tidal enterprise scheduler and Oozie Operational Services for coordinating the cluster and scheduling workflows.

Applied spark streaming for real time data transforming.

Created multiple dashboards in tableau for multiple business needs.

Installed and configured Hive and written Hive UDFs and used piggy bank a repository of UDF’s for Pig Latin.

Implemented Partitioning, Dynamic Partitions and Buckets in HIVE for efficient data access.

Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team Using Tableau.

Implemented Composite server for the data virtualization needs and created multiples views for restricted data access using a REST API.

Devised and lead the implementation of next generation architecture for more efficient data ingestion and processing.

Created and implemented various shell scripts for automating the jobs.

Implemented Apache Sentry to restrict the access on the hive tables on a group level.

Employed AVRO format for the entire data ingestion for faster operation and less space utilization.

Experienced in managing and reviewing Hadoop log files.

Worked in Agile environment, and used rally tool to maintain the user stories and tasks.

Worked with Enterprise data support teams to install Hadoop updates, patches, version upgrades as required and fixed problems, which raised after the upgrades.

Implemented test scripts to support test-driven development and continuous integration.

Used Spark for Parallel data processing and better performances.

Environment: MapR 5.0.1, Map Reduce, HDFS, Hive, pig, Impala, Cassandra 5.04, spark, Scala, Java, SQL, Tableau, PIG, Zookeeper, Sqoop, Teradata, CentOS, Pentaho. MQFTE.

Client: Insight software, Hyderabad India Oct2016 – May2018

Data Engineer

Responsibilities:

Acted as a lead resource and build the entire Hadoop platform from scratch.

Evaluated suitability of Hadoop and its ecosystem to the above project and implementing / validating with various proof of concept (POC) applications to eventually adopt them to benefit from the Big Data Hadoop initiative.

Estimated the Software & Hardware requirements for the Name Node and Data Node & planning the cluster.

Extracted the needed data from the server into HDFS and Bulk Loaded the cleaned data into Hbase.

Lead role in NoSQL column family design, client access software, Cassandra tuning; during migration from Oracle based data stores.

Designed, implemented and deployed within a customer’s existing Hadoop / Cassandra cluster a series of custom parallel algorithms for various customer defined metrics and unsupervised learning models.

Using the Spark framework Enhanced and optimized product Spark code to aggregate, group and run data mining tasks.

Wrote queries Using Data Stax Cassandra CQL to create, alter, insert and delete elements.

Written the Map Reduce programs, Hive UDFs in Java.

Used Map Reduce JUnit for unit testing.

Deployed an Apache /Lucene search engine server to help speed up the search of financial documents.

Develop HIVE queries for the analysts.

Created an e-mail notification service upon completion of job for the particular team which requested for the data.

Defined job work flows as per their dependencies in Oozie.

Played a key role in productionizing the application after testing by BI analysts.

Given POC of FLUME to handle the real-time log processing for attribution reports.

Maintain System integrity of all sub-components related to Hadoop.

Environment: Apache Hadoop, HDFS, Spark, Hive, DataStax Cassandra, Map Reduce, Pig, Java, Flume, Cloudera CDH4, Oozie, Oracle, MySQL, Amazon S3.

Contact this candidate