Post Job Free

Resume

Sign in

Data Engineer Big

Location:
Texas City, TX
Posted:
January 25, 2024

Contact this candidate

Resume:

SAGAR K

Sr. Data Engineer

+1-469-***-****

ad23fv@r.postjobfree.com

www.linkedin.com/in/sagar-kusuma

PROFESSIONAL SUMMARY

Around 8 years of experience in various domains as a Data Engineer in building and deploying data-intensive projects for Data Acquisition, Data Visualization, and Data Mining with large datasets of Structured and Semi-structured data.

Experience in Big Data, AWS (S3, EC2, RDS, EMR, Redshift), Azure cloud platform, Hadoop, Hive, Spark, Sqoop, SQL, PLSQL, and Teradata.

Extensive work in ETL process consisting of data transformation, data sourcing, mapping, conversion, and loading.

Implement One-time Data Migration of Multistate level data from SQL server to Snowflake by using Python and SnowSQL.

Proficient in SQL databases like MySQL, MS SQL, and PostgreSQL. Worked with UNIX/Linux including commands and shell scripting.

Proficient in all activities related to high-level design of ETL mappings for integrating data from multiple heterogeneous data sources (Excel, Flat File, Text format Data) using Informatica Power Center.

Setting up AWS and Azure Databricks accounts.

Hands-on experience in creating Data quality rules in Ataccama tool to validate data quality scenarios.

Extensive experience in Amazon Web Services (AWS) Cloud services such as EC2, VPC, S3, IAM, EBS, RDS, ELB, VPC, Route53, OpsWorks, DynamoDB, Autoscaling, CloudFront, CloudTrail, CloudWatch, CloudFormation, Elastic Beanstalk, AWS SNS, AWS SQS, AWS SES, AWS SWF & AWS Direct Connect.

Hands-on experience in utilizing Hadoop ecosystem components like Hadoop, Hive, Sqoop, HBase, Spark, Spark Streaming, Spark SQL, Zookeeper, Kafka, Flume, MapReduce structure, Yarn, and Scala.

Worked on Dimensional and Relational Data Modeling using Star and Snowflake Schemas, OLTP/OLAP system, and Conceptual, Logical, and Physical data modeling using Erwin.

Experience in working with Azure cloud platforms -HDInsight, datalake, databricks, Blob Storage, Data Factory, Synapse, SQL Database, Azure Analysis Service, Azure SQL Data Warehouse, Azure Data Factory, and Azure SQL Data Warehouse.

Spark streaming and expert in building Pyspark and Spark-Scala applications for interactive analysis, batch processing, and stream processing.

Experience in Hadoop Ecosystem and Big Data components for developing and deploying enterprise-based applications using Apache Spark, Python, Spark SQL, Spark MLib, HDFS, Map Reduce, KAFKA, Flume, Sqoop, Airflow, and Hive.

Exposure in developing Spark applications using Spark-SQL in Databricks for data extraction, and transformation aggregation from multiple file formats for analyzing and transforming the data to get insights into the customer usage patterns.

Designed, built, and deployed a multitude of applications utilizing almost all AWS stack (Including EC2, R53, S3, RDS, HSM DynamoDB, SQS, IAM, and EMR), focusing on high-availability, fault tolerance, and auto-scaling.

Strong hands-on experience with Microservices like Spring IO, and Spring Boot in deploying on various cloud Infrastructure like AWS.

Proficient with the execution of Continuous Integration (CI), Continuous Delivery, and Continuous Deployment (CD) on Various Java-based applications utilizing Jenkins, TeamCity, Azure DevOps, Maven, Git, Nexus, Docker, and Kubernetes.

Hands-on Experience in deploying Kubernetes Cluster on the cloud with master/minion architecture and involvement with composing numerous YAML files to create services like cases, arrangements, auto-scaling, load balancers, marks, wellbeing checks, Namespaces, and Configure Map.

Great exposure in utilizing Kerberos, Azure AD, Sentry, and Ranger for keeping up with verification and authentication and Hands-on experience in utilizing Visualization tools like Tableau, and Power BI.

Pragmatic understanding of Data modeling (Dimensional and Relational) ideas like Star-Schema Modeling, Snowflake Schema Modeling, Fact and Dimension tables.

Designed UNIX Shell Scripting for automating deployments and other routine assignments.

TECHNICAL SKILLS

Amazon AWS

EMR, EC2, EBS, RDS, S3, Athena, Glue, Elasticsearch, Lambda, SQS, DynamoDB, Redshift, ECS, Quick Sight, Kinesis.

Data Analytics tool

ACL

Microsoft Azure

Databricks, Data Lake, Blob Storage, Azure Data Factory, SQL Database, SQL Data Warehouse, Azure Active Directory

Database Tools

MySQL, SQL, Oracle, Teradata

Hadoop/Big Data Technologies

Hadoop, Map Reduce, Sqoop, Hive, Spark, Zookeeper and Cloudera Manager, Kafka, Flume.

ETL Tools

Informatica, PowerBI, Tableau, Custom shell scripts, Talend, SSIS, SSRS, SSAS, ER Studio,

NO SQL Database

HBase, Dynamo DB.

Hadoop Distribution

Horton Works, Cloudera

Build Tools

Maven

Programming & Scripting

Python, Scala, SQL, Shell Scripting.

Version Control

GIT, SVN, Bitbucket

Cluster Managers

Docker, Kubernetes

Development Methodologies

Agile, Waterfall

PROFESSIONAL EXPERIENCE

Vizient, Inc, Chicago, IL Dec 2022 to Till Date

Role: Data Engineer

Responsibilities:

Designed and implemented end-to-end data pipelines across AWS, GCP, and Azure platforms, ensuring seamless data flow and integration.

Worked with Spark to improve performance and optimization of the existing algorithms in Hadoop using Spark Context, Data Frames, and Azure Databricks.

Loaded streaming data from Kafka, and EventHub into Databricks for real-time analytics.

Optimized Spark jobs by tuning and debugging performance issues on Databricks clusters.

Established role-based access control policies and usage-based cost monitoring on Databricks.

Working with Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Secondary Name Node, and Map Reduce concepts.

Involved in converting Hive or SQL queries into Spark transformations using Python and Scala

Construction and maintenance of production-level data engineering pipelines for optimization of ETL/ELT jobs from sources like Azure SQL, Azure Data Factory, Blob storage, Azure SQL Data warehouse, and Azure Data Lake Analytics.

Design and implement streaming solutions using Kafka or Azure Stream Analytics.

Deployed several database environments such as Cosmos DB PostgreSQL and PostgreSQL Flexible Server with Pgvector extension while choosing the right vector store for Retrieval Augmented Generation (RAG) Vector Search.

In Scala, I used the Data Frame API to turn a distributed collection of data into named columns.

Ensured data integrity and standardization through meticulous data preparation, including unit standardization and log transformation.

Implemented role-based access control, encryption, and masking for enterprise security and governance.

Prepared CI/CD (Continuous Integration & Delivery) using Concourse pipelines to deploy code in DEV/STG/PRD environment.

Performed Exploratory Data analysis for several datasets as part of Phase 1 of the project.

Environment: Azure Open AI, Azure SQL DB, Azure Notebooks, Azure Data Factory, Azure Functions, Power BI, Prompt flow, Langchain, PostgreSQL Flexible server, pgvector extension, Cosmos DB, SharePoint, Scala, R, Python, Agile.

Pitney Bowes, Los Angeles, California May 2021 to Nov 2022

Role: Data Engineer

Responsibilities:

Developing production-ready software by using Python and Pyspark for Data Engineering use cases.

Working on developing a system that generates an efficient idiomatic Python and Pyspark code for Data Engineering use cases which is used to handle a variety of data from multiple sources.

Implementing the techniques of software engineering design principles along with scripting techniques to solve different important issues with readability and performance.

Designing solutions using Pandas and Pyspark which can automatically assemble into larger programs that can perform complex operations.

Design Start Schema in Big Query.

Building Datasets and tables in Big Query and loading data from cloud storage.

Converting and modifying Hive queries to use in Big Query and Performed data cleaning on unstructured information using various tools.

Extract transforms and loads from the source system to Azure Data storage services using a combination of Azure data factory, Spark-SQL, and Azure data analytics. Ingested the data to Azure data lake and processed the data in Azure Databricks.

High volumes of data from multiple sources are ingested into Azure Cosmos DB for AdHoc querying and change feed functionality is used to reliably and incrementally read inserts and updates made to an Azure Cosmos DB.

Experience in building and architecting multiple Data pipelines, end-to-end ETL, and ELT processes for Data ingestion and transformation in GCP and coordinating tasks among the team.

Used Cosmos queries using SQL as a JSON query language to read the data stored in Azure Cosmos DB.

Contributing to the team by researching, designing, and analyzing the software programs and coming up with solutions in Python and Pyspark that can handle a broad range of use cases with optimized performance.

Performance testing the developed code and identified areas of modification in existing programs and subsequently developed these modifications.

Performing transformations, cleaning, and filtering on imported data using Spark Data Frame API, Hive, and MapReduce, and loaded final data into Hive.

Used the Spark tools, and streamed data from a variety of places, including the Cloud (AWS) and on-premises.

Creating a data pipeline for various ingestion, aggregation, and loading consumer response data from an AWS S3 bucket into Hive external tables in HDFS.

Using GitHub, CI/CD, Pytest, and other technical specifications to deliver and coordinate high-quality technical work.

Identifying and reporting issues with the code generator that is used to generate the Python and Pandas code.

Documenting technical requirements, and code logic and collaborated with team members to modify the code to help solve complex problems and manage the risks.

JC Penney, Plano Aug 2019 to Apr 2021

Role: Hadoop Developer

Responsibilities:

•Worked with the Hortonworks distribution platform. Based on the business and team requirements, I installed, configured, and managed a Hadoop cluster.

•Knowledge of big data components such as HDFS, MapReduce, YARN, Hive, HBase, Druid, Sqoop, Pig, and Ambari is required.

•Involved in end-to-end implementation of ETL pipelines for high-volume analytics utilizing Python and SQL, as well as reviewing use cases before HDFS onboarding. Flume and Kafka for semi-structured data and Sqoop for existing relational databases were used to capture data and import it to HDFS.

•Interacted with HDFS cluster via Cloudera Hue and Zeppelin notebooks. Configure and monitor resource use across the cluster using Cloudera Manager, Search, and Navigator.

•Python scripts for existing modules that have been improved. To apply transformations joins, aggregations, and load data into HDFS, ETL jobs were converted to Pig scripts. Used partitioning and bucketing concepts in Hive and designed internal and external tables in Hive to improve the performance of queries.

•Created Apache NiFi jobs to get files from transaction systems into the raw zone of the data lake.

•Used Ambari and Hadoop streaming tasks to load, manage, and evaluate terabytes of log files.

•Switched from JMS to Kafka and used Zookeeper to manage synchronization, serialization, and coordination.

•Zookeeper was used to provide cluster-wide coordination, synchronization, and grouping services.

•Contributed to the development of rack topology scripts and Java map reduction programs for parsing raw data.

•Sqoop was used to move data from a traditional RDBMS to HDFS. Teradata, Oracle, and MySQL data were ingested into HDFS. Required tables and views were identified and exported into Hive. For faster data access, I used Hive joins, partitioning, and bucketing techniques to perform ad-hoc queries.

Environment: Spark, Spark-Streaming, Spark SQL, EMR, MapReduce, HDFS, Hive, Pig, Pyspark, Shell scripting, Linux, MySQL, Oracle, Kafka, Python, SQL, Java, Teradata, Oracle, MySQL, Tableau, SVN, JIRA

Tesco PLC, Bangalore, India Jun 2014 to Dec 2017

Role: Big Data Engineer

Responsibilities:

Installed and configured Hadoop MapReduce, HDFS Developed multiple MapReduce jobs for data cleaning and pre-processing.

Importing and exporting data jobs, to perform operations like copying data from RDMS to HDFS using Sqoop and developed Spark code and Spark-SQL/Streaming for faster testing and processing of data.

Worked on querying data using Spark SQL on top of Spark Engine, implementing Spark RDDs using Python.

Built best-practice ETLs with Apache Spark to load and transform raw data into easy-to-use dimensional data for self-service reporting.

Used Spark SQL to load JSON data and create Schema RDD loaded it into Hive Tables and handled structured data.

Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.

Developed scripts to load data to hive from HDFS and was involved in ingesting data into the Data Warehouse using various data loading techniques.

Developed Mapping for Data Warehouse and Data Mart objects.

Design and Develop ETL jobs using the DataStage tool to load data warehouse and Data Mart.

Designed and implemented data loading and aggregation frameworks and jobs that will be able to handle hundreds of GBs of JSON files, using Spark, and Airflow.

Involved in creating Hive tables, loading with data, and writing hive queries that will run internally in map reduced way.

Worked on developing ETL pipelines on S3 Parquet files on data lake using AWS Glue.

Responsible for expanding and optimizing data and data pipeline architecture, as well as optimizing data flow and collection for cross-functional teams.

Built reports using Tableau to allow internal and external teams to visualize and extract insights from big data platforms.



Contact this candidate