Post Job Free

Resume

Sign in

Data Engineer Big

Location:
Hyderabad, Telangana, India
Posted:
January 04, 2024

Contact this candidate

Resume:

Name: Kumar Syamala

Email: ad2fyr@r.postjobfree.com

PH: +1-201-***-****

Current Role: Sr. Data Engineer

LinkedIn: https://www.linkedin.com/in/kumar-reddy-60298622a/

Professional Summary

•8+ years of Professional experience in Data Engineering primarily using AWS, AZURE Cloud and Hadoop Ecosystem.

•Experience in design, development, and Implementation of Big data applications using Hadoop Architecture, Hadoop ecosystem frameworks and tools like HDFS, Map Reduce, Yarn, Pig, Hive, Sqoop, Spark, Storm HBase, Kafka, Flume, Nifi, Impala, Oozie, Zookeeper, etc.

•Strong experience using HDFS, Map Reduce, Hive, Spark, Big Data, Java, SAP, T SQL, Beam, Scala, Kafka, Data Bricks, Pyspark, Sqoop, Oozie, and HBase.

•Deep knowledge of troubleshooting and tuning Spark applications and Hive scripts to achieve optimal performance.

•Experienced working with various Hadoop Distributions (Cloudera, Hortonworks, Map R, Amazon EMR) to fully implement and leverage new Hadoop features.

•Experience in developing Spark Applications using Spark RDD, Spark-SQL and Data frame APIs.

•Worked with real-time data processing and streaming techniques using Spark streaming and Kafka.

•Experience in moving data into and out of the HDFS and Relational Database Systems (RDBMS) using Apache Sqoop.

•Expertise in working with HIVE data warehouse infrastructure-creating tables, data distribution by implementing Partitioning and Bucketing, developing and tuning the HQL queries.

•Replaced existing MR jobs and Hive scripts with Spark SQL & and Spark data transformations for efficient data processing.

•Real-time experience with loading data into AWS cloud, Google Cloud Platform (GCP), Snowflake Data Pipeline using AWS GLU, Informatica, and Snap logic.

•Good experience with Snowflake utility Snow SQL.

•Hands-on experience in workflow using Apache Airflow and Oozie workflow engine for managing and scheduling Glue jobs.

•Hands-on experience in Snowflake merge procedures to do DML functions in Snowflake Data Warehouse.

•Experience working with Amazon Web Services (AWS) cloud and its services like GLUE, AMEZONE MWAA, EC2, S3, RDS, EMR, VPC, IAM, Elastic Load Balancing, Lambda, Redshift, Elastic Cache, Front, Cloud Watch, ATHINA, DMS, Aurora,

•Deep knowledge of troubleshooting and tuning Spark applications and Hive scripts to achieve optimal performance.

•Experience developing Kafka producers and Kafka Consumers for streaming millions of events per second on streaming data.

•Database design, modeling, migration, and development experience in using stored procedures, triggers, cursor, constraints, and functions. Used My SQL, MS SQL Server, Microsoft Data Factory, IAM, DB2, and Oracle

•Experience working with NoSQL database technologies, including MongoDB, Cosmos DB, Cassandra, RDS, and HBase.

•Solid experience in using various file formats like CSV, TSV, Parquet, ORC, JSON, and AVRO.

•Experience with Software development tools such as JIRA, and GIT HUB, SERVISE NOW.

•Experience setting up AWS Data Platform - AWS Cloud Formation, Development End Points, Infrastructure, AWS Glue, EMR and Jupiter/ Sage maker Notebooks, Redshift, S3, EC2 instances and Amazon Elastic Beanstalk

•Experience in Migrating SQL database to Azure Data Lake, Azure Data Bricks, Kafka Development, Data Factory, Talend, Azure Data Lake Analytics, Azure SQL Database, Beam, Microsoft Data Bricks, Teradata, Denodo, Data Bricks and Azure SQL Data warehouse and controlling and granting database access and Migrating On-premise databases to Azure Data lake store using Azure Data factory, Microsoft Data Factory.

•Experience in change implementation, monitoring, and troubleshooting of Snowflake databases and cluster-related issues.

•Strong experience with ETL and orchestration tools (e.g., Talend, Toad, HQL, T sql, Oozie, Airflow, Informatica)

•Experience working on healthcare data, Financial data systems, employee payroll-related Data, and retail data.

•Experienced in using Agile scrum systems,

TECHNICAL SKILLS:

•Programming languages: Python, PySpark, Scala, Shell Scripting, SQL, PL/SQL, Pig, Hive QL, SNOWSQL.

•Big Data: Hadoop, Sqoop, Apache Spark, NiFi, Kafka, Snowflake, T SQL, HQL, Beam, SAP, Toad, Cloudera, Horton Works, PySpark, Java, Spark, Spark SQL

•Operating Systems: Windows (XP/7/8/10), UNIX, LINUX, UBUNTU

•Databases: Oracle, SQL Server, My SQL, DB2, Sybase, Netezza, Hive, Impala, Teradata, NoSQL Database (HBase, MongoDB).

•Cloud Technologies:

(a)AWS cloud services i.e. EC2, S3, Glue, Athena, AWS lambda, event bridge, glue catalogs, AWS DMS, Dynamo DB, Redshift, AMEZONE MWAA, RDS, EMR, VPC, IAM, Elastic Load Balancing, Lambda, Redshift, Elastic Cache, Cloud Watch.

(b)MS AZURE services i.e. Azure Data lake, Data factory, Azure Data bricks, Microsoft Data Bricks, Azure SQL, Azure SQL Data warehouse

(c)SNOWFLACK DATA, WHEREHOUSE: SNOW SQL, SNOFLACK MERGE PROCEDURES.

•IDE Tools: Aginitiy for Hadoop, PyCharm, Toad, SQL Developer, SQL *Plus, Sublime Text, VI Editor, Eclipse, Visual Studio, Data Grip

•GIT: GIT HUB, ACTIONS, WERKFLOWS.

•DATA VISUALISATION Tools: Tableau, Power BI and OWERLEDGE.

•ETL/Data warehouse Tools: Informatica 9.6/9.1, Snap logic, AWS Glue.

Professional Experience

Sr. Data Engineer

McKinsey & Company, Atlanta, GA March 2022 to Present

Roles& Responsibilities:

•Design and develop ETL Integration patterns using Python on Spark

•Develop a framework for converting existing snap logic mappings and to PySpark jobs and snow SQL procedures.

•create Glue Jobs using PySpark to bring data from ORACLE, SAP HANA, SAP S4 to AWS S3

•Optimize the PYSPARK jobs to run on the EKS cluster for fast data processing.

•Migrate on-prem informatica and Snap logic ETL process to AWS and Snowflakes.

•Implement CICD pipelines for code deployment.

•Build Airflow Dag’s to run ETL jobs as per the schedule.

•Develop Kafka connectors to consume data from upstream confluent Kafka cloud to provide data to downstream applications and store data in Snowflake.

•Worked extensively on migrating all Informatica aggregation logic into snow-SQL logic.

•Used Broadcast Join in SPARK for making smaller datasets to large datasets without shuffling of data across nodes.

•Build snowflake connectors to read and write data from s3 to Snowflake.

•Developing Glue Pipeline using Spark to push data from Snowflake to Elastic search index and document Db indexes to provide data to downstream APIs.

•Mainly implemented spark Partitioning in AWS GLUE JOB that reads Data from RDBMS Data source to avoid Job time out due to heavy loads of data.

•Automated Scheduling of workflow using MWAA Airflow to ensure daily execution in production.

•Created Spark Streaming jobs using Python to read messages from Kafka & and download JSON files from AWS S3 buckets.

•Implemented Glue Jobs to Flaten JSON and EXEL files using Spark and Pandas to Parquet format and store in S3.

•Manage production incidents using Service NOW.

•Provide support to consumers on data-related queries.

•Worked on the creation of a framework that compares data from on-prem and cloud after migration and generates mismatch reports for auditing purposes.

•Working experience in running a Python back-end application on Elastic Beanstalk.

•Created various reports using Tableau, Power BI based on requirements with the BI team.

Environment: Spark, PySpark, AWS GLUE, Snowflake, Oracle, RDS, Data lake, Ec2, Amazon MWAA, AWS Lambda, SNAPLOGIC, Snowflake Web UI, Snow SQL, Shell Scripting, Informatica, Tableau, Kafka, Data grip, SQL Server.

Sr. Data Engineer

Truist Bank, Charlotte, NC March 2021 to March 2022

Responsibilities:

•Developed applications using spark to implement various aggregation and transformation functions of Spark RDD and Spark SQL.

•Performing ETL from multiple sources

•Worked on Apache Spark Utilizing the Spark, SQL and Streaming components to support the intraday and real-time data processing.

•Automated Scheduling of workflow using Apache Airflow and shell scripting to ensure daily execution in production.

•Install and configure Apache Airflow, terraform for S3 bucket and Snowflake data warehouse and created dags to run through Airflow.

•Developing Spark scripts, UDFS using Spark SQL query for Data Flow, Dashboards, KPI Scorecards Google Cloud Platform(GCP), data aggregation, querying, and writing data back into RDBMS through Sqoop.

•Created Spark Streaming jobs using Python to read messages from Kafka & download JSON files from AWS S3 buckets.

•Used Spark Streaming to receive real-time data from Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra

•Experience in moving data in between GCP and Azure using Azure Data Factory, Microsoft Data Factory.

•Developed AWS strategy, planning, and configuration of S3, Security groups, IAM, Amazon MWAA, Spark Optimization, Collabira, Talend, EC2, EMR, Jenkins and Redshift

•Developed and implemented ETL pipelines on S3 parquet files in a data lake using AWS Glue.

•Used AWS glue catalog with crawler to get the data from S3 and perform sql query operations

•Performed end-to-end architecture and implementation evaluations of different AWS services such as Amazon EMR, Redshift, S3, Athena, Glue, and Kinesis.

•Conducted data cleansing for unstructured datasets by applying Informatica Data Quality to identify potential errors and improve data integrity and data quality.

•Using shell commands to push the environment and test files AWS using Jenkins automated pipelines

•Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS.

•Load data into Amazon Redshift and use AWS Cloud Watch to monitor AWS RDS instances within Confidential.

Developed and executed a migration strategy to move Data Warehouse, Google cloud Platform(GCP) from an Oracle platform to AWS Redshift. Developed the Pyspark code for AWS Glue jobs and for EMR.

Experience with Hadoop Ecosystem (Spark, Sqoop, Hive, Flume, GCP, Hbase, GIT, Bit Bucket, Sqoop, Hadoop, Kafka, Oozie).

•Was responsible for creating on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and PySpark

•Experience in data ingestions techniques for batch and stream processing using AWS Batch, AWS Kinesis, AWS Data Pipeline

•Hands on expertise with AWS Databases such as RDS(Aurora), GCP, Denodo, Beam, AWS Infrastructure, CICD, Data analytics, Redshift, Dynamo DB and Elastic Cache

•Developed Scala scripts using both Data frames/Spark SQL and Athena, S3, Spark RDD/Map Reduce for Data Aggregation, queries and writing data back into the OLTP system through Sqoop.

•Used Spark API over Cloudera Hadoop Yarn to perform analytics on data in Hive.

•Created various reports using Tableau, Power BI based on requirements with the BI team.

•Created Snow pipe for continuous data load from staged data residing on cloud gateway servers.

•Stage the API or Kafka Data (in JSON file format) into Snowflake DB by Flattening the same for different services.

•Used Snowflake time travel feature to access historical data. And Heavily involved in testing Snowflake to understand best possible way to use the cloud resources.

•Experience with Snowflake Multi-Cluster Warehouses, working with both Maximized and Auto-scale functionalities while running the multi-cluster warehouses.

Day to-day responsibility includes developing ETL Pipelines in and out of data warehouse, develop major regulatory and financial reports using advanced SQL queries in snowflake.

Environment: Apache Spark, Hadoop, PySpark, HDFS, Cloudera, Healthcare,

AWS, GCP, CVS, JSON, IAM, Azure, Big Data, Azure S3, Hadoop, AWS Glue, AWS EMR, Redshift, Kafka, Snowflake, T SQL, Toad, HQL, Docker, cloud computing, Data Management, Database Administration, CI/CD, Jenkins, Kubernetes, Flink, Microsoft Data Bricks, Collabira, Talend, Data Bricks, Data Pipeline, Nifi, Informatica, Tableau, JSON, Teradata, DB2, SQL Server, MongoDB, H Base, Casandra, Snowflake Web UI, Snow SQL, Shell Scripting.

Sr Data Engineer

Macy's, New York, NY September 2018 to February 2021

Responsibilities:

•Work on requirements gathering, analysis and designing of the systems.

•Developed Spark programs using Scala to compare the performance of Spark RDD, and SparkSQL with Hive.

•Developed spark streaming application to consume JSON messages from Jenkins, Denodo, T SQL, Beam, CI/CD, GIT, Bit bucket, Kafka and perform transformations.

•Performed data analysis on healthcare for Medicare, Medicaid, and commercial lines of business writing complex sql quires.

•Used Spark API over Hortonworks Hadoop YARN, terraform to perform analytics on data in Hive.

•Implemented Spark using Python and SparkSql for faster testing and processing of data.

•Involved in developing a Map Reduce framework that filters bad and unnecessary records. a

•Ingested data from RDBMS and performed data transformations, Data Flow, Azure S3, GCP, AWS Glue and then export the transformed data to Cassandra as per the business requirement.

•Involved in converting Hive/SQL, T SQL, HQL queries into Spark transformations, Beam, HQL, Cloud Computing, Cloud formation using Spark RDDs with Python.

•Exported the analyzed data to the relational databases using Sqoop to further visualize and generate reports for the BI team.

•Migrated the computational code in hql to PySpark.

•Worked with Spark Ecosystem using Scala and Hive Queries on different data formats like Text file and parquet.

•Worked in migrating Hive QL into Impala to minimize query response time.

•Responsible for migrating the code base to Amazon EMR, AWS S3, Json, Flink, CVS Air flow, Flink, Athena, spark Optimization and evaluated Amazon eco systems components like Redshift, S3.

•Collected the logs data from web servers and integrated in to HDFS using Flume

• Developed Python scripts to clean the raw data.

•Imported data from AWS S3 into Spark RDD, Data Pipelines, Athena, Google Cloud Platform (GCP) Performed transformations and actions on RDD's

Used AWS services like EC2 and S3 for small data sets processing and storage

Implemented Nifi flow topologies to perform cleansing operations before moving data into HDFS.

Worked on different file formats (ORC FILE, Parquet, Avro) and different Compression Codecs (GZIP, SNAPPY, LZO). Created applications using Kafka, which monitors consumer lag within Apache Kafka clusters.

•Worked on importing and exporting data into HDFS and Hive using Sqoop, Hudi built analytics on Hive tables using Hive Context in spark Jobs.

•Extracted the data from Teradata into HDFS/Dashboards using Spark Streaming.

•Developed workflow in Oozie to automate the tasks of loading the data into HDFS.

•Worked in Agile environment using Scrum methodology.

Environment: Hadoop, Hive, Map Reduce, Sqoop, Kafka, Spark, big data, Git, Bit Bucket, Data Management, Database Admistration, Data Analytics, IAM, Flink, Azure, AWS, Google Cloud Platform, GCP, Data Transformation, Yarn, Pig, PySpark, Cassandra, Apache Airflow, Nifi, Solar,

Shell Scripting, Hbase, Scala, AWS, AWS Redshift, S3, agile methodologies, Flink, Terraform, Horton works, Soap, Python, Teradata, MySQL.

Data Engineer

CBRE, Dallas, TX March 2017 to August 2018

Responsibilities:

•Worked with data transfer from on-premise SQL servers to cloud databases (Azure Synapse Analytics (DW) and Azure SQL DB).

•Created Pipelines which were built in Azure Data Factory using Linked Services/Datasets/Pipeline/ to extract, transform, and load data from a variety of sources including Azure SQL, Blob storage, AWS Infrastructure, Air Flow, Aws S3, Micro Service, Azure SQL Data warehouse, write-back tool and reverse.

•Used a blend of Azure Data Factory, T-SQL, ETL, Talend, KPI Scorecards, Dashboards, Data Flow, Air Flow, Spark SQL, and U-SQL Azure Data Lake Analytics, gather, convert, and load the data from source systems to Azure Data Storage services.

•Processed structured and semi-structured data into Spark Clusters using CI/CD, Spark SQL and Data Frames API.

•Have good experience working with Azure BLOB and Data lake storage and loading data into Azure SQL Synapse analytics (DW).

•Optimized Apache spark clusters using Delta lake

•Created Spark apps with Azure Data Factory, Microsoft Data Factory and Spark-SQL for data extraction, transformation, and aggregation from various file formats in order to analyze and transform the data in order to reveal insights into consumer usage patterns.

•Ingestion of data into one or more Azure Services (Azure Data Lake, Data Transformation, AWS Infrastructure, Bit Bucket, Azure Storage, Azure SQL, Azure DW) and processing of data in Azure Data bricks.

•related Physical Data Model from the Logical Data Model using Compare and Merge Utility in ER/Studio and worked with the naming standards utility.

•Monitored the SQL scripts and modified them for improved performance using PySpark SQL.

•Used Kubernetes for the runtime environment of the CI/CD system to build, test deploy.

Used Kubernetes to manage Docker orchestration and containerization, Cloud computing, deployment, scaling

Environment: Azure Cloud, Azure Data Factory (ADF v2), Azure functions Apps, AWS Infrastructure, IAM, Jenkins, Azure Data Lake, BLOB Storage, SQL Server, Windows remote desktop, AZURE PowerShell, Data Transformation, Data bricks, Hudi, CVS, Terraform, Micro Service, Python, Kubernetes, Azure SQL Server, Azure Data Warehouse.

Data Analyst

Amigos Software Solutions, Hyderabad, India March 2015 to November 2016

Responsibilities:

Analyzing Functional Specifications Based on Project Requirement.

2 years of Data Base Analyst experience on various domains like Banking, Financial and Telecommunications.

Ingested data from various data sources into Hadoop HDFS/Hive Tables using SQOOP

Developing Hive Queries for the user requirement.

Worked on multiple POCs in Implementing Data Lake for Multiple Data Sources ranging from Team Center, SAP, Workday, SaaS, Cloud Computing, Git, Bit Bucket, Machine logs.

•Developed Spark code using Python and Spark-SQL/Streaming for faster testing and processing of data.

•Planning, scheduling, and implementing Oracle to MS SQL server migrations for AMAT in house applications and tools.

•I services firm and one of the largest banking institutions in the United States, with operations more than 60 countries worldwide. JP Morgan Chase co, is a leader in investment banking, financial services for customers, small business and commercial banking, financial transaction processing, asset management and private equity.

Integrated Tableau with Hadoop data source for building dashboard to provide various insights on sales of the organization.

•Worked on Spark in building BI reports using Tableau. Tableau was integrated with Spark using Spark-SQL.

•Business/ Data Analyst for various banking products, performed business analysis on Treasury. Loans, Credit controls and Deposit tracks along with other banking products.

•Using shell commands to push the environment and test files IAM, AWS using Jenkins automated pipelines

•Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS.

• Load data into Amazon Redshift and use AWS Cloud Watch to collect and monitor AWS RDS instances within Confidential.

•Developed Spark jobs using Python on top of Yarn/MRv2 for interactive and Batch Analysis.

•Used data modeling tool Embarcadero/ER studio for creating UML methodology, Use Case Diagrams, DDL’s (Data Definition Languages), Logical and Physical data models.

•Developed workflows in Live compared to Analyze SAP Data and Reporting.

•Participated in daily scrum meetings and iterative development.

Environment: Hadoop, Yarn, Tableau, Oracle, HDFS, Hive, CVS, Bit Bucket, Healthcare, Cloud Computing, GIT, Data Management, Sqoop, Micro Service, Python Scripting, Informatica 10.5, Spark Optimization, AWS, AWS Glue, Aws S3, SQL, Spark, Spark SQ python, IAM, MS SQL Server PDW, agile.

Hadoop Developer

Creator Technologies Pvt Ltd Hyderabad, India August 2013 to February 2015

Responsibilities:

•Installed and configured Hadoop Map Reduce, HDFS and developed multiple Map Reduce jobs in Java for data cleansing and preprocessing.

•Extensive experience in Apachi/Hudi Datasets on insert/ Bulk insert.

•Involved in loading data from UNIX file system to HDFS. and configured Hive and written Hive UDFs.

•Importing and exporting data into HDFS and Hive using Sqoop

•Used Cassandra CQL and Java APIs to retrieve data from Cassandra Table.

•Responsible for cluster maintenance, adding and removing cluster nodes, cluster monitoring and troubleshooting,

•Worked hands on with ETL process. Handled importing of data from various data sources, performed transformations using Hive, Map Reduce, and loaded data into HDFS.

•Extracted the data from Teradata into HDFS using Sqoop.

•Analyzed the data by performing Hive queries and running Pig scripts to know user behavior.

•Exported the patterns analyzed back into Terraform, Teradata using Sqoop.

•Installed Oozie workflow engine to run multiple Hive Tables.

Environment: Hadoop, Map Reduce, HDFS, UNIX, Hive, Sqoop, Cassandra, ETL, AWS, IAM, Hudi, Pig Script, Cloudera, Oozie.

Education: Bachelor of Technology in Computer Science - JNTU-H – 2013.



Contact this candidate