Data Web Services

Location:

Coppell, TX

Posted:

August 14, 2023

Contact this candidate

Resume:

Kranthi Hadoop Developer.

Email : ***************@*****.*** Contact : 469-***-****

Over 6 + years of experience on Big Data ecosystems using Hadoop, HDFS, Hive, Impala, Sqoop, Spark, Spark streaming, Kafka, Oozie, Airflow, Storm, Flume, Snowflake, Teradata, and Zookeeper

PROFESSIONAL SUMMARY

•Expertise in writing end to end Data processing Jobs to analyze data using MapReduce, Hive. Spark, Spark streaming, Kafka, and other big data ecosystems tools.

•Experienced on Spark Architecture including Spark Core, Spark SQL, Data Frames, Spark Streaming and Spark MLlib.

•Extensive Knowledge on developing Spark Streaming jobs by developing RDD’s (Resilient Distributed Data sets) using Scala, PySprak and Spark-Shell.

•Experienced in data manipulation using python for loading and extraction as well as with python libraries such as NumPy, SciPy and Pandas for data analysis and numerical computations.

•Extensive experience in working with Partitions, bucketing concepts in Hive, and designed both Managed and External tables in Hive to optimize performance.

•Experienced in using Pig scripts to do transformations, event joins, filters, and pre-aggregations before storing the data into HDFS.

•Strong knowledge of Hive analytical functions, extending Hive functionality by writing custom UDFs.

•Expertise in writing Map-Reduce Jobs in Python for processing large sets of structured, semi-structured and unstructured data sets and stores them in HDFS.

•Good understanding of data modeling (Dimensional & Relational) concepts like Star-Schema Modeling, Snowflake Schema Modeling, Fact and Dimension tables.

•Hands on experience working Amazon Web Services (AWS) using Elastic Map Reduce (EMR), Redshift, and EC2 for data processing.

•Hands on experience in SQL and NOSQL database such as Snowflake, HBase, Cassandra and MongoDB.

•Hands on experience in setting up workflow using Apache Airflow, Autosys, and Oozie workflow engine for managing and scheduling Hadoop jobs.

•Strong experience in working with UNIX/LINUX environments, writing shell scripts.

•Worked with various formats of files like delimited text files, clickstream log files, Apache log files, Avro files, JSON files, XML Files.

Technical Expertise:

Hadoop

HDFS, Sqoop, Hive, Impala, MapReduce, PIG, Spark, Spark SQL, Spark Streaming, Kafka, Zookeeper, Flume, Oozie, Airflow.

NoSQL

HBase, MongoDB, Cassandra

Languages

Java, Python, Scala, and UNIX.

Data warehousing & ETL

Snowflake, Talend & AWS Services.

Web Services

XML, SOAP, REST

Databases

Oracle, DB2, SQL Server, MySQL, Teradata.

Web Servers

Web Logic, Web Sphere, Apache Tomcat.

Modeling Tools

UML on Rational Rose, Rational Clear Case, Enterprise Architect, Microsoft Visio

Build Tools

Maven, ANT, SBT, DBT.

Professional Experience:

Equifax, St louis. March 2021 – Present Role: Hadoop Developer/Data Engineer

Responsibilities:

Imported data from various formats like JSON, Sequential, Text, CSV, AVRO and Parquet to HDFS cluster with compressed for optimization.

Ingested data from RDBMS sources like - Oracle, SQL Server into HDFS using Sqoop.

Developed data pipelines using Kafka, Spark and Hive to ingest, transform and analyzing data.

Working on CDC (Change Data Capture) tables using Spark Application to load data into Dynamic Partition Enabled Hive Tables.

Writing Spark jobs to apply compaction on smaller HDFS block size files in HDFS for cluster consumption optimization.

Building an internal Hadoop framework to load the data into Data Mart for Snapshot and Incremental tables along with aggregations and transformations.

Building data processing pipeline based on Spark different AWS services like S3, EC2, EMR, SNS, SQS, Lambda, Redshift, Data pipeline, Athena, AWS Glue.

Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.

Implementing Spark RDD transformations to Map business analysis and apply actions on top of transformations.

Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in production.

Designed, build and managed ELT data pipeline, leveraging Air ow, python, DBT, and AWS solutions.

Working on Snowflake modelling and highly proficient in data warehousing techniques for data cleansing, Slowly Changing Dimension phenomenon, surrogate key assignment and change data capture.

Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.

Investigate Data Quality issues and generate presentable narratives based on biases possible due to incompleteness of data.

Developing data pipeline using Flume, Pig, Sqoop to ingest cargo data and customer histories into HDFS for analysis.

Developing Sqoop scripts to import export data from relational sources and handled incremental loading on the customer, transaction data by date.

Working with different File Formats like text file, Avro, ORC, and Parquet for Hive querying and processing.

Importing & exporting database using SQL Server Integrations Services (SSIS) and Data Transformation Services (DTS Packages).

Developing Talend jobs to populate the claims data to data warehouse - star schema, snowflake schema, Hybrid Schema.

Working on complex SQL Queries, PL/SQL procedures and convert them to ETL tasks.

Environment: Hadoop, Spark, Scala, Python, Java, AWS, EMR, Lambda, S3, Athena, Glue, Redshift, Snowflake, Elastic Map Reduce, Hive, Impala, Sqoop, Oozie, Tableau, Airflow, DBT, SQL, SSIS, DynamoDB, NumPy, Pandas, Ansible, Git.

Dell, Austin, TX July2018– Feb 2021

Role: Hadoop Developer

Responsibilities:

Developed spark applications in Python (PySprak) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables.

Writing Spark jobs to apply compaction on smaller HDFS block size files in HDFS for cluster consumption optimization.

Building an internal Hadoop framework to load the data into Data Mart for Snapshot and Incremental tables along with aggregations and transformations.

Developed fully customized framework using python, shell script, Sqoop & hive.

Migrated data from Oracle to Data Lake using Sqoop, Spark and Talend (ETL Tool).

Extensively worked on developing strategies for ETL data from various sources into Data Warehouse and Data Marts using DataStage.

Model, lift and shift custom SQL and transpose LookML into dbt for materializing incremental views.

Extensively worked on Python and build the custom ingest framework.

Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS.

Converted all Hadoop jobs to run in EMR by configuring the cluster according to the data size.

Created S3 buckets also managing policies for S3 buckets and Utilized S3 bucket and Glacier for storage and backup on AWS.

Developed the PySprak code for AWS Glue jobs and for EMR & worked on developing ETL pipelines on S3 Parquet files on Data Lake using AWS Glue.

Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline which can be written to Glue Catalog and can be queried from Athena.

Performing end- to-end Architecture & implementation assessment of various AWS services like Amazon EMR, Redshift, S3.

Worked on Batch processing and Real-time data processing on Spark Streaming using Lambda architecture.

Extensively worked on creating data pipeline integrating Kafka with spark streaming application used Scala for writing applications.

Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.

Used Spark to process the data before ingesting the data into the HBase and both batch and real-time spark jobs were created using Scala.

Used SBT to build the Scala project.

Working on CDC (Change Data Capture) tables using Spark Application to load data into Dynamic Partition Enabled Hive Tables.

Documented logical, physical, relational, and dimensional data models. Designed the Data Marts in dimensional data modeling using star and snowflake schemas.

Created Snow pipe for continuous data load and applied transformation logic using Snow SQL.

Using BULK LOAD Utility data pushed to Cassandra databases.

Created Airflow Scheduling scripts in Python.

Conducted ETL development in the Netezza environment using standard design methodologies.

Schedule Hadoop jobs by writing Autosys scripts in Autosys automated job control system.

Environment

Hadoop, CDH, Spark, AWS, Lambda, Scala, Python, Airflow, DBT, Hive, Impala, Sqoop, Pig, Storm, Oozie, SQL, DataStage, Autosys, Tableau, Netezza, Cassandra, Bit bucket, Ansible, Kubernetes, Jenkins, JFrog, JIRA.

Nadsol techno labs private limited, Hyderabad, India March 2017– June 2018.

Role: Software Engineer.

Responsibilities:

J2EE Front-End and Back-End supporting business logic, integration, and persistence.

Used JSP with Spring Framework for developing User Interfaces.

Developed the front-end user interface using J2EE, Servlets, JDBC, HTML, DHTML, CSS, XML, XSL, XSLT and JavaScript as per Use Case Specification.

Integrated Security Web Services for authentication of users.

Used Hibernate Object/Relational mapping and persistence framework as well as a Data Access abstraction Layer.

Data Access Objects (DAO) framework is bundled as part of the Hibernate Database Layer.

Designed Data Mapping XML documents that are utilized by Hibernate, to call stored procedures.

Implemented Web-Services to integrate between different applications (internal and third-party components using SOAP and RESTful services using Apache-CXF

Developed and published web-services using SOAP.

Developed efficient PL/SQL packages for data migration and involved in bulk loads, testing and reports generation.

Development of complex SQL queries and stored procedures to process and store the data

Used CVS version control to maintain the Source Code.

Environment:

Java, J2EE, JSPs, Struts, EJB, Spring, RESTful, SOAP, Apache- CXF, WebSphere, PL/SQL, Hibernate, HTML, XML, Oracle 9i, Swing, JavaScript, CVS.

Education:

Master’s in computer/information Technology management, Lindsey Wilson College, Columbia, Kentucky.

Contact this candidate