Big Data Engineer

Location:

Newark, NJ

Posted:

March 11, 2024

Contact this candidate

Resume:

Sai Krupa E Mail: ************@*****.***

Senior Data Engineer M: 281-***-****

Summary:

Having 4+ years of IT experience in software analysis, design, development, testing and implementation of Big Data, Hadoop, SQL and No SQL technologies.

Expert in big data technologies and cloud platforms, including AWS, Azure, Snowflake Data Warehouse, and a comprehensive understanding of the Hadoop Ecosystem (Hive, Pig, Oozie, HBase, and ZooKeeper).

Developed Pig Latin scripts to extract the data from the web server output files to load into HDFS.

Experience in Amazon AWS services such as EMR, EC2, S3, Cloud Formation, Red Shift, and Dynamo DB which provides fast and efficient processing of Big Data.

Strong understanding and strong knowledge in No SQL databases like HBase, Mongo DB& Cassandra.

Create and configured the continuous delivery pipelines for deploying microservices using Jenkins.

Written multiple Map Reduce programs in Python for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV and other compressed file EJB, Hibernate, Java Web Service, SOAP, REST Services, Java Thread, Java Socket, Java Servlet, JSP, JDBC formats.

Used Pandas, Numpy, Pandas, seaborn, SciPy, Matplotlib, Scikit-learn, NLTK in Python for developing various machine learning algorithms and utilized machine learning algorithms such as linear regression.

Good experience in using Sqoop for traditional RDBMS data pull.

Hands on experience with Hadoop, HDFS, Map Reduce and Hadoop Ecosystem (Pig, Hive, Oozie, Flume and HBase).

Good experience transformation and storage: HDFS, Map Reduce, Spark

Hands on experience in developing SPARK applications using Spark tools like RDD transformations, Spark core, Spark Streaming and Spark SQL.

Strong understanding of Hadoop daemons and Map-Reduce concepts.

Expertise in Java and Scala

Seamlessly integrated IBM Watson, Solr, Elasticsearch, and OpenSearch technologies within multi-cloud environments, optimizing data flow and accessibility across platforms.

Experience on Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory

Worked with Yarn Queue Manager to allocate queue capacities for different service accounts.

Hands on experience on Horton works and Cloudera Hadoop environments.

Leading the testing efforts in support of projects/programs across a large landscape of technologies (Unix, AWS, Cucumber JVM, Mongo DB, GIT Hub, SQL, No SQL database, API, Java, Jenkins)

Excellent analytical and programming abilities in using technology to create flexible and maintainable solutions for complex development problems.

Technical Skills:

Languages/Tools

SQL, PL/SQL, Python, Scala, Java, C, C++

Big Data technologies

Hadoop, HDFS, Map Reduce, HIVE, PIG, HBase, SQOOP, Oozie, Zookeeper, Spark, Kafka, Storm, Solr, Impala, Airflow

Testing &Case Tools

JUnit, Log4j, Rational Clear case, CVS, ANT, Maven, JBuilder.

Build Tools

CVS, Subversion, GIT, Ant, Maven, Gradle, Hudson, Team City, Jenkins, Chef, Puppet, Ansible, Docker.

Scripting Languages

Shell (Bash), Perl, PowerShell, Ruby, Groovy, PowerShell.

Monitoring Tools

Nagios, Cloud Watch, JIIRA, Bugzilla and Remedy.

Databases

Oracle, MS SQL Server, DB2, MS Access & MySQL. Teradata, Cassandra, Mongo DB, HBase

Cloud Environment

AWS, Azure, Snowflake

Operating systems

Windows, Solaris, Unix, Linux (Red Hat 5.x, 6.x, 7.x'SUSELinux 10), Sun Solaris, Ubuntu, CentOS.

Professional Experience:

Client: IT Induct, Inc, Coppel, TX Aug’2023 – Present

Role: Cloud Engineer

Responsibilities:

•As Cloud Data Engineering in Big data Hadoop ecosystems such as HDFS, Hive, Spark, Data Bricks, Kafka, Yarn on AWS, Azure cloud services and Cloud rational databases.

•Implemented Installation and configuration of multi-node cluster on Cloud using Amazon Web Services (AWS) on EC2.

•Developed Hive queries and Sqooped data from RDBMS to Hadoop staging area

•Handled importing of data from various data sources, performed transformations using Hive, and loaded data into data lake

•Extensive experience in designing and implementing various web applications in both LAMP (Linux, Apache, MY SQL, PHP) Environments.

•Analyse, design and build Modern scalable distributed data solutions using with Hadoop, AWS cloud services.

•Hands on Amazon EC2, Amazon S3, Amazon RedShift, Amazon EMR, Amazon RDS, Amazon ELB, Amazon Cloud Formation, and other services of the AWS family.

•Perform Informatica Intelligent cloud services (IICS) pilot project on Amazon cloud services.

•Scheduling Spark/Scala jobs using Airflow in Hadoop Cluster and Generated detailed design documentation for the source-to-target transformations.

•Designed and developed architecture for data services ecosystem spanning Relational, NoSQL, and Big data technologies. Extracted Mega Data from Amazon Redshift, AWS, and Elastic Search engine using SQL Queries to create reports.

•Having Good Experience in Object Oriented Concepts (OOPS) with C# and Python

•Leveraged IBM Watson's AI and machine learning capabilities to develop cognitive computing solutions, enhancing decision-making processes and customer interactions.

•Administered and optimized Apache Solr instances for high-volume search applications, ensuring rapid response times and relevancy in search results.

•Experienced in handling large datasets using partitions, Spark in-memory capabilities, broadcasts, effective & efficient joins, transformations, and other operations

•Processed data stored in data lake and created external tables using Hive and developed scripts to ingest and repair tables that can be reused across the project

•Developed Spark scripts by using python commands as per the requirement.

•Solved performance issues in Spark with understanding of groups, joins and aggregation.

•Experience in using the EMR cluster and various EC2 instance types based on requirements.

•Used AWS glue catalog with crawler to get the data from S3 and perform sql query operations

•Designed and developed Map Reduce program to analysis & evaluate multiple solutions

•Experience in building frameworks in Python for Test Automation.

•Worked on - Continuous Integration (CI): Gradle, Maven, Ant, Jenkins, Git

•Developed dataflows and processes using SparkSQL & Spark Data Frames

•Working on Hive Metastore backup, partitioning, and bucketing techniques in hive to improve the performance

•Worked in Agile methodology and actively participated in standup calls, PI planning and work reported

•Involved in Requirement gathering and prepared the Design documents

Environment: AWS, OOPS, LAMP, Spark, Python, AWS, Sqoop, Hive, Hadoop, SQL, HBase, MapReduce, HDFS, Agile

Client: Hexaware Technologies, India Jun’2020 – Jul’2021

Role: Data Engineer

Responsibilities:

Responsible for architecting and implementing very large-scale data intelligence solutions around Snowflake Data Warehouse.

•Scheduling Spark/Scala jobs using Airflow in Hadoop Cluster and Generated detailed design documentation for the source-to-target transformations.

•Involved in creating HDINSIGHT cluster in Microsoft Azure Portal also created EVENTSHUB and AZURE SQL Databases.

•Developed data pipeline using Eventhubs, Spark, Hive, Pig and Azure SQL Database to ingest customer behavioral data and financial histories into HDINSIGHT cluster for analysis.

•Worked on a clustered Hadoop for Windows Azure using HDInsight and HORTONWORKS Data Platform for Windows.

•Created lambda functions in Java that triggers and activates the pipeline which transforms the data in suitable format for loading into analytical tools.

•Used PIG to do transformations, event joins, filter boot traffic and SOME PRE-AGGREGATIONS before storing the data onto azure database.

•Expertise with the tools in Hadoop Ecosystem including PIG, HIVE, HDFS, YARN, OOZIE, AND ZOOKEEPER. Hadoop architecture and its components.

•Scheduling Spark jobs using Airflow workflow in Hadoop Cluster and generated detailed design documentation for the source-to-target transformations.

Responsible for Migration of key systems from on-premises hosting to Azure Cloud Services Snow SQL Writing: SQL queries against Snowflake.

Designed and maintained scalable Elasticsearch clusters, facilitating efficient storage, search, and analysis of large datasets in real-time.

Transitioned search and analytics workloads to OpenSearch, leveraging its open-source framework for improved scalability and flexibility in search applications.

•Exploring with the SPARK improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-Sql, Data Frame, Pair Rdd's, Spark Yarn.

•I have been experienced with SPARK STREAMING to ingest data into SPARK ENGINE.

•Import the data from different sources like Eventhubs, cosmos into spark rdd.

•Responsible for writing SQL queries and procedures using DB2.

•Developed Spark Code using SCALA and Spark-SQL/Streaming for faster testing and processing of data.

•Experience with different configuration file to access data from database DB2.

•Configured and deployed Azure Automation scripts for applications utilizing the Azure stack that including compute, blobs, ADF, Azure Data Lake, Azure Data Factory, Azure SQL, Cloud services, ARM Templates and utilities focusing on Automation Involved in converting Hive/SQL queries into SPARK transformations using Spark RDDs, and Scala.

•Developed multiple POCs using Scala and deployed on the Yarn cluster, compared the performance of Spark, with Hive and SQL/Teradata.

•Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.

Build, create and configure enterprise level Snowflake environments. Maintain, implement, and monitor Snowflake Environments.

•Worked on the Spark Sql and Spark Streaming modules of Spark extensively and Used SCALA to write code for all Spark use cases.

•Used DATAFRAME API in Scala for converting the distributed collection of data organized into named columns.

•Implemented Spark programs using PySpark and analyzed the SQL scripts and designed the solutions.

•Analyzed the SQL scripts and designed the solution to implement USING SCALA.

Environment: Azure, Snowflake, Spark, Hive, Spark Sql, Kafka, Horton Works, Jboss Drools, Hive, Pig, Oozie, Hbase, Python, Scala, Maven, Jupyter Notebook, Visual Studio, Nifi, Unix Shell Scripting.

Client: Sasken Technologies, India Aug’2019 – May’2020

Role: Big Data Engineer

Responsibilities:

Worked with Hadoop infrastructure to storage data in HDFS storage and use Spark / HIVE SQL to migrate underlying SQL codebase in AWS.

Experience in data processing like collecting, aggregating, moving the data using Apache Kafka.

Used Kafka to load data into HDFS and move data back to S3 after data processing

Expertise in installation, configuration, supporting and managing Hadoop Clusters using Apache, Cloudera (CDH3, CDH4) distributions, Horton works and on Amazon web services (AWS).

Prepared Data Mapping Documents and Design the ETL jobs based on the DMD with required Tables in the Dev Environment.

Worked with processes to transfer/migrate data from AWS S3/Relational database and flat files common staging tables in various formats to meaningful data into Snowflake.

Used Talend for Big Data Integration using Spark and Hadoop.

Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.

Developing Spark programs with Python, and applied principles of functional programming to process the complex structured data sets.

Worked with Hadoop ecosystem and Implemented Spark using Scala and utilized Dataframes and Spark SQL API for faster processing of data.

Involved with writing scripts in Oracle, SQL Server and Netezza databases to extract data for reporting and analysis and Worked in importing and cleansing of data from various sources like DB2, Oracle, flat files onto SQL Server with high volume data

Integrated IBM Watson services with enterprise applications to implement natural language processing, sentiment analysis, and chatbot functionalities, significantly improving user engagement and satisfaction.

Experience with Data Analytics, Data Reporting, Ad-hoc Reporting, Graphs, Scales, PivotTables and OLTP reporting.

Generate metadata, create Talend ETL jobs, mappings to load data warehouse, data lake.

Designed and Developed Real Time Stream Processing Application using Spark, Kafka, Scala and Hive to perform Streaming ETL and apply Machine Learning.

Process Location and Segments data from S3 to Snowflake by using Tasks, Streams, Pipes, and stored procedures.

The individual will be responsible for design and development of High-performance data architectures which support data warehousing, real-time ETL and batch big-data processing.

Filtering and cleaning data using Scala code and SQL Queries

Converting Hive/SQL queries into Spark transformations using Spark RDDs and Pyspark

Analyzing SQL scripts and designed the solution to implement using PySpark

Export tables from Teradata to HDFS using Sqoop and build tables in Hive.

Used Zookeeper to store offsets of messages consumed for a specific topic and partition by a specific Consumer Group in Kafka.

Environment: AWS, Snowflake, Hadoop, Spark, Scala, Hbase, Hive, UNIX, Erwin, TOAD, MS SQL Server database, XML files, Cassandra, MongoDB, Kafka, IBM Info Sphere Data Stage, PL/SQL, Oracle 12c, Flat files, Autosys, MS Access database.

Client: Sasken technologies, India Aug’2018 – Jul’2019

Role: Business Data Analyst - Intern

Responsibilities:

Involved in analyzing, planning, coding, debugging, testing and go-live phases of Big Data applications.

Designed and developed Use Cases, Activity Diagrams, Sequence Diagrams using Visio.

Performed technical analysis, validating, and enhancing the processes by incorporating vendors' requirements.

Worked with the business group to document functional requirements for migration project and process enhancements including new features, functionality, version upgrades, and reporting requirements.

Conducted workflow, process, and GAP analysis to derive requirements for existing systems enhancements.

Performed data analysis and used SQL extensively for analysis and business logic validation.

Prepared High level test scenarios and helped QA team in preparing and reviewing the test cases.

Performed User Acceptance Testing (UAT) for Enhancement phase 1 and phase 2.

Environment: SQL, MS Office, MS Excel, Linux, Windows

Education:

New Jersey Institute of Technology Aug’ 2021 – Aug’2023

Master’s in computer science

During my master’s at the NJIT, I specialized in Big Data applications and Image Processing, complemented by

courses in Database Management Systems and Software Design. My academic pursuits included researching the

"Social Spider Algorithm" for algorithm analysis and engaging in cloud computing projects using AWS

technologies. Additionally, I developed AI projects focusing on Gaussian Maximum Likelihood and Multiple Object

Tracking, showcasing my diverse technical skillset.

Jawaharlal Nehru Technological University Aug’ 2016 – Jun’2020

Bachelor’s in computer science and engineering

Major Project:

Cyber Security Operations for Digital Transactions Using Machine Learning Algorithms, Hyderabad, January 15

April 15, 2020

Minor Project:

E-Attendance Management System Using JAVA, The Institution of Electronics and Telecommunication Engineers,

Hyderabad, June 13- July 13, 2019

Links:

https://github.com/Krupa049

https://medium.com/@golisaikrupa.409

https://aiquest.blog/

Contact this candidate