Post Job Free

Resume

Sign in

Big Data Processing

Location:
William Penn Annex West, PA, 19107
Posted:
February 15, 2024

Contact this candidate

Resume:

Snehitha Bobba

Email: ad3nmr@r.postjobfree.com

Mobile: +1-309-***-****

PROFESSIONAL SUMMARY:

Having 5+ years of IT experience in designing, developing, and delivering of software using wide variety of technologies in all phases of the development life cycle. Expertise in Python/Big data technologies as developer, proven ability in project-based leadership, teamwork, and good communication skills.

Over 4+ years of strong experience with AWS, Snowflake, Spark, Kafka.

Very Strong Functional programming and Object-oriented concepts with complete SDLC experience - Requirements gathering, Conceptual Design, Analysis, Detail design, Development, Mentoring, System and User Acceptance Testing.

Hands-on development and implementation experience in Big Data Management Platform (BMP).

Strong experience in Micro service architecture and its design patterns.

Strong exposure to snowflake and redshift.

Strong understanding of snowflake cost optimizations.

Designed, deployed, and managed Snowflake data warehouses, ensuring optimal performance, scalability, and reliability.

Databricks was used to implement ETL techniques, converting unstructured data into insightful data and enabling further analysis.

Created and managed Databricks notebooks in Python, Scala, or SQL for data display, analysis.

Databricks Jobs were scheduled and managed to automate repetitive data processing operations and ensure on-time execution.

Databricks Streaming was used to implement real-time data processing, managing continuous streams of data for instant insights.

In depth knowledge on container frameworks like Kubernetes, Docker, etc.

In depth knowledge on Spark core and Spark SQL API.

Developed Batch processing jobs using Java Spark and MapReduce and Hive.

Good Knowledge and experience in Hadoop Administration.

Experience in scripting for automation, and monitoring using Shell scripts.

Created Kafka producers and consumers to stream data in real time while ensuring fault-tolerant and efficient message delivery.

Implemented and managed schema registry to ensure data consistency and compatibility among producers and consumers.

Developed and maintained Kafka topics and enhanced partitioning techniques for effective parallel processing and data distribution.

In-depth understanding of Snowflake Schema, Database, and Table structures.

Developing transformation logic through snow pipeline.

Developed and utilized Snowflake user-defined functions and stored procedures to encapsulate complex logic, enhancing code modularity and maintainability.

Experience in migrating the data to the cloud data ecosystem.

Team player with good management, analytical, communication and interpersonal skills.

Technical professional with management skills, excellent business understanding and strong communication skills.

Developed a real-time data processing system, reducing the time to process and analyze data by 50%.

Designed and implemented a data archiving strategy that reduced storage costs by 30%.

ACADEMIC BACKGROUND

Texas A&M University, Commerce

Master’s Degree,

Information Systems

GPA: 3.6/4

Keshav Memorial Institute of Technology and Science

Bachelor's degree,

Information Technology

GPA: 3.2/4

Creative

Excellent analytical and logical skills to solve problems in logical manner, and resolve them decisively

Quick Learning

Steep learning curve so open to new technologies

Team Player

Good communication and interpersonal skills with ability to multi-task and work independently and within a team environment.

Performance Tuning

Used cutting edge tools and technology to Improve system performance

TECHNICAL SKILLS:

Big Data Ecosystems

Hadoop, MapReduce, HDFS, Hive, Spark, Kafka, Impala, Snowflake, Databricks.

Operating Systems

Windows, Linux, UNIX

Languages

Python, JAVA, C++, SQL

Shell scripting

UNIX Shell Script.

Frame Works

Flask, Apache Spark

Databases

MySQL, Aurora

IDE’s

PyCharm, IntelliJ

PROFESSIONAL EXPERIENCE:

Target (Salt Lake City) Jan 2022 - Present

Snowflake Data Engineer

Designed and implemented ETL pipelines for ingesting and processing large volumes of data from various sources, resulting in a 25% increase in efficiency. Built and maintained data warehousing solutions using Snowflake, allowing for faster data access and improved reporting capabilities. Developed and optimized complex SQL queries and stored procedures to extract insights from large datasets.

Responsibilities:

Experience in Snowflake virtual warehouse and building Snow pipe.

Conducted performance tuning exercises, optimizing Snowflake configurations and query execution plans for faster and more efficient data retrieval.

Implemented query optimization techniques to enhance performance, taking advantage of Snowflake's unique features such as automatic clustering and indexing.

Developed and maintained robust data models within Snowflake, optimizing for performance and efficiency in large-scale data warehousing environments.

Developed and optimized Spark jobs within Databricks for efficient distributed data processing.

Parameterized Databricks notebooks for reusability, allowing easy adaptation to different datasets and scenarios.

Within Databricks notebooks, new widgets were developed to improve user interaction and offer dynamic controls for data analysis.

Utilized Databricks features for data sharing and collaboration, enabling seamless teamwork and knowledge sharing among team members.

Implemented complex SQL transformations within Databricks notebooks, optimizing queries for improved performance.

Good experience in Extracting, Loading, and Transforming (ETL) data sources for medium and large enterprise data warehousing.

Hands on experience on clustering, cloning, data sharing and metadata management in snowflake.

Have deep knowledge on snowflake pricing and snowflake administration concepts.

Very good understanding of RDBMS topics, ability to write complex SQL/PLSQL

Good understanding of snowflake caching mechanisms.

Configured and managed Apache Kafka clusters for optimal performance, scalability, and fault tolerance.

Implemented topic compaction to efficiently manage storage and retention of data in Kafka topics.

Created internal and external stage for data loading to snowflake tables.

Built an outlier logic on top of snowflake tables.

Built an analytical application on top of snowflake tables with different drill-down levels.

Configured and optimized Databricks clusters to balance performance and cost, ensuring scalability based on workload demands.

In order to ensure data integrity and dependability, Delta Lake was implemented for versioning and managing massive data lakes inside the Databricks environment.

Configured and optimized Databricks clusters to balance performance and cost, ensuring scalability based on workload demands.

Developed natural language processing (NLP) solutions using transformers, word embeddings, vector embeddings and sentiment analysis for text classification, named entity identification, and language translation.

Finetuned the existing model with the company data to train a model on sentimental analysis to analyse the customer feedback.

Monitored fine-tuned models in production environments and iteratively fine-tuned them based on performance feedback and changing data distributions to maintain optimal performance over time.

Implemented techniques such as Word Embeddings to vectorize text data for natural language processing (NLP) tasks.

Applied tokenization on the data to covert raw data into sequence of tokens. And these tokens are used to train the model.

Infor(Hyderabad) Jun ‘18 – Feb ‘21

Big Data Engineer

The goal of this project is to build data lake for capturing the granular data generated from various sources about customers. In order to conduct more advanced analysis on customer behavior and the social trends. Access to this data via Spark will also enable client to generate reports and take business driven solutions CUSTOMER360. At presently working on building a DataMart by pulling the data from the Data Lake into the Redshift.

Responsibilities:

Built a framework to load data from files and databases using spark into Data Lake using python. Data lake has three layers Staging, Raw, Golden Record.

Used Spark SQL and accessed external hive meta-store (MySQL Instance) and processed hive tables on daily basis with incremental data and achieved batch updates.

Configured the EMR clusters in the VPC cloud, with different subnets, routing tables, ACLS to communicate with in the cluster and closed all the unnecessary ports.

Built a model to capture the CDC using the Sqoop for importing data from databases into data lake.

Worked with oozie to submit jobs to the cluster and Job dependency between various stages.

Optimize the Spark processing by looking into spark executor memory and cores usage in the Spark UI and Yarn Resource Manager which reduced time to process by half time.

Written Lambdas in order to Create Transit EMR clusters for processing the data based on time and file arrival and time-based events using the AWS CloudWatch

Used columnar data storage format like ORC and Parquet for efficient storage and better performance while processing

Used S3 as storage for the data, so we can terminate the data after it is processed.

Implemented s3-dist-cp to achieve the greater speed in moving the data from s3 to HDFS and vice versa

Enteniselvy used AWS RDS instances to set the hive metastore and hue database for users’ login outside the cluster.

Implemented Data-quality on the data and rejected the records that didn’t satisfy the given condition.

Performed the jobs with Spark core, SparkSQL, Spark Streaming, and Data frames, transformations, actions.

Expert in spark windowing functions for the time series data and implemented various UDF.

Designed Spark schema and data selection queries that are involved in data ingestion process.

Extensive experience in troubleshooting and debugging spark applications in testing environment and in production.

Imported data from the Relational Databases and worked on Data Warehousing concepts such as star schema.

Used bitbucket as the version control tool. Extensive experience in release management of the branch and git sync and merge issues

Good experience in writing the shell scripts for job dependencies and clean ups

Good knowledge of debugging the job, look into the corresponding logs to take the decision to rerun it or rectify the error before the next run

Involved in daily scrums.

Designed a Framework to achieve inserts updates and deleted (Frame work to achieve updates) while writing data from Spark.

AWS Skills:

EC2, EMR, RDS, S3, VPC, Lambda, AWS Step, SES, SNS, Cloud-Watch, AWS CLI, EBS, RedShift, Athena

Used all the services in various Applications

Environment: Python, Apache Spark Core, Spark Streaming, Spark SQL, Kafka, Hive, MySQL, IntelliJ IDE, Git, Agile



Contact this candidate