Sravya
Phone: 224-***-****
Data Engineer
Technical summary:
Overall 6 years of experience in Big Data Technologies, Data Analyst.
4+ years of experience in developing applications that perform large scale Distributed Data Processing using Big Data ecosystem tools Hadoop, Hive, Pig, Sqoop, HBase, Cassandra, Spark, Kafka, Oozie, Zoo Keeper, Flume, Yarn and Avro.
Excellent understanding / knowledge of Hadoop architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, MapReduce programming paradigm and good hands-on experience in Pyspark and SQL queries.
Experience with the use of AWS services includes RDS, Networking, Route 53, IAM, S3, EC2, EBS
and VPC and also administering AWS resources using Console and CLI.
Hands on experience on Google Cloud Platform (GCP in all the bigdata products BigQuery, Cloud Data Proc, Google Cloud Storage,Composer (AirFlow as a service)
Hands-on experience on major components in Hadoop Ecosystem including Hive, HBase, HBase & Hive Integration, Sqoop, Flume & knowledge of Mapper/Reduce/HDFS Framework.
Set up standards and processes for Hadoop based application design and implementation.
Worked on NoSQL databases including HBase, Cassandra and MongoDB.
Experience on Horton works and Cloudera Hadoop environments.
Experience in designing and testing highly scalable mission-critical systems, and Spark jobs both in Scala and PySpark, Kafka.
Build a program with Python and apache beam and execute it in cloud Dataflow to run Data
validation between raw source file and Bigquery tables.
Setting up data in AWS using S3 bucket and configuring instance backups to S3 bucket.
Good experience in analysis using Pig and Hive and understanding of SQOOP and Puppet.
Expertise in database performance tuning data modeling.
Experienced in providing security to Hadoop cluster with Kerberos and integration with LDAP/AD at Enterprise level.
Experience in developing pipelines in spark using Scala and python.
Involved in best practices for Cassandra, migrating application to Cassandra database from the legacy platform for Choice, upgraded Cassandra 3.
Implemented POC to migrate map reduce jobs into Spark transformations using Python.
Experienced in developing MapReduce programs using Apache Hadoop for working with Big Data.
Good understanding of XML methodologies (XML, XSL, XSD) including Web Services and SOAP.
Used the Spark - Cassandra Connector to load data to and from Cassandra.
Hands on experience in Apache Spark creating RDD’s and Data Frames applying Operations Transformation and Actions and concerting RDD’s to Data Frames.
Migrating various Hive UDF's and queries into Spark SQL for faster requests.
Experience data processing like collecting, aggregating, moving from various sources using Apache Flume and Kafka.
Extensive Knowledge on developing Spark Streaming jobs by developing RDD’s (Resilient Distributed Datasets) using Scala, PySpark and Spark-Shell.
Installed and configured apache airflow for workflow management and created workflows in
python.
Experience in using stackdriver service/ dataproc clusters in GCP for accessing logs for debugging.
Experience in using Apache Kafka for log aggregating.
Developed a data pipeline using Kafka and Spark Streaming to store data into HDFS and performed the real-time analytics on the incoming data.
Experience in importing the real-time data to Hadoop using Kafka and implemented the Oozie job for daily imports.
Loading the data into EMR from various sources S3 process it using Hive Scripts.
Exploring with Spark various modules of Spark and working with Data Frames, RDD and Spark Context.
Performed map-side joins on RDD and Imported data from different sources like HDFS/HBase into Spark RDD.
Familiarity and experience with data warehousing and ETL tools. Good working Knowledge in OOA&OOD using UML and designing use cases.
Good understanding of Scrum methodologies, Test Driven Development and Continuous integration.
Experience in production support and application support by fixing bugs.
Used HP Quality Center for logging test cases and defects.
Major strengths are familiarity with multiple software systems, ability to learn quickly new technologies, adapt to new environments, self-motivated, team player, focused adaptive and quick learner with excellent interpersonal, technical and communication skills.
Designed and implemented Hive and Pig UDF's using Java for evaluation, filtering, loading and storing of data.
Experience in GCP Dataproc, GCS, Cloud functions. BigQuery, Azure Data Factory DataBricks.
Experience working with JAVA J2EE, JDBC, ODBC, JSP, Java Eclipse, Java Beans, EJB, Servlets.
Expert in developing web page interfaces using JSP, Java Swings, and HTML scripting languages.
Excellent understanding on Java beans and Hibernate framework to implement model logic to interact with RDBMS databases.
Experience in using IDEs like Eclipse, NetBeans and Maven.
Good knowledge of the Software Development Life Cycle (SDLC) and experience utilizing agile methodologies like SCRUM
Skill Set:
Hadoop Core Services
HDFS, Map Reduce, Spark, YARN, Hive, Pig, Scala, Kafka, Flume, Tez, Impala, Oozie, Zookeeper
Hadoop Distribution
Horton works, Cloudera
NO SQL Databases
HBase, Cassandra, MongoDB
Cloud Computing Tools
AWS,GCP,AZURE
Programming Languages
Java/J2EE, Python, SQL, Pig Latin, HiveQL, Unix Shell, PySpark Scripting
Java & J2EE Technologies
Core Java, Servlets, Hibernate, Spring, Struts, JMS, EJB
Application Servers
Web Logic, Web Sphere, JBoss, Tomcat, Jetty
Databases
Oracle, MySQL, SQL
Operating Systems
Windows, LINUX
Build Tools
Jenkins, Maven, ANT
Development Tools
Microsoft SQL Studio, Eclipse
Development methodologies
Agile/Scrum, Waterfall
Education : Masters in computer science in USA DEC -2022
Professional Experience:
Sr. Data Engineer
Fidelity Information Services Jan 2023- dec 2024
Remote (office Cincinnati,OH)
Responsibilities :
Designed data pipelines and workflows that can efficiently move data from AWS S3 to Snowflake.
Defined data schemas and mapping between S3 objects and Snowflake tables.
Developed and maintained Apache Airflow Directed Acyclic Graphs (DAGs) to automate the data ingestion process.
Configured and scheduled tasks within Airflow to ensure timely and reliable data ingestion.
Applied necessary transformations to the data before loading it into Snowflake, if needed as part of data transformations.
Used appropriate Snowflake commands and tools to load data from the staging area into Snowflake tables.
Monitored the execution of Airflow DAGs and tasks/jobs to ensure successful data ingestion.
Implemented error handling and retry mechanisms in Airflow workflows to manage failures.
Implemented data quality checks to validate data quality during and after ingestion.
Ensured that the data loaded into Snowflake correctly checking data accuracy that meets expected formats and standards by the client.
Documented the architecture, configurations, and workflows in confluence pages that are related to data ingestion.
Provided regular reports on the status of data ingestion processes, including any issues or anomalies to the scrum master and whole team daily
Worked closely with data analysts, and business intelligence team to align data ingestion processes with organizational needs.
Provided support and troubleshooting for issues related to data ingestion and integration in team meetings
Involved in PI planning and suggested tools and technologies that could enhance data ingestion process.
Involved in Setting up task dependencies in Airflow to ensure that each step is executed in the correct order.
Implemented mechanisms to monitor new or updated files in S3. used AWS Lambda with S3 triggers to detect changes and initiate processing.
Written and managed transformation scripts that can be executed as part of the DAG workflow.
Used Snowflake’s bulk loading features such as the COPY command for efficient data ingestion. Airflow will trigger these commands via the SnowflakeOperator.
Utilized Snowflake’s staging area (internal or external stages) to initially load data before moving it into the target tables.
Used Snowflake SQL to query the data, to create databases, schemas, tables, and views. Also, made sure querying the data and usage of the warehouse is cost-efficient by selecting the appropriate warehouse.
Used Airflow’s built-in monitoring and logging features to track the status of DAGs and tasks. We have Set up alerts for task failures or delays.
Closely worked with clients daily to understand the new business requirements and communicate with the team.
Created tables in raw, processed, and target layers in the DEV environment. Also, I created views for all tables in the target layer for the BI team to use and generate reports for analysis.
Worked on different file formats like CSV, JSON, parquet, Avro, etc
Worked on multiple source systems like oracle, Postgres, etc.
Developed python scripts to define Airflow DAGS and tasks to configure workflows.
Used Github to record changes made to files in the repository. Each commit has a unique ID and includes a message describing the changes.
Allowed developers to work on different features or fixes in isolation from the main codebase. These branches can be merged later into the main branch.
Proposed changes to the codebase by raising a pull request. Other team members can review, comment on, and discuss the changes before merging them.
Used WIKIS to offer additional documentation and detailed project information Useful for comprehensive guides and reference material.
Environment: Hadoop (Cloudera), AWS S3, Hive, Python, Spark, Snowflake, Snowflake SQL, Apache Airflow, Oracle, Postgres, GIT, Tableau.
GCP Data Engineer
JP Morgan, India Oct 2020 to Jun 2021
Responsibilities:
Installed, configured and maintained Apache Hadoop cluster.
Used Sqoop to import data into HDFS/Hive from multiple relational databases, performed operations and exported the results back.
Got involved in migrating on prem Hadoop system to using GCP (Google Cloud Platform).
Extensively used Spark Streaming to perform the analysis of sales data on the real-time regular window time intervals coming from sources like Kafka.
Performed Spark transformations and actions on large datasets. Implemented Spark SQL to perform complex data manipulations, and to work with large amounts of structured and semi-structured data stored in a cluster using Data Frames/Datasets.
Performed Spark join optimizations, troubleshooted, monitored and wrote efficient codes using Scala.
Used big data tools Spark (Pyspark, SparkSQL) to conduct real-time analysis of the insurance transaction.
Migrated previously written cron jobs to airflow/composer in GCP
Created Hive tables based on business requirements. Wrote many Hive queries, UDFs and implemented concepts like Partitioning, Bucketing for efficient data access, Windowing operations and more.
Leveraged cloud and GPU computing technologies for automated machine learning and analytics pipelines, such as AWS, GCP.
Integrated Hive, Sqoop with HBase and performed transactional and analytical processing.
Configured, designed, implemented and monitored Kafka clusters and connectors. Wrote Kafka producers and consumers using Java.
Import the data from different sources like HDFS/HBase into Spark RDD and perform computations using PySpark to generate the output response.
Was involved in setting up of apache airflow service in GCP.
Implemented proof of concept (POC) for processing stream data using Kafka -> Spark -> HDFS.
Developed a data pipeline using Kafka, Spark, and Hive/ HDFS to ingest, transform and analyze data. Automated jobs using Oozie.
Generated Tableau dashboards and worksheets for large datasets.
Implemented custom interceptors for Flume to filter data, and defined channel selectors to multiplex the data into different sinks.
Implemented many Spark jobs and wrote Function definitions, Case and Object classes using Scala.
Used Spark SQL for Scala & Python interface that automatically converts RDD case classes to schema RDD.
Utilized Spark, Scala, Python for querying, preparing from big data sources.
Wrote pre-processing queries in python for internal spark jobs.
Involved in the process of Cassandra data modeling, performing data operations using CQL and Java.
Maintain and work with our data pipeline that transfers and processes several terabytes of data using Spark, Scala, Python, Apache Kafka, Pig/Hive & Impala.
Working experience in Apache Hadoop and Spark frameworks including Hadoop Distributed File System, MapReduce, Pyspark and SparkSQL.
Build data pipelines in airflow in GCP for ETL-related jotis using different airflow operators.
Performed data integration with a goal of moving more data effectively, efficiently and with high performance to assist in business-critical projects using Talend Data Integration.
Used SQL queries and other data analysis methods to know the quality of the data.
Exported the aggregated data onto Oracle using Sqoop for reporting on the Tableau dashboard.
Involved in QA, test data creation, and unit testing activities.
Implemented security on Hadoop cluster using Kerberos.
Involved in design, development and testing phases of Software Development Life Cycle.
Agile Scrum Methodology to help manage and organize a team with regular code review sessions.
Weekly meetings with technical collaborators and active participation in code review sessions with senior and junior developers.
Environment: Hadoop (Cloudera), GCP,Spark, Hive,python, PySpark, Kafka, Sqoop, Oozie, Java 8, Cassandra, Oracle 12c, 11g, Impala, Scala, Talend studio, Tableau.
Data Engineer
Wells Fargo, India Nov 2019 to Oct 2020
Responsibilities:
Involved in Requirement gathering phase to gather the requirements from the business users to continuously accommodate changing user requirements.
Responsible for managing data from multiple sources.
Designed and developed Security Framework to provide fine grained access to objects in AWS
S3 using AWS Lambda, DynamoDB.
Designed Data Quality Framework to perform schema validation and data profiling on Spark (PySpark).
Developed Spark code in Python and SparkSQL environment for faster testing and processing of data and loading the data into Spark RDD and doing In-memory computation to generate the output response with less memory usage.
Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
Performed transformations using Python and Scala to analyze and gather the data in required format.
Performed end- to-end Architecture & implementation assessment of various AWS services
like Amazon EMR, Redshift, S3.
Creating on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and Spark.
Worked on PySpark APIs for data transformations.
Designed data and ETL pipeline using Python and Scala with Spark.
Developed highly complex Python and Scala code, which is maintainable, easy to use, and satisfies application requirements, data processing and analytics using inbuilt libraries.
Involved in designing optimizing Spark SQL queries, Data frames, import data from Data sources, perform transformations; perform read/write operations, save the results to output directory into HDFS/AWS S3.
Used AWS EMR to transform and move large amounts of data into and out of other AWS data
stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon
DynamoDB.
Developed Pig Latin scripts to extract the data from the web server output files to load into HDFS.
Built dashboards using Tableau to allow internal and external teams to visualize and extract insights from big data platforms.
Responsible for analyzing and cleansing raw data by performing Hive queries and running Pig scripts on data.
Created Hive tables, loaded data and wrote Hive queries that run within the map.
Worked on ETL Processing which consists of data transformation, data sourcing and mapping, Conversion, and loading.
Optimized MapReduce Jobs to use HDFS efficiently by using various compression mechanisms.
Used Amazon Elastic Cloud Compute (EC2) infrastructure for computational tasks and Simple Storage Service (S3) as storage mechanism.
Worked with building data warehouse structures, and creating facts, dimensions, aggregate tables, by dimensional modeling, Star and Snowflake schemas.
Worked on Dockers containers by combining them with the workflow to make them lightweight.
Written multiple MapReduce programs for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV and other compressed file formats.
Worked on Kafka REST API to collect and load the data on Hadoop file system and used Sqoop to load the data from relational databases.
Worked on MongoDB by using CRUD (Create, Read, Update and Delete), Indexing, Replication and Shading features.
Worked on SQL queries in dimensional data warehouses and relational data warehouses. Performed Data Analysis and Data Profiling using Complex SQL queries on various systems.
Followed agile methodology for the entire project.
Environment: Spark, Scala, AWS, ETL, Kafka, Tableau, Hadoop, Python, Snowflake, HDFS, Hive, MapReduce, PySpark, Pig, Docker, Sqoop, Teradata, JSON, MongoDB, SQL, Agile and Windows
Hadoop Developer Mar 2018 – Oct 2019
Tata Consultancy Services, India
Responsibilities:
Creating end to end Spark applications using Scala to perform various data cleansing, validation, transformation and summarization activities on user behavioral data.
Developed custom Input Adaptor utilizing the HDFS File system API to ingest click stream log files from FTP server to HDFS.
Developed end-to-end data pipeline using FTP Adaptor, Spark, Hive and Impala.
Implemented Spark and utilized SparkSQL heavily for faster development, and processing of data.
Exploring with Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's.
Involved in converting Hive/SQL queries into Spark transformations using Spark with Scala.
Used Scala collection framework to store and process the complex consumer information.
Implemented a prototype to perform Real time streaming the data using Spark Streaming with Kafka
Handled importing other enterprise data from different data sources into HDFS using Sqoop and performing transformations using Hive, Map Reduce and then loading data into HBase tables.
Exported the analyzed data to the relational databases using Sqoop, to further visualize and generate reports for the BI team.
Collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis
Analyzed the data by performing Hive queries (Hive QL) and running Pig scripts (Pig Latin) to study customer behavior.
Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
Created components like Hive UDFs for missing functionality in HIVE for analytics.
Worked on various performance optimizations like using distributed cache for small datasets, Partition, Bucketing in Hive and Map Side joins.
Created, validated, and maintained scripts to load data using Sqoop manually.
Created Oozie workflows and coordinators to automate Sqoop jobs weekly and monthly.
Uploaded and processed more than 30 terabytes of data from various structured and unstructured sources into HDFS (AWS cloud) using Sqoop and Flume.
Used Oozie and Oozie coordinators to deploy end-to-end data processing pipelines and scheduling the workflows.
Continuous monitoring and managing the Hadoop cluster.
Developed interactive shell scripts for scheduling various data cleansing and data loading processes.
Experience with data wrangling and creating workable datasets.
Environment: HDFS, Pig, Hive, Sqoop, Flume, Spark, Scala, MapReduce, Scala, Oozie, Oracle 11g, YARN, UNIX Shell Scripting, Agile Methodology.
Data Analyst
Conquerors Software Technologies Pvt Limited- HYDERABAD April 2015 to June 2017
Responsibilities:
Identify business, functional and Technical requirements through meetings and interviews and AD sessions
Define the ETL, mapping specification and Design the ETL process to source the data from sources and lead it into DWH tables.
Designed the logical and physical schema for data marts and integrated the legacy system data into data marts integrate Data stage Metadata to informatica Metadata and created ETL mappings and workflows.
Designed mapping and identified and resolved performance bottlenecks in Source to Target Mappings
Developed Mappings using Source Qualifier, Expression, Filter Look up Upgate Strategy Sorter, joiner Normalizer and Router transformations
involved in writing, testing, and implementing triggers, stored procedures and functions at Database level using PL/SQL
Developed Stored Procedures to test ETL Load per batch and provided performance optimized Solution to eliminate duplicate records.
Provide the team with technical leadership on ETL design and development best practices
Environment: Informatica Power Center v8.6.1. Power Exchange. IBM Rational Data Architect, MS SQL Server, Teradata, PL/SQL IBM Control Center, TOAD. Microsoft