Resume

Data Engineer Big

Location:

Herndon, VA

Posted:

July 06, 2023

Contact this candidate

Resume:

Name: Sree Kakani

Role: Data Engineer

Email: adx4vi@r.postjobfree.com

Phone: 202-***-****

Professional Overview:

8+ years of software development experience with around 5+ years of extensive experience in Data Engineering using Big Data/Spark technologies.

Good experience with programming languages Scala, Java, and Python.

Strong experience building data pipelines, deploying to production, monitoring, and maintaining.

Strong experience working with large datasets and designing highly scalable and optimized data modelling and data integration pipelines.

Good understanding of distributed systems architecture and parallel computing paradigms.

Strong experience working with Spark processing framework for performing large scale data transformations, cleansing, aggregations etc.,

Good experience working Spark core, Spark Data Frame API, Spark SQL and Spark Streaming API’s.

Strong experience fine tuning long running Spark applications and troubleshooting common failures.

Utilized various features of spark like broadcast variables, accumulators, caching/persist, dynamic allocation etc.,

Worked on real time data integration using Kafka and Spark streaming.

Experience developing Kafka producers and Kafka Consumers for streaming millions of events on streaming data.

Strong experience working with various Hadoop ecosystem components like HDFS, Hive, HBase, Sqoop, Oozie, Impala, Yarn, Hue etc.,

Strong experience using Hive for creating centralized data warehouses and data modelling for efficient data access.

Strong experience creating partitioned tables in Hive and bucketing for improving large join performance.

Extensive experience utilizing AWS cloud services like S3, EMR, Redshift, Athena, Glue meta store etc., for managing and building data lakes natively on the cloud.

Hands on experience in importing and exporting data into HDFS and Hive using Sqoop.

Exposure on usage of NoSQL databases HBase and Cassandra.

Extensive experience in working with structured, semi-structured, and unstructured data by implementing complex MapReduce programs.

Experience with design, development and maintenance of ongoing metrics, reports, analyses, dashboards, etc. using tableau, to drive key business decisions and communicate key concepts to readers.

Experience using various Hadoop Distributions (Cloudera, Hortonworks, etc.) to fully implement and leverage new Hadoop features.

Good exposure to other cloud providers AWS and Azure and utilized Azure Databricks for learning and experimentation.

Strong experience as a Core Java developer for building Rest API’s and other integration applications.

Developed and executed change management strategies to drive user adoption of the MDM platform, conducting training sessions, providing ongoing support, and facilitating the transition to new data management practices

Strong Experience in working with Databases like Oracle, DB2, Teradata and MySQL and proficiency in writing complex SQL queries.

Great team player and quick learner with effective communication, motivation, and organizational skills combined with attention to details and business improvements.

Experienced in involving complete SDLC life cycle includes requirements gathering, design, development, Testing, and production environments.

Managed end-to-end MDM projects, utilizing project management methodologies to ensure timely delivery, efficient resource allocation, and adherence to budget and scope.

TECHNICAL SKILLS:

Operating Systems: Windows Server 2003/2008 R2/2012R2, 2016 Linux

Programming: Python, Scala, Java, C++ programming languages, Running Scripts with BASH Database Technologies: Amazon Redshift, Amazon RDS

Data Modeling: Toad, Podium, Talend, Informatica

Administration: Ambari, Cloudera Managers, Nagios

Analytics & Visualization: Tableau, PowerBI, Kibana, Qlik View, Pentaho

Data Warehouse: Teradata, Hive, Amazon Redshift, BIGQuery, Azure Data Warehouse ETL/Data Pipelines: DataStage, Apache Sqoop, Apache Camel, Flume, Apache Kafka, Apatar, Atom, Talend,

Pent Project: Agile Processes, Problem Solution skills, Mentoring, requirement gathering.

Communication: Very strong technical written and oral communication skills

Compute Engines: Apache Spark, Spark Streaming, Flink

File Systems/Formats: CSV, Parquet, Avro, Orc, JSON

Data Visualization: Pentaho, QlikView, Tableau, Informatica, Power BI

Data Query Engines: Impala. TEZ, Spark SQL

Search Tools: Apache Lucene, Elasticsearch, Kibana, Apache SOLR, Cloudera

Search Cluster: Yarn, Puppet Frameworks: Hive, Pig, Spark, Spark Streaming, Storm

Work Experience:

Client: Cigna, Harford, CT Mar 2021 - Present

Role: Data Engineer

Responsibilities:

Developed Spark Applications to implement various data cleansing/validation and processing activity of large-scale datasets ingested from traditional data warehouse systems.

Migrated the existing on-premises applications and scripts from Java code to Cloud based platform - Azure Cloud storage.

Using PySpark to process and analyze large datasets in a distributed manner.

Writing PySpark scripts to perform complex data transformations and aggregations for improving performance.

Optimizing PySpark applications using techniques such as caching, broadcast variables, and partitioning.

Managing and monitoring Azure resources such as virtual machines, storage accounts, and network resources to ensure high availability and performance.

Integrating Azure services such as Azure Stream Analytics and Azure Machine Learning to build real-time data processing pipelines.

Designed and developed scalable big data solutions using Hadoop and Azure Cloud technologies.

Implemented data processing pipelines using Apache Spark.

Designed and developed data pipelines using Hive, Pig, and Spark to transform and analyze large datasets for real-time data processing.

Maintained and optimized Hadoop clusters for high availability and performance.

Worked with Databricks and Oozie jobs to automate and schedule Hadoop workflows, resulting in a 25% increase in operational efficiency.

Conducted performance tuning and capacity planning to optimize Hadoop clusters for various workloads.

Defining workflows using Oozie to automate Hadoop jobs and other big data applications.

Creating Oozie coordinators to schedule and manage multiple workflows.

Monitoring and troubleshooting Oozie workflows to ensure successful completion of jobs.

Collaborating with data scientists and analysts to understand their data needs and developing solutions to meet those needs using Hadoop and related technologies.

Writing complex Hive queries to extract data from large datasets for analysis and reporting.

Performed SQL Joins among Hive tables to get input for Spark batch process.

Extended Hive core functionality by writing custom UDFs using Java.

Involved in migrating MapReduce jobs into Spark jobs and used Spark and Data frames API to load structured data into Spark clusters.

Staying up to date with the latest trends and advancements in Hadoop Ecosystem technologies and Cloud platforms, ensuring that the organization's data processing capabilities are modern and efficient.

Using Git hub to manage code and version control, ensuring that code changes are tracked and documented.

Creating and managing Databricks workspaces to run Spark-based applications in a collaborative environment. Also, Using Databricks notebooks to write and execute Spark code.

Building PySpark machine learning models using MLlib to classify data, make predictions, and recommend items.

Tools Used: PL/SQL, Python, Azure-Data factory, Azure Blob storage, Azure table storage, Azure SQL server, Apache Hive, Apache Spark, MDM, Netezza, Teradata, Oracle 12c, SQL Server, Teradata SQL Assistant, Microsoft Word/Excel, Flask, AWS S3, AWS Redshift, Snowflake, AWS RDS, DynamoDB, Athena, Lambda, MongoDB, Pig, Sqoop, Tableau, Power BI and UNIX, Docker, Kubernetes.

Client: Citibank, Tampa, FL Oct 2020- March 2021

Role: Data Engineer

Responsibilities:

Worked with Spark to create structured data from a pool of unstructured data received.

Implemented advanced procedures such as text analytics and processing using in-memory computing capabilities such as Apache Spark written in Scala.

Implemented Spark using Scala and Spark SQL for faster testing and processing of data.

Converted Hive/SQL queries into Spark transformations using Spark RDDs and Scala.

Documented requirements, including the available code which should be implemented using Spark, Hive, HDFS and Elastic Search.

Maintained ELK (Elastic Search, Kibana) and wrote Spark scripts using Scala shell.

Implemented Spark using Scala and utilized DataFrames and Spark SQL API for faster processing of data.

Developed Spark Streaming applications to consume data from Kafka topics and insert the processed streams to HBase.

Provided a continuous discretized DStream of data with a high level of abstraction with Spark Structured Steaming.

Moved transformed data to Spark cluster where the data is set to go live on the application using Kafka.

Created a Kafka producer to connect to different external sources and bring the data to a Kafka broker.

Handled schema changes in data stream using Kafka.

Developed new Flume agents to extract data from Kafka.

Created a Kafka broker in structured streaming to get structured data by schema.

Analyzed and tuned Cassandra data model for multiple internal projects and worked with analysts to model Cassandra tables from business rules and enhance/optimize existing tables.

Designed and deployed new ELK clusters.

Created log monitors and generated visual representations of logs using ELK stack.

Implemented CI/CD tools Upgrade, Backup, and Restore.

Played a key role installing and configuring various Big Data ecosystem tools such as Elastic Search, Logstash, Kibana, Kafka, and Cassandra.

Reviewed functional and non-functional requirements on the Hortonworks Hadoop project and collaborated with stakeholders and various cross-functional teams.

Customized Kibana for dashboards and reporting to provide visualization of log data and streaming data.

Developed Spark applications for the entire batch processing by using Scala.

Developed Spark scripts by using Scala shell commands as per the requirement.

Defined the Spark/Python (PySpark) ETL framework and best practices for development.

Installed and configured Tableau Desktop to connect to the Hortonworks Hive Framework (Database), which contained the bandwidth data form the locomotive through the Hortonworks JDBC connector for further analytics of the data.

Versioning with Git and set-up a Jenkins CI to manage CI/CD practices.

Built Jenkins jobs for CI/CD infrastructure from GitHub repository.

Developed Spark programs using Python to run in the EMR clusters.

Created User Defined Functions (UDF) using Scala to automate few of the business logic in the applications.

Used Python for developing Lambda functions in AWS.

Designed AWS Glue pipelines to ingest, process, and store data interacting with different services in AWS.

Tools Used: Python, Teradata, Netezza, Oracle 12c, PySpark, SQL Server, UML, MS Visio, Oracle Designer, SQL Server 2012, Cassandra, Azure, Oracle SQL, Athena, SSRS, SSIS, AWS S3, AWS Redshift, AWS EMR, AWS RDS, DynamoDB, Lambda, Hive, HDFS, Sqoop, Scala, No- SQL (Cassandra) and Tableau.

Client: Valence Health Inc. - Dublin, OH Oct 2018- Sep 2020

Role: Big Data Engineer

Responsibilities:

Experience in Data ingestion from sources like MySQL, Oracle and CSV files.

Experience in using Apache Sqoop to import and export data to from HDFS and external RDBMS databases, MYSQL and CSV files.

Implemented Spark using Scala and Spark SQL for faster testing and processing of data.

Worked with Spark and Scala mainly in Claims Invoice Ingestion Framework exploration for transition from Hadoop/MapReduce to Spark.

Write Spark jobs to transform the data to calculate and group the Vendor payment status on HDFS and store it in hive tables / Kafka topics.

Spark transformations are performed using data frames and Spark SQL.

Business Reports generated from the data stored in Hive tables to display in the dashboard.

Experienced with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark SQL, Data Frame, Pair RDD's, Spark YARN.

Implemented Spark streaming framework that processes the data for Kafka and perform analytics on top of it.

Migrating the needed data from Oracle, MySQL in to HDFS in using Sqoop and importing various formats of flat files in to HDFS.

Worked in Agile development approach.

Developed a strategy for Full load and incremental load using Sqoop

Implemented POC to migrate map reduce jobs into Spark RDD transformations.

Tools Used: Python, Teradata, Netezza, Oracle 12c, PySpark, SQL Server, UML, MS Visio, Oracle Designer, SQL Server 2012, Cassandra, Azure

Client: Health Cure, India Aug 2017 – Sep 2018

Role: Hadoop and Spark Developer

Responsibilities:

Consumed rest APIs and wrote source code so it could be used for the Kafka program. Worked on various real-time and batch processing applications using Spark/Scala, Kafka and Cassandra.

Built Spark applications to perform data enrichments and transformations using Spark Data frames with Cassandra lookups.

Used Data Stax Spark Cassandra Connector to extract and load data to/from Cassandra. Worked in a team to develop an ETL pipeline that involved extraction of Parquet serialized files from S3 and persisted them in HDFS.

Developed Spark application that uses Kafka Consumer and Broker libraries to connect to Apache Kafka and consume data from the topics and ingest them into Cassandra.

Developed applications involving Big Data technologies such as Hadoop, Spark, Map Reduce, Yarn, Hive, Pig, Kafka, Oozie, Sqoop, and Hue.

Worked on Apache Airflow, Apache Oozie, and Azkaban. Designed and implemented data ingestion framework to load data into the data lake for analytical purposes.

Developed data pipelines using Hive, Pig, and MapReduce.

Wrote Map Reduce jobs.

Administered clusters in the Hadoop ecosystem.

Installed and configured the Hadoop Cluster of Major Hadoop Distributions.

Designed the reporting application that uses the Spark SQL to fetch and generate reports on Hive.

Analyzed data using Hive and wrote User Defined Functions (UDFs).

Used AWS services like EC2 and S3 for small data sets processing and storage.

Executed Hadoop/Spark jobs on AWS EMR using programs, data stored in S3 Buckets

Tools Used: Spark, Scala, AWS, DBeaver, Zeppelin, SSIS, Cassandra, Workspace, c# scripting.

Client: Cube Solution. Pvt Ltd - Bangalore, India Jan 2015 – July 2017

Role: Java Developer

Responsibilities:

Implemented Multi-Threaded Environment and used most of the interfaces under the Collection framework by using Core Java concepts.

Involved in developing code using major concepts of Spring Framework Dependency Injection (DI) and Inversion of control (IOC).

Used Spring MVC framework for implementing RESTful web services so that complexity of integration will be reduced, and maintenance will be quite easy.

Used Bootstrap to create responsive web pages which can be displayed properly in different screen sizes.

Used GIT as version control tool to update work progress and attended daily Scrum sessions.

Developed Interactive web pages using Angular, HTML5, CSS and JavaScript.

Build REST web service by building Server in the backend to handle requests sent from the front-end.

Involved in Stored Procedures, User Defined functions, Views and implemented the Error Handling in the Stored Procedures and SQL objects and modified already existing stored.

Functionalities include writing code in HTML, CSS, JavaScript, JSON, Bootstrap with MySQL Database as the backend.

Extensive knowledge about the micro services Architecture and involved in developing backend API’s.

Involved in design and development of a user-friendly enterprise application using Java, Spring, Hibernate, Web services, Eclipse.

Developed and enhanced the application using Java and J2EE (Servlets, JSP, JDBC, JNDI, EJB), Web Services (RESTful Web Services), HTML, JSON, XML, Maven and MySQL DB.

Used GIT as source control management giving a huge speed advantage on centralized systems that must communicate with a server.

Tools Used : Java/J2EE, core java, Spring, Hibernate, GIT, MySQL database, Maven, RESTful Web Services, HTML, HTML5, CSS, JavaScript, Bootstrap, JSON, XML, Micro services.s

Contact this candidate