Data Engineer Big

Location:

Lenexa, KS

Posted:

April 22, 2024

Contact this candidate

Resume:

PRANAVI ANJANA KALLEPALLY

SENIOR DATA ENGINEER

EMAIL: ***************@*****.*** PH NO: 443-***-****

LINKEDIN: www.linkedin.com/in/pranavi-anjana-kallepally

PROFESSIONAL SUMMARY:

Around 9+ years overall IT experience as a Data Engineer in various types of industries, which includes most experience in Big Data Analytics and development.

Experienced in designing & developing applications using Big Data technologies HDFS, Map Reduce, Sqoop, Hive, PySpark & Spark SQL, Hbase, Python, Databricks, Snowflake, S3 storage and Airflow.

Implemented AWS solutions using AWS EC2, S3, RDS, EBS, Elastic Load Balancer, Auto scaling groups and AWS CLI and Ci/Cd Pipelines.

Good experience of AWS Elastic Block Storage (EBS), different volume types and use of various types of EBS volumes based on requirement.

Good experience on working with Azure Cloud Platform services like Azure Data Factory (ADF), Azure Data Lake, Azure Blob Storage, Azure SQL Analytics and HD Insight/Data bricks.

Skilled in performing complex workflows using GCP's workflow automation tools like Cloud Composer and Cloud Functions.

Experienced in developing and maintaining data quality processes using Informatica Data Quality for accurate and reliable data analysis.

Skilled in building scalable data processing systems using Python frameworks like Apache Spark and Dask, ensuring efficient handling of large datasets.

Expertise with Python, Scala and Java in Design, Development, Administrating and Supporting of large scale distributed systems.

Experienced in transforming Hive/SQL queries into Spark changes using Spark Data Frames and Python.

Excellent experience on Scala, Apache Spark, Spark Streaming, Pattern Matching and Map - Reducing.

Experience on streaming frameworks like Apache storm to overload the data from messaging distribution systems like Apache Kafka into HDFS.

Experienced in building high performance and scalable solutions using various Hadoop ecosystem tools like Data Pig, Hive, Sqoop, Spark and Kafka.

Good Exposure on Apache Hadoop Map Reduce programming PIG Scripting and Distribute Application and HDFS.

Proficient in big data ingestion and streaming tools like Flume, Sqoop, Kafka and Storm.

Developed sales reports using the on premise data on Power BI. Experienced in optimizing Power BI reports and dashboards for improved performance, ensuring efficient data processing and user experience.

Utilized Boomi's built-in connectors and APIs to integrate with a variety of cloud-based and on-premises applications and databases.

Extensive experience working in Oracle DB2 SQL Server and MySQL database.

proficiency in T-SQL for managing and querying relational databases within the Microsoft SQL Server ecosystem.

Experienced in utilizing NoSQL technologies for real-time analytics, distributed computing, and handling massive datasets in Big environments.

Hands on experience in designing, implementing, and optimizing PostgreSQL databases for relational data modeling and management.

Hands-on experience in managing Snowflake accounts, users, roles, and security configurations to ensure data privacy and compliance.

Documented Tableau solutions, including data source connections, calculations, and dashboard functionalities, for knowledge sharing and future reference.

Evaluated and integrated Power BI custom visuals and extensions to enhance reporting capabilities and user experience.

Utilized Git, GitHub, and PowerShell for managing versioning of datasets and data pipelines, ensuring reproducibility and traceability in data engineering workflows.

Experienced in utilizing Jira dashboards and reporting tools to track project progress and drive continuous improvement initiatives.

TECHNICAL SKILLS:

Programming Languages/Frameworks

Python, Scala, Java

Big Data Technologies

HDFS, Map Reduce, Sqoop, Hive, PySpark, Spark SQL, Hbase

Clouds

AWS, Azure

Data Processing Systems/Frameworks

Apache Spark, Dask

Streaming Frameworks

Apache Kafka, Apache Storm, Spark Streaming

Database Management Systems

Oracle, DB2, SQL Server, MySQL, PostgreSQL

NoSQL Technologies

MongoDB, Cassandra, HBase

Data Warehousing Platforms

Snowflake, ADLS.

BI & Visualization Tools

Tableau, Power BI

Workflow Orchestration Tools

Apache Airflow

Version Control Tools

Git, GitHub, SVN

ETL Tools

Informatica, SSIS, Alteryx, ADF, Airflow, Glue.

Data Analysis & Visualization Libraries

Pandas, NumPy, Matplotlib, Seaborn, Plotly

Data Manipulation Languages

SQL, HiveQL, DAX, M (Power Query),Pyspark.

PROFESSIONAL EXPERIENCE:

Client: AMEX, New York, NY. Apr 2021 – Present

Role: Sr. Data Engineer

Responsibilities:

Experienced in managing Azure Data Lakes (ADLS) and Data Lake Analytics and an understanding of how to integrate with other Azure Services.

Designed and developed a new solution to process by using Azure stream analytics and Azure Event Hub.

Used Azure DevOps & Jenkins pipelines to build and deploy different resources (Code and Infrastructure) in Azure.

Implemented Informatica Data Quality to standardize data formats, values, and structures across disparate sources, ensuring consistency and compatibility for downstream processes.

Utilized version control systems to manage changes to Informatica mappings, workflows, and configurations, ensuring traceability, rollback capabilities, and collaboration among team members.

Skilled in integrating Python with distributed computing frameworks like Hadoop and Spark to handle big data processing tasks effectively.

Utilized Python's multiprocessing and multithreading capabilities for parallelizing data processing tasks, optimizing performance, and reducing processing time.

Implemented data archiving and purging strategies to manage data lifecycle in NoSQL databases efficiently.

Developed custom data migration scripts and tools to facilitate seamless migration from legacy systems to PostgreSQL.

Developed Kafka producer and consumers, HBase clients, Spark, Streams and Hadoop Map Reduce jobs along with components on HDFS, Hive.

Integrated data warehouse set up and provided design and architecture suggestion converting to Hadoop using Map Reduce, HIVE, SQOOP and Pig Latin.

Developed business logic transformation using Kafka Direct Stream in Spark Streaming and implemented business transformations.

Developed Scala scripts, UDFs using both Data frames and RDDs in Spark for Data Aggregation, queries and writing data back into OLTP Systems.

Developed Spark Streaming script which consumes topics from distributed messaging source Kafka and periodically pushes a batch of data to spark for real-time processing.

Worked with complex SQL views, Stored Procedures, Triggers, and packages in large databases from various servers.

Collaborated with ETL developers and data engineers to integrate data models with PySpark-based ETL processes, ensuring seamless data flow and transformation.

Created custom UDF’s for Spark and Kafka procedure for some of non-working functionalities in custom UDF into Scala in production environment.

Installed Hadoop Map Reduce, HDFS, developed multiple Map Reduce jobs in Scala for data cleaning and preprocessing.

Developed and maintained monitoring dashboards and alerts for proactive detection and resolution of issues within the Snowflake environment.

Developed automated workflows and scheduling mechanisms using Snow pipe and Snow SQL to streamline data ingestion processes, reducing manual intervention and improving operational efficiency.

Worked closely with data engineers to optimize data extraction and preparation processes for Tableau visualizations.

Provided expertise in DAX (Data Analysis Expressions) and M (Power Query) languages to optimize Power BI calculations and data transformations.

Managed Tableau user access and permissions, ensuring appropriate data access controls and security measures.

Utilized Git, GitHub, and PowerShell for managing versioning of datasets and data pipelines, ensuring reproducibility and traceability in data engineering workflows.

Experienced in utilizing Jira Service Management for handling data engineering-related incidents, problems, and service requests, ensuring prompt resolution and minimal downtime.

Environment: Azure (Data Lakes, Data Lake Analytics, Stream Analytics, Event Hub, DevOps), Jenkins, Informatica, Python, Hadoop, Spark, NoSQL, PostgreSQL, Kafka, HBase, Hadoop Map Reduce, Hive, Scala, Spark, SQL, Snowflake, Snow pipe, Snow SQL, Tableau, Power BI DAX, M, Git, GitHub, Jira, PowerShell.

Client: Homesite, Boston, MA. July 2018 – Mar 2021

Role: Sr. Data Engineer

Responsibilities:

Designed and developed Security Framework to provide fine grained access to objects in AWS S3 using AWS Lambda and AWS Dynamo DB.

Integrated Apache Airflow with AWS to monitor multi-stage ML workflows with the tasks running on AWS Sage Maker.

Implemented the machine learning algorithms using Python to predict the quantity a user might want to order for a specific item so we can automatically suggest using AWS kinesis firehose and AWS S3 Data Lake.

Utilized Informatica performance tuning techniques to optimize data processing speed and resource utilization, maximizing efficiency and reducing processing times.

Integrated Informatica Data Quality to identify and rectify inconsistencies, errors, and duplicates within datasets, ensuring high-quality data for analysis and decision-making.

Responsible for data extraction and data integration from different data sources into Hadoop Data Lake by creating ETL pipelines Using Spark, Map Reduce, Pig and Hive.

Responsible for running Hadoop streaming jobs to process terabytes of xml's data, utilized cluster co-ordination services through Zookeeper.

Used Spark SQL for Scala & amp, Python interface that automatically converts RDD case classes to schema RDD.

Experienced on importing and exporting data using stream processing platforms like Flume and Kafka.

Involved in converting Hive/SQL queries into Spark transformations using Spark RDD, Scala and Python.

Worked on database issues and connections with SQL and NoSQL databases like Apache HBase.

Importing & exporting database using SQL Server Integrations Services (SSIS) and Data Transformation Services (DTS Packages).

Implemented database automation scripts and tools for routine maintenance tasks such as vacuuming and re indexing in PostgreSQL.

Implemented Kafka and spark structured streaming for real time data ingestion.

Designed and implemented Snowflake data pipelines using Snow SQL and Snowflake's native functionalities for data ingestion and transformation.

Skilled in implementing Snowflake's data sharing features to securely share data across different regions and cloud providers.

Developed and maintained Tableau data extracts and data sources to support ongoing reporting needs.

Conducted Data blending, Data preparation using Alteryx and SQL for Tableau consumption and publishing data sources to Tableau server.

Designed and implemented Power BI deployment pipelines for automated testing, deployment, and version control.

Integrated Git with other tools and platforms used in the data engineering ecosystem, such as JIRA, Docker, or Jenkins, to streamline workflows and enhance productivity.

Utilized Jira's issue tracking capabilities to prioritize and manage data engineering tasks, ensuring timely resolution and delivery.

Environment: AWS (Lambda, Dynamo DB, Kinesis Firehose, S3), Apache Airflow, Python, Informatica, ETL, Spark,

Map Reduce, Pig, Hive, Hadoop, Zookeeper, Spark SQL, Scala, Flume, Kafka, SQL, NoSQL, HBase, SSIS, PostgreSQL, Kafka, Snowflake, Snow SQL, Tableau, Alteryx, Power BI, Git, JIRA, Docker, Jenkins.

Client: Merck, Mumbai, India. Sept 2016 – May 2018

Role: Data Engineer

Responsibilities:

Created numerous pipelines in Azure using Azure Data Factory to get the data from disparate source systems by using different Azure Activities like Move &Transform, Copy, filter, for each Data bricks.

Created pipelines in Azure using ADF (Azure Data Factory) to get the data from different source systems and transform the data by using many activities.

Design and developed Batch processing and real-time processing solutions using ADF (Azure Data Factory), Azure Data bricks clusters and Azure stream Analytics.

Performed Data wrangling to clean, transform and reshape the data utilizing NumPy and Pandas of Python Libraries.

Integrated data Imputation using variant methods in Scikit-learn package in Python.

Developed pipeline using Hive (HQL) to retrieve the data from Hadoop cluster, SQL to retrieve data from Oracle database and used ETL for data transformation.

Optimized lot of SQL statements and PL/SQL blocks by analyzing the execute plans of SQL statement and created and modified triggers, SQL queries, stored procedures for performance improvement.

Used Spark for interactive queries, processing of streaming data and integration with NoSQL database for huge volume of data.

Developed stored procedures, triggers, and functions in PostgreSQL to automate data processing tasks.

Developed Spark- Streaming applications to consume the data from Kafka topics and to insert those processed streams to HBase.

Developed spark programs using Scala APIs to compare the performance of spark with HIVE and SQL.

Skilled in integrating Python with distributed computing frameworks like Hadoop and Spark, leveraging Alteryx for ETL processes

Designed and created Hive external tables using shared meta-store instead of derby with partitioning, dynamic partitioning, and buckets in Hadoop Cluster.

Extracted real time data using Kafka and Spark streaming by Creating streams and converting them into RDD, processing it and stored it into Hbase.

Utilized Scala libraries for distributed computing to handle large-scale data processing tasks with speed and efficiency.

Developed and managed Tableau implementation plans for the stakeholders, ensuring timely completion and successful delivery according to stakeholder expectations.

Developed and maintained Power BI data flows to streamline data preparation processes and improve data consistency across reports.

Coordinated resources and processes to achieve Tableau implementation plans.

Developed custom Power BI connectors to integrate with proprietary and external data sources for comprehensive reporting.

Utilized Jira APIs for custom integrations and extending the platform's functionality to meet unique data engineering project requirements.

Environment: Azure (Data Factory, Data bricks, Stream Analytics), NumPy, Pandas, Scikit-learn, Hive, HQL, SQL, PL/SQL, Spark, NoSQL, PostgreSQL, Hadoop, Kafka, Hbase, ETL, Scala, Tableau, Power BI, Jira API.

Client: Amway Corp, Delhi, India. July 2014 – Aug 2016

Role: Python Developer

Responsibilities:

Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as AWS Simple Storage Service (AWS S3) and AWS Dynamo DB.

Performed end- to-end Architecture & implementation assessment of various AWS services like AWS EMR, Redshift and AWS S3.

Implemented Predictive analytics and machine learning algorithms in Data bricks to forecast key metrics in the form of designed dashboards on to AWS (S3/EC2).

Utilized Python to optimize SQL queries for data retrieval and manipulation, enhancing database performance and query efficiency.

Used Python tools such as Pandas, Matplotlib, Seaborn, NumPy, Scikit-learn and Plotly to perform data cleaning, feature selection, feature engineering, and extensive statistical analysis.

Conducted Exploratory Data Analysis (EDA) using python libraries such pandas, Matplotlib, Seaborn and Plotly.

Created scalable data processing solutions utilizing Python libraries such as Pandas and NumPy, optimizing performance and resource utilization.

Used graphical packages in python like Seaborn and Matplotlib to produce ROC to visually represent.

Analyzed and extract data from various confidential databases using SQL queries. Developed ETL pipelines to extract, transform, and load data from various sources into NoSQL databases.

Installed, Configured and Maintained Apache Hadoop clusters for application development and Hadoop tools like Hive, Pig, Hbase and HDFS.

Experienced in working on Spark-Scala programming with good knowledge on Spark Architecture and its In-memory Processing.

Experienced in integrating Hadoop with Kafka, experienced in uploading Clickstream data from to HDFS.

Optimized Tableau performance by implementing efficient data models and calculations.

Published Power BI Reports in the required originations and Made Power BI Dashboards available in Web clients and mobile apps.

Implemented Tableau best practices for dashboard design, layout, and user experience.

Maintain data flow documentation and perform object mapping using Power BI tools and validation.

Environment: AWS (EMR, S3, EC2, Dynamo DB, Redshift), Python, Pandas, Matplotlib, Seaborn, NumPy, Scikit- learn, Plotly, SQL, EDA, NoSQL, Hadoop, Hive, Pig, HBase, HDFS, Spark-Scala, Kafka, Tableau, Power BI.

EDUCATION: Jawaharlal Nehru Technology University, Hyderabad, TS, India

BTech in Computer Science and Engineering, June 2010 - May 2014

Contact this candidate