Daniel Pallapati
Data Engineer
980-***-**** ***********@*****.*** LinkedIn
Professional Summary:
Around 10+ years of IT Industry, over 6+ years in Data Engineering, and 4+ years in Data Warehouse.
Result-oriented and highly skilled professional with experience in AWS, Snowflake and Big Data technologies.
Strong experience in writing scripts using Python API, PySpark API and Spark API for analyzing the data.
Extensive experience in working with HDFS, Sqoop, PySpark, Hive, MapReduce, and HBase for big data processing and analytics.
Expertise in Snowflake to create and Maintain Tables and views. Python Libraries PySpark, PyTest, PyMongo, PyExcel, Psycopg, NumPy and Pandas.
Utilized AWS S3 for scalable and cost-effective data storage and retrieval.
Skilled in utilizing AWS EMR for big data processing, including technologies like Hadoop, Spark, Hive, MapReduce, and PySpark.
Adept in integrating AWS SNS and SQS for real-time event processing and messaging.
Skilled in utilizing AWS services such as CloudWatch, Kinesis, Route53 for effective monitoring, data streaming, DNS management, and network access control in cloud environments.
Proficient in managing user access and permissions to AWS resources using IAM.
Experienced in utilizing AWS Glue for ETL workflows, enabling efficient data extraction, transformation, and loading.
Strong knowledge of AWS CloudWatch for monitoring and managing AWS resources, setting up alarms, and collecting metrics.
Hands of experience in GCP, Big Query, GCS bucket, G - cloud function, Dataflow, Pub/sub cloud shell, GSUTIL, BQ command line utilities, Datafusion, Stack driver.
Experience in migrating on premise ETLs to GCP using cloud native tools such as BIG query, Cloud Data Proc, Google Cloud Storage, and Composer.
Build data pipelines in airflow in GCP for ETL-related jobs using different airflow operators.
Proficient in creating effective data pipelines and performing complex data manipulation tasks using Snow SQL.
Implemented Data pipelines using Pandas Data Frame, Spark Data Frame, RDDs.
Skilled in designing roles, views, and implementing performance tuning techniques to enhance Snowflake system performance.
Intensive experience in AWS Cloud Environment, Hadoop ecosystem in designing, deploying, and operating highly available, scalable, and fault-tolerant systems.
Proficient in utilizing virtual warehouses, caching, and Snow pipe for real-time data ingestion and processing in Snowflake.
Strong knowledge of Snowflake's time travel feature for auditing and analyzing historical data.
Experienced in integrating data from diverse sources, including loading nested JSON-formatted data into Snowflake tables, using the AWS S3 bucket and the Snowflake cloud data warehouse.
Highly proficient in Snowflake scripting to automate ETL processes, data transformations, and data pipelines.
Proficient in developing and optimizing Spark and Spark-Streaming applications for real-time data processing and analytics.
Prowess in scheduling and workflow management using IBM Tivoli, Control-M, Oozie, and Airflow for efficient job orchestration.
Strong database development skills in Teradata, Oracle, SQL Server, including the development of stored procedures, triggers, and cursors.
Proficient in version control systems like Git, GitLab, and VSS for code repository management and collaboration.
TECHNICAL SKILLS:
Big Data Eco-system
HDFS, MapReduce, Spark, Yarn, Hive, Pig, HBase, Sqoop, Flume,
Kafka, Oozie, Zookeeper, NiFi, Impala
Hadoop Technologies
Apache Hadoop, Apache Hadoop, Cloudera CDH4/CDH5.
Programming Languages
Python, Scala, Shell Scripting, HiveQL
Machine Learning
Regression, Decision Tree, Clustering, Random Forest
Classification, SVM, NLP
Operating Systems
Windows, Linux (Ubuntu, Centos)
NoSQL Database
HBase, Cassandra, Hortonworks and Mongo DB
Database
RDBMS, MySQL, Teradata, DB2, Oracle
Container/Cluster Managers
Docker, Kubernetes
BI Tool
Tableau, Power BI
Cloud
AWS, Azure, GCP
Web Development
HTML, XML, CSS
IDE Tools
Eclipse, Jupiter, Anaconda, PyCharm
Development Methodologies
Agile, Waterfall
Professional Experience
Client: AV Worx, Palm Beach, Florida May 2024 - Present
Azure Data Engineer
Project Description: Led a project at AV Worx to create customizable data pipelines that enhanced user capabilities and streamlined data processing. Engineered automated data workflows using RESTful APIs to feed data into Azure SQL Database and Azure Data Factory, eliminating manual data entry. Configured real-time API data integration, designed custom ETL scripts for CRM data synchronization, and integrated Microsoft Graph API through Azure API Management for unified data control. Implemented MSAL for OAuth 2.0 authentication and utilized Azure Key Vault for scalable API access. Developed and deployed dynamic DAGs in Apache Airflow, creating a user interface with Dash for management. Ensured workflow reliability with Azure Monitor and managed data migration using SQLAlchemy and pyodbc within Azure’s ecosystem.
Responsibilities:
Engineered automated data extraction and integration workflows from diverse external platforms using RESTful APIs, streamlining data pipelines into Azure SQL Database and Azure Data Factory, significantly reducing manual data handling.
Configured dynamic API calls for real-time data retrieval and parsing, facilitating continuous data integration with Azure SQL Database and ensuring data accuracy and easy accessibility.
Designed and executed tailored ETL scripts to synchronize data from CRM systems with Azure Data Factory pipelines, aligning data flows across CRM, ERP, and reporting tools.
Developed and managed Microsoft Graph API integrations via Azure API Management to centralize data control and retrieval for Teams channels, calendar events, and user profiles.
Implemented OAuth 2.0 authentication protocols using Microsoft Authentication Library (MSAL) to secure access to Microsoft 365 resources, integrating Azure Active Directory (Azure AD) for compliance and data protection.
Configured dynamic permission scopes and token management solutions with Azure Key Vault to enable scalable and secure API access, minimizing manual reconfiguration efforts.
Architected and deployed dynamic Directed Acyclic Graphs (DAGs) using Python and SQL in Apache Airflow, enhancing automation, task scheduling, and workflow efficiency.
Built an interactive web interface using Dash and Plotly for user-driven DAG creation and management, empowering users and reducing manual coding dependency.
Deployed Apache Airflow server on Azure App Service, leveraging Azure’s scalability for robust workflow scheduling and execution.
Implemented comprehensive error handling and status tracking through Azure Monitor and Application Insights, improving reliability and facilitating rapid issue resolution.
Integrated workflows with Azure Functions to enable consistent scheduling and execution, streamlining operational setup.
Managed relational databases with SQLAlchemy and pyodbc to oversee structured data migration within Azure SQL Database, ensuring data consistency and seamless integration.
Utilized Azure SDK for cloud-based storage solutions, managing DAGs in Azure Blob Storage and optimizing configurations for seamless Apache Airflow deployment.
Created Entity-Relationship (ER) diagrams for workflow design and automation, enabling efficient deployment of automated workflows to Apache Airflow.
Environment: Azure Data Factory, Azure SQL Database, Azure Blob Storage, Azure App Service, Apache Airflow, Dash, Plotly, Azure API Management, Microsoft Graph API, Azure Active Directory, Azure Monitor, Application Insights, SQLAlchemy, pyodbc, Python, MSAL, OAuth 2.0, Azure Key Vault, Docker
Client: Lucid Motors, Newark, California Jul 2021 – Apr 2024
GCP Data Engineer
Project Description: Led a project at Lucid Motors focused on enhancing automotive safety and user experience through advanced driver-assistance and autonomous driving capabilities. Leveraged Google Cloud Platform (GCP) for data engineering to build data pipelines for ingesting, cleansing, and transforming large volumes of vehicle sensor data, environmental data, and user feedback. Optimized data workflows and collaborated with cross-functional teams to develop and deploy predictive models using GCP’s AI and machine learning tools, such as AutoML and TensorFlow, resulting in improved accuracy and performance of driver-assistance systems.
Responsibilities:
Developed streaming pipelines using Apache Spark with Python.
Proficient in utilizing Spark, SQL, and Python on Databricks platform to develop streaming pipelines, encode and decode JSON objects, and build data pipelines, ensuring efficient data processing and transformation.
Experience in building and architecting multiple Data pipelines, end-to-end ETL, and ELT processes for Data ingestion and transformation in GCP and coordinating tasks among the team.
Build data pipelines in airflow in GCP for ETL-related jobs using different airflow operators.
Experience in GCP Dataproc, GCS, Cloud functions, and BigQuery.
Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery
Got involved in migrating on-prem Hadoop system to using GCP (Google Cloud Platform).
Migrated previously written cron jobs to airflow/composer in GCP.
Extensive involvement in software solution development, employing agile methodologies to implement robust data engineering workflows on Databricks, ensuring smooth integration and deployment of solutions.
Designed Pipelines with Apache Beam, KubeFlow, and Dataflow and orchestrated jobs into GCP.
Experience in Google Dataflow for building and executing data processing pipelines on GCP.
Experience in Google Datafusion for visually constructing and executing ETL pipelines.
Created PL/SQL packages and Database Triggers and developed user procedures and prepared user manuals for the new programs.
Experience in moving data between GCP and Azure using Azure Data Factory.
Documented the inventory of modules, infrastructure, storage, components of existing On-Prem data warehouse for analysis and identifying the suitable technologies/strategies required for Google Cloud Migration.
Wrote scripts in Hive SQL for creating complex tables with high-performance metrics like partitioning, clustering, and skewing.
Developed code in Spark SQL for implementing Business logic with Python as a programming language.
Created Tableau reports with complex calculations and worked on Ad-hoc reporting using PowerBI.
Worked on Informatica tools such as power center, MDM, Repository manager and workflow monitor.
Analyzed the SQL scripts and designed the solution to implement using PySpark.
Container management using Docker by writing Docker files and set up the automated build on Docker HUB and installed and configured Kubernetes.
Created Kibana Dashboards and combined several source and target systems into Elastic search for real-time analysis of end-to-end transactions tracking.
Working on building visuals and dashboards using Power BI reporting tool.
Used Spark for Parallel data processing and better performances.
Used Python to extract data for Web scraping.
Environment: GCP, Snowflake, RDS, Redshift, Lambda, Amazon SageMaker, Apache Spark, Databricks, HBase, Apache Kafka, HIVE, SQ00P, Scala, Map Reduce, Apache Pig, Python, Tableau, UNICA, Kibana, PowerBI, Informatica, Terraform, Docker.
Client: Centene Corp, St. Louis, Missouri Oct 2018 – Jun 2021
AWS Snowflake Data Engineer
Project Description: Centene Corp seeks to optimize its healthcare provider network to improve access to quality care, enhance member satisfaction, and reduce healthcare costs. The objective of this project is to develop a real-time provider network optimization platform that leverages data analytics and machine learning to dynamically match members with the most appropriate and cost-effective providers based on their healthcare needs and preferences.
Responsibilities:
Utilized Spark, SQL, and Python on Databricks platform to design and implement data ingestion and storage solutions using AWS S3, Redshift, and Glue.
Created and configured workflows for data processing and ETL pipelines.
Implemented AWS Athena for ad-hoc data analysis and querying on S3 data.
Integrated AWS SNS and SQS for real-time event processing and messaging.
Utilized Scala to deploy comprehensive data engineering applications, integrating AWS S3, Redshift, and Glue to facilitate data extraction, transformation, and loading processes, ensuring the delivery of scalable and maintainable solutions.
Data Extraction, aggregation, and consolidation of Adobe data within AWS Glue using PySpark.
Utilized AWS CloudWatch for monitoring and managing resources, setting up alarms, and collecting metrics.
Designed and implemented data streaming solutions using AWS Kinesis for real-time data processing.
Successfully deployed Data Analytics and Engineering resources on AWS using Terraform, configuring workflows for data processing and ETL pipelines, and integrating AWS Athena for ad-hoc data analysis and querying on S3 data
Optimized DNS configurations and routing using AWS Route53 for efficient application and service deployment.
Loaded the transformed data into AWS RedShift data warehousing to analyze the data.
Designed and implemented Snowflake stages to efficiently load data from various sources into Snowflake tables.
Built Apache Airflow with AWS to analyze multi-stage machine learning processes with Amazon SageMaker tasks.
Designed and developed Security Framework to provide fine-grained access to objects in AWS S3 using AWS Lambda
Used AWS EMR to move large data (Big Data) into other platforms such as AWS data stores, Amazon S3 and Amazon Dynamo DB.
Managed various Snowflake table types and optimized warehouses for performance.
Developed complex Snow SQL queries and partitioning techniques for efficient data retrieval.
Configured multi-cluster warehouses and defined access privileges for security.
Implemented caching mechanisms, Snowpipe for real-time data ingestion, and time travel for historical data tracking.
Used regular expressions and Snowflake scripting for automation of pipelines and transformations.
Developed data processing pipelines using Hadoop, including HDFS, Sqoop, Hive, MapReduce, and Spark.
Implemented Spark Streaming for real-time data processing and analytics.
Implemented scheduling and job automation using IBM Tivoli, Control-M, Oozie, and Airflow.
Designed and developed database solutions using Teradata, Oracle, and SQL Server.
Utilized Git, GitLab, and VSS for code repository management and collaboration.
Worked on Apache NIFI to decompress and move JSON files from local to HDFS.
Moving data from Teradata to a Hadoop cluster Using TDCH/Fast export and Apache NIFI.
Environment: AWS, AWS S3, Redshift, EMR, SNS, SQS, Athena, Glue, CloudWatch, Kinesis, Route53, IAM, Sqoop, MYSQL, HDFS, Apache Spark, Hive, Cloudera, Kafka, Zookeeper, Oozie, PySpark, Ambari, JIRA, IBM Tivoli, Control-m, OOZIE, Airflow, Teradata, Oracle, SQL.
Client: Nissan, Irving, Texas Aug 2017 – Sept 2018
AWS Data Engineer
Project Description: Nissan is embarking on a data engineering project aimed at leveraging data-driven insights to optimize vehicle manufacturing processes and enhance operational efficiency across its global manufacturing facilities. The objective of this project is to develop a comprehensive data infrastructure and analytics platform that enables real-time monitoring, analysis, and optimization of key manufacturing parameters.
Responsibilities:
Exploring with Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark SQL, Data Frame, and Spark Yarn.
Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS and converted all Hadoop jobs to run in EMR by configuring the cluster according to the data size.
Wrote Spark applications for Data validation, cleansing, transformations and custom aggregations and imported data from different sources into Spark RDD for processing and developed custom aggregate functions using Spark SQL and performed interactive querying.
Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregations on the fly to build the common learner data model and persist the data in HDFS.
Created AWS Glue job for archiving data from Redshift tables to S3 (online to cold storage) as per data retention requirements and involved in managing S3 data layers and databases including Redshift and Postgres.
Involved in designing and developing Amazon EC2, Amazon S3, Amazon RDS, Amazon Elastic Load Balancing, Amazon SWF, Amazon SQS, and other services of the AWS infrastructure.
Worked on creating star schema for drilling data. Created PySpark procedures, functions, packages to load data.
Developed a Python Script to load the CSV files into the S3 buckets.
AWS S3 buckets, performed folder management in each bucket, managed logs and objects within each bucket.
Integrated Hadoop into traditional ETL, accelerating the extraction, transformation, and loading of massive structured and unstructured data.
Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive and AWS cloud and making the data available in Athena and Snowflake.
Extensively used Stash Git-Bucket for Code Control and Worked on AWS Components such as Airflow, Elastic Map Reduce (EMR), Athena, and Snowflake.
Designed scalable architectures on AWS to accommodate the growing volume and complexity of customer data, ensuring high availability and reliability of the insight platform.
Implemented data governance frameworks on AWS, including access control policies and encryption mechanisms, to ensure the security and privacy of customer data in compliance with regulatory standards.
Environment: Spark, AWS, EC2, EMR, Hive, SQL Workbench, Tableau, Kibana, Sqoop, Spark SQL, Spark Streaming, Scala, Python, Hadoop (Cloudera Stack), Informatica, Jenkins, Docker, Hue, Spark, Netezza, Kafka, HBase, HDFS, Hive, Pig, Sqoop, Oracle, ETL, AWS S3, AWS Glue, GIT, Grafana.
Client: Consumers Energy, Michigan Jan 2015 – Jul 2017
AWS Data Engineer
Project Description: Consumers Energy is embarking on a transformative data engineering project aimed at leveraging data-driven insights to enhance energy efficiency, optimize operations, and improve customer satisfaction. The objective of this project is to develop a comprehensive data infrastructure and analytics platform that enables real-time monitoring, analysis, and optimization of energy consumption across the company's customer base.
Responsibilities:
Populated a Data Lake data using AWS S3 from various data sources using AWS Kinesis.
Used an Amazon EMR for processing Big Data implementing tools like Hadoop, Spark, and Hive.
Authored AWS Lambda functions to run Java scripts in response to events in S3.
Configured deployed and maintained multi-node Dev and Test Kafka Clusters.
Created AWS Cloud Formation templates to create infrastructure in the cloud.
Performance auditing of the PostgreSQL RDBMS on everyday basis which includes profiling of the SQL queries, log and database health Metric Collection.
Wrote Kafka producers to stream the data from external rest APIs to Kafka topics.
Develop Java MapReduce jobs for the aggregation and interest matrix calculation for users.
Created and implemented AWS IAM user roles and policies to authenticate and control access.
Implemented optimizations in Spark nodes and improved the performance of the Spark Cluster.
Processed multiple terabytes of data stored in S3 using AWS Redshift and AWS Athena.
Developed ETL jobs in AWS Glue to extract data from S3 buckets and loaded it into the data mart in Amazon Redshift.
Implemented and maintained EMR, Redshift pipeline for data warehousing.
Used Spark, Spark SQL, and Spark Streaming for data analysis and processing.
Developed spark applications in Java (Spark) on a distributed environment to load huge number of CSV files with different schema into Hive ORC tables.
Worked on reading and writing multiple data formats like JSON, ORC, Parquet on HDFS using Spark.
Environment: Spark, Hive, JSON, AWS, Hadoop, Java, Map Reduce, ETL, Kafka.
Client: Caresource, Dallas, Texas Sep 2013 – Dec 2014
Big Data Engineer / Hadoop Developer
Responsibilities:
Design and implement scalable, fault-tolerant big data solutions using Hadoop and related technologies such as HDFS, MapReduce, Yarn, Hive, Pig, and Spark.
Configure and manage Hadoop clusters using tools such as Cloudera Manager, Ambari, or Hortonworks Data Platform
Develop and maintain data pipelines using tools like Apache NiFi, Apache Kafka, and Apache Storm.
Build and maintain data warehousing solutions using Hive and Impala
Optimize and improve the performance of Hadoop clusters by tuning parameters and implementing best practices.
Collaborate with data scientists, data analysts, and other team members to support data-driven decision-making.
Experience with big data processing and analysis frameworks such as Apache Spark, Storm, and Flink
Experience with data integration and migration tools such as Apache NiFi, Apache Kafka, and Sqoop.
Experience with cluster management and orchestration tools such as Cloudera Manager, Ambari, and Hortonworks Data Platform.
Work with different data sources like HDFS, Hive and Teradata for Spark to process the data.
Use Kafka a publish-subscribe messaging system by creating topics using consumers and producers to ingest data into the application for Spark to process the data and Configure Zookeeper to coordinate and support the distributed applications as it offers high throughput and availability with low latency.
Configure Nginx to serve the static content of the web pages reducing the load on the web server for the static content.
Write SQL queries to perform CRUD operations on PostgreSQL to save, store, update, and delete rows in tables using Play Slick.
Create and update Jenkins jobs to develop pipelines to deploy the application in different environments like develop, QA and Production.
Environment: Spark, Zookeeper, SQL, Scala, Jenkins, Kafka, HBase, HDFS, Hive, Teradata, NiFi, Storm, Flink, HDFS, MapReduce, Yarn, Hive, Pig.