Data Engineer Engineering

Location:

Dallas, TX

Posted:

February 18, 2025

Contact this candidate

Resume:

Praneeth V

Email: **.********@*****.*** Phone: 469-***-****

Sr. Cloud Data Engineer

Professional Summary

Over 9+ years of experience in Big Data Engineering, Cloud Computing (AWS & Azure), Data Pipelines, and Real-time Analytics.

Hands-on expertise with the Hadoop ecosystem, including strong knowledge of Big Data technologies such as HDFS, Spark, YARN, Kafka, MapReduce, Apache Cassandra, HBase, Zookeeper, Hive, Oozie, Impala, Pig, and Flume.

Hands-on experience on AWS components such as EMR, EC2, S3, RDS, IAM, Auto Scaling, CloudWatch, SNS, Athena, Glue, Kinesis, Lambda, Redshift, DynamoDB to ensure a secure zone for an organization in AWS public cloud.

Skilled in setting up Kubernetes clusters using tools like kubeadm, kops, or managed Kubernets services (e.g., Amazon EKS, Google GKE, Azure AKS).

Proven experience deploying software development solutions for a wide range of high-end clients, including Big Data Processing, Ingestion, Analytics, and Cloud Migration from On-Premises to AWS Cloud.

Expertise in Azure infrastructure management (Azure Web Roles, Worker Roles, SQL Azure, Azure Storage).

Strong Experience in working with ETL Informatica which includes components Informatica PowerCenter Designer, Workflow manager, Workflow monitor, Informatica server and Repository Manager.

Good understanding of Spark Architecture with Data bricks, Structured Streaming. Setting Up AWS and Microsoft Azure with Data bricks, Data bricks Workspace for Business Analytics, Manage Clusters In Data bricks, Managing the Machine Learning Lifecycle

Experienced in designing and implementing scalable and efficient data warehousing solutions using Azure Synapse, including schema design, partitioning, and indexing strategies.

Experienced in building Snow pipe and In-depth knowledge of Data Sharing in Snowflake and Snowflake Database, Schema and Table structures.

Automated various business processes, reducing manual intervention and enhancing efficiency by leveraging Biztalk orchestration capabilities.

Experience and Knowledge in NoSQL databases like Mongo DB, HBase and Cassandra.

Performance tuning in Hive & Impala using multiple methods but not limited to dynamic partitioning, bucketing, indexing, file compressions, and cost-based optimization etc.

Hands-on experience handling different file formats like Json, AVRO, ORC and Parquet.

Hands on experience with Spark using SQL, Python and Scala.

Knowledge on DevOps tools and techniques like Jenkins and Docker.

Evaluated Hortonworks NiFi (HDF 2.0) and recommended solution to inject data from multiple data sources to HDFS & Hive using NiFi.

Hands on experience in Spark architecture and its integrations like Spark SQL, Data Frames and Datasets API.

Designed and developed logical and physical data models that utilize concepts such as Star Schema, Snowflake Schema and Slowly Changing Dimensions.

Utilized Data flow and mapping data flow features in Azure Data Factory to perform complex data transformations, ensuring data quality and consistency.

Developed ETL pipelines using AWS Glue, Python, and PostgreSQL to extract, transform, and load data from various sources into PostgreSQL databases, enabling data integration and analysis.

Generated a script in AWS Glue to transfer the data and utilize AWS Glue to run ETL jobs and run aggregation on PySpark code.

Hands-on experience in handling database issues and connections with SQL and No SQL databases such as MongoDB, HBase, SQL server. Created Java apps to handle data in MongoDB and HBase.

TECHNICAL SKILLS

Big Data Technologies: Hadoop MapReduce, HDFS Sqoop, PIG, Hive, HBase, Oozie, Flume, NiFi, Kafka, Zookeeper. Yarn, Apache Spark, Mahout, Sparklib, Apache Druid.

Databases: Oracle, MySQL, SQL Server, Azure Synapse, MongoDB, Cassandra, DynamoDB, PostgreSQL, Cosmos.

Programming: Python, PySpark, Scala, Java, C, C++, Shell script, Perl script, SQL, Splunk.

Cloud Technologies: AWS, Microsoft Azure, GCP

Frameworks: Django REST framework, MVC, Hortonworks

Tools: Alteryx, PyCharm, Eclipse, Visual Studio, SQL*Plus, SQL Developer, TOAD, SQL Navigator, Query Analyzer, SQL Server Management Studio, SQL Assistance, Eclipse, Postman, NOSQL.

Versioning tools: SVN, Git, GitHub

Operating Systems: Windows 7/8/XP/2008/2012, Ubuntu Linux, MacOS

Network Security: Kerberos

Database Modelling: Dimension Modeling, ER Modeling, Star Schema Modeling, Snowflake Modeling

Monitoring Tool: Apache Airflow

Visualization/ Reporting: Tableau, ggplot2, matplotlib, SSRS and Power BI

Machine Learning Techniques: Linear & Logistic Regression, Classification and Regression Trees, Random Forest, Associative rules, NLP and Clustering.

Educational Qualifications:

Bachelor’s in computer science and engineering from Jawaharlal Nehru Technological University, Hyderabad in 2015

Professional Experience

Sr. AWS Data Engineer September 2023 to Present

Kohls – Menomonee Falls, WI

Responsibilities:

Proficient in designing and implementing data models to support business requirements and improve data integrity and efficiency.

Integrated Kubernetes with CI/CD pipelines automating the deployment process using tools like Jenkins, Gitlab CI/CD or CircleCI.

Implemented data structures using best practices in Data Modeling, ETL/ELT processes, SQL and python.

Experienced in integrating Azure Synapse with external data sources using PolyBase, enabling seamless querying and processing of data across on-premises and cloud environments.

Leveraged Azure Synapse Analytics to seamlessly integrate big data processing and analytics capabilities, empowering data exploration and insights generation.

Excellent experience with SSIS to create ETL packages to validate, extract, transform and load data to data warehouse and data marts and Power BI.

Developed custom-built ETL solution, batch processing, and real-time data ingestion pipeline to move data in and out of the Hadoop cluster using PySpark and Shell Scripting.

Designed and implemented relational databases in MS SQL Server, optimizing performance and scalability.

Managed MS SQL Server instances, including installation, configuration, and upgrading of database software.

Designed and implemented ETL processes using SSIS to integrate data from various sources.

Built and maintained data models supporting data warehousing and analytics initiatives, ensuring alignment with business requirements for processing traffic violation data across multiple states.

Employed branching strategies in Azure DevOps to manage multiple development streams and ensure safe code deployments, minimizing downtime during production releases.

Fostered a culture of continuous learning and improvement within the data engineering team by staying updated with the latest trends in cloud-based data solutions and mentoring junior engineers on best practices in data pipeline development.

Ensured data quality and integrity by implementing rigorous testing frameworks for data pipelines and ETL processes, leading to a significant reduction in data discrepancies for traffic violation records.

Integrated Microsoft CDP with multiple data sources, ensuring seamless, accurate, and real-time data flow through Azure Dataverse and Power Platform.

Gathered report requirements and determined the best solution to provide the results in either a Reporting Services report or Power BI.

Created Power BI reports and dashboards, developing calculated columns and complex DAX measures.

Implemented robust security measures within AWS Govcloud including encryption, access control and multifactor authentication ensuring the protection of sensitive data and compliance with ITAR requirements.

Collaborated with cross functional teams to design and implement high availability and disaster recovery solutions specific to AWS govcloud resulting in a uptime for mission critical applications.

Designed scalable and resilient AWS connect architectures, leveraging services such as Amazon EC2, Amazon RDS, AWS Lambda, and Amazon S3 to ensure high availability and fault tolerance.

Engineered end-to-end data pipelines for processing large volumes of data in Azure Data Lake Storage.

Developed robust and scalable data pipelines using Azure Data Factory to automate ETL processes ensuring seamless data integration from multiple source

Environment: Spark RDD. AWS Glue, C#, Apache Kafka, Celonis, Amazon S3, Java, SQL, T-SQL, Spark, Angular, AWS cloud, AWS Govcloud, Azure Data Lake, Terraform, Azure, Data bricks, ETL, MS SQL Server, GCP, Kusto, NumPy, SciPy, pandas, Scikit-learn, Seaborn, NLTK) and Spark 1.6 / 2.0 (PySpark, MLlib, EMR, EC2, and amazon RDS. Data lake, Kubernetes, Docker, Python, Cloudera Stack, HBase, Hive, Impala, Pig, NiFi, Spark, Spark Streaming, Elastic Search, Logstash, Microstrategy, Apache parquet, Apache iceberg, BizTalk, Kibana, JAX-RS, Spring, Hibernate, Apache Airflow, Oozie, RESTFul API, JSON, JAXB, XML, WSDL, MySQL, Talend, fCassandra, MongoDB, HDFS, ELK/Splunk, Athena, tableau, redshift, Scala, snowflake.

Sr AWS Cloud Data Engineer July 2021 to August 2023

Sofi – San Francisco, CA

Responsibilities:

Led requirements gathering, system analysis, and testing effort estimation, ensuring comprehensive project planning and execution.

Demonstrated proficiency in migration strategies, solution orchestration and data model design.

Designed various system components including Sqoop, Hadoop processes, Spark, and FTP integrations.

Applied expertise in Snowflake to develop custom data models and semantic reporting layers, meeting diverse customer reporting needs.

Optimized Hive and Spark queries, utilizing techniques such as window functions and customized Hadoop parameters for enhanced performance.

Developed ETL processes using PySpark, leveraging Data frame API and Spark SQL API for efficient data processing.

Executed data transformations and actions using Spark, storing resulting data in HDFS and Snowflake databases.

Migrated on-premises applications to AWS, utilizing services like EC2, S3, and managing clusters on AWS EMR.

Proficiently employed Spark Streaming, Kafka, and Flume for real-time data analytics, configuring Spark streaming to store data from Kafka into HDFS.

Designed and implemented ETL processes in AWS Glue for migrating campaign data into AWS Redshift from various external sources.

Utilized Jira for issue tracking and Jenkins for continuous integration and deployment, ensuring efficient project management.

Enforced data cataloging and governance standards, promoting data quality and compliance efforts.

Developed DataStage jobs incorporating various stages for data processing and manipulation.

Leveraged Airflow for ETL batch processing, scheduling, and monitoring jobs, ensuring efficient data loading into Snowflake.

Collaborated with data stewards to ensure compliance and data governance in cloud-based ETL pipelines.

Employed PySpark for data extraction, filtering, and transformation within data pipelines, enhancing data processing efficiency.

Utilized Data Build Tool and AWS Lambda for ETL transformations, optimizing data processing workflows.

Developed Spark applications in Databricks using Spark SQL for data extraction, transformation, and aggregation, analyzing customer usage patterns.

Managed Spark Databricks cluster sizing, monitoring, and troubleshooting to ensure optimal performance.

Automated data loading processes using Unix Shell scripts, improving efficiency and reducing manual effort

Implemented monitoring solutions using Ansible, Terraform, Docker, and Jenkins, ensuring robust system performance and stability.

Environment: Hadoop ecosystem (Hive, Sqoop, MapReduce), Apache Spark (PySpark, Spark SQL), AWS services (S3, Redshift, Glue, EMR, Lambda, SQS, CloudWatch), Databricks, Jenkins, Airflow, Ansible, Terraform, Oracle, Snowflake, Git/GitHub, Jira, Python, Unix Shell Scripting.

Syntel/Cuna Mutual - Madison, WI October 2019 to June 2021

Sr. AWS Data Engineer

RESPONSIBILITIES:

Exploring with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark SQL, Data Frame, and Spark Yarn.

Used HBase for application requiring low latency access to large volumes of data, such as social media analytics, fraud detection, and monitoring systems.

RDDs and Scala and involved in using SQOOP for importing and exporting data between RDBMS and HDFS.

Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregations on the fly to build the common learner data model.

Integrated AWS Connect with existing data pipelines, enabling seamless capture and analysis of call data for business intelligence and reporting purposes.

Created AWS Glue job for archiving data from Redshift tables to S3 (online to cold storage) as per data retention requirements and involved in managing S3 data layers and databases including Redshift and Postgres.

Developed comprehensive MongoDB database designs, including collections, indexes, and sharding strategies to optimize query performances and enhance scalability.

Collaborated with cross functional teams to design and implement high availability and disaster recovery solutions specific to AWS govcloud resulting in a uptime for mission critical applications.

Implemented ETL workflows on data bricks, integrating various data sources and transforming raw data into meaningful insights using Apache spark libraries.

Implemented efficient data ingestion processes to bring structured and unstructured into Data Lake Storage.

Optimized query performance and reduced costs by implementing Snowflake's clustering keys, materialized views, and caching techniques, achieving significant query speedups and cost savings in data processing.

Utilized Auto Loader in Databricks to progressively stream Cloud Files from Azure Data Lake Storage (ADLS), enabling structured data organization.

Automated monthly and daily reports and pipelines using PySpark on Azure Databricks, collaborating directly with stakeholders.

Performed data quality analyses and applied business rules throughout the data extraction and loading process.

Gathered report requirements and identified the best solution using Reporting Services or Power BI, creating effective visualizations like Bar Chart, Clustered Column Chart, Waterfall Chart, Gauge, Pie Chart, and Tree Map.

AWS S3 buckets, performed folder management in each bucket, managed logs and objects within each bucket

Environment: Spark, AWS, AWS govcloud, Azure Data Lake, C#, EC2, EMR, Hive, Java, SQL Workbench, Tableau, Kibana, Sqoop, Spark SQL, Spark Streaming, Scala, Python, Hadoop (Cloudera Stack), Informatica, NVIDIA Clara, Jenkins, Docker, Hue, Spark, Netezza, Kafka, HBase, HDFS, Hive, Pig, Sqoop, Oracle, ETL, AWS S3, AWS Glue, GIT, Grafana.

Sanofi – Bridgewater, NJ December 2018 to September 2019

Azure Data Engineer

Responsibilities:

Created Spark jobs by writing RDDs in Python and created data frames in Spark SQL to perform data analysis and stored in Azure Data Lake.

Engineered Robust data ingestion pipelines using Azure Data factory to efficiently bring diverse data sources into Azure Data Lake Storage.

Implemented Optimized data storage solutions within Azure Data Lake, including file formats partitioning and compression techniques, reducing storage costs and improving query performances.

Configured Spark Streaming to ingest real-time data from Apache Kafka and store it in HDFS using Scala.

Designed and implemented robust NOSQL data models specifically tailored for MongoDB, accommodating unstructured and semi structured data while ensuring high performance and scalability.

Developed comprehensive MongoDB database designs, including collections, indexes, and sharding strategies to optimize query performances and enhance scalability.

Developed Spark Applications by using Kafka and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.

Designed batch processing jobs using Apache Spark to increase speed compared to that of MapReduce jobs.

Executing Spark SQL operations on JSON, transforming the data into a tabular structure using data frames, and storing and writing the data to Hive and HDFS.

Managed CI/CD pipelines via Jenkins, enabling automation, speeding up development and testing.

Developed PySpark and Scala code for Athena jobs to perform complex data transformations and analysis.

Writing complex SQL scripts in Amazon Redshift data warehouse for business analysis and reporting.

Configured Airflow to work with Apache Spark and Hadoop for large-scale data transformations and analytics.

Created complex SQL queries, custom dashboards, and automated reporting solutions within Big Query to enable data-driven insights for business stakeholders, significantly improving data accessibility and decision-making across the organization.

Built Athena views and procedures for easy data access, optimizing code performance using NumPy and handling large, complex datasets with Pandas.

Converted SQL Server Stored Procedures to Amazon Redshift PostgreSQL and integrated them into Python Pandas framework.

Designed and implemented scalable, high-performance data warehouse solutions on Snowflake, utilizing its multi-cluster architecture and automatic scaling capabilities to optimize storage and compute resources for large datasets.

Worked with HIVE data warehouse infrastructure-creating tables, data distribution by implementing partitioning and bucketing, writing, and optimizing the HQL queries.

Created hive tables as per requirement were Internal or External tables defined with appropriate static, dynamic partitions, and bucketing, intended for efficiency.

Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.

Worked on developing ETL processes to load data from multiple data sources to HDFS using FLUME and performed structural modifications using HIVE.

Provided technical solutions on MS Azure HDInsight, Hive, HBase, MongoDB, Telerik, Power BI, Spot Fire, Tableau, Azure SQL Data Warehouse Data Migration Techniques using BCP, Azure Data Factory, and Fraud prediction using Azure Machine Learning.

Environment: Hadoop, Hive, Azure Data Lake, Kafka, Snowflake, Spark, Scala, HBase, Cassandra, JSON, XML, UNIX Shell Scripting, Cloudera, MapReduce, Power BI, ETL, MySQL, No SQL

Big Data Engineer July 2015 to November 2018

Renee Systems Inc-Hyderabad

Responsibilities:

Collaborated with business users, product owners, and developers to analyze functional requirements.

Implemented Spark SQL queries that combine hive queries with Python programmatic data manipulations supported by RDDs and data frames.

Used Kafka Streams to Configure Spark streaming to get information and then store it in HDFS.

Extract Real-time feed using Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data in HDFS.

Developing Spark scripts, UDFS using Spark SQL query for data aggregation, querying, and writing data back into RDBMS through Sqoop.

Installed and configured Pig, writing Pig Latin scripts and MapReduce jobs for data processing.

Worked on analyzing Hadoop clusters using different big data analytic tools including HBase database and Sqoop.

Worked on importing and exporting data from Oracle, and DB2 into HDFS and HIVE using Sqoop for analysis, visualization, and generating reports.

Creating and inserting data into Hive tables for dynamically inserting data into data tables using partitioning and bucketing for EDW tables and historical metrics.

Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations, and others during the ingestion process itself.

Created ETL packages with different data sources (SQL Server, Oracle, Flat files, Excel, DB2, and Teradata) and loaded the data into target tables by performing different kinds of transformations using SSIS.

Designed, developed data integration programs in a Hadoop environment with No SQL data store Cassandra for data access and analysis.

Created partitions, bucketing across the state in Hive to handle structured data using Elastic search.

Performed Sqooping for various file transfers through the HBase tables for processing of data to several No SQL DBs- Cassandra, Mongo DB.

Environment: Hadoop, MapReduce, HDFS, Hive, Python, Kafka, HBase, Sqoop, No SQL, Spark 1.9, PL/SQL, Oracle, Cassandra, Mongo DB, ETL, MySQL

Contact this candidate