Data Engineer Azure

Location:

Dallas, TX

Posted:

July 07, 2025

Contact this candidate

Resume:

Viveka Gangaraboina

AWS Data Engineer

Email 469-***-**** LinkedIn

Professional Summary:

Over 7 years of professional IT experience in building and optimizing ETL/ELT pipelines, architecting data lakes/warehouses, and processing structured, semi-structured, and unstructured data using Big Data technologies and cloud platforms.

Proven ability to architect and implement Data Lake and Data Warehouse solutions from the ground up, enabling organizations to centralize their structured and unstructured data for improved accessibility, analytics, and compliance.

Proficient in advanced Python programming for data manipulation and analysis, building distributed computing solutions using Apache Spark and Scala, and creating end-to-end data processing pipelines with PySpark.

Extensive hands-on experience with Amazon Web Services (AWS), utilizing key services like EC2 for computing, S3 for object storage, RDS for relational databases, VPC and IAM for network and access control, Lambda for serverless computing, and Redshift, Athena, and EMR for large-scale data processing and querying.

Solid understanding of Microsoft Azure's data ecosystem, including Azure Data Lake Store for high-volume data storage, Azure SQL Data Warehouse for scalable analytics, Azure Data Factory for orchestration, and Azure Data bricks for collaborative data science and engineering at scale.

Specialized in building efficient and reliable data pipelines on Data bricks, optimizing the end-to-end ETL/ELT process by integrating multiple data sources, transforming data for downstream use, and ensuring minimal latency and high data integrity.

Strong programming skills in Python, using libraries such as Pandas, NumPy, PySpark, and Boto3 for data wrangling, automation, and developing scalable data engineering solutions.

Proficient in performance tuning and optimization of Informatica mappings, sessions, and workflows to improve processing time and resource utilization.

Deep expertise in ETL performance optimization, applying best practices such as incremental data loading, data deduplication, partitioning, and caching strategies to handle massive datasets efficiently and minimize processing time.

Developed and managed Apache Airflow DAGs to automate, schedule, and monitor ETL workflows, ensuring reliable and repeatable data movement and transformations in both batch and near real-time environments.

Strong technical background in Apache Spark and Scala, leveraging distributed computing frameworks to process high-volume datasets in parallel and improve job performance for data engineering and machine learning workflows.

Engineered real-time streaming solutions using Apache Kafka and Amazon Kinesis, enabling ingestion, transformation, and analytics of live data streams for applications requiring immediate insights, such as fraud detection and user behaviour tracking.

Developed and deployed batch data pipelines to ingest, transform, and load data into Snowflake.

In-depth experience with the Big Data ecosystem, including tools and services such as HDFS, HBase, Zookeeper, Flume, Cassandra, and Oozie, allowing end-to-end management and orchestration of complex data workflows.

Strong command of both relational and non-relational databases, including MySQL, Oracle, MS SQL Server, as well as NoSQL systems like MongoDB, HBase, and Cassandra, enabling flexible data storage and retrieval strategies for diverse use cases.

Skilled in Data Modelling, proficient in creating Entity-Relationship (E-R) and Dimensional models that accurately represent business processes and support OLTP and OLAP operations across various domains.

Experience in building and managing CI/CD pipelines using tools like Azure DevOps and AWS Code Pipeline, enabling continuous integration, testing, and deployment of data engineering solutions with minimal manual intervention.

Proficient in Infrastructure as Code (IaC) using Terraform, automating the provisioning, scaling, and management of cloud infrastructure resources on AWS and Azure to ensure consistency and reduce operational overhead.

Comprehensive knowledge of Software Development Life Cycle (SDLC) methodologies, with experience working in both Agile (Scrum, Kanban) and Waterfall environments, contributing to iterative delivery and long-term project planning.

Adept at using version control systems like Git and Bitbucket, implementing best practices in branching, merging, and pull requests to support collaborative development and maintain code quality across teams.

Technical Skills:

Hadoop/Big Data Technologies

Hadoop, Map Reduce, HDFS, YARN, Oozie, Hive, Sqoop, Spark, Nefi, Zookeeper and Cloud era Manager, Horton Works, Kafka, Pig, HBase, Apache Airflow

Azure Cloud Services

Azure Data Factory, Azure Synapse Analytics, Data Lake, Blob Storage, HDInsight, Azure Data bricks, Azure Data Analytics, Azure functions

AWS Cloud Services

EC2, EMR, Redshift, S3, Data bricks, Athena, Glue, AWS Kinesis, Cloud Watch, SNS, SQS, SES, Lambda, DynamoDB

NO SQL Database

HBase, Dynamo DB, post DB, MongoDB, Redis

ETL/BI

PowerBI, Tableau, Snowflake, Informatica, SSRS, SSAS, SSIS, Talend

Hadoop Distribution

Horton Works, Cloud era.

Programming & Scripting

Python, Scala, SQL, Shell Scripting, YAML, R

Operating systems

Linux (Ubuntu, RedHat), Windows (XP/7/8/10), macOS

Databases

Oracle, MY SQL, Teradata, PostgreSQL, SQL Server, DB2, SQLite

Version Control

Bitbucket, Git Lab, Git Hub, Azure DevOps, SVN, Jenkins

Education:

Masters in Advance Data Analytics – University of North Texas.

Experience:

Client: Silicon Valley Bank Santa Clara, CA. Apr 2023 to Present

Senior AWS Data Engineer

Designed and implemented Python-based ETL scripts to automate data ingestion, cleansing, transformation, and export tasks from various data sources.

Designed and implemented a scalable Data Lake on AWS using services like S3, EMR, SQS, Redshift, Athena, Glue, and more, supporting efficient data analysis, processing, storage, and reporting.

Leveraged libraries like Pandas, NumPy, SQLAlchemy, and Boto3 to streamline workflows and ensure high performance and scalability.

Retrieved and processed data from both structured (SQL, CSV) and unstructured (JSON, XML, log files) formats, enabling seamless integration into downstream analytics pipelines and dashboards.

Managed users and their privileges across SQL Server and MongoDB environments by creating roles, configuring role hierarchies, and defining fine-grained access controls based on business and compliance requirements.

Implemented Role-Based Access Control (RBAC) policies to ensure that only authorized users could access specific datasets and administrative functions, thus reducing security risks and maintaining compliance with internal and external audit requirements.

Utilized Apache Spark and Hadoop frameworks for efficient processing of extensive datasets, enabling prompt insights for decision-making.

Automated deployment of EMR, S3, and Redshift using Cloud Formation and Terraform, reducing manual setup.

Automated nightly build to run ETL jobs using Python with BOTO3 libraries, reducing effort significantly.

Utilized Azure Databricks to create notebooks and implemented complex business logic to transform data using Spark SQL.

Integrated Oracle databases with Apache Kafka and Spark Streaming, enabling low-latency ingestion of transactional data into distributed analytics platforms.

Configured Amazon S3 buckets with appropriate permissions, encryption, and lifecycle rules for storing raw, intermediate, and processed data in a cost-effective and secure manner.

Developed event-driven, serverless functions using AWS Lambda, enabling automation of tasks like data transformation, file validation, and notifications in response to S3 uploads, database updates, and scheduled events.

Built and managed scalable ETL pipelines using AWS Glue, including schema detection, job scheduling, and data catalog integration. This facilitated metadata management and improved discoverability and governance of datasets.

Integrated Amazon Cognito to provide secure authentication, authorization, and user session management for applications, using federated identity providers and token-based access control.

Designed and deployed scalable data processing pipelines on Hadoop ecosystems, utilizing components like Map Reduce, Hive, Sqoop, and Pig for batch processing, data ingestion, and transformation in large-scale data environments.

Orchestrated complex workflows using AWS Step Functions, chaining together multiple Lambda functions, Glue jobs, and external API calls to manage long-running and multi-stage data processes with retry and error-handling logic.

Designed, loaded, and maintained data warehouses on Amazon Redshift, creating optimized schemas, managing sort and distribution keys, and running performance-tuned SQL queries for business intelligence use cases.

Implemented application and infrastructure monitoring with Amazon CloudWatch, setting up custom metrics, dashboards, and alarms to proactively detect and respond to failures or performance degradation.

Developed Apache Spark applications in Scala to process large-scale datasets from a variety of sources including RDBMS (PostgreSQL, MySQL) and real-time data streams via Kafka.

Leveraged tools like Informatica, Alteryx, Air byte, and Five Tran for efficient data processing and cleansing operations.

Ensured fault tolerance, data lineage tracking, and job recoverability in Spark workflows by using check pointing, structured streaming, and integration with distributed storage systems like HDFS and S3.

Participated actively in agile development processes including daily stand-ups, sprint planning, backlog grooming, and retrospectives, ensuring alignment with product goals and timely delivery of project milestones.

Client: SM Energy Denver, CO Jan 2021 to Mar 2023

AWS Data Engineer

Automated infrastructure deployment with AWS Cloud Formation, reducing manual intervention and human error, ensuring consistent, repeatable, and efficient provisioning of cloud resources.

Managed AWS production environments, overseeing deployment, scaling, and configuration of applications on S3, EC2, Lambda, and EMR, ensuring high availability, fault tolerance, and performance optimization.

Built real-time data ingestion pipelines using AWS Kinesis Data Streams and Firehouse, and integrated Lambda for real-time data transformation, minimizing processing latency.

Utilized External Tables in Snowflake to directly query data stored in S3, reducing the need for full data loads and enabling faster access to raw datasets.

Designed Tableau dashboards for business stakeholders, translating complex data into actionable visual insights and reports.

Configured comprehensive monitoring solutions with CloudWatch and SNS for proactive tracking of system metrics, automated alerts, and notifications, improving response time and reducing downtime.

Developed detailed monitoring strategies for AWS Lambda functions and ETL jobs by setting up log streams, alarms, and performance dashboards, improving troubleshooting and operational transparency.

Built a web UI with Python's Streamlit library and connected it to a ML model endpoint.

Developed PySpark (Python) code in a distributed environment to efficiently process large volumes of CSV files with varying schemas.

Designed scalable big data solutions using PySpark and Apache Spark, utilizing Data Frames and RDDs to perform efficient data transformations and large-scale processing tasks.

Leveraged EMR auto-scaling to dynamically adjust compute resources based on data volume, ensuring cost-efficiency and maintaining high throughput during peak periods.

Developed PySpark scripts to automate data ingestion from external vendors to internal data platforms, reducing manual intervention and optimizing data pipeline performance.

Skilled in Oracle SQL for high-performance data extraction and tuning; built robust Oracle-to-Snowflake pipelines for seamless data transfer and analytics.

Integrated Kafka with Spark for real-time data collection, processing, and analysis, delivering insights quickly from multiple sources for decision-making.

Designed a batch processing pipeline to process daily data dumps from S3 into Snowflake via EMR and Airflow.

Utilized HiveQL to query and analyse extensive datasets, providing valuable insights that met business requirements and supported key decision-making processes.

Built high-speed data processing solutions using Spark SQL and PySpark for on-premises operations, enabling efficient data analysis and processing for mission-critical tasks.

Implemented Apache Airflow for managing and orchestrating ETL workflows across AWS and Hadoop, improving pipeline reliability, scheduling efficiency, and data management.

Automated application deployment with Jenkins and Git, creating a CI/CD pipeline that streamlined code releases, ensured consistent quality, and improved team collaboration.

Contributed to the migration of legacy systems to the AWS cloud, improving system performance, reducing operational costs, and enhancing scalability.

Client: Epic Verona, WI Feb 2018 to Dec 2020

Data Engineer

Designed and developed ETL pipelines in AWS Glue to migrate data from external sources (S3, formatted files) into AWS Redshift.

Leveraged AWS Glue Crawlers and AWS Athena for data discovery, cataloguing, and performing SQL operations on S3 data.

Integrated Power BI with AWS Athena and Presto for federated querying of semi-structured S3 data without replication, enabling quick exploratory analysis over raw zone datasets.

Built an ETL framework using Apache Spark with Python (PySpark) to load standardized data into Hive and HBase.

Created RAW and Standardized Hive tables, applying partitioning and bucketing for optimized querying and validation.

Created ingestion pipelines using Sqoop, Flume, Pig, and MapReduce for structured and semi-structured data from legacy systems into HDFS, enabling behavioral analytics and advanced reporting.

Ingested large volumes of structured and semi-structured data into the Raw Data Zone (HDFS) using Sqoop and Spark.

Developed event-driven AWS Lambda functions to automatically initiate Glue jobs based on S3 triggers, streamlining the data ingestion process and ensuring near real-time processing capabilities.

Configured CloudWatch for comprehensive observability, setting up detailed logs, alerts, and notifications for both Lambda functions and Glue jobs, enabling proactive monitoring and issue resolution.

Processed millions of heterogeneous CSV files in distributed environments using PySpark, handling schema evolution and ensuring consistent data transformation across large-scale batch processing jobs.

Implemented Map Reduce programs to parse, transform, and enrich raw datasets from various sources, storing cleansed data into NoSQL (HBase) and Hive warehouses for advanced querying and analytics.

Developed drill-through and drill-down reports in Power BI for detailed analysis of customer behavior and operational data ingested through Flume, Sqoop, and Spark pipelines.

Implemented Spark using Scala and Spark SQL, optimizing data testing and processing across multiple sources.

Developed real-time streaming applications using PySpark, Apache Flink, Kafka, and Hadoop distributed cluster.

Implemented data processing using Spark (Scala) and Spark SQL, improving performance and scalability across datasets.

Designed a batch processing pipeline to process daily data dumps from S3 into Snowflake via EMR and Airflow.

Developed big data analytics solutions using Apache Spark with Python, leveraging Spark Data Frames and RDDs for efficient data transformations and actions on AWS.

Optimized data processing and testing using Apache Spark with Scala and Spark SQL, enabling high-throughput operations across multiple data sources and reducing job runtimes significantly.

Designed, implemented, and managed ETL workflows using Apache Airflow, developing DAGs for job orchestration, dependency handling, retries, and SLA monitoring to ensure data pipeline reliability.

Managed both Relational and NoSQL databases, including schema design, query optimization, indexing, performance tuning, and troubleshooting to ensure efficient data storage and retrieval.

Engineered scalable and high-performance storage solutions using Apache Druid and Amazon Redshift, enhancing resource utilization and enabling sub-second query performance for large analytical workloads.

Utilized Apache Spark and Hadoop frameworks for efficient processing of extensive datasets, enabling prompt insights for decision-making.

Built real-time data processing systems using Apache Kafka and Apache Flink, facilitating low-latency ingestion and processing for time-sensitive applications such as fraud detection, real-time dashboards, and IoT analytics.

Designed normalized and denormalized data models to structure raw and processed data effectively, ensuring scalability, data consistency, and optimized access patterns for BI tools and analytical applications.

Containerized data processing applications using Docker and orchestrated deployments using Kubernetes, ensuring consistency across environments, enabling horizontal scaling, and supporting CI/CD integration for seamless releases.

Demonstrated strong version control practices using Git and Bitbucket, enabling effective collaboration across distributed teams, maintaining code quality, managing branches, and supporting peer reviews in large-scale data project.

Contact this candidate