Data Engineer Software Development

Location:

Houston, TX

Posted:

September 10, 2025

Contact this candidate

Resume:

Akhil V

Data Engineer

Phone: 469-***-**** Email: *******.*@*****.***

SUMMARY

BigData:

Data Engineer with 8+ years of IT experience in software development and building scalable data analytics pipelines using Big Data and Cloud technologies.

Skilled in programming with Java, Scala, and Python; strong expertise in Spark applications development using Scala, Python, and PySpark scripts.

Experienced in analyzing large datasets with Hive queries, Spark SQL, Spark DataFrames, and Spark Streaming for both batch and real-time data pipelines.

Proficient in Spark internals, job execution lifecycle, troubleshooting failures, and fine-tuning long-running jobs for efficient performance and resource allocation.

Extensive experience in Spark RDD, DataFrames/Datasets, Spark-SQL, Spark ML, and ETL workflows using Spark Core, Spark SQL, and Spark Streaming.

Skilled in Hive with managed/external tables, partitioning, bucketing, and optimization techniques for faster query and join performance.

Worked with Hadoop ecosystems like Cloudera, Hortonworks, AWS EMR, and tools such as Sqoop, Oozie, Impala, Yarn, HDFS, and Hue.

Experienced in Kafka for real-time data streaming, NoSQL databases (HBase, Cassandra, MongoDB), and integration tools like Talend and NiFi.

Hands-on expertise with various data formats (Avro, Parquet, ORC, JSON, text) and automation workflows using Oozie, Rundeck, and Airflow.

Proficient in designing ETL processes in Impala, loading delta/transactional data into NoSQL systems, and delivering optimized reporting and transformation solutions.

Cloud:

•Strong experience using and integrating various AWS cloud services like S3, EMR, Redshift, Athena, Glue Metastore into the data pipelines.

•Good experience working with Aws Databricks distribution and notebooks for spark application development.

•Good experience working with Snowflake as a cloud Data warehouse and integrating with Spark data pipelines.

•Good experience with various IAM and Security policies to manage and maintain overall secure infrastructure in the cloud.

DevOps:

•Experience in building continuous integration and deployments using Jenkins.

•Expert in building containerized apps using tools like Docker, Kubernetes.

•Expert in Java and Scala-built tools like Maven and SBT for application development.

•Experience in working with tools like GitHub, GitLab, and SVN for code repository.

TECHNICAL SKILLS

Big Data: Hadoop, Sqoop, Hive, Spark, Pig, .Net, Kafka, HBase, Impala, Yarn

Database: Oracle, Mongo DB, SQL Server, Teradata

Tools: Ambari, SQL Developer, TOAD, Erwin, Visio, Tortoise SN

Programming Languages: Java, Scala, Python, Shell Scripting

Cloud: Azure, AWS, GCP

Version control: Git, SVN, CVS, GitLab

Tools: FileZilla, Putty, PL/SQL Developer, JUnit

IDE: Eclipse, Intellil, Pycharm

BI & Visualization Tools: Power BI, Tableau, Looker Studio

EDUCATION

Bachelors in Computer Science, JNTUH

WORK EXPERIENCE

Client: MCG Health Location: Seattle, Washington

Designation: Sr. Data Engineer Duration: August 2022 - Present

Managed and deployed applications on AWS EC2 and AWS Elastic Beanstalk, ensuring seamless scalability and performance of cloud-based applications.

Designed, built, and maintained ETL pipelines using AWS Glue, EMR, Kinesis, Kafka, and CloudWatch, processing large datasets and improving efficiency by 70%.

Extracted data from SFTP server, Google Drive, Email server, and RQ4 (POS) using Snaplogic, storing in AWS S3 and EMR pipelines processing 50GB+ daily.

Developed Python-based solutions with Spark, PySpark, and API integration, reducing workflow time by 60% and building real-time/large-scale pipelines.

Optimized SQL queries for AWS Redshift, MS SQL Server, and PostgreSQL, achieving 50% reduction in execution time.

Implemented AWS serverless architectures with Lambda, Event Bridge, Step Functions, SNS, SQS, and S3, processing 10M+ events daily.

Integrated AWS Redshift Serverless with S3, Glue, Lambda, Athena, and Kinesis, ensuring 99.8% pipeline availability.

Architected SQL Server metadata-driven migration framework in Databricks, converting stored procedures and views to PySpark/Delta Lake pipelines, reducing migration time by 50%.

Built/deployed ML models with AWS AI/ML services, deep learning AMIs, and NLP, improving predictive accuracy by 35% with MLOps-enabled workflows.

Configured IAM roles/policies ensuring HIPAA and SOX compliance; secured data using AWS KMS for encryption at rest and in transit.

Engineered Data Lakehouse with AWS S3, Delta Lake, and Databricks storing 500+ TB and enabling scalable querying.

Built Kafka producers streaming 5M+ transactions daily; developed Spark-Streaming applications consuming Kafka topics and writing to HBase with sub-second latency.

Leveraged Hive for table creation, partitioning, dynamic partitions, bucketing, and optimized Hive scripts for analytics; used Sqoop for Oracle integration handling 100GB+ daily.

Implemented Attunity for replication and integration with 99.9% data consistency; configured SQL Server transactional replication for HA/DR.

Built Snaplogic pipelines moving processed data from Data Lake to Teradata EDW staging for enterprise analytics.

Optimized Spark and .Net in-memory dataset processing with broadcast variables, joins, transformations, achieving 40% performance improvement.

Developed front-end dashboards with HTML5, CSS3, JSP, JavaScript, JQuery, JQueryUI, and Bootstrap.

Delivered BI/Analytics with Tableau, Power BI connecting directly to Athena and Redshift; developed Tableau executive dashboards with daily automated KPIs.

Built Databricks advanced analytics workflows with PySpark and MLflow for real-time model training, monitoring, and deployment.

Designed event-driven architectures with AWS SNS/SQS, enabling reliable queues and scaling distributed systems processing 2M+ messages daily.

Optimized AWS Athena queries on S3, reducing costs by 45% and supporting on-the-fly querying.

Environment: AWS EMR, Spark, Hive, S3, Athena, Sqoop, Kafka, HBase, Scala, Java, Redshift, Step Functions.

Client: JPM Chase Location: Columbus, OH

Designation: Data Engineer Duration: January 2020 - July 2022

Created pipelines in Azure Data Factory v2 using activities like Move & Transform, Copy, Filter, For Each, and Databricks; supported optimal pipelines, data flows, and transformations with ADF and Spark.

Worked with Azure PaaS components: Data Factory, Databricks, Logic Apps, Application Insights, Data Lake, Data Lake Analytics, Virtual Machines, Zero-Replication, and App Services.

Built, deployed, and monitored batch and near real-time data pipelines to load structured/unstructured data into Azure Data Lake Storage.

Implemented Hive complex data types and multiple file formats like ORC and Parquet.

Involved in project design/development using Java, Scala, Go, Hadoop, Spring, Apache Spark, NiFi, and Airflow technologies.

Used Databricks Jobs API to schedule/orchestrate data workflows and Oozie workflow engine to manage interdependent Hadoop, Hive, Sqoop, and MapReduce jobs.

Automated SQL Server schema introspection and DDL conversion using Python in Databricks to accelerate migrations and reduce manual errors.

Designed interactive Power BI dashboards with drill-through capability on Azure Analysis Services; enabled BI reporting on Synapse and Data Lake with Spark SQL views and materialized tables.

Deployed Azure Resource Manager JSON Templates via PowerShell; worked with Azure SQL Database, Data Lake, Data Warehouse, Data Factory, and Analysis Services.

Proposed cost-optimized Azure architectures; developed right-sizing recommendations for Azure data infrastructure.

Implemented Pydantic in API integration pipelines for type safety and streamlined ETL; built unit/integration testing for Databricks pipelines using Pytest.

Built Spark Streaming apps in Scala and Python for Kafka data, optimized Spark streaming applications in Python.

Integrated SQLAlchemy ORM with PostgreSQL, Redshift, and SQL Server; developed and migrated SQL Server replication environments during upgrades.

Imported batch data using Sqoop to load MySQL into HDFS; extracted API data with Java and Scala; converted Hive queries into SparkSQL for optimized runs.

Built custom Azure Databricks + Delta Lake pipelines for versioned processing and real-time streaming analytics; migrated HDFS on-prem to Azure HD Insights.

Developed ETL processes in Jupyter notebooks using Databricks Spark; built microservices modules for application stats visualization.

Worked with Docker and Kubernetes for containerized deployments; diagnosed SQL Server replication issues with Replication Monitor and logs.

Collaborated with data engineers/analysts on Databricks solutions; created reusable .NET libraries for logging, exception handling, and caching.

Implemented NiFi pipelines to export HDFS data to AWS and Azure; built pipelines using Data Factory, API Gateway Services, SSIS, Talend, .NET, and Python.

Worked on Airflow automation to integrate clusters and orchestrate pipelines.

Environment: Hive, Sqoop, Linux, Cloudera, .Net, Scala, Kafka, HBase, Avro, Spark, Zookeeper, MySQL, Databricks, Python, Airflow.

Client: Molina Healthcare Location: NYC, NY

Designation: Data Engineer Duration: February 2018 - December 2019

Experienced in importing/exporting data into HDFS and Hive using Sqoop; created Hive dynamic partitions, tables, and buckets for time-series analytics and ad-hoc queries.

Developed Flume agents for capturing continuous streaming data from multiple sources and storing into HDFS.

Migrated Hive-based ETL jobs to Google Cloud Dataproc, reducing job run times and infrastructure costs.

Built data pipelines using Apache Beam and Cloud Dataflow for streaming transformations and enrichment from multiple sources.

Used Google Cloud Storage (GCS) for staging and BigQuery for analytical reporting and dashboarding.

Leveraged GCP Pub/Sub for near real-time event-driven data capture and distribution from external systems.

Deployed ingestion and processing jobs on Cloud Composer (Airflow) for orchestration and scheduling.

Tuned Hive queries with partitioning/bucketing to improve performance on-prem and in Dataproc Hive.

Processed Avro and JSON formats via GCS + Dataflow pipelines with schema validation and enforcement.

Used Stackdriver (Cloud Operations) for logging, monitoring, and debugging of long-running jobs.

Migrated bulk data into HBase using MapReduce; developed HBase data models for real-time analytics via Java API.

Built reusable Spark jobs on Dataproc for cleaning/aggregation, integrated into Cloud Composer workflows.

Developed MapReduce programs, secondary sorting, and chained mappers for optimized large-scale processing.

Implemented custom HBase filters and counters for analytics across multiple tables.

Handled Avro data files in HDFS via Avro tools/MapReduce; worked with Pig loaders/storage classes for JSON, compressed CSV, etc.

Designed and optimized Hive joins (map joins, bucket map joins, sorted bucket map joins).

Integrated Spring schedulers with Oozie client beans to manage cron jobs.

Worked with .NET Core to build cross-platform, microservice-based applications.

Experience with CDH distribution and Cloudera Manager to monitor/manage Hadoop clusters.

Actively participated in SDLC phases (scope, design, implement, deploy, test) and agile story-driven development with daily scrums.

Environment: Hadoop, Sqoop, Pig, HBase, Hive, Flume, Java 6, Eclipse, Apache Tomcat 7.0, Oracle, J2EE, .NET.

Client: Qualcomm Location: India

Designation: JAVA Developer Duration: September 2015 - December 2017

Maintained the client’s e-commerce site by resolving bugs, enhancing functionality, and creating modules; managed team progress, performance, and deliverables.

Rewrote and developed products using JS, JSP for frontend and Java, Spring MVC, Hibernate, Oracle for backend with a 5-member team.

Supervised development of a portal using Angular to validate microservices functionality.

Designed and implemented RESTful APIs with versioning, cache management, pagination, header handling, standard status codes, debugging, and documentation.

Implemented Spring MVC, Spring Boot, Spring Transactions, Spring JDBC Template, and JSON features.

Applied UI guidelines and standards in JavaScript during website development and maintenance.

Developed Data Access Objects with Hibernate for O/R mapping and database access.

Performed WebLogic administration for application deployment; created WAR/EAR files using Maven.

Ensured high availability and scalability using clustering environments at app server and database levels.

Improved performance using caching systems to optimize responses.

Monitored team performance, prepared weekly performance reports, and coordinated interviews for new hires.

Led a sub-team to develop system components for beneficiary processing.

Built system to send automated daily email summaries of database error logs.

Followed up on production deployments to ensure quality deliverables and compliance.

Environment: Java 1.4/1.7/1.8, J2EE, HTML, jQuery, JavaScript, Spring MVC 4, Spring Batch, Spring Core & Security, Hibernate, JPA, REST Web Services, Microservices, WebLogic, Oracle, Windows.

Contact this candidate