Multimodal Alignment and Fusion for Diabetic Retinopathy Detection using Deep Learning Approach

Kartina Diah Kesuma  Wardhani; Shahreen Kasim; Wawan Yunanto; Deshinta Arrova Devi; Rohayanti Hassan; Sheeba Armoogum; Mohammad Syafwan Arshad

Authors

Kartina Diah Kesuma Wardhani Department of Informatics Engineering, Politeknik Caltex Riau, Umban Sari No.1, Umbansari, Kecamatan Rumbai, Pekanbaru 28265, Riau, Indonesia
Shahreen Kasim Soft Computing and Data Mining Centre (SCM), Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, Persiaran Tun Dr. Ismail, Parit Raja, 86400, Johor Darul Ta’zim, Malaysia
Wawan Yunanto Department of Informatics Engineering, Politeknik Caltex Riau, Umban Sari No.1, Umbansari, Kecamatan Rumbai, Pekanbaru 28265, Riau, Indonesia
Deshinta Arrova Devi Center for Data Science and Sustainable Technologies, INTI International University, Nilai, Malaysia
Rohayanti Hassan Faculty of Computing, Universiti Teknologi Malaysia, Skudai, Johor Bahru, Malaysia.
Sheeba Armoogum University of Mauritius, Reduit 80837, Mauritius
Mohammad Syafwan Arshad NILECRAFT GLOBAL SDN BHD PO2, Unit 10-15, 10th Floor, Menara MAA Lorong Api Api 1, Api Api Centre, 88000, Kota Kinabalu, Sabah, Malaysia

Keywords:

Multimodal alignment, late fusion, diabetic retinopathy, Fundus, OCT Scan, EHR

Abstract

This study proposes a label-driven multimodal alignment with late fusion framework for diabetic retinopathy (DR) detection, integrating fundus images, OCT scans, and structured electronic health records (EHR). Unlike patient-wise paired datasets, which are often unavailable, the proposed alignment strategy groups and matches modalities based on diagnostic labels (DR/No_DR), ensuring semantic consistency across heterogeneous sources. Each modality is modeled using an architecture tailored to its data type—CNNs for fundus and OCT images, and an ANN for EHR—before predictions are combined via four late fusion strategies: Simple Average, Weighted Average, Majority Voting, and Stacked Ensemble. By enforcing label-driven alignment, the framework ensures that multimodal integration leverages coherent diagnostic cues from aligned class distributions, even without patient- level pairing. Experimental results, evaluated on accuracy, specificity, sensitivity, and F1-Score, show that while unimodal CNN-OCT and ANN-EHR models achieved strong accuracy, the Simple Average and Weighted Average fusion methods attained the highest F1-Scores (0.999), demonstrating an optimal precision–recall balance. Confusion matrix analysis further confirms high specificity and sensitivity, underscoring the ability of label-aligned multimodal fusion to exploit complementary diagnostic strengths. These findings highlight that label-driven alignment, coupled with averaging-based late fusion, not only improves predictive performance but also enhances robustness and clinical applicability, offering a scalable and interpretable AI- assisted DR screening solution for real-world ophthalmology practice.