Incomplete Modality Disentangled Representation for Ophthalmic Disease Grading and Diagnosis

1Department of Computer Science, University of Exeter, UK
2Department of Eye and Vision Sciences, University of Liverpool, UK
3Faculty of IT, Monash University, AU
4Center For AI And Data Science For Integrated Diagnostics, University of Pennsylvania, USA
image
Figure 1. (a) “Vanilla” denotes the latent subspace methods. (b) Our proposed IMDR strategy employs explicit constraints to minimize mutual information to the Disentangle Extraction layer, guided by a joint distribution, to effectively decouple multimodal data. (c) Illustration of the intra-modality inter-channel distance and inter-modality inter-channel distance between feature maps of the encoders from different modalities. The definition of channel distance is detailed in the Appendix B. “Single” is the model that trains the encoders of each modality independently, providing the ideal feature diversity without inter-modality interference. “:A” denotes the histogram mean. Lower inter-channel similarity means higher diversity.

Abstract

Ophthalmologists typically require multimodal data sources to improve diagnostic accuracy in clinical decisions. However, due to medical device shortages, low-quality data and data privacy concerns, missing data modalities are common in real-world scenarios. Existing deep learning methods tend to address it by learning an implicit latent subspace representation for different modality combinations. We identify two significant limitations of these methods: (1) implicit representation constraints that hinder the model's ability to capture modality-specific information and (2) modality heterogeneity, causing distribution gaps and redundancy in feature representations. To address these, we propose an Incomplete Modality Disentangled Representation (IMDR) strategy, which disentangles features into explicit independent modal-common and modal-specific features by guidance of mutual information, distilling informative knowledge and enabling it to reconstruct valuable missing semantics and produce robust multimodal representations. Furthermore, we introduce a joint proxy learning module that assists IMDR in eliminating intra-modality redundancy by exploiting the extracted proxies from each class. Experiments on four ophthalmology multimodal datasets demonstrate that the proposed IMDR outperforms the state-of-the-art methods significantly.

Methodology

image

Figure 2. Overview of Our Proposed Framewor: (a) We train a teacher model using complete modality data, followed by co-training with a student model on incomplete inputs for knowledge distillation. The distillation is supervised by feature loss LFeat and logit loss LLogit. During the training of the teacher model, the encoder outputs the single-modality features ef and eO. We build a set of proxies for a modality, with each set representing a class. Positive proxies are selected by a similarity matrix between and e. All proxies are optimized through the proxy loss LProx. Consequently, f,+ and O,+, together with features ef and eO, are passed to the IMDR. (b)Details for the IMDR strategy: We estimate the distributions of f,+ and O,+, then combine them using Equation P(̂e | xf, xO) to obtain the joint distribution. The modality-shared feature s is sampled from this distribution. This feature s guides the decoupling via an attention layer, supervised by the loss LMI to minimize the mutual information between the extracted shared features ̂s and specific features Rf, RO, as well as between Rf and RO.

Results

image
Figure 3. Under Complete-Modality conditions, we evaluate our IMDR model against other models using two baseline architectures: CNN and Transformer. Specifically, we utilize 2D/3D-ResNet50 as CNN backbones, while Swin-Transformer and UNETR are employed as Transformer backbones. Under Missing-Modality conditions, all models uniformly use 2D/3D-ResNet-50 backbones and apply the same distillation method, as outlined in our approach, to ensure experimental consistency. The model’s efficacy is assessed using four key metrics: ACC, AUC, F1 and Spec.
image
Figure 4. The comparison of performance across various missing rates under intra-modality incompleteness.

Visualization

image
Figure 5. Comparative visualization of attention maps under the inter-modality incompleteness setting: The first row corresponds to the Harvard-30k AMD dataset; the second row corresponds to the Harvard-30k Glaucoma dataset; the third row corresponds to the Harvard-30k DR dataset, and the fourth row corresponds to the GAMMA dataset. For each dataset, we select a representative disease stage.

BibTeX