Forgetting in MLLM Fine-Tuning

TLDR Fine-Tuning multimodal large language models (MLLMs) leads to catastrophic forgetting.

Evaluating Existing Open Source MLLMs

We present the classification accuracies of LLaVA, Otter, LENS, and InstructBLIP on MNIST, CIFAR10, CIFAR100, and miniImageNet as follows. We defferentiate the presented radial plots by the base ViT CLIP models.

ViT-L-14

ViT-H-14

ViT-g-14

Despite most tested MLLMs cannot obtain similar performance as their base vision encoders, several notable observations are:

InstructBLIP-7b makes the only exception by outperforming its vision encoder.
Among all tested models, LENS achieves the worst overall classification performance.

Examing the prediction outputs

We eaxm the outputs of different models on different datasets. We identify several issues affecting the classification accuracy, and demonstrate one representative example for each issue.

Incorrect prediction: just like other classification task, MLLMs sometimes make wrong predictions.
Intrinsic hallucination: the generated output directly contradicts the source content.
Intrinsic hallucination: the output bears no verifiable connection to the original source content.

Task & label	Model	Output
MNIST: 0	LLaVA-7b-v0	The number in the image is 8:
CIFAR10: horse	LENS	airplane, automobile, bird, cat, deer, dog, frog, horse,
CIFAR100: aquarium_fish	InstructBLIP-7b	a picture of a fish in a tank

Fine-Tuning LLaVA

We use EMT to evaluate the accuracy changes during LLaVA fine-tuning, by fine-tuning (1) the linear adapter layer (denote as linear); (2) the linear adapter layer and the LLM using Lora (denote as lora).

Our fine-tuning experiments show that:

Fine-tuning on one dataset leads to catastrophic forgetting on other datasets, this phenomenon happens in both linear and lora fine-tuning.
Lora fine-tuning leads to more forgetting than linear fine-tuning.

Next, we examine the fine-tuning process in more details by providing the accuracy curves.

From the classification curves, we observe that:

Linear fine-tuning enjoys generalizability, as linear fine-tuning using a RGB dataset (CIFAR10, CIFAR100, miniImageNet) also improve the accuracy on other RGB datasets at the first epoch.
Lora fine-tuning fine-tuning does not possess the generalizability as linear fine-tuning.

Examing the prediction outputs

To see how fine-tuning affects the prediction outputs, we provide the following examples from 3-epoch fine-tuned LLaVA models:

IMG	Model	Output
	CIFAR10-fine-tuned-LLaVA	The object is an airplane.
	MNIST-fine-tuned-LLaVA	The airplane is 8.
	CIFAR100-fine-tuned-LLaVA	The object is a(n) butterfly.
	miniImageNet-fine-tuned-LLaVA	The object is a(n) aircraft carrier.

The above examples show that:

Fine-tuning MLLMs indeed improve the classification performance on the fine-tuning dataset.
Fine-tuning MLLMs leads to catastrophic forgetting on other datasets, as the fine-tuned MLLM tends to memorize the fine-tuning dataset hence generating hallucinated texts.

Acknowledgements

Investigating the Catastrophic Forgetting in Multimodal Large Language Models

Overview

Evaluating Existing Open Source MLLMs

Examing the prediction outputs

Fine-Tuning LLaVA

Examing the prediction outputs

Acknowledgements

Disclaimers

Why we do not release codes

In case you like theory

Misc