Investigating the Catastrophic Forgetting in Multimodal Large Language Models

@ CPAL2024
Yuexiang Zhai1
Shengbang Tong2
Xiao Li3
Mu Cai4
Qing Qu3
Yong Jae Lee4,5
Yi Ma1
1UC Berkeley
2NYU
3UMichigan
4UW-Madison
5Cruise LLC

TLDR Fine-Tuning multimodal large language models (MLLMs) leads to catastrophic forgetting.

Overview

We introduce EMT: Evaluating MulTimodality, an evaluation framework that treats each MLLM as an image classifier. We use EMT to evaluate LLaVA, Otter, LENS, and InstructBLIP on MNIST, CIFAR10, CIFAR100, and miniImageNet. We find that

Evaluating Existing Open Source MLLMs

We present the classification accuracies of LLaVA, Otter, LENS, and InstructBLIP on MNIST, CIFAR10, CIFAR100, and miniImageNet as follows. We defferentiate the presented radial plots by the base ViT CLIP models.
ViT-L-14
ViT-H-14
ViT-g-14

Despite most tested MLLMs cannot obtain similar performance as their base vision encoders, several notable observations are:

Examing the prediction outputs

We eaxm the outputs of different models on different datasets. We identify several issues affecting the classification accuracy, and demonstrate one representative example for each issue.

IMG Task & label Model Output
MNIST: 0 LLaVA-7b-v0 The number in the image is 8:
CIFAR10: horse LENS airplane, automobile, bird, cat, deer, dog, frog, horse,
CIFAR100: aquarium_fish InstructBLIP-7b a picture of a fish in a tank

Fine-Tuning LLaVA

We use EMT to evaluate the accuracy changes during LLaVA fine-tuning, by fine-tuning (1) the linear adapter layer (denote as linear); (2) the linear adapter layer and the LLM using Lora (denote as lora).

Our fine-tuning experiments show that:

Next, we examine the fine-tuning process in more details by providing the accuracy curves.

From the classification curves, we observe that:

Examing the prediction outputs

To see how fine-tuning affects the prediction outputs, we provide the following examples from 3-epoch fine-tuned LLaVA models:

IMG Model Output
CIFAR10-fine-tuned-LLaVA The object is an airplane.
MNIST-fine-tuned-LLaVA The airplane is 8.
CIFAR100-fine-tuned-LLaVA The object is a(n) butterfly.
miniImageNet-fine-tuned-LLaVA The object is a(n) aircraft carrier.

The above examples show that:

Acknowledgements

This work is completed when YZ and MC were interning at Cruise AI research. We would like to thank Samuel Ainsworth and Yuning Chai from Cruise AI research for mentoring this project. YZ shamelessly borrowed this html template from Brent Yi.

Disclaimers

Why we do not release codes

In case you like theory

Misc

EMT also has another meaning: Emilia-tan Maji Tenshi (エミリアたん・マジ・天使).