TLDR
Fine-Tuning multimodal large language models (MLLMs) leads to catastrophic forgetting.
We introduce EMT: Evaluating MulTimodality, an evaluation framework that treats each MLLM as an image classifier. We use EMT to evaluate LLaVA, Otter, LENS, and InstructBLIP on MNIST, CIFAR10, CIFAR100, and miniImageNet. We find that
Despite most tested MLLMs cannot obtain similar performance as their base vision encoders, several notable observations are:
We eaxm the outputs of different models on different datasets. We identify several issues affecting the classification accuracy, and demonstrate one representative example for each issue.
IMG | Task & label | Model | Output |
---|---|---|---|
MNIST: 0 | LLaVA-7b-v0 | The number in the image is 8: | |
CIFAR10: horse | LENS | airplane, automobile, bird, cat, deer, dog, frog, horse, | |
CIFAR100: aquarium_fish | InstructBLIP-7b | a picture of a fish in a tank |
Our fine-tuning experiments show that:
From the classification curves, we observe that:
To see how fine-tuning affects the prediction outputs, we provide the following examples from 3-epoch fine-tuned LLaVA models:
IMG | Model | Output |
---|---|---|
CIFAR10-fine-tuned-LLaVA | The object is an airplane. | |
MNIST-fine-tuned-LLaVA | The airplane is 8. | |
CIFAR100-fine-tuned-LLaVA | The object is a(n) butterfly. | |
miniImageNet-fine-tuned-LLaVA | The object is a(n) aircraft carrier. |
The above examples show that:
This work is completed when YZ and MC were interning at Cruise AI research. We would like to thank Samuel Ainsworth and Yuning Chai from Cruise AI research for mentoring this project. YZ shamelessly borrowed this html template from Brent Yi.
EMT also has another meaning: Emilia-tan Maji Tenshi (エミリアたん・マジ・天使).