Blip-2 as a classification model

I was wondering is it even possible to use the Blip-2 model (Blip2ForConditionalGeneration) for classification-like tasks. I have not been able to find any thorough information on how to use this model using a classification head.

Also, if the answer is yes, then which features should be extracted to train the classifier on. I can think of two possibilities:

  1. Use the last_hidden_layer from the q-former and combine these features with the last_hidden_layer of the vision model; or
  2. Use the pooled output of the q-former.

I feel like this is an interesting topic, which unfortunately was not able to find much information about.
Any related tips would be really appreciated. Thanks!

2 Likes

Hey, any luck on that?

I use it for classifying if an image matches a text or not, I just take the embeddings from image and text, find the cosine similarity then use a threshold to see if it’s matching or not

1 Like

BLIP-2 can also handle this, but for the purpose of completing this task, there seems to be no need to limit it to BLIP-2 today…