Idea arises from the very exciting recent work titled Catching Out-of-Context Misinformation with Self-supervised Learning by Aneja et al. They propose a simple handcrafted rule to detect images where two captions align with same object(s) in the image; but semantically convey different meanings, then the image with its two captions is classified as having conflicting-image captions.
Here, both captions align with same object(s) in the image; i.e., Donald Trump and Angela Merkel on the left and Obama, Dr. Fauci, and Melinda Gates on the right. If the two captions are semantically different from each other (right), they consider them as out of context; otherwise not (left).
The objective during training is to obtain higher scores for aligned image-text pairs (i.e., if an image appeared with the text irrespective of the context) than misaligned image-text pairs (i.e., some randomly-chosen text which did not appear with the image). To this end, they formulate a scoring function to align objects in the image with the caption. Intuitively, an image-caption pair should have a high matching score if visual correspondences for the caption are present in the image, and a low score if the caption is unrelated to the image. The resulting Image-Text matching model obtained from training, as described above, provides an accurate representation of how likely a caption aligns with an image.
At test time they take as input an image and two captions; they then use the trained Image-Text Matching model where
they first pick the top k bounding boxes and check the overlap for each of the k boxes for
caption1 with each of the k boxes from caption2. If IoU for any such pair > threshold t, they infer that image regions overlap.
Similarly, they compute textual overlap Ssim using pre-trained Sentence Similarity model and if Ssim > threshold
t, it implies that the two captions are semantically different, thus implying that image is used out of context.
I am curious how well would multimodal networks trained to directly predict an alignment score given an image and a caption as input work on this problem. This would avoid the need of intermediate object detection predictions and therefore the network can learn to use image regions outside the object boxes for alignment with the caption. The advantage of the hand designed alignement metric is not obvious to me. Moreover, this allows us to also have fewer hand designed hyperparameteres such as top k bounding boxes at the inference stage.
Supervised Multimodal Bitransformers for Classifying Images and Text by Kiela et al. could be potentialy used to learn to predict the image caption alignment score directly. It has unimodally pretrained models that are simple and easy to adapt, i.e., it is straightforward to replace the text or image encoders with better alternatives and directly finetune, without requiring multimodal retraining. Furthermore, their method does not rely on a particular feature extraction pipeline since it does not require e.g. region or bounding box proposals, and is modality agnostic: it works for any sequence of dense vectors.