Causal Interventions for Robust Visual Question Answering
Date of Award
6-1-2023
Document Type
Thesis
Degree Name
Master of Science in Computer Science
First Advisor
Raphael B. Alampay, PhDPatricia Angela R. Abu, PhD
Abstract
Contemporary visual question answering (VQA) models have been shown to exhibit poor out-of-distribution (OOD) generalization ability due to their tendency to learn superficial statistical correlations from training data as opposed to more reliable underlying causal features. This can be addressed by widening the training distribution through data augmentation, but though recent advances have been made in generative modelling and training large foundation models, the application of these methods for data augmentation targeting robust VQA remains underexplored. This study proposes a novel approach to ensembling foundation models in order to generate OOD datapoints to widen the distribution of a training dataset. In particular, this study proposes a novel token sampling method to perturb existing image captions into OOD captions, which can then be used to steer a pretrained text-to-image model. The resulting images along with the original questions and answers can then be used to finetune a VQA model that has only been trained on the original training dataset. This method is empirically shown to lead to robustness improvements; with a BLIP pretrained on VQA v2.0, finetuning with the study’s generated data introduces a 7.59% accuracy drop reduction on AQUA and a 1.43% accuracy drop reduction on VizWiz.
Recommended Citation
Ramos, Ryan Ceasar C., (2023). Causal Interventions for Robust Visual Question Answering. Archīum.ATENEO.
https://archium.ateneo.edu/theses-dissertations/873
