On the hidden treasure of dialog in video question answering - Irisa Accéder directement au contenu
Communication Dans Un Congrès Année : 2021

On the hidden treasure of dialog in video question answering

Résumé

High-level understanding of stories in video such as movies and TV shows from raw data is extremely challenging. Modern video question answering (VideoQA) systems often use additional human-made sources like plot synopses, scripts, video descriptions or knowledge bases. In this work, we present a new approach to understand the whole story without such external sources. The secret lies in the dialog: unlike any prior work, we treat dialog as a noisy source to be converted into text description via dialog summarization, much like recent methods treat video. The input of each modality is encoded by transformers independently, and a simple fusion method combines all modalities, using soft temporal attention for localization over long inputs. Our model outperforms the state of the art on the KnowIT VQA dataset by a large margin, without using question-specific human annotation or humanmade plot summaries. It even outperforms human evaluators who have never watched any whole episode before.
Fichier principal
Vignette du fichier
C119.iccv21.vqa.pdf (1.15 Mo) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-03530160 , version 1 (17-01-2022)

Identifiants

  • HAL Id : hal-03530160 , version 1

Citer

Deniz Engin, François Schnitzler, Ngoc Q K Duong, Yannis Avrithis. On the hidden treasure of dialog in video question answering. ICCV 2021 - IEEE/CVF International Conference on Computer Vision, Oct 2021, Virtual, France. pp.1-10. ⟨hal-03530160⟩
41 Consultations
45 Téléchargements

Partager

Gmail Facebook X LinkedIn More