EgoTV : Egocentric Task Verification from Natural Language Task Descriptions
ICCV 2023

Rishi Hazra1
Brian Chen2
Akshara Rai3
Nitin Kamra2
Ruta Desai3
Örebro University1
Meta Reality Labs Research2
Meta AI3

[Paper]
[GitHub]
[EgoTV Dataset]
[CTV Dataset]
 

TLDR: We introduce synthetic EgoTV and real-world CTV datasets for advancing egocentric agents. Given: (1) natural language task description, and (2) (egocentric) video of agent performing the task, Objective: to determine if the task was executed correctly based on description. The datasets feature multi-step tasks with diverse complexities. We also propose a novel Neuro-Symbolic Grounding (NSG) approach that outperforms SOTA vision-language models in causal, temporal, and compositional reasoning.


Abstract

To enable progress towards egocentric agents capable of understanding everyday tasks specified in natural language, we propose a benchmark and a synthetic dataset called Egocentric Task Verification (EgoTV). The goal in EgoTV is to verify the execution of tasks from egocentric videos based on the natural language description of these tasks. EgoTV contains pairs of videos and their task descriptions for multi-step tasks -- these tasks contain multiple sub-task decompositions, state changes, object interactions, and sub-task ordering constraints. In addition, EgoTV also provides abstracted task descriptions that contain only partial details about ways to accomplish a task. Consequently, EgoTV requires causal, temporal, and compositional reasoning of video and language modalities, which is missing in existing datasets. We also find that existing vision-language models struggle at such all round reasoning needed for task verification in EgoTV. Inspired by the needs of EgoTV, we propose a novel Neuro-Symbolic Grounding (NSG) approach that leverages symbolic representations to capture the compositional and temporal structure of tasks. We demonstrate NSG's capability towards task tracking and verification on our EgoTV dataset and a real-world dataset derived from CrossTask (CTV). We open-source the EgoTV and CTV datasets and the NSG model for future research on egocentric assistive agents.



EgoTV Dataset


Figure 1. EgoTV dataset. A positive example [Left] and a negative example [Right] from the train set along with illustrative examples from the test splits [Bottom] of EgoTV are shown. The test splits are focused on generalization to novel compositions of tasks, unseen sub-tasks or steps and scenes, and abstraction in NL task descriptions. The bounding boxes are solely for demonstration purposes and are not used during training/inference.



  • Benchmark: Determine if a task described in NL has been correctly executed by the agent in the egocentric video.
  • Tasks: Actions heat, clean, slice, cool, place, pick parameterized by a target object. place action is additionally parameterized by a receptacle object.
  • Ordering Constraints: and, then, before/after
  • Example: heat_then_clean(apple) | NL Description: apple is heated, then cleaned in a sinkbasin. The task consists of two ordered sub-tasks: heat → clean on target object: apple
  • Metrics: Complexity: #sub-tasks in a task requiring compositional reasoning. Ordering: #ordering constraints in a task requiring temporal reasoning. Moreover, F1-score and accuracy are used to measure performance.
  • Generalization Splits: Novel Tasks: Unseen compositions of seen sub-tasks. Novel Steps: Unseen affordances. Novel Scenes: seen tasks in unseen kitchen scenes. Abstraction: Abstract task descriptions.


EgoTV Dataset Statistics


Figure 2. EgoTV dataset Stats



  • 168 hours, 82 tasks, 1038 task-object combinations
  • average video length of 84 seconds
  • 4.6 sub-tasks per task in the EgoTV dataset, each sub-task spans ~ 14 frames
  • ~2.4 ways to verify a task from NL description


CTV Dataset


Figure 3. CrossTask Verification (CTV) dataset



We introduce CrossTask Verification (CTV) dataset, using videos from the CrossTask dataset to evaluate task verification models on real-world videos. Thus, CTV complements EgoTV dataset -- CTV and EgoTV together provide a solid test-bed for future research on task verification. CrossTask has 18 task classes, each with ~ 150 videos, from which we create ~ 2.7K samples.


Neuro-Symbolic Grounding (NSG)

Figure 4. NSG Model (a) Semantic Parser and Query Encoder. (b) Video Aligner.



EgoTV requires visual grounding of task-relevant entities such as actions, state changes, etc. extracted from NL task descriptions for verifying tasks in videos. To enable grounding that generalizes to novel compositions of tasks and actions, we propose the Neuro-symbolic Grounding (NSG). NSG consists of three modules: (a Left) semantic parser, which converts task-relevant states from NL task descriptions into symbolic graphs, (a Right) query encoders, which generate the probability of a node in the symbolic graph being grounded in a video segment, and (b) video aligner, which uses the query encoders to align these symbolic graphs with videos. NSG thus uses intermediate symbolic representations between NL task descriptions and corresponding videos to achieve compositional generalization.


Results


Figure 5. [Left] Comparison of baselines with NSG on different data splits using F1-score. [Middle] F1-score of NSG vs. best-performing baseline for EgoTV tasks with varying complexity averaged over all splits. [Right] Confusion Matrix for NSG Queries on validation split (SQuery: StateQuery, RQuery: RelationQuery).



  • NSG learns to perform compositional & temporal reasoning on EgoTV and CTV datasets.
  • NSG shows consistent performance with increasing task difficulty.
  • NSG learns to localize task-relevant entities without explicit supervision.


  • Why is a new benchmark necessary?


    EgoTV vs. existing video-language datasets. EgoTV benchmark enables systematic investigation (diagnostics) on compositional, causal (e.g., effect of actions), and temporal (e.g., action ordering) reasoning in egocentric settings.



  • Reasoning: EgoTV focuses on compositional, causal and temporal reasoning.
  • Observations: EgoTV is egocentric unlike fully-observable CLEVRER dataset.
  • Objective: EgoTV focuses on task verification, while ALFRED on task execution.
  • Control: EgoTV provides systematic control and precise diagnostics across various independent reasoning aspects that Ego4D, EPIC-KITCHENS do not.


  • Paper

    R. Hazra, B. Chen, A. Rai, N. Kamra, R. Desai.
    EgoTV: Egocentric Task Verification from Natural Language Task Descriptions.
    ICCV, 2023.

    [Bibtex]