TLDR: We introduce synthetic EgoTV and real-world CTV datasets for advancing egocentric agents. Given: (1) natural language task description, and (2) (egocentric) video of agent performing the task, Objective: to determine if the task was executed correctly based on description. The datasets feature multi-step tasks with diverse complexities. We also propose a novel Neuro-Symbolic Grounding (NSG) approach that outperforms SOTA vision-language models in causal, temporal, and compositional reasoning.

EgoTV Dataset

Figure 1. EgoTV dataset. A positive example [Left] and a negative example [Right] from the train set along with illustrative examples from the test splits [Bottom] of EgoTV are shown. The test splits are focused on generalization to novel compositions of tasks, unseen sub-tasks or steps and scenes, and abstraction in NL task descriptions. The bounding boxes are solely for demonstration purposes and are not used during training/inference.

Benchmark: Determine if a task described in NL has been correctly executed by the agent in the egocentric video.
Tasks: Actions heat, clean, slice, cool, place, pick parameterized by a target object. place action is additionally parameterized by a receptacle object.
Ordering Constraints: and, then, before/after
Example: heat_then_clean(apple) | NL Description: apple is heated, then cleaned in a sinkbasin. The task consists of two ordered sub-tasks: heat → clean on target object: apple
Metrics: Complexity: #sub-tasks in a task requiring compositional reasoning. Ordering: #ordering constraints in a task requiring temporal reasoning. Moreover, F1-score and accuracy are used to measure performance.
Generalization Splits: Novel Tasks: Unseen compositions of seen sub-tasks. Novel Steps: Unseen affordances. Novel Scenes: seen tasks in unseen kitchen scenes. Abstraction: Abstract task descriptions.

EgoTV Dataset Statistics

Figure 2. EgoTV dataset Stats

168 hours, 82 tasks, 1038 task-object combinations
average video length of 84 seconds
4.6 sub-tasks per task in the EgoTV dataset, each sub-task spans ~ 14 frames
~2.4 ways to verify a task from NL description

CTV Dataset

Figure 3. CrossTask Verification (CTV) dataset

We introduce CrossTask Verification (CTV) dataset, using videos from the CrossTask dataset to evaluate task verification models on real-world videos. Thus, CTV complements EgoTV dataset -- CTV and EgoTV together provide a solid test-bed for future research on task verification. CrossTask has 18 task classes, each with ~ 150 videos, from which we create ~ 2.7K samples.

Results

Figure 5. [Left] Comparison of baselines with NSG on different data splits using F1-score. [Middle] F1-score of NSG vs. best-performing baseline for EgoTV tasks with varying complexity averaged over all splits. [Right] Confusion Matrix for NSG Queries on validation split (SQuery: StateQuery, RQuery: RelationQuery).

NSG learns to perform compositional & temporal reasoning on EgoTV and CTV datasets.

NSG shows consistent performance with increasing task difficulty.

NSG learns to localize task-relevant entities without explicit supervision.

Why is a new benchmark necessary?

EgoTV vs. existing video-language datasets. EgoTV benchmark enables systematic investigation (diagnostics) on compositional, causal (e.g., effect of actions), and temporal (e.g., action ordering) reasoning in egocentric settings.

Reasoning: EgoTV focuses on compositional, causal and temporal reasoning.

Observations: EgoTV is egocentric unlike fully-observable CLEVRER dataset.

Objective: EgoTV focuses on task verification, while ALFRED on task execution.

Control: EgoTV provides systematic control and precise diagnostics across various independent reasoning aspects that Ego4D, EPIC-KITCHENS do not.

Paper

R. Hazra, B. Chen, A. Rai, N. Kamra, R. Desai
EgoTV: Egocentric Task Verification from Natural Language Task Descriptions.
ICCV, 2023.
[Bibtex]

Abstract