FAST-EQA

Efficient Embodied Question Answering with Global and Local Region Relevancy

Haochen Zhang*, Nirav Savaliya, Faizan Siddiqui, Enna Sachdeva

Honda Research Institute USA, *Carnegie Mellon University

๐Ÿ“„ Paperย ย ย ย ๐Ÿ’ปCode


Framework Overview

Overview

Embodied Question Answering (EQA) requires an agent to navigate a 3D environment and answer natural language questions using visual observations. This demands perception, spatial reasoning, semantic understanding, memory management, and efficient exploration.

Existing approaches often struggle with:

  • Inefficient exploration policies
  • Unbounded or noisy memory
  • Slow inference due to heavy vision-language reasoning
  • Difficulty handling multi-target questions

FAST-EQA introduces an efficient and scalable framework that combines:

  • Question-conditioned global exploration
  • Target-aware local search
  • Bounded visual memory
  • Chain-of-Thought reasoning for robust answering

The result is a faster, more accurate embodied QA agent that scales to long-horizon tasks.


Method Overview

Watch the Video

FAST-EQA operates in four main stages:

1. Question Parsing

Given a question, an LLM extracts:

  • Visual Targets (T) โ€“ objects or attributes to inspect
  • Relevant Regions (R) โ€“ areas likely to contain those targets

Example:
"Are the bedsheets in the master and guest bedrooms the same color?"

  • Targets: bedsheets (master bedroom), bedsheets (guest bedroom)
  • Relevant Regions: master bedroom, guest bedroom

This decomposition guides exploration from the start.


2. Global Relevance Exploration (GRE)

Global Exploration

Instead of standard frontier exploration, FAST-EQA prioritizes semantically meaningful frontiers.

Key ideas:

  • Detect narrow openings (doors, hallways) from occupancy maps
  • Treat them as region transitions
  • Rank frontiers based on proximity to relevant regions
  • Move toward regions likely to contain answer evidence

This allows the agent to efficiently reach important areas without exhaustively exploring the full map.


3. Local Relevance Exploration (LRE)

Once inside a relevant region:

  • The agent collects panoramic observations
  • Visual observations are evaluated for target relevance
  • A stopping condition checks whether enough evidence has been gathered

This prevents unnecessary wandering and reduces redundant observations.


4. Bounded Visual Memory

Visual Memory

To maintain efficiency, FAST-EQA uses a fixed-size memory per target.

Each observation is scored using:

  • CLIP similarity between image and target text
  • A generative VLM relevance score

Only the top-k most relevant observations per target are stored.

This ensures:

  • Constant memory size
  • Reduced noise
  • Fast inference
  • Improved reasoning quality

5. Chain-of-Thought Answering

Once exploration completes:

  • The stored memory is passed to a large vision-language model
  • A structured Chain-of-Thought prompt guides reasoning
  • The model produces the final answer

This improves:

  • Multi-step reasoning
  • Multi-target comparison
  • Interpretability

Experimental Results

Results

Inference Speed

Ablation

Links

Paper: https://arxiv.org/abs/2602.15813

Code: Coming Soon


Cite our work:

If you find this work useful, please consider citing:

@article{fasteqa2026,
  title={FAST-EQA: Efficient Embodied Question Answering with Global and Local Region Relevancy},
  author={Haochen Zhang and Nirav Savaliya and Faizan Siddiqui and Enna Sachdeva},
  booktitle = {Winter Conference on Applications of Computer Vision (WACV), 2026},
  year={2026}
}