teaser image

Abstract

While open-source vision-language models perform well on simple question-answering, they still struggle with complex questions that require both perceptual and reasoning capabilities. We propose LATTE, a family of vision-language models that have LeArned to Think wiTh vision spEcialists. By offloading perception to state-of-the-art vision models, our approach enables vision-language models to focus solely on reasoning over high-quality per-ceptual information. To train LATTE, we synthesize and filter a large dataset of 273K multi-modal reasoning traces over perceptual outputs of vision specialists. LATTE trained on this data achieves significant gains over baselines across 6 benchmarks covering both perception and reasoning abilities. Ablation studies reveal that the effectiveness of multi-modal reasoning traces depends on the data sources, formats, and quality of thoughts.

Figure 1. Example outputs of LATTE vs. SoTA multi-modal large language models. Our LATTE model is able to answer challenging visual questions by reasoning over perceptual information output by vision specialists. It does so by generating a reasoning trace over vision specialists' outputs and producing a final answer based on its reasoning.

Additional Examples

LATTE Traces

LATTE-traces Generation

dataset generation method

Figure 2. Data Generation.

We illustrate our model-based data generation (top) and programmatic generation (bottom) pipelines. In model-based generation, we take existing image and QA pairs as inputs and prompt a large language model (i.e. GPT-4o) to generate either a LATTE-trace or chain-of-thought (CoT) to answer the questions. Then, we verify that the chains lead to correct final answers and parse successfully; if not, we convert them into the direct answer (Direct) format with groundtruth answers.

In programmatic generation, we first annotate images with human labelers and models, and then use the dense annotations to fill in manually written templates and generate QA and the corresponding LATTE-trace with Python programs.

Data Distribution

data distribution

Figure 3. Distribution of data formats and sources. We visualize the frequency of data formats (i.e. LATTE-pos/neg, and CoT-pos/neg, pos = correct final answers, neg = incorrect) in the original GPT-4-generated data and in our training data (i.e. LATTE-trace, CoT, or Direct) across all data sources. We also highlight the LATTE-useless (i.e. % of CoT-pos - LATTE-pos > 10 or % of LATTE-neg - LATTE-pos > 10) vs. LATTE-useful datasets.

Experimental Results

We perform extensive experiments with small multi-modal models and 9 data recipes on 6 benchmarks to study the effectiveness of LATTE-traces in enabling models to reason with vision specialists on diverse vision-language tasks.

Takeaway 1: LATTE leads to substantial gains compared to vanilla instruction-tuning on both perception and reasoning benchmarks, whereas other distillation baselines result in smaller gains or even degradation on some perception tasks. latte vs vanilla it results

Table 1. LATTE vs. Baselines on Perception and Reasoning Benchmarks.

results of best cota data recipe

Figure 1. Performance of LATTE vs. Baselines across Training Data Scales.

Takeaway 2: Our method beats the vanilla instruction-tuning baseline on average across all benchmarks regardless of the base model and checkpoint, with significant gains of 10-16% on MMVet.

results of best cota data recipe

Table 2. LATTE vs. Vanilla IT with Different Models.

data distribution

Figure 2. Qualitative analysis. Example outputs of VPD, LLaVA-CoT vs. LATTE on BLINK.

Takeaway 3: LATTE performs better in fine-grained perception tasks such as the counting questions in BLINK, while VPD and LLaVA-CoT tend to hallucinate and make perceptual errors.

Citation

@misc{ma2024tacolearningmultimodalaction,
      title={TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action}, 
      author={Zixian Ma and Jianguo Zhang and Zhiwei Liu and Jieyu Zhang and Juntao Tan and Manli Shu and Juan Carlos Niebles and Shelby Heinecke and Huan Wang and Caiming Xiong and Ranjay Krishna and Silvio Savarese},
      year={2024},
      eprint={2412.05479},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.05479}, 
}