LATTE
LeArning to Think wiTh Vision SpEcialists

Zixian Ma¹, Jianguo Zhang², Zhiwei Liu², Jieyu Zhang¹, Juntao Tan², Manli Shu², Juan Carlos Niebles², Shelby Heinecke², Huan Wang², Caiming Xiong², Ranjay Krishna¹, Silvio Savarese²

¹University of Washington

² Salesforce AI Research

arXiv Code

🤗

Model Weights

🤗

Datasets

💻

Demo

Abstract

While open-source vision-language models perform well on simple question-answering, they still struggle with complex questions that require both perceptual and reasoning capabilities. We propose LATTE, a family of vision-language models that have LeArned to Think wiTh vision spEcialists. By offloading perception to state-of-the-art vision models, our approach enables vision-language models to focus solely on reasoning over high-quality per-ceptual information. To train LATTE, we synthesize and filter a large dataset of 273K multi-modal reasoning traces over perceptual outputs of vision specialists. LATTE trained on this data achieves significant gains over baselines across 6 benchmarks covering both perception and reasoning abilities. Ablation studies reveal that the effectiveness of multi-modal reasoning traces depends on the data sources, formats, and quality of thoughts.

Successful examples

Failure examples

LATTE-traces Generation

Figure 2. Data Generation.

We illustrate our model-based data generation (top) and programmatic generation (bottom) pipelines. In model-based generation, we take existing image and QA pairs as inputs and prompt a large language model (i.e. GPT-4o) to generate either a LATTE-trace or chain-of-thought (CoT) to answer the questions. Then, we verify that the chains lead to correct final answers and parse successfully; if not, we convert them into the direct answer (Direct) format with groundtruth answers.

In programmatic generation, we first annotate images with human labelers and models, and then use the dense annotations to fill in manually written templates and generate QA and the corresponding LATTE-trace with Python programs.

Data Distribution

Figure 3. Distribution of data formats and sources. We visualize the frequency of data formats (i.e. LATTE-pos/neg, and CoT-pos/neg, pos = correct final answers, neg = incorrect) in the original GPT-4-generated data and in our training data (i.e. LATTE-trace, CoT, or Direct) across all data sources. We also highlight the LATTE-useless (i.e. % of CoT-pos - LATTE-pos > 10 or % of LATTE-neg - LATTE-pos > 10) vs. LATTE-useful datasets.

Experimental Results

We perform extensive experiments with small multi-modal models and 9 data recipes on 6 benchmarks to study the effectiveness of LATTE-traces in enabling models to reason with vision specialists on diverse vision-language tasks.

Takeaway 1: LATTE leads to substantial gains compared to vanilla instruction-tuning on both perception and reasoning benchmarks, whereas other distillation baselines result in smaller gains or even degradation on some perception tasks.
Table 1. LATTE vs. Baselines on Perception and Reasoning Benchmarks.

Figure 1. Performance of LATTE vs. Baselines across Training Data Scales.

Takeaway 2: Our method beats the vanilla instruction-tuning baseline on average across all benchmarks regardless of the base model and checkpoint, with significant gains of 10-16% on MMVet.

Table 2. LATTE vs. Vanilla IT with Different Models.

Figure 2. Qualitative analysis. Example outputs of VPD, LLaVA-CoT vs. LATTE on BLINK.

Takeaway 3: LATTE performs better in fine-grained perception tasks such as the counting questions in BLINK, while VPD and LLaVA-CoT tend to hallucinate and make perceptual errors.

Citation

@misc{ma2024tacolearningmultimodalaction,
      title={TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action}, 
      author={Zixian Ma and Jianguo Zhang and Zhiwei Liu and Jieyu Zhang and Juntao Tan and Manli Shu and Juan Carlos Niebles and Shelby Heinecke and Huan Wang and Caiming Xiong and Ranjay Krishna and Silvio Savarese},
      year={2024},
      eprint={2412.05479},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.05479}, 
}

LATTE
LeArning to Think wiTh Vision SpEcialists

Abstract

Additional Examples

LATTE Traces

LATTE-traces Generation

Data Distribution

Experimental Results

Takeaway 2: Our method beats the vanilla instruction-tuning baseline on average across all benchmarks regardless of the base model and checkpoint, with significant gains of 10-16% on MMVet.

Takeaway 3: LATTE performs better in fine-grained perception tasks such as the counting questions in BLINK, while VPD and LLaVA-CoT tend to hallucinate and make perceptual errors.

Citation

LATTE LeArning to Think wiTh Vision SpEcialists

Abstract

Additional Examples

LATTE Traces

LATTE-traces Generation

Data Distribution

Experimental Results

Takeaway 2: Our method beats the vanilla instruction-tuning baseline on average across all benchmarks regardless of the base model and checkpoint, with significant gains of 10-16% on MMVet.

Takeaway 3: LATTE performs better in fine-grained perception tasks such as the counting questions in BLINK, while VPD and LLaVA-CoT tend to hallucinate and make perceptual errors.

Citation

LATTE
LeArning to Think wiTh Vision SpEcialists