Figure 2. Data Generation.
We illustrate our model-based data generation (top) and programmatic generation (bottom) pipelines. In
model-based generation, we take existing image and QA pairs as inputs and prompt a large language model (i.e. GPT-4o) to generate
either a LATTE-trace or chain-of-thought (CoT) to answer the questions. Then, we verify that the chains lead to correct final answers and
parse successfully; if not, we convert them into the direct answer (Direct) format with groundtruth answers.
In programmatic generation,
we first annotate images with human labelers and models, and then use the dense annotations to fill in manually written templates and
generate QA and the corresponding LATTE-trace with Python programs.
Data Distribution
Figure 3. Distribution of data formats and sources. We visualize the frequency of data formats (i.e. LATTE-pos/neg, and CoT-pos/neg,
pos = correct final answers, neg = incorrect) in the original GPT-4-generated data and in our training data (i.e. LATTE-trace, CoT, or
Direct) across all data sources. We also highlight the LATTE-useless (i.e. % of CoT-pos - LATTE-pos > 10 or % of LATTE-neg -
LATTE-pos > 10) vs. LATTE-useful datasets.