Benchmarking and Improving Detail Image Caption (2024)

Hongyuan Dong¹^*,Jiawen Li¹^*,Bohong Wu¹,Jiacong Wang^1,2,Yuan Zhang^1,3,Haoyuan Guo¹^†
¹ByteDance Inc. ²School of Artificial Intelligence, University of Chinese Academy of Sciences
³School of Computer Science, Peking University
{donghongyuan.dousia, lijiawen.0818, bohongwu}@bytedance.com
wangjiacong20@mails.ucas.ac.cn, {zhangyuan.gump, guohaoyuan}@bytedance.com

Abstract

Image captioning has long been regarded as a fundamental task in visual understanding.Recently, however, few large vision-language model (LVLM) research discusses model’s image captioning performance because of the outdated short-caption benchmarks and unreliable evaluation metrics.In this work, we propose to benchmark detail image caption task by curating high-quality evaluation datasets annotated by human experts, GPT-4V and Gemini-1.5-Pro.We also design a more reliable caption evaluation metric called CAPTURE (CAPtion evaluation by exTracting and coUpling coRE information).CAPTURE extracts visual elements, e.g., objects, attributes and relations from captions, and then matches these elements through three stages, achieving the highest consistency with expert judgements over other rule-based or model-based caption metrics.The proposed benchmark and metric provide reliable evaluation for LVLM’s detailed image captioning ability.Guided by this evaluation, we further explore to unleash LVLM’s detail caption capabilities by synthesizing high-quality data through a five-stage data construction pipeline.Our pipeline only uses a given LVLM itself and other open-source tools, without any human or GPT-4V annotation in the loop.Experiments show that the proposed data construction strategy significantly improves model-generated detail caption data quality for LVLMs with leading performance, and the data quality can be further improved in a self-looping paradigm.All code and dataset will be publicly available at https://github.com/foundation-multimodal-models/CAPTURE.

^†^†footnotetext: * Equal contribution.^†^†footnotetext: †Email corresponding

1 Introduction

Image captioning has long been a fundamental task to assess LVLM’s vision understanding capability[54, 34, 12, 15].However, recent LVLM researches evaluate LVLMs’ visual understanding performance with a focus on QA benchmarks, such as MME[16], MMBench[36], MMMU[60], MM-Vet[59], etc., which may suffer from instability caused by LVLMs’ varying instruction following abilities[16].What’s worse, human-defined queries may cover a limited scope of vision features[25] and introduce bias in performance evaluation[59].Traditional image captioning task is considered unreliable for visual understanding evaluation because of the outdated benchmarks and unstable evaluation metrics.Current image caption benchmarks consist of fairly short captions with limited vision information[32, 2], while SOTA LVLMs are capable of generating detail image captions encompassing a variety of fine-grained elements[9, 54, 34], and only a few of them are covered in the provided ground truth captions.This contradiction leads to unsatisfying evaluation results.To this end, we propose to curate high-quality detail image caption evaluation datasets to provide reliable evaluation results for SOTA LVLMs.The evaluation datasets are annotated by human experts and the most capable LVLM GPT-4V[41]&Gemini-1.5-Pro[46], and are therefore of satisfying quality for state-of-the-art (SOTA) LVLM evaluation.

Apart from benchmarks, existing caption evaluation metrics also suffer from poor consistency with human judgements.Traditional rule-based caption metric such as BLEU[43], CIDER[53] and METEOR[4],compute n-gram segment matching score between candidate and reference captions, which is extremely sensitive to caption writing style, resulting into unstable evaluation results[19].Model-based evaluation metric are proposed to improve the reliability of image caption evaluation.However, representative model-based metrics either adopt outdated backbone models[3], or suffer from limited input text length[19, 48], leading to unsatisfying detail caption evaluation results.

To tackle the aforementioned problems, we propose CAPTURE, which adopts the SOTA text scene graph parser Factual[29] to extract visual elements from captions, i.e., objects, attributes and relations.We match the extracted elements from candidate and ground truth captions through a stop words filtering module and a three-stage matching strategy.Compared with SPICE, our proposed CAPTURE metric adopts a T5-based language model as parser rather than PCFG, while we design a more capable three-stage core information coupling module to match the parsed result.As illustrated in Figure1, CAPTURE produces satisfying consistency with human evaluation results, while other metrics do not.Experiments on both GPT-4 annotated dataset and human-annotated datasets show that the proposed CAPTURE achieves the highest consistency with human or GPT-4 experts, surpassing all traditional caption evaluation metrics and model-based metrics.

With CAPTURE providing reliable evaluation results, we further explore to unleash LVLMs’ detail image caption capabilities in a divide-and-conquer paradigm with a given LVLM.No expert annotation is required in our proposed data construction loop.The data construction pipeline is illustrated in Figure1.We adopt a divide-and-conquer strategy to synthesize high-quality detail image caption.An LVLM is instructed to generation both overall caption for the image and local captions for salient objects segmented by SAM[23].We adopt a novel phrase-level filtering strategy to suppress hallucinations, which extracts visual element phrases from captions, and filter out those scored low by the open-vocabulary object detection model.Finally, the filtered overall caption and local captions are fed to an LLM to be merged into a high-quality detail image caption.Experiments show that our data construction pipeline produces significantly higher-quality detail caption, and a simple-yet-effective self-looping strategy can further improve the data quality.Moreover, the synthesized data improves LVLM’s understanding capabilities effectively when incorporated into the training process.

To summarize, the contribution of this work can be listed as follows:

(1) We release a 4870-case GPT-4V&Gemini-1.5-Pro annotated detail image caption benchmark for reliable evaluation, accompanied with three model-generated captions and corresponding GPT-4 annotated quality scores for expert judgement consistency evaluation.

(2) We propose a novel detail image caption evaluation metric CAPTURE, which adopts a T5-based parser to extract visual elements from captions, and compute the matching score via a three-stage matching module.Experiments indicate that CAPTURE metric achieves the highest consistency with human and GPT-4V&Gemini-1.5-Pro judgement over other caption metrics, providing reliable detail caption evaluation results without expensive LLM API calls.

(3) We propose a five-stage detail image caption data construction pipeline, which explores to use a given LVLM and open-source vision and language tools to produce higher-quality detail caption data.Experiments show that our data construction pipeline improves detail caption data quality significantly, and the data quality can be further improved by self-looping.

2 Related Work

Image caption evaluation.

Early image captioning benchmarks, such as COCO[11], NoCaps[2], consist of precise annotated captions but contain limited visual information, which is outdated for recently released LVLMs with leading performance.Traditional caption evaluation metrics, such BLEU[43], CIDER[53] and METEOR[4], compute n-gram matching score and therefore suffer from instability caused by varying writing styles.Model-based metric SPICE[3] extracts visual elements from caption sentences, and match them to obtain evaluation results.CLIP-Score[19], MID[22] and PAC-S[48] borrow pretrained CLIP[44] model to assess the quality of model-generated image captions.Although producing relatively reliable evaluation results, these metrics can hardly tackle detail caption evaluation tasks because of the outdated backbone model (SPICE) and limited text input length (CLIPScore).

Detail caption data construction.

A series of work seek to construct detail caption data for LVLM training.ShareGPT4V[9] and ALLaVA[8] curate detail image caption data annotated by GPT-4V for model training.All-Seeing[55] leverages LLMs to imagine co-occurrence visual elements for detail caption construction.GLaMM[45] and ASMv2[56] use open-source suites for dense caption generation, with a focus on correspondence of local descriptions and image regions.Our proposed data construction pipeline adopts a divide-and-conquer strategy, unleashing LVLM’s detail caption ability by generating and merging local captions.A recent work Monkey[28] also adopts a zoom-in-and-caption approach, but they use outdated local captioner and rely on ChatGPT for caption generation.Compared with Monkey, we use open-source LVLM and LLM to synthesize detail caption data, and propose a phrase-level filtering strategy.Guided by the proposed benchmark, we also provide in-depth analysis for the effectiveness of the detail caption construction pipeline.

3 Benchmarking Detail Image Caption

In this section, we elaborate the expert judgement data construction process and the workflow of the proposed detail image caption metric.

3.1 Detail Caption Evaluation Datasets

DetailCaps-100	LAION[49], CC[50], SBU[42]	Human	$100$	$100$	$175.96$	$10,858$
Benchmark	Data source	Annt. expert	Img num	Ref num	Avg len	Uni. 2-gram
$\text{COCO}_{test}$	COCO[32]	Human	$5000$	$25,010$	$10.59$	$61,448$
$\text{Nocaps}_{val}$	Openimages[24]	Human	$4500$	$45,000$	$11.49$	$116,969$
DetailCaps-100	COCO[32], SAM[23]	Human	$100$	$100$	$175.96$	$10,858$
DetailCaps-4870	COCO[32], SAM[23], LAION[49]	GPT-4V	$4870$	$9740$	$122.06$	$377,184$
DetailCaps-4870	CC[50], SBU[42], Coyo[6], Flickr[57]	Gemini-1.5-Pro	$4870$	$9740$	$122.06$	$377,184$

To benchmark detail image caption task reliably and better evaluate the consistency between each image caption metric and expert evaluation, we construct two expert-annotated datasets for performance evaluation.

For human evaluation dataset, we curate 100 cases sampled from ShareGPT4V-102k[9] randomly.We first call GPT-4V to generate detail captions, followed by human experts removing hallucinations and supplementing omitted visual elements.The refined detail image captions are then used as the ground truth for evaluation.We prompt three LVLMs with leading detail captioning performance for caption generation, which are ShareCaptioner[9], CogVLM[54] and LLaVA-1.5[33].Human experts are instructed to score each caption based on the precision and recall of three types of visual elements: object, attribute and relation.The overall scores range in $[0,5]$ , and are normalized to $[0,1]$ for fair expert judgement consistency evaluation of caption metrics.

We further curate a 4870 case dataset annotated by GPT-4V&Gemini-1.5-Pro for detail caption evaluation.Besides the data sources used in human-annotated 100 cases, we further incorporate pictures from COYO[6], LAION[49], CC[7] and Flickr[58] for diversity.Captions generated by ShareCaptioner, CogVLM and LLaVA-1.5 and corresponding annotated caption scores are provided for each sample.We instruct text-only GPT-4[1] to compare model-generated captions with GPT-4V&Gemini-1.5-Pro annotated references to obtain evaluation scores.We use text-only GPT-4 for evaluation because of its outstanding instruction following abilities.We refer to AppendixA for more details about the prompts used for detail caption generation and GPT4 evaluation generation.

We show the statistics of the curated expert judgement datasets in Table1.Our detail caption evaluation benchmarks contain image samples from various sources, and the reference captions are significantly longer than previous benchmarks.It worth noticing that DetailCaps-4870 benchmark contains 377,184 unique 2-grams in 9740 reference captions, while has only 116,969 unique 2-grams across 45,000 references.

3.2 CAPTURE Metric

CAPTURE metric extracts and matches core visual elements instead of n-gram pieces to obtain evaluation results, suppressing the influence of varying writing styles.We elaborate the design of CAPTURE metric in the following parts: visual elements extraction, stop words filtering and visual elements matching.We refer to AppendixB for implementation details of CAPTURE metric.

Visual elements extraction.

Visual elements extraction module extracts objects, attributes and relations from caption sentences.We adopt Factual parser[30], which is a T5-base model with leading performance in text scene graph parsing.Since Factual parser is trained on short caption parsing dataset, we use NLTK toolkit[5] to split detail image caption into sentences to be parsed separately.The parsing results are then lemmatized (Wordnet[39]), deduplicated and merged to be the final parsing result.

Stop words filtering.

Factual parser may extract abstract nouns as object elements, for example "foreground", "background", which do not correspond to visual elements in the image, and are not expected to participate in the matching process.To this end, we curate a stop word list to filter out these abstract nouns from extracted object elements.We first apply LLaMA2-13B-chat[52] and Factual parser to ShareGPT4V-102k dataset for nouns extraction respectively, and curate words recalled by Factual parser but omitted by LLaMA2-13B-chat.We compute the frequency of these words and task human experts to judge whether words with the highest frequencies have tangible meanings.Finally, 317 words with high frequency are included in the stop word list.

Visual elements matching.

In this part, we match the extracted visual elements to produce evaluation result.We implement a three-stage matching strategy to obtain matching results, which is robust to varying writing styles.An illustration of the matching module is shown in Figure2.We first match the same visual elements, followed by a synonym matching module.Words sharing one or more synonyms are considered matched, whereWordnet[39] is employed to get the synonym set of visual elements.Phrases matched in exact or synonym matching module obtain a $1.0$ matching score.To deal with the remaining unmatched elements,we further propose a soft matching module, which uses Sentence BERT[14] model to compute soft matching score.To be specific, we use Sentence BERT to encode the remaining object, attribute and relation phrases and compute the cosine similarity matrix between ground truth phrase embeddings and candidate ones.The max similarity score of each row and column, which is in $[0.0,1.0)$ , are the added up to the exact matching and synonym matching scores.We then compute the precision, recall and F1 of visual elements based on the matching score.CAPTURE metric computes the caption quality score as a weighted summation of the three F1 scores, which is illustrated in Figure2.We set weights for each type of visual elements as Object:Attribute:Relation=5:5:2 by default.

4 Improving Detail Image Caption

In this section, we elaborate the design of the proposed detail caption synthesizing pipeline, and introduce how to improve LVLM training with constructed detail caption data.

4.1 Detail Caption Construction

We introduce the proposed divide-and-conquer detail caption construction pipeline in the following five stages.The pipeline is illustrated in the right part of Figure1.

Stage I: Overall caption generation.

We first instruct a given LVLM to generate overall image caption as the skeleton for high quality detail caption generation.The overall caption may suffer from hallucinations and omissions, and will be polished in the following stages.

Stage II: Salient visual elements detection.

To locate salient objects for local caption generation, we segment the image with SAM[23] and filter out masks with extreme large or small sizes.Then, we adopt a maximal rectangle algorithm to reduce overlap between remaining masks.The resulted cropped bounding boxes are regarded as salient visual elements.

Stage III: Local caption generation.

To produce complementary detail visual information for the overall caption, we instruct the given LVLM to generate local caption for each bounding box obtained in Stage II.We limit the output length of local captions to be no more than twenty words to suppress unexpected hallucinations.

Stage IV: Hallucination filtering.

We propose a novel phrase-level filtering strategy to suppress hallucinations and preserve the recalled visual elements.We first extract visual element phrases from both overall caption and local captions with Factual parser, and filter out those scored lower than 0.01 by Owlv2[40], which is an open-vocabulary object detection model.Notice that captions may suffer from some grammar errors with phrases filtered out.These errors will be corrected in the final stage.

Stage V: Caption merging.

In this stage, an LLM is instructed to merge local captions into the skeleton provided in the overall caption smoothly, instead of simply concatenating them.

With local caption providing supplementary visual information and filtering module tackling accompanied hallucinations, the synthesized detail image caption captures more visual elements with hallucinations suppressed.Visualized examples are shown in Appendix C.

4.2 Improving LVLM Training with Synthesized Detail Caption Data

We further explore to enhance LVLM’s overall understanding performance with self-generated detail caption data.We synthesize detail caption data for images from ShareGPT4V-102k dataset[9], and then select a proportion of synthesized detail caption data for model training.Samples with the largest number of visual elements extracted by Factual parser are selected for their rich visual information.The selected data is incorporated into the SFT dataset to improve overall understanding performance.

5 Experiments

In this section, we introduce the experiment settings and show main experimental results to demonstrate the effectiveness of the proposed detail image caption metric and data construction pipeline.

5.1 Benchmarking Detail Image Caption

5.1.1 Experiment Settings

Datasets.

We conduct experiments on the two expert judgement datasets described in Section3.1.Each sample in the two datasets contains expert-annotated reference detail captions, and expert-annotated caption quality scores for three SOTA LVLM-generated captions.The statistics of the two datasets are shown in Table1.

Evaluation protocol.

We evaluate the caption metrics’ consistency with expert judgements with four metrics: Pearson correlation coefficient (PCC) $\rho$ , coefficient of determination $R^{2}$ , Kendall’s $\tau$ (Kd $\tau$ ) and Sample $\tau$ (Sp $\tau$ ).PCC reflects the linear correlation between the metric-evaluated scores and the expert-annotated ones.Coefficient of determination evaluates both the linear correlation and the variation of metric-evaluated score values from expert judgement.Kd $\tau$ is computed as the proportion of matched score order pairs among all partial order pairs.Sp $\tau$ computes Kd $\tau$ for each sample’s caption scores independently, and use the average value as final result.Sp $\tau$ ’s formulation fits LVLM’s caption evaluation process well, and therefore is regarded as the most important metric for consistency evaluation.

Baselines.

We compare the CAPTURE metric with both rule-based and model-based caption metrics.BLEU-2[43], CIDER[53], ROUGE-L[31] and METEOR[4] are considered as representative rule-based metrics.For model-based metrics, we consider SPICE[3], CLIPScore[19] and PAC-S[48].SPICE is built on a PCFG text parser model for information extraction, while CLIPScore and PAC-S borrow CLIP model to evaluate the alignment between images and text captions.We implement the model-based metrics with OpenCLIP-L/14[21], and truncate the detail caption paragraph for alignment score computation due to the limitation in input length.We also evaluate the consistency between GPT-Eval and human judgements on DetailCaps-100 benchmark.

5.1.2 Main Results

Rule-based metrics
Metric	DetailCaps-100				DetailCaps-4870				Average
	PCC $\rho$	$1-R^{2}$	Kd $\tau$	Sp $\tau$	PCC $\rho$	$1-R^{2}$	Kd $\tau$	Sp $\tau$	PCC $\rho$ $\uparrow$	$1-R^{2}$ $\downarrow$	Kd $\tau$ $\uparrow$	Sp $\tau$ $\uparrow$
BLEU[2002]	$0.2150$	$96.27$	$0.1623$	$0.2163$	$0.3099$	$24.86$	$0.2135$	$0.2812$	$0.2625$	$60.57$	$0.1879$	$0.2488$
ROUGE[2004]	$0.2554$	$185.69$	$0.1905$	$0.3321$	$0.3291$	$90.42$	$0.2348$	$0.3303$	$0.2923$	$138.06$	$0.2127$	$0.3312$
METEOR[2005]	$0.3643$	$384.58$	$0.2679$	$0.3529$	$0.4386$	$193.46$	$0.3165$	$0.4621$	$0.4015$	$289.02$	$0.2922$	$0.4075$
CIDER[2015]	$0.0834$	$1.7e^{7}$	$0.1159$	$0.0564$	$0.1213$	$2.27e^{7}$	$0.0908$	$0.0948$	$0.1024$	$1.99e^{7}$	$0.1034$	$0.0756$
Model-based metrics
SPICE[2016]	$0.3580$	$126.60$	$0.2641$	$0.3819$	$0.5155$	$131.1$	$0.3818$	$0.5554$	$0.4368$	$128.85$	$0.3230$	$0.4687$
CLIP-Score[2021]	$0.2532$	$48.81$	$0.1807$	$0.2928$	$0.4463$	$16.10$	$0.3039$	$0.4109$	$0.3498$	$32.46$	$0.2423$	$0.3519$
PAC-S[2023]	$0.2584$	$62.82$	$0.1833$	$0.2843$	$0.2783$	$18.93$	$0.1795$	$0.2930$	$0.2684$	$40.88$	$0.1814$	$0.2887$
CAPTURE	$\bm{0.4735}$	$\bm{11.58}$	$\bm{0.3688}$	$\bm{0.6117}$	$\bm{0.5366}$	$\bm{4.82}$	$\bm{0.3956}$	$\bm{0.5737}$	$\bm{0.5051}$	$\bm{8.20}$	$\bm{0.3822}$	$\bm{0.5927}$
GPT4-Eval	0.5157	44.44	0.4237	0.6120	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$-$

CAPTURE achieves the highest consistency with expert judgements.

As shown in Table2, the proposed metric CAPTURE improves PCC $\rho$ by $0.0683$ ( $15.6\%\uparrow$ ), $R^{2}$ score by $24.26$ ( $74.7\%\downarrow$ ), Kd $\tau$ by $0.0592$ ( $18.3\%\uparrow$ ) and Sp $\tau$ by $0.1240$ ( $26.4\%\uparrow$ ) over previous SOTA baselines.The advantages in PCC $\rho$ , Kd $\tau$ and Sp $\tau$ indicate that the proposed metric performs the best in linear correlation with expert judgment and pair-wise ranking accuracy, showing promising prospects for LVLM-generated detail caption evaluation.Besides, CAPTURE also performs the best in $1-R^{2}$ metric, indicating that CAPTURE produces evaluation results with aligned values.

METEOR and SPICE perform the best among rule-based and model-based metrics, respectively.

We attribute METEOR’s satisfying performance to its consideration for both precision and recall of n-grams.METEOR also adopts exact, synonym and porter stem matching strategies, improving its robustness to varying writing styles.For SPICE, its PCFG parser performs more robust for long detail captions compared with CLIP-based metrics, which suffer from CLIP’s limited input text length.

GPT4-Eval achieves the highest consistency with human evaluation on DetailCaps-100 dataset.

This result validates the effectiveness of evaluating caption metrics’ consistency with GPT4-Eval results on the larger dataset DetailCaps-4870.It is also worth noticing that CATURE’s consistency performance is pretty close to that of GPT-Eval.Moreover, CAPTURE does not require calling expensive LLM APIs, demonstrating its promising prospect in detail caption evaluation.

5.1.3 Analysis

We verify the effectiveness of the design of CAPTURE metric.Among the consistency evaluation metrics, we point out that Sp $\tau$ is the closest to real detail caption evaluation scenario, and we focus on this metrics for analysis.

Stop words filtering improves sample-level evaluation consistency effectively.

Statistics show that when evaluating candidate captions on DetailCaps-100 dataset, 28.43% extracted object phrases are detected and discarded by the stop words filtering module.As shown in Table3, performance drops on SP $\tau$ are witnessed on both DetailCaps-100 and DetailCaps-4870 benchmark when stop words filtering module is removed.We attribute the fluctuation in other consistency metrics to the varying number of visual elements discarded by the stop words filtering module across samples.

Soft matching module improves evaluation consistency and the alignment of evaluation score values.

When soft matching module is removed, CAPTURE suffers from a $3.3\%\downarrow$ performance drop in Sp $\tau$ .It is also worth noticing that $1-R^{2}$ score deteriorates the most significantly.The soft matching strategy tackles a variety of phrases with similar meaning, and thus makes up the deficiency of exact matching and synonym matching modules when tackling varying writing styles.

The default $\alpha,\beta,\gamma=5,5,2$ setting is a sweet spot for detail caption evaluation.

We modify the scale factors of relation elements $\gamma$ from $0$ (discarding relation matching score) to $5$ (relation F1 is considered equally with object F1 and attribute F1) to verify this judgement.Experiment results show that CAPTURE’s performance drops with relation matching score ratio $\gamma$ as $0$ or $5$ , validating that $\alpha,\beta,\gamma=5,5,2$ is the most suitable for CAPTURE’s evaluation.

Metric	DetailCaps-100				DetailCaps-4870				Average
	PCC $\rho$	$1-R^{2}$	Kd $\tau$	Sp $\tau$	PCC $\rho$	$1-R^{2}$	Kd $\tau$	Sp $\tau$	PCC $\rho$ $\uparrow$	$1-R^{2}$ $\downarrow$	Kd $\tau$ $\uparrow$	Sp $\tau$ $\uparrow$
CAPTURE	$0.4735$	$11.58$	$0.3688$	$0.6117$	$0.5366$	$4.82$	$0.3956$	$0.5737$	$0.5051$	$8.20$	$0.3822$	$0.5927$
- Stop words	$0.4830$	$13.23$	$0.3804$	$0.5947$	$0.5356$	$5.94$	$0.3940$	$0.5730$	$0.5093$	$9.58$	$0.3872$	$0.5838$
- Soft matching	$0.4674$	$29.15$	$0.3488$	$0.5770$	$0.5426$	$18.65$	$0.3964$	$0.5685$	$0.5050$	$23.90$	$0.3726$	$0.5728$
$\alpha,\beta,\gamma=5,5,0$	$0.4654$	$9.21$	$0.3642$	$0.5947$	$0.5282$	$3.94$	$0.3941$	$0.5686$	$0.4968$	$6.58$	$0.3792$	$0.5816$
$\alpha,\beta,\gamma=5,5,5$	$0.4651$	$13.75$	$0.3556$	$0.6064$	$0.5270$	$5.65$	$0.3834$	$0.5659$	$0.4960$	$9.70$	$0.3695$	$0.5862$

LVLM	Language	Detail Caption Data	Resolution	$\text{DC}_{100}$	$\text{DC}_{4870}$
CogVLM[2023a]	Vicuna-7B	Human Annt.	$490^{2}$	$63.01$	$60.20$
ShareCaptioner-7B[2023a]	Vicuna-7B	GPT-4V Annt.	$448^{2}$	$60.85$	$59.55$
LLaVA-1.5-7B[2023a]	Vicuna-7B	Synthesized	$336^{2}$	$51.23$	$49.69$
LLaVA-1.5-13B[2023a]	Vicuna-13B	Synthesized	$336^{2}$	$51.74$	$51.10$
LLaVA-NEXT-7B[2024a]	Vicuna-7B	GPT-4V Annt.	$336^{2}$ *{ $1$ - $5$ }	$60.18$	$58.22$
LLaVA-NEXT-13B[2024a]	Vicuna-13B	GPT-4V Annt.	$336^{2}$ *{ $1$ - $5$ }	$60.38$	$58.66$
LLaVA-NEXT-34B[2024a]	Hermes-2-Yi-34B	GPT-4V Annt.	$336^{2}$ *{ $1$ - $5$ }	$60.60$	$58.88$
Mini-Gemini-HD-7B[2024]	Vicuna-7B	GPT-4V Annt.	$336^{2}$ * $5$	$59.51$	$57.58$
Mini-Gemini-HD-13B[2024]	Vicuna-13B	GPT-4V Annt.	$336^{2}$ * $5$	$60.51$	$58.39$
Intern-XComposerV2[2024]	Vicuna-7B	GPT-4V Annt.	$490^{2}$	$61.43$	$59.92$
InternVL-V1.2-PLUS-40B[2023b]	Hermes-2-Yi-34B	GPT-4V Annt.	$448^{2}$	$61.61$	$60.75$
InternVL-V1.5-26B[2024c]	InternLM-20B	GPT-4V Annt.	$448^{2}$ *{ $1$ - $41$ }	$65.62$	$62.36$

5.1.4 Evaluating LVLMs with Leading Performance

With DetailCaps benchmark and CAPTURE evaluating LVLMs’ detail captioning performance reliably, we review the detail caption capabilities for 12 open source LVLMs with leading performance.The evaluation results on DetailCaps-100 and DetailCaps-4870 are shown in Table4.Among all models, InternVL-V1.5[13] achieves the best detail image caption performance with a large advantage over other models.It also can be observed from the results of the LLaVA-1.5, LLaVA-Next and Mini-Gemini[26] series that model’s detail captioning ability improves consistently as the model size increases.In addition, a common observation is that training with detail caption data generated by GPT-4V leads to better detail captioning performance.Among these LVLMs, CogVLM achieves the second highest CAPTURE score with high-quality human-refined detail image caption data.

5.2 Improving Detail Image Caption

5.2.1 Experiment Settings

We use ShareGPT4V-102k dataset for detail caption data construction and implement two pipelines with different model size.For 7B model pipeline, we use SAM-ViT-L[23] for segmentation, LLaVA-1.5-7B for overall and local caption generation, OwlV2-large-ensemble[40] for hallucination filtering and LLaMA-2-7B-Chat for caption mering.For 13B model pipeline, we replace SAM-ViT-H, LLaVA-1.5-13B, and LLaMA-2-13B-Chat instead.We validate the effectiveness of the proposed data construction pipeline with four LVLMs with leading performance, which are LLaVA-1.5-7B, LLaVA-1.5-13B, LLaVA-NEXT-7B and Mini-Gemini-7B-HD.

LLaVA-1.5-7B
Caption	Detailcaps-100			Detailcaps-4870			Average
	CAPTURE	Precision	Recall	CAPTURE	Precision	Recall	CAPTURE	Precision	Recall
Self	$51.23$	$65.24$	$43.31$	$51.27$	$64.61$	$43.79$	$51.25$	$64.92$	$43.55$
Synthesized	$57.11$	$64.12$	$52.08$	$56.18$	$63.02$	$51.48$	$56.64$	$63.57$	$51.78$
LLaVA-1.5-13B
Self	$51.76$	$65.01$	$44.10$	$51.45$	$65.04$	$43.91$	$51.61$	$65.03$	$44.00$
Synthesized	$57.36$	$62.07$	$53.52$	$56.83$	$61.61$	$53.3$	$57.09$	$61.84$	$53.41$
LLaVA-NEXT-7B
Self	$61.48$	$65.60$	$57.82$	$59.86$	$64.16$	$56.61$	$60.67$	$64.88$	$57.22$
Synthesized	$62.24$	$64.49$	$60.07$	$60.10$	$62.36$	$58.60$	$61.17$	$63.42$	$59.34$
Mini-Gemini-7B-HD
Self	$59.51$	$61.99$	$57.28$	$57.68$	$60.24$	$55.89$	$58.59$	$61.12$	$56.59$
Synthesized	$60.44$	$60.98$	$59.78$	$58.64$	$58.76$	$59.17$	$59.54$	$59.87$	$59.48$

5.2.2 Main results

Our detail caption synthesizing pipeline improves LVLM-generated caption quality effectively.

As shown in Table5, for LLaVA-1.5-7B and LLaVA-1.5-13B, the detail caption quality is improved by a large fraction in terms of CAPTURE score.For more advanced LVLM like LLaVA-NEXT and Mini-Gemini-HD, the advantage of the proposed pipeline persists, demonstrating the effectiveness of the our data synthesizing strategy.We attribute the smaller fraction of improvement in LLaVA-NEXT and Mini-Gemini-HD to other vision and language tools’ limited capabilities, which pose "short boards" compared with LVLMs trained with expert-annotated detail caption training data.

Our pipeline enhances recall of visual elements effectively with little precision drop.

As shown in Table5, this tendency can be observed across all four LVLMs, indicating that the divide-and-conquer strategy improves model’s perception of detail visual elements effectively.Thanks to the hallucination filtering module, the performance drop in precision is suppressed, so that improvement on CAPTURE score is witnessed across all LVLMs.

5.2.3 Analysis

Ablation
Caption	Detailcaps-100			Detailcaps-4870			Average
	CAPTURE	Precision	Recall	CAPTURE	Precision	Recall	CAPTURE	Precision	Recall
Self	$51.23$	$65.24$	$43.31$	$51.27$	$64.61$	$43.79$	$51.25$	$64.92$	$43.55$
Synthesized	$57.11$	$66.31$	$52.16$	$56.18$	$63.02$	$51.48$	$56.64$	$64.67$	$51.82$
- filter	$56.78$	$65.16$	$53.26$	$56.01$	$62.80$	$51.51$	$56.39$	$63.98$	$52.38$
vqa filter	$56.44$	$63.95$	$51.11$	$55.85$	$62.80$	$51.12$	$56.14$	$63.38$	$51.11$
filter local	$56.75$	$63.87$	$51.61$	$56.24$	$62.74$	$51.90$	$56.50$	$63.30$	$51.75$
Self-looping
Self	$51.23$	$65.24$	$43.31$	$51.27$	$64.61$	$43.79$	$51.25$	$64.92$	$43.55$
loop1	$51.91$	$63.48$	$45.02$	$52.52$	$63.76$	$45.81$	$52.22$	$63.62$	$45.42$
loop2	$52.50$	$63.43$	$45.66$	$52.53$	$62.57$	$46.42$	$52.52$	$63.00$	$46.04$
loop3	$52.89$	$62.45$	$46.86$	$52.86$	$61.83$	$47.25$	$52.88$	$62.14$	$47.05$
loop4	$54.02$	$62.24$	$48.45$	$54.38$	$61.68$	$49.50$	$54.20$	$61.96$	$48.98$

Our phrase-level hallucination filtering strategy achieves the best performance.

As shown in Table6, when the filtering module is removed (-filter), a performance drop in CAPTURE score is witnessed.We also compare our filtering strategy with other alternatives used in Monkey[28].For VQA filtering, we use LVLM to check if the visual element phrase exists in the image.For local caption filtering, we filter out hallucinated local caption sentences rather than extracted phrases.Experiment results show that both alternatives lead to performance drops in CAPTURE score, demonstrating the effectiveness of the proposed phrase-level filtering strategy.

LVLM’s detail caption ablity can be improved via self-looping.

We adopt LLaVA-1.5-7B as the backbone LVLM, and synthesize detail caption data for model’s training.In each loop, we rerun the SFT stage of LLaVA-1.5-7B from a pretrain checkpoint (without any SFT), with annotated 25k detail caption data incorporated into the training data.Experiment results are shown in Table6.Model’s detail captioning ability keeps improving in the listed 4 loops, showing a promising self-evolutioning phenomena in detail captioning performance.

5.2.4 Improving LVLM Training with Synthesized Detail Caption Data

Experiment Settings.

We follow LLaVA-1.5[33] pipeline for model training.The vision-language projector is trained with 558k short caption data and a 128 batch size during pretraining, and all parameters except the vision module are trained with 665k visual instruction tuning data and a 256 batch size during SFT.We train the model with AdamW optmizer, with a $1e^{-4}$ pretraining learning rate and a $2e^{-5}$ SFT learning rate.We add 25k detail caption data into SFT stage for the 7B model, and 50k for the 13B model due to its larger capacity.In our experiments, the pretraining process takes 24 GPU hours and SFT takes 88 GPU hours on Nvidia A100.We use MME[17], MMMU[60], MMStar[10], GQA[20], VizWiz[18], POPE[27] and the proposed DetailCaps benchmarks for model’s natural scene visual understanding ability evaluation.RefCOCOg[38] is a referring expression comprehension task to evaluate model’s detail understanding capability.OCRBench[37] and DocVQA[51] are selected to evaluate model’s performance in text-heavy scenarios.For baselines, we report our reproduced results rather than reported ones for fair comparison.

LLaVA-1.5-7B
DC Data	$\text{MME}_{\text{p}}$	$\text{MME}_{\text{c}}$	$\text{MMMU}_{\text{v}}$	MMStar	GQA	VisWiz	POPE	$\text{RefCOCO}_{\text{g}}$	$\text{OCR}_{\text{Bench}}$	$\text{VQA}_{\text{Doc}}$	$\text{DC}_{\text{100}}$	$\text{DC}_{\text{4870}}$	Win
Base	$1487.1$	$\bm{260.4}$	$34.6$	$33.33$	$62.86$	$53.70$	$86.22$	$72.16$	$316$	$28.75$	$51.26$	$51.45$	$-$
+ Self 25k	$1499.3$	$258.6$	$36.3$	$33.40$	$62.64$	$54.90$	$86.84$	$\bm{72.75}$	$316$	$29.26$	$51.49$	$51.83$	$10/12$
+ Syn 25k	$\bm{1523.2}$	$257.1$	$\bm{37.3}$	$\bm{33.53}$	$\bm{62.86}$	$\bm{56.88}$	$\bm{87.08}$	$72.61$	$\bm{321}$	$\bm{30.01}$	$\bm{51.91}$	$\bm{52.65}$	$\bm{11/12}$
LLaVA-1.5-13B
Base	$1553.4$	$267.1$	$34.3$	$34.80$	$63.36$	$58.35$	$85.90$	$74.51$	$331$	$30.60$	$51.96$	$52.05$	$-$
+ Self 50k	$1543.4$	$286.8$	$34.3$	$\bm{35.40}$	$63.53$	$\bm{59.08}$	$86.17$	$74.63$	$331$	$30.66$	$\bm{52.62}$	$52.85$	$11/12$
+ Syn 50k	$\bm{1564.0}$	$\bm{286.8}$	$\bm{34.3}$	$35.27$	$\bm{63.56}$	$58.56$	$\bm{86.28}$	$\bm{74.65}$	$\bm{333}$	$\bm{30.79}$	$52.56$	$\bm{52.91}$	$\bm{12/12}$

Synthesized detail caption data improves LVLM’s overall understanding performance effectively.

As shown in Table7, even if we only add a little fraction of synthesized high-quality detail caption data in the SFT stage (25k for 7B model and 50k for 13B model), performance improvement is witnessed across a series of visual understanding benchmarks, demonstrating the effectiveness enhancing LVLM’s overall understanding capabilities with synthesized detail caption data.

Directly generated detail caption data also improves LVLM’s overall understanding performance.

As shown in Table7, training with detail caption data generated directly also leads to an overall performance improvement.Although the improvement is eclipsed by synthesized detail caption data, this observation validates the importance of using detail caption data for model training, even if the data is generated by the model itself directly.

Model’s benchmark scores correlate to their detail caption task performance positively.

We observe a positive correlation between LVLMs’ benchmark scores (win rates) and their performance in detail caption tasks.This observation validates the importance of detail image captioning task and the feasibility of enhancing LVLM’s overall visual understanding abilities by improving its detail caption ability with synthesized high-quality caption data.

6 Limitations and Future Work

The proposed detail image caption evaluation metric achieves outstanding consistency with human evaluation in the curated benchmarks.However, we point out that although two powerful expert are adopted for evaluation dataset construction, these captions may not be perfect.Human refining and more reference captions will be incorporated into the detail caption benchmark in out future work.For the data construction pipeline, we observe a diminishing effect when the backbone LVLM becomes stronger.For example, LVLMs like LLaVA-NEXT and Mini-Gemini uses GPT-4V-annotated detail caption data for training, and therefore the advantage of the proposed pipeline may suffer from incompatible capabilities of other vision and language tools used in the pipeline.We will seek to further improve LVLM’s detail captioning abilities with more powerful and scalable vision and language suites in our future work.

7 Conclusions

In this work, we analyze the shortcomings of existing image caption benchmarks for LVLM evaluation, and curate high-quality expert-annotated evaluation dataset for detail caption evaluation.We also propose a novel detail image caption metric CAPTURE, which extracts visual elements from detail captions, and match them through three stages to produce evaluation results.Experiments show that CAPTURE metric achieves the highest consistency with expert judgements, and ablation studies demonstrate the effectiveness of the stop words filtering module, three-stage matching module and the default ratio of different type of visual elements.Guided by the proposed detail caption evaluation methods, we further seek to unleash LVLM’s detail image captioning ability with a divide-and-conquer caption construction pipeline powered by open-source vision and language tools.Experiments show that the proposed pipeline improves LVLM-annotated detail caption data quality significantly, and the data quality can be further improved via self-looping.Ablation studies validate the effectiveness of the pipeline design.

References

Achiam etal. [2023]Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, FlorenciaLeoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, etal.Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023.
Agrawal etal. [2019]Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson.Nocaps: Novel object captioning at scale.In Proceedings of the IEEE/CVF international conference on computer vision, pages 8948–8957, 2019.
Anderson etal. [2016]Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould.Spice: Semantic propositional image caption evaluation.In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14, pages 382–398. Springer, 2016.
Banerjee and Lavie [2005]Satanjeev Banerjee and Alon Lavie.Meteor: An automatic metric for mt evaluation with improved correlation with human judgments.In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005.
Bird [2006]Steven Bird.Nltk: the natural language toolkit.In Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions, pages 69–72, 2006.
Byeon etal. [2022]Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim.Coyo-700m: Image-text pair dataset.https://github.com/kakaobrain/coyo-dataset, 2022.
Changpinyo etal. [2021]Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut.Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3558–3568, 2021.
Chen etal. [2024a]GuimingHardy Chen, Shunian Chen, Ruifei Zhang, Junying Chen, Xiangbo Wu, Zhiyi Zhang, Zhihong Chen, Jianquan Li, Xiang Wan, and Benyou Wang.Allava: Harnessing gpt4v-synthesized data for a lite vision-language model.arXiv preprint arXiv:2402.11684, 2024a.
Chen etal. [2023a]Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin.Sharegpt4v: Improving large multi-modal models with better captions.arXiv preprint arXiv:2311.12793, 2023a.
Chen etal. [2024b]Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, YuQiao, Dahua Lin, etal.Are we on the right way for evaluating large vision-language models?arXiv preprint arXiv:2403.20330, 2024b.
Chen etal. [2015]Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and CLawrence Zitnick.Microsoft coco captions: Data collection and evaluation server.arXiv preprint arXiv:1504.00325, 2015.
Chen etal. [2023b]Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, YuQiao, and Jifeng Dai.Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks.arXiv preprint arXiv:2312.14238, 2023b.
Chen etal. [2024c]Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, etal.How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.arXiv preprint arXiv:2404.16821, 2024c.
Devlin etal. [2018]Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018.
Dong etal. [2024]Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, etal.Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model.arXiv preprint arXiv:2401.16420, 2024.
Fu etal. [2023a]Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, XuLin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, KeLi, Xing Sun, and Rongrong Ji.Mme: A comprehensive evaluation benchmark for multimodal large language models.ArXiv, abs/2306.13394, 2023a.URL https://api.semanticscholar.org/CorpusID:259243928.
Fu etal. [2023b]Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, XuLin, Jinrui Yang, Xiawu Zheng, KeLi, Xing Sun, Yunsheng Wu, and Rongrong Ji.Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023b.
Gurari etal. [2018]Danna Gurari, Qing Li, AbigaleJ Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and JeffreyP Bigham.Vizwiz grand challenge: Answering visual questions from blind people.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617, 2018.
Hessel etal. [2021]Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan LeBras, and Yejin Choi.CLIPScore: A reference-free evaluation metric for image captioning.In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7514–7528, 2021.
Hudson and Manning [2019]DrewA Hudson and ChristopherD Manning.Gqa: A new dataset for real-world visual reasoning and compositional question answering.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
Karkkainen and Joo [2021]Kimmo Karkkainen and Jungseock Joo.Fairface: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation.In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1548–1558, 2021.
Kim etal. [2022]Jin-Hwa Kim, Yunji Kim, Jiyoung Lee, KangMin Yoo, and Sang-Woo Lee.Mutual information divergence: A unified metric for multimodal generative models.Advances in Neural Information Processing Systems, 35:35072–35086, 2022.
Kirillov etal. [2023]Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, AlexanderC Berg, Wan-Yen Lo, etal.Segment anything.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023.
Krasin etal. [2017]Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Ferrari, Sami Abu-El-Haija, Alina Kuznetsova, Hassan Rom, Jasper Uijlings, Stefan Popov, Andreas Veit, etal.Openimages: A public dataset for large-scale multi-label and multi-class image classification.Dataset available from https://github. com/openimages, 2(3):18, 2017.
Li etal. [2023a]Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan.Seed-bench: Benchmarking multimodal llms with generative comprehension.ArXiv, abs/2307.16125, 2023a.URL https://api.semanticscholar.org/CorpusID:260334888.
Li etal. [2024]Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia.Mini-gemini: Mining the potential of multi-modality vision language models.arXiv preprint arXiv:2403.18814, 2024.
Li etal. [2023b]Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, WayneXin Zhao, and Ji-Rong Wen.Evaluating object hallucination in large vision-language models.arXiv preprint arXiv:2305.10355, 2023b.
Li etal. [2023c]Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xiang Bai.Monkey: Image resolution and text label are important things for large multi-modal models.arXiv preprint arXiv:2311.06607, 2023c.
Li etal. [2023d]Zhuang Li, Yuyang Chai, TerryZhuo Yue, Lizhen Qu, Gholamreza Haffari, Fei Li, Donghong Ji, and QuanHung Tran.Factual: A benchmark for faithful and consistent textual scene graph parsing.arXiv preprint arXiv:2305.17497, 2023d.
Li etal. [2023e]Zhuang Li, Yuyang Chai, TerryYue Zhuo, Lizhen Qu, Gholamreza Haffari, Fei Li, Donghong Ji, and QuanHung Tran.FACTUAL: A benchmark for faithful and consistent textual scene graph parsing.In Findings of the Association for Computational Linguistics: ACL 2023, pages 6377–6390, Toronto, Canada, July 2023e. Association for Computational Linguistics.URL https://aclanthology.org/2023.findings-acl.398.
Lin [2004]Chin-Yew Lin.Rouge: A package for automatic evaluation of summaries.In Text summarization branches out, pages 74–81, 2004.
Lin etal. [2014]Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and CLawrence Zitnick.Microsoft coco: Common objects in context.In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
Liu etal. [2023a]Haotian Liu, Chunyuan Li, Yuheng Li, and YongJae Lee.Improved baselines with visual instruction tuning.In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023a.
Liu etal. [2024a]Haotian Liu, Chunyuan Li, Yuheng Li, BoLi, Yuanhan Zhang, Sheng Shen, and YongJae Lee.Llava-next: Improved reasoning, ocr, and world knowledge, January 2024a.URL https://llava-vl.github.io/blog/2024-01-30-llava-next/.
Liu etal. [2024b]Haotian Liu, Chunyuan Li, Qingyang Wu, and YongJae Lee.Visual instruction tuning.Advances in neural information processing systems, 36, 2024b.
Liu etal. [2023b]Yuanzhan Liu, Haodong Duan, Yuanhan Zhang, BoLi, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin.Mmbench: Is your multi-modal model an all-around player?ArXiv, abs/2307.06281, 2023b.URL https://api.semanticscholar.org/CorpusID:259837088.
Liu etal. [2023c]Yuliang Liu, Zhang Li, Hongliang Li, Wenwen Yu, Mingxin Huang, Dezhi Peng, Mingyu Liu, Mingrui Chen, Chunyuan Li, Lianwen Jin, etal.On the hidden mystery of ocr in large multimodal models.arXiv preprint arXiv:2305.07895, 2023c.
Mao etal. [2016]Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan Yuille, and Kevin Murphy.Generation and comprehension of unambiguous object descriptions.In CVPR, 2016.
Miller [1995]GeorgeA Miller.Wordnet: a lexical database for english.Communications of the ACM, 38(11):39–41, 1995.
Minderer etal. [2024]Matthias Minderer, Alexey Gritsenko, and Neil Houlsby.Scaling open-vocabulary object detection.Advances in Neural Information Processing Systems, 36, 2024.
OpenAI [2023]OpenAI.Gpt-4v(ision) system card, 2023.URL https://cdn.openai.com/papers/GPTV_System_Card.pdf.
Ordonez etal. [2011]Vicente Ordonez, Girish Kulkarni, and Tamara Berg.Im2text: Describing images using 1 million captioned photographs.Advances in neural information processing systems, 24, 2011.
Papineni etal. [2002]Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu.Bleu: a method for automatic evaluation of machine translation.In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
Radford etal. [2021]Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, etal.Learning transferable visual models from natural language supervision.In International conference on machine learning, pages 8748–8763. PMLR, 2021.
Rasheed etal. [2023]Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, RaoM Anwer, Erix Xing, Ming-Hsuan Yang, and FahadS Khan.Glamm: Pixel grounding large multimodal model.arXiv preprint arXiv:2311.03356, 2023.
Reid etal. [2024]Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, etal.Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024.
Reimers and Gurevych [2019]Nils Reimers and Iryna Gurevych.Sentence-BERT: Sentence embeddings using Siamese BERT-networks.In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), November 2019.
Sarto etal. [2023]Sara Sarto, Manuele Barraco, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara.Positive-augmented contrastive learning for image and video captioning evaluation.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6914–6924, 2023.
Schuhmann etal. [2021]Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki.Laion-400m: Open dataset of clip-filtered 400 million image-text pairs.arXiv preprint arXiv:2111.02114, 2021.
Sharma etal. [2018]Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut.Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning.In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
Tito etal. [2021]Rubèn Tito, Dimosthenis Karatzas, and Ernest Valveny.Document collection visual question answering.In Document Analysis and Recognition–ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part II 16, pages 778–792. Springer, 2021.
Touvron etal. [2023]Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, etal.Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023.
Vedantam etal. [2015]Ramakrishna Vedantam, CLawrenceZitnick, and Devi Parikh.Cider: Consensus-based image description evaluation.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
Wang etal. [2023a]Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, JiQi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, and Jie Tang.Cogvlm: Visual expert for pretrained language models.ArXiv, abs/2311.03079, 2023a.URL https://api.semanticscholar.org/CorpusID:265034288.
Wang etal. [2023b]Weiyun Wang, Min Shi, Qingyun Li, Wenhai Wang, Zhenhang Huang, Linjie Xing, Zhe Chen, Hao Li, Xizhou Zhu, Zhiguo Cao, etal.The all-seeing project: Towards panoptic visual recognition and understanding of the open world.In The Twelfth International Conference on Learning Representations, 2023b.
Wang etal. [2024]Weiyun Wang, Yiming Ren, Haowen Luo, Tiantong Li, Chenxiang Yan, Zhe Chen, Wenhai Wang, Qingyun Li, Lewei Lu, Xizhou Zhu, etal.The all-seeing project v2: Towards general relation comprehension of the open world.arXiv preprint arXiv:2402.19474, 2024.
Young etal. [2014a]Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier.From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions.Transactions of the Association for Computational Linguistics, 2:67–78, 2014a.doi: 10.1162/tacl_a_00166.URL https://aclanthology.org/Q14-1006.
Young etal. [2014b]Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier.From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions.Transactions of the Association for Computational Linguistics, 2:67–78, 2014b.
Yu etal. [2023]Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang.Mm-vet: Evaluating large multimodal models for integrated capabilities.ArXiv, abs/2308.02490, 2023.URL https://api.semanticscholar.org/CorpusID:260611572.
Yue etal. [2024]Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, GeZhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, YuSu, and Wenhu Chen.Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi.In Proceedings of CVPR, 2024.

Appendix A Prompt Templates for Detail Caption Benchmark Curation

Prompt of GPT4-Evaluation scores generation.

In order to verify the effect of the proposed CAPTURE metric on a larger evaluation set, we use GPT4[1] instead of humans for evaluation. To better align with human preferences, we manually construct three in context learning cases as shown in Figure 3. In each case, a standard caption and three candidate captions are given, and the corresponding human evaluation results are listed as references, including the relative ranking and the absolute scores. Finally, the current ground truth and candidate captions to be evaluated are given in the same format, prompting GPT4 to output the corresponding evaluation results. And we select the output captions of LLaVA-1.5[35], CogVLM[54] and ShareCaptioner[9] as three candidates for evaluation.

Benchmarking and Improving Detail Image Caption (3)

Prompt of detail caption generation.

In the process of generating detail captions, we use multiple different prompts for GPT-4V[41] to obtain diverse captions as shown in Figure 4. For Gemini-Pro-1.5[46], we found that the model is more likely to output short captions when the prompt does not indicate the expected output length. Based on this, we only use a single prompt with a word limit for generation.

Benchmarking and Improving Detail Image Caption (4)

Appendix B Implementation Details for CAPTURE Metric

Core information extraction.

Core information extraction module aims to extract objects, attributes and relations from a given caption for following matching modules.We adopt a SOTA text scene graph parser: Factual parser[29] as the backbone model.Factual parser is a T5-base model trained on human-annotated scene graph parsing dataset.It takes as input a short caption paragraph, and produce the objects, attributes and relations appearing in the caption.Since Factual parser is trained on short caption parsing dataset, its performance deteriorates severely when given detail image captions.To solve this problem, we first use NLTK toolkit[5] to cut detail image caption into short paragraphs, and apply Factual parser to each paragraph to obtain a list of parsing results.The parsing results are then merged into a larger scene graph based on the following rules:(1) all nouns and adjectives are lemmatized with Wordnet[39];(2) duplicated objects are merged as one element, so are corresponding attributes;(3) attributes describing two or more merged objects are deduplicated;(4) duplicated relations are merged as one element;In this way, we obtain a large scene graph for each caption with duplicated elements removed.The scene graph is then used to compute the final matching score.

Stop words filtering.

Although yielding relatively satisfying parsing results, Factual parser struggles to discriminate concrete nouns from abstract ones, which are not expected to participate in the following matching process.For example, in caption "Two white sheep are enjoying the moment", "sheep" refers to a perceptible element in the image, while "moment" has no tangible meaning.We filter out abstract nouns via a stop word list: once an object in parsing results appear to be in the stop word list, the word itself will not participate in the object elements matching process.

To construct such the stop word list, we first apply LLaMA2-13b-chat[52] and Factual parser to ShareGPT4V-102k dataset for nouns extraction, respectively.We observe that LLaMA may omit a proportion of objects appearing in the caption, but the extracted concrete nouns demonstrate impressive precision.Based on this obeservation, we curate words recalled by Factual parser but omitted by LLaMA, and compute the frequency of these words.Human experts are tasked to judge whether words with the highest frequency are concrete nous or abstract ones.Finally, 500 abstract nouns with the highest frequency are curated to be the stop word list.

It is also worth noticing that although yielding relatively satisfying parsing results, Factual parser struggles when dealing with cross-sentence pronoun reference.When given ambiguous pronoun references, Factual parser may generate objects which are not contained in the caption.To tackle this problem, we further check the parsed objects’ appearance in the caption, and filter out unmatched objects as well as its corresponding attributes and relations.

Core information matching.

After obtaining and filtering core information from both ground truth detail caption and candidate one, the extracted elements are matched to produce final evaluation result.Intuitively, identical object, attribute or relation elements are matched.However, due to the diverse writing style of LVLMs, the same element can be expressed in various ways, and exact matching strategy fail to handle such cases.To solve this problem, we add a synonym matching module after exact matching to match elements with similar meanings.We employ Wordnet to get the synonym set of both the candidate element and ground truth one, and match them if their synonym sets overlaps.Matched candidate objects, attributes and relations are formulated as:

\begin{split}&cand_{type}^{match}=cand_{type}^{ex}\bigcup cand_{type}^{syn},\\\end{split}

(1)

where ${type}\in\{obj,attr,rel\}$ . $cand_{type}^{ex}$ and $cand_{type}^{syn}$ stand for exactly matched and synonym matched candidate phrases, respectively.Matched ground truth elements are formulated in the same way as $gt_{obj}^{match}$ , $gt_{attr}^{match}$ and $gt_{rel}^{match}$ .

Exact matching and synonym matching strategies tackle most of the matched cases, but still fail to cover all core information extracted from captions in various writing styles.To this end, we propose a soft matching strategy, which takes Sentence BERT[47] model to encode remaining object, attribute or relation phrases and compute a matching score in $[0,1)$ for remaining unmatched phrases.Let $cand_{type}^{rm}$ be unmatched candidate phrases and $gt_{type}^{rm}$ be ground truth ones, their similarity matrix $S_{type}^{rm}\in\mathbf{R}^{|cand_{type}^{rm}|\times|gt_{type}^{rm}|}$ is calculated as:

S_{type}^{rm}=\phi(cand_{type}^{rm})\times\phi(gt_{type}^{rm})^{T},

(2)

where $\phi(\cdot)$ denotes Sentence BERT model.We further compute the matching score of $cand_{type}^{rm}$ and $gt_{type}^{rm}$ as follows:

\begin{split}&cand\_match_{type}^{rm}[i]=\max_{j=1,2,...,|gt_{type}^{rm}|}S_{%type}^{rm}[i,j],\\&gt\_match_{type}^{rm}[j]=\max_{i=1,2,...,|cand_{type}^{rm}|}S_{type}^{rm}[i,j%].\\\end{split}

(3)

$cand\_match_{type}^{rm}$ and $gt\_match_{type}^{rm}$ are then used as a complementary to exact matched and synonym matched relations.

After obtaining matching results, we compute the precision and recall for each type of core information.The precision and recall are computed as:

\begin{split}&precision_{type}=\frac{|cand_{type}^{match}|}{|cand_{type}|},\\&recall_{type}=\frac{|gt_{type}^{match}|}{|gt_{type}|}.\\\end{split}

(4)

Attribute precision and recall are computed in the same way.As for relation elements, candidate matching score and ground truth matching score are counted separately due to the introduction of feature matching:

\begin{split}&precision_{type}=\frac{|cand_{type}^{match}|+\frac{\sum{cand\_%match_{type}^{rm}}}{|cand\_match_{type}^{rm}|}}{|cand_{type}|},\\&recall_{type}=\frac{|gt_{type}^{match}|+\frac{\sum{gt\_match_{type}^{rm}}}{|%gt\_match_{type}^{rm}|}}{|gt_{type}|}.\\\end{split}

(5)

Finally, CAPTURE metric takes the precision and recall of all three types of core information into consideration, and produce the final evaluation result as:

CAPTURE=\frac{\alpha F1_{obj}+\beta F1_{attr}+\gamma F1_{rel}}{\alpha+\beta+%\gamma},

(6)

where $\alpha$ , $\beta$ and $\gamma$ are scale factors, and $F1_{type}=\frac{precision_{type}\cdot recall_{type}}{precision_{type}+recall_{%type}}$ stands for the F1 score of each type of core information.

Appendix C Visualized Examples for Improved Detail Caption Construction

Cases of detail caption construction.

Benchmarking and Improving Detail Image Caption (5)

In Figure 5, we show the effectiveness of detail caption construction in Section 4.1 with three visualized cases.In the first case, highlighted in red, the LVLM-generated caption incorrectly mentions that there are people in the image, while the caption produced by our pipeline removes the relevant description correctly.In the following two cases, the synthesized captions complement model-generated captions with additional visual information highlighted in green, resulting into higher-quality detail image caption.

Benchmarking and Improving Detail Image Caption (2024)

Abstract

1 Introduction

2 Related Work

Image caption evaluation.

Detail caption data construction.

3 Benchmarking Detail Image Caption

3.1 Detail Caption Evaluation Datasets

3.2 CAPTURE Metric

Visual elements extraction.

Stop words filtering.

Visual elements matching.

4 Improving Detail Image Caption

4.1 Detail Caption Construction

Stage I: Overall caption generation.

Stage II: Salient visual elements detection.

Stage III: Local caption generation.

Stage IV: Hallucination filtering.

Stage V: Caption merging.

4.2 Improving LVLM Training with Synthesized Detail Caption Data

5 Experiments

5.1 Benchmarking Detail Image Caption

5.1.1 Experiment Settings

Datasets.

Evaluation protocol.

Baselines.

5.1.2 Main Results

CAPTURE achieves the highest consistency with expert judgements.

METEOR and SPICE perform the best among rule-based and model-based metrics, respectively.

GPT4-Eval achieves the highest consistency with human evaluation on DetailCaps-100 dataset.

5.1.3 Analysis

Stop words filtering improves sample-level evaluation consistency effectively.

Soft matching module improves evaluation consistency and the alignment of evaluation score values.

The default α,β,γ=5,5,2formulae-sequence𝛼𝛽𝛾552\alpha,\beta,\gamma=5,5,2italic_α , italic_β , italic_γ = 5 , 5 , 2 setting is a sweet spot for detail caption evaluation.

5.1.4 Evaluating LVLMs with Leading Performance

5.2 Improving Detail Image Caption

5.2.1 Experiment Settings

5.2.2 Main results

Our detail caption synthesizing pipeline improves LVLM-generated caption quality effectively.

Our pipeline enhances recall of visual elements effectively with little precision drop.

5.2.3 Analysis

Our phrase-level hallucination filtering strategy achieves the best performance.

LVLM’s detail caption ablity can be improved via self-looping.

5.2.4 Improving LVLM Training with Synthesized Detail Caption Data

Experiment Settings.

Synthesized detail caption data improves LVLM’s overall understanding performance effectively.

Directly generated detail caption data also improves LVLM’s overall understanding performance.

Model’s benchmark scores correlate to their detail caption task performance positively.

6 Limitations and Future Work

7 Conclusions

References

Appendix A Prompt Templates for Detail Caption Benchmark Curation

Prompt of GPT4-Evaluation scores generation.

Prompt of detail caption generation.

Appendix B Implementation Details for CAPTURE Metric

Core information extraction.

Stop words filtering.

Core information matching.

Appendix C Visualized Examples for Improved Detail Caption Construction

Cases of detail caption construction.

References

The default $\alpha,\beta,\gamma=5,5,2$ setting is a sweet spot for detail caption evaluation.