The large scene video text dataset for scene video text spotting (LSVTD)

Research Tasks

End-to-end Video Text Spotting

2021-06-01 (v. 1)

Contact author

Baorui Zou

Hikvision Research Institute

(+86) 18826072052

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.


TASK 3 - End-to-end Video Text Spotting

This task aims to evaluate the end-to-end performance of  video text spotting methods. It requires that words are correctly localised in every frame, correctly tracked over the video frames, and also correctly recognized at the sequence level.


The evaluation also adopts CLEAR-MOT [2] and VACE [3] evaluation framework (similar to 'Task 2'). In this case, a predicted word is considered a true positive if its intersection over union with a ground-truth word is larger than 0.5, and the word recognition is correct. Specifically,  MOTA, MOTP, ATA is used as the evaluation metrics by considers recognition performance.

We also propose the sequence-level evaluation protocal by adapting the metrics used in [1]. The adaptation is that, a predicted text sequence is regarded as a true postive only if it satisfies 2 constraints:

1)spatial-temporal localisation constraint: assuming a predicted text sequence matches its ground truth sequence, we first compute the spatial intersection over union (IoU) with the corresponding ground-truth of bounding box in each frame, and the predicted bounding box is a valid match if its IoU is larger than 0.5. On this basis, if the matched number of this sequence exceeds the half of union of predicted and ground truth frames, the predicted sequence can be treated as a candidate true positive.

2) recognition constraint: for text sequences satisfying constrain 1), their recognition results should also match the text transcription of the corresponding ground truth sequences.

Specifically, Recall, Precision, F_score are used for the sequence-level evaluation (notice that they are different from the detection metrics). We propose the sequence-level metrics to explicitely evaluate the performance of directly extracting text sequence information from videos. Since in many real-world applications, only the sequence-level information is needed, and the correct recognition of words in every frame is somewhat redundant and unnecessary.


[1]. Cheng, Z.; Lu, J.; Niu, Y.; Pu, S.; Wu, F.; and Zhou, S. 2019. You Only Recognize Once: Towards Fast Video Text Spoting. In ACM Multimedia 2019.

[2]. K. Bernardin and R. Stiefelhagen, “Evaluating multiple object tracking performance: The CLEAR MOT metrics,” EURASIP Journal on Image and Video Processing, vol. 2008, May 2008.

[3]. R. Kasturi, D. Goldgof, P. Soundararajan, V. Manohar, J. Garofolo, R. Bowers, M. Boonstra, V. Korzhova, and J. Zhang, “Framework for performance evaluation of face, text, and vehicle detection and tracking in video: Data, metrics, and protocol,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 2, pp. 319–336, 2009.



No comments on this dataset yet.

Add your comment

In order to comment on a dataset you need to be logged on
Register Now!


In order to rate this dataset you need to be logged on
Register Now!