LSVT consists of 20,000 testing data, 30,000 training data in full annotations and 400,000 training data in weak annotations, which are referred to as partial labels. For most of the training data in weak labels, only one transcription per image is provided, which we refer to as `text-of-interest'. All the images were captured from streets, which consist of a large variety of complicated real-world scenarios, e.g., store fronts and landmarks, making the challenge extreme high by narrowing gaps between research and real applications. 

