FJU Test Collection
for Evaluation of Chinese OCR Text Retrieval

By Yuen-Hsien Tseng, Sep. 20, 2002

Introduction

The advent of the World Wide Web has revolutionized our ways of publising, disseminating, and accessing digital information. The ease of accessing such resources has inspired more and more libraries and information providers to digitize their data for networked information services.

Future information is likely to be present in full digital form and is thus readily accessible through the Internet with little difficulty. In contrast, retrospective paper materials require digitization and indexing before they can be easily accessed.

To digitize paper materials, they are first scanned into digital images. OCR (Optical Character Recognition) techniques are then applied to convert these images into digital texts. This approach has been shown to be the cheapest and fastest way to make retrospective data accessible with ease. However, the conversion is not always perfect. Actually this OCR process is error-prone especially for low-quality prints. It is thus an academic task to evaluate the effectiveness of such an approach to justify their cost advantage.

The test collection provided here originates from a several-year long digitization project of Socio-Cultural Research Center (SCRC) at Fu Jen Catholic University (FJU). The source of the digital images is the newspaper clipping collection of SCRC. The digitization project is mostly sponsored by the National Science Council, Taiwan, Republic of China.

Contents of the Test Collection

This collection mainly contains a set of text files (no any programs), including:

What can you do with this collection?

Download

To get this collection, click here and you will be prompted with a form that needs your following information:

Upon receiving this request, a pair of username and password will be included in a reply message for downloading the test collection.

Request List

Those who have requested this test collection will be listed here for 2 reasons:

Here are the requests:

Acknowledgement