| The Use of Latent Semantic Indexing to Mitigate OCR Effects of Related Document Images
               Renato F. Bulcão-Neto (Innolution Sistemas de Informática Ltda., Brazil)
 
               José A. Camacho-Guerrero (Innolution Sistemas de Informática Ltda., Brazil)
 
               Márcio Dutra (Innolution Sistemas de Informática Ltda., Brazil)
 
               Álvaro Barreiro (University of A Coruña, Spain)
 
               Javier Parapar (University of A Coruña, Spain)
 
               Alessandra A. Macedo (Universidade de Sáo Paulo, Brazil)
 
              Abstract: Due to both the widespread and multipurpose use   of document images and the current availability of a high number of   document images repositories, robust information retrieval   mechanisms and systems have been increasingly demanded. This paper   presents an approach to support the automatic generation of   relationships among document images by exploiting Latent Semantic   Indexing (LSI) and Optical Character Recognition (OCR). We developed   the LinkDI (Linking of   Document Images) service,   which extracts and indexes document images content, computes its   latent semantics, and defines relationships among images as   hyperlinks. LinkDI was experimented with document images   repositories, and its performance was evaluated by comparing the   quality of the relationships created among textual documents as well   as among their respective document images. Considering those same   document images, we ran further experiments in order to compare the   performance of LinkDI when it exploits or not the LSI   technique. Experimental results showed that LSI can mitigate the   effects of usual OCR misrecognition, which reinforces the   feasibility of LinkDI relating OCR output with high degradation. 
             
              Keywords: applied computing, document engineering, document image, experimentation, information retrieval, latent semantic, optical character recognition 
             Categories: H.3, H.3.3, H.5.4  |