Go home now Header Background Image
Search
Submission Procedure
share: |
 
Follow us
 
 
 
 
Volume 16 / Issue 9

available in:   PDF (261 kB) PS (587 kB)
 
get:  
Similar Docs BibTeX   Write a comment
  
get:  
Links into Future
 
DOI:   10.3217/jucs-016-09-1190

 

LemmaGen: Multilingual Lemmatisation with Induced Ripple-Down Rules

Matjaž Juršič (Jožef Stefan Institute, Slovenia)

Igor Mozetič (Jožef Stefan Institute, Slovenia)

Tomaž Erjavec (Jožef Stefan Institute, Slovenia)

Nada Lavrač (Jožef Stefan Institute, Slovenia)

Abstract: Lemmatisation is the process of finding the normalised forms of words appearing in text. It is a useful preprocessing step for a number of language engineering and text mining tasks, and especially important for languages with rich inflectional morphology. This paper presents a new lemmatisation system, LemmaGen, which was trained to generate accurate and efficient lemmatisers for twelve different languages. Its evaluation on the corresponding lexicons shows that LemmaGen outperforms the lemmatisers generated by two alternative approaches, RDR and CST, both in terms of accuracy and efficiency. To our knowledge, LemmaGen is the most efficient publicly available lemmatiser trained on large lexicons of multiple languages, whose learning engine can be retrained to effectively generate lemmatisers of other languages.

Keywords: lemmatisation, natural language processing, ripple-down rules, rule induction

Categories: E.1, I.2.6, I.2.7