Go home now Header Background Image
Submission Procedure
share: |
Follow us
Volume 21 / Issue 13

available in:   PDF (248 kB) PS (1 MB)
Similar Docs BibTeX   Write a comment
Links into Future
DOI:   10.3217/jucs-021-13-1708


Cross-Language Source Code Re-Use Detection Using Latent Semantic Analysis

Enrique Flores (Universitat Politècnica de Valencia, Spain)

Alberto Barrón-Cedeño (Qatar Computing Research Institute, Qatar)

Lidia Moreno (Universitat Politècnica de Valencia, Spain)

Paolo Rosso (Universitat Politècnica de Valencia, Spain)

Abstract: Nowadays, Internet is the main source to get information from blogs, encyclopedias, discussion forums, source code repositories, and more resources which are available just one click away. The temptation to re-use these materials is very high. Even source codes are easily available through a simple search on the Web. There is a need of detecting potential instances of source code re-use. Source code re-use detection has usually been approached comparing source codes in their compiled version. When dealing with cross-language source code re-use, traditional approaches can deal only with the programming languages supported by the compiler. We assume that a source code is a piece of text ,with its syntax and structure, so we aim at applying models for free text re-use detection to source code. In this paper we compare a Latent Semantic Analysis (LSA) approach with previously used text re-use detection models for measuring cross-language similarity in source code. The LSA-based approach shows slightly better results than the other models, being able to distinguish between re-used and related source codes with a high performance.

Keywords: cross-language re-use detection, latent semantic analysis, plagiarism, source code

Categories: D.2.13, F.3.2, F.3.3, H.3.1, H.3.3, H.3.4, I.2.5, I.7.0, L.0.0, L.3.0