On the Use of Character n-grams as the only Intrinsic Evidence of Plagiarism
Rosso, Paolo; Bensalem, Imene; Chikhi, Salim. - : Springer-Verlag, 2019
Abstract: [EN] When a shift in writing style is noticed in a document, doubts arise about its originality. Based on this clue to plagiarism, the intrinsic approach to plagiarism detection identifies the stolen passages by analysing the writing style of the suspicious document without comparing it to textual resources that may serve as sources for the plagiarist. Character n-grams are recognised as a successful approach to modelling text for writing style analysis. Although prior studies have investigated the best practice of using character n-grams in authorship attribution and other problems, there is still a need for such investigations in the context of intrinsic plagiarism detection. Moreover, it has been assumed in previous works that the ways of using character n-grams in authorship attribution remain the same for intrinsic plagiarism detection. In this paper, we study the effect of character n-grams frequency and length on the performance of intrinsic plagiarism detection. Our experiments utilise two state-of-the-art methods and five large document collections of PAN labs written in English and Arabic. We demonstrate empirically that the low- and the high-frequency n-grams are not equally relevant for intrinsic plagiarism detection, but their performance depends on the way they are exploited. ; We are very grateful to the anonymous reviewers for their insightful suggestions and constructive comments that greatly improved the paper. This work has been partially supported by the Ecole Superieure de Comptabilite et de Finances de Constantine. The work of Paolo Rosso has been partially funded by the SomEMBED TIN2015-71147-C2-1-P research project (MINECO/FEDER). The work of Salim Chikhi has been partially funded by CNEPRU/DGRSDT/B*07120140018 research project. ; Bensalem, I.; Rosso, P.; Chikhi, S. (2019). On the Use of Character n-grams as the only Intrinsic Evidence of Plagiarism. Language Resources and Evaluation. 53(3):363-396. Keyword: Character n-grams; Intrinsic plagiarism detection; LENGUAJES Y SISTEMAS INFORMATICOS; Stylistic features; Writing style analysis
