2 |
Corpus of academic Slovene KAS 2.0
|
|
Žagar, Aleš; Kavaš, Matic; Robnik-Šikonja, Marko; Erjavec, Tomaž; Fišer, Darja; Ljubešić, Nikola; Ferme, Marko; Borovič, Mladen; Boškovič, Borko; Ojsteršek, Milan; Hrovat, Goran. - : Faculty of Electrical Engineering and Computer Science, University of Maribor, 2022. : Faculty of Computer and Information Science, University of Ljubljana, 2022
|
|
Abstract:
The KAS corpus of Slovene academic writing consists of almost 65,000 BSc/BA, 16,000 MSc/MA and 1,600 PhD theses (82 thousand texts, 5 million pages or 1,5 billion tokens) written 2000 - 2018 and gathered from the digital libraries of Slovene higher education institutions via the Slovene Open Science portal (http://openscience.si/). The theses have associated with them significant metadata, while each thesis in the corpus contains its textual body, i.e. without their front and back matter. The body is divided into chapters, then into pages, these into paragraphs, and then into sentences. The sentence tokens are tagged with morphosyntactically descriptions (detailed part-of-speech tags) and the words lemmatised. As opposed to the previous version 1.0, the KAS corpus of Slovene academic writing 2.0 is cleaner and contains segmentations into chapters. The metadata also contains more information about research fields of each work. Both versions consist of the same number of BSc/BA, MSc/MA, and PhD theses, however, the processing was done from scratch for 2.0, so the number of e.g. pages and tokens is different. Note also that the new version does not contain links to the PNG pictures of individual pages , nor does it contain annotated terms, both present in version 1.0. It is, unlike 1.0, also not mounted on the CLARIN.SI concordancers. The new version is distributed in the canonical TEI encoding, JSON, and as plain text files. In the TEI format, chapter names are denoted with the tag. Each entry in JSON files have a string ID and a list containing names of chapters as its first element and texts as its second element. Chapters without text are represented as an empty string. The plain text files contain only text bodies without segmentation information. References: Žagar, A., Kavaš, M., & Robnik Šikonja, M. (2021). Corpus KAS 2.0: cleaner and with new datasets. In Information Society - IS 2021: Proceedings of the 24th International Multiconference. https://doi.org/10.5281/zenodo.5562228
|
|
Keyword:
academic writing; BSc/BA theses; MSc/MA theses; PhD theses; TEI
|
|
URL: http://hdl.handle.net/11356/1448
|
|
BASE
|
|
Hide details
|
|
5 |
Surveys in toponymy in Brazil: works produced in postgraduate stricto sensu ; Pesquisas em toponímia no Brasil: trabalhos produzidos na pós-graduação stricto sensu
|
|
|
|
In: Acta Scientiarum. Language and Culture; Vol 44 No 1 (2022): Jan.-June; e53282 ; Acta Scientiarum. Language and Culture; v. 44 n. 1 (2022): Jan.-June; e53282 ; 1983-4683 ; 1983-4675 (2022)
|
|
BASE
|
|
Show details
|
|
7 |
In search of safety: A qualitative study on how LGBT+ college students find safe spaces on college campuses
|
|
|
|
BASE
|
|
Show details
|
|
8 |
College Students’ Attitudes Toward Immigration within the United States
|
|
|
|
BASE
|
|
Show details
|
|
9 |
Improving the Accessibility of Arabic Electronic Theses and Dissertations (ETDs) with Metadata and Classification
|
|
|
|
BASE
|
|
Show details
|
|
10 |
Le complotisme « transnational » et le discours de haine : le cas de Chypre et de l’Italie
|
|
|
|
In: Mots. Les langages du politique, n 125, 1, 2021-02-15, pp.15-34 (2021)
|
|
BASE
|
|
Show details
|
|
14 |
An overview of studies within applied linguistics in Brazilian graduation programs between 2017 and 2020 ; Fotografias da pesquisa em linguística aplicada na pós-graduação brasileira entre 2017 e 2020
|
|
|
|
In: Entrepalavras; v. 10, n. 3 (10) (2020)
|
|
BASE
|
|
Show details
|
|
15 |
A Brave Space for Community: Bolstering K-12 Theatre Education for Diversity, Equity, and Inclusion ...
|
|
Loest, Tylor. - : Maryland Shared Open Access Repository, 2019
|
|
BASE
|
|
Show details
|
|
16 |
Monitoring Academic Studies of Turkish Lexicography: A Bibliometric Study of 84 Years
|
|
|
|
In: Lexikos; Vol. 29 (2019) ; 2224-0039 (2019)
|
|
BASE
|
|
Show details
|
|
|
|