1 |
Handling Cross- and Out-of-Domain Samples in Thai Word Segmentation ...
|
|
|
|
BASE
|
|
Show details
|
|
2 |
Robust Fragment-Based Framework for Cross-lingual Sentence Retrieval ...
|
|
|
|
BASE
|
|
Show details
|
|
3 |
Sentiment analysis for Urdu online reviews using deep learning models
|
|
|
|
In: 38 ; 8 ; 1 (2021)
|
|
BASE
|
|
Show details
|
|
4 |
Handling cross and out-of-domain samples in Thai word segmentation
|
|
|
|
In: 1003 ; 1016 (2021)
|
|
BASE
|
|
Show details
|
|
5 |
Linguistic features evaluation for hadith authenticity through automatic machine learning
|
|
|
|
BASE
|
|
Show details
|
|
6 |
Robust fragment-based framework for cross-lingual sentence retrieval
|
|
|
|
In: Findings of the Association for Computational Linguistics: EMNLP 2021 ; 935 ; 944 (2021)
|
|
BASE
|
|
Show details
|
|
7 |
Exploiting Tweet Sentiments in Altmetrics Large-Scale Data ...
|
|
|
|
BASE
|
|
Show details
|
|
8 |
Domain adaptation of Thai word segmentation models using stacked ensemble
|
|
|
|
In: 3841 ; 3847 (2020)
|
|
BASE
|
|
Show details
|
|
9 |
Native language identification of fluent and advanced non-native writers
|
|
|
|
In: 19 ; 4 ; 1 (2020)
|
|
Abstract:
This is an accepted manuscript of an article published by ACM in ACM Transactions on Asian and Low-Resource Language Information Processing in April 2020, available online: https://doi.org/10.1145/3383202 The accepted version of the publication may differ from the final published version. ; Native Language Identification (NLI) aims at identifying the native languages of authors by analyzing their text samples written in a non-native language. Most existing studies investigate this task for educational applications such as second language acquisition and require the learner corpora. This article performs NLI in a challenging context of the user-generated-content (UGC) where authors are fluent and advanced non-native speakers of a second language. Existing NLI studies with UGC (i) rely on the content-specific/social-network features and may not be generalizable to other domains and datasets, (ii) are unable to capture the variations of the language-usage-patterns within a text sample, and (iii) are not associated with any outlier handling mechanism. Moreover, since there is a sizable number of people who have acquired non-English second languages due to the economic and immigration policies, there is a need to gauge the applicability of NLI with UGC to other languages. Unlike existing solutions, we define a topic-independent feature space, which makes our solution generalizable to other domains and datasets. Based on our feature space, we present a solution that mitigates the effect of outliers in the data and helps capture the variations of the language-usage-patterns within a text sample. Specifically, we represent each text sample as a point set and identify the top-k stylistically similar text samples (SSTs) from the corpus. We then apply the probabilistic k nearest neighbors’ classifier on the identified top-k SSTs to predict the native languages of the authors. To conduct experiments, we create three new corpora where each corpus is written in a different language, namely, English, French, and German. Our experimental studies show that our solution outperforms competitive methods and reports more than 80% accuracy across languages. ; Research funded by Higher Education Commission, and Grants for Development of New Faculty Staff at Chulalongkorn University | Digital Economy Promotion Agency (# MP-62-0003) | Thailand Research Funds (MRG6180266 and MRG6280175). ; Published version
|
|
Keyword:
author profiling; forensic investigation; native language identification; Stylometry; text classification
|
|
URL: https://doi.org/10.1145/3383202 http://hdl.handle.net/2436/623710
|
|
BASE
|
|
Hide details
|
|
10 |
A scalable framework for stylometric analysis query processing
|
|
|
|
BASE
|
|
Show details
|
|
|
|