1 |
A comparative study of different features for efficient automatic hate speech detection
|
|
|
|
In: IPrA 2021 - 17th International Pragmatics Conference ; https://hal.archives-ouvertes.fr/hal-03115781 ; IPrA 2021 - 17th International Pragmatics Conference, Jun 2021, Winterthur, Switzerland (2021)
|
|
Abstract:
International audience ; Commonly, Hate Speech (HS) is defined as any communication that disparages a person or agroup on the basis of some characteristic (race, colour, ethnicity, gender, sexual orientation, na-tionality, etc. (Nockeby, 2000)). Due to the massive activities of user-generator on social networks(around 500 million tweets per day) Hate Speech is continuously increasing on the web.Recent initiatives, such as SemEval2019 shared task 5 Hateval2019 (Basile et al., 2019) contri-bute to the development of automatic hate speech detection systems (HSD) by making availableannotated hateful corpus. We focus our research on automatic classification of hateful tweets,which are the first sub-task of Hateval2019. The best Hateval2019 HSD system was FERMI (In-durthi et al., 2019) with 65.1 % macro-F1 score on the test corpus. This system used sentenceembeddings, Universal Sentence Encoder (USE) (Cer et al., 2018) as input of a Support VectorMachine classifier.In this article, we study the impact of different features on an HSD system. We use deep neu-ral network (DNN) based classifier with USE. We investigate the word level features, such aslexicon of hateful words (HFW), Part of Speech (POS), uppercase letters (UP), punctuationmarks (PUNCT), the ratio of the number of times a word appears in hateful tweets comparedto the total number of times that word appears (RatioHW) ; and the emojis (EMO). We think thatthese features are relevant because they carry feelings. For instance, cases (UP) and punctuations(PUNCT) can carry the intonation of the tweets and can be used to express a hateful content. ForHFW features, we tag each word of tweets as hateful or not using the Hatebase lexicon (Hate-base.org) and we associate a binary value to each word. For POS features, we use twpipe (Liu etal., 2018) for tagging the words and this information is coded as an one-hot vector. For emojis,we generate an embedding vector using emoji2vec tools (Eisner et al., 2016). The input of ourneural network consists of the USE vector and our additional features. We used convolutionalneural networks (CNN) as binary classifier. We performed the experiments on the HateEval2019corpus to study the influence of each proposed feature. Our baseline system without proposedfeatures achieves 65.7% of macro-F1 score on the test corpus. Surprisingly, HFW degrades thesystem performance and decreases the macro-F1 by 14 points compared to the baseline. Thiscan be due to the fact that some words are hateful only in a particular context. UP, RatioHWand PUNCT slightly degrade the baseline system. The POS features do not change the baselinesystem result and so are probably not correlated to the hate speech. The best result is obtainedusing EMO features with 66.0% of macro-F1. EMOs are largely used to transmit emotions. Inour system,they are modeled by a specific embedding vector. USE does not take into account theemojis. Therefore, EMOs give additional information to USE about the hateful content of tweets.
|
|
Keyword:
[INFO.INFO-SI]Computer Science [cs]/Social and Information Networks [cs.SI]; [INFO.INFO-TT]Computer Science [cs]/Document and Text Processing; [INFO]Computer Science [cs]
|
|
URL: https://hal.archives-ouvertes.fr/hal-03115781/file/CFP___Offensive_language_on_social_media___International_Pragmatics_Conference_panel.pdf https://hal.archives-ouvertes.fr/hal-03115781/document https://hal.archives-ouvertes.fr/hal-03115781
|
|
BASE
|
|
Hide details
|
|
2 |
Multiword Expression Features for Automatic Hate Speech Detection
|
|
|
|
In: NLDB 2021 - 26th International Conference on Natural Language & Information Systems ; https://hal.archives-ouvertes.fr/hal-03231047 ; NLDB 2021 - 26th International Conference on Natural Language & Information Systems, Jun 2021, Saarbrücken/Virtual, Germany ; http://nldb2021.sb.dfki.de/ (2021)
|
|
BASE
|
|
Show details
|
|
3 |
BERT-based Semantic Model for Rescoring N-best Speech Recognition List
|
|
|
|
In: INTERSPEECH 2021 ; https://hal.archives-ouvertes.fr/hal-03248881 ; INTERSPEECH 2021, Aug 2021, Brno, Czech Republic ; https://www.interspeech2021.org/ (2021)
|
|
BASE
|
|
Show details
|
|
4 |
Improving Automatic Hate Speech Detection with Multiword Expression Features ...
|
|
|
|
BASE
|
|
Show details
|
|
5 |
Introduction of semantic model to help speech recognition
|
|
|
|
In: TSD 2020 - Twenty-third International Conference on Text, Speech and Dialogue ; https://hal.archives-ouvertes.fr/hal-02862245 ; TSD 2020 - Twenty-third International Conference on Text, Speech and Dialogue, Sep 2020, Brno, Czech Republic (2020)
|
|
BASE
|
|
Show details
|
|
6 |
Introduction d’informations sémantiques dans un système de reconnaissance de la parole
|
|
|
|
In: Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 1 : Journées d'Études sur la Parole ; 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 1 : Journées d'Études sur la Parole ; https://hal.archives-ouvertes.fr/hal-02798559 ; 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 1 : Journées d'Études sur la Parole, 2020, Nancy, France. pp.362-369 (2020)
|
|
BASE
|
|
Show details
|
|
7 |
RNN Language Model Estimation for Out-of-Vocabulary Words
|
|
|
|
In: Lecture Notes in Artificial Intelligence ; https://hal.archives-ouvertes.fr/hal-03054936 ; Lecture Notes in Artificial Intelligence, Springer, In press, 12598, ⟨10.1007/978-3-030-66527-2_15⟩ (2020)
|
|
BASE
|
|
Show details
|
|
8 |
DNN-Based Semantic Model for Rescoring N-best Speech Recognition List ...
|
|
|
|
BASE
|
|
Show details
|
|
9 |
Dynamic Extension of ASR Lexicon Using Wikipedia Data
|
|
|
|
In: IEEE Workshop on Spoken and Language Technology (SLT) ; https://hal.archives-ouvertes.fr/hal-01874495 ; IEEE Workshop on Spoken and Language Technology (SLT), Dec 2018, Athènes, Greece (2018)
|
|
BASE
|
|
Show details
|
|
10 |
Modelling Semantic Context of OOV Words in Large Vocabulary Continuous Speech Recognition
|
|
|
|
In: ISSN: 2329-9290 ; EISSN: 2329-9304 ; IEEE/ACM Transactions on Audio, Speech and Language Processing ; https://hal.inria.fr/hal-01461617 ; IEEE/ACM Transactions on Audio, Speech and Language Processing, Institute of Electrical and Electronics Engineers, 2017, 25 (3), pp.598 - 610. ⟨10.1109/TASLP.2017.2651361⟩ (2017)
|
|
BASE
|
|
Show details
|
|
11 |
Topic segmentation in ASR transcripts using bidirectional rnns for change detection
|
|
|
|
In: ASRU 2017 - IEEE Automatic Speech Recognition and Understanding Workshop ; https://hal.archives-ouvertes.fr/hal-01599682 ; ASRU 2017 - IEEE Automatic Speech Recognition and Understanding Workshop, Dec 2017, Okinawa, Japan (2017)
|
|
BASE
|
|
Show details
|
|
12 |
Out-of-Vocabulary Word Probability Estimation using RNN Language Model
|
|
|
|
In: 8th Language & Technology Conference ; https://hal.archives-ouvertes.fr/hal-01623784 ; 8th Language & Technology Conference, Nov 2017, Poznan, Poland (2017)
|
|
BASE
|
|
Show details
|
|
13 |
How Diachronic Text Corpora Affect Context based Retrieval of OOV Proper Names for Audio News
|
|
|
|
In: LREC 2016 ; https://hal.archives-ouvertes.fr/hal-01331714 ; LREC 2016, May 2016, Portoroz, Slovenia (2016)
|
|
BASE
|
|
Show details
|
|
14 |
Improved Neural Bag-of-Words Model to Retrieve Out-of-Vocabulary Words in Speech Recognition
|
|
|
|
In: INTERSPEECH 2016 ; https://hal.archives-ouvertes.fr/hal-01384488 ; INTERSPEECH 2016, Sep 2016, San Francisco, United States. ⟨10.21437/Interspeech.2016-1219⟩ (2016)
|
|
BASE
|
|
Show details
|
|
15 |
Temporal and Lexical Context of Diachronic Text Documents for Automatic Out-Of-Vocabulary Proper Name Retrieval
|
|
|
|
In: Human Language Technology. Challenges for Computer Science and Linguistics ; https://hal.inria.fr/hal-01475080 ; Zygmunt Vetulani; Hans Uszkoreit; Marek Kubis Human Language Technology. Challenges for Computer Science and Linguistics, 9561, Springer, pp.41-54, 2016, Lecture Notes in Computer Science, 978-3-319-43808-5. ⟨10.1007/978-3-319-43808-5_4⟩ (2016)
|
|
BASE
|
|
Show details
|
|
16 |
Dynamic adjustment of language models for automatic speech recognition using word similarity
|
|
|
|
In: IEEE Workshop on Spoken Language Technology (SLT 2016) ; https://hal.archives-ouvertes.fr/hal-01384365 ; IEEE Workshop on Spoken Language Technology (SLT 2016), Dec 2016, San Diego, CA, United States ; http://www.slt2016.org/ (2016)
|
|
BASE
|
|
Show details
|
|
17 |
Document Level Semantic Context for Retrieving OOV Proper Names
|
|
|
|
In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ; https://hal.archives-ouvertes.fr/hal-01331716 ; 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , Mar 2016, Shanghai, China. pp.6050-6054, ⟨10.1109/ICASSP.2016.7472839⟩ (2016)
|
|
BASE
|
|
Show details
|
|
18 |
OOV Proper Name Retrieval using Topic and Lexical Context Model
|
|
|
|
In: IEEE International Conference on Acoustics, Speech and Signal Processing ; https://hal.archives-ouvertes.fr/hal-01184963 ; IEEE International Conference on Acoustics, Speech and Signal Processing, 2015, Brisbane, Australia (2015)
|
|
BASE
|
|
Show details
|
|
19 |
Continuous Word Representation using Neural Networks for Proper Name Retrieval from Diachronic Documents
|
|
|
|
In: Interspeech 2015 ; https://hal.archives-ouvertes.fr/hal-01184951 ; Interspeech 2015, Sep 2015, Dresden, Germany (2015)
|
|
BASE
|
|
Show details
|
|
20 |
Neural Networks Revisited for Proper Name Retrieval from Diachronic Documents
|
|
|
|
In: proceedings of LTC2015 ; LTC Language & Technology Conference ; https://hal.archives-ouvertes.fr/hal-01240480 ; LTC Language & Technology Conference, Nov 2015, Poznan, Poland. pp.120-124 (2015)
|
|
BASE
|
|
Show details
|
|
|
|