1 |
Homepage2Vec: Language-Agnostic Website Embedding and Classification ...
|
|
|
|
Abstract:
Currently, publicly available models for website classification do not offer an embedding method and have limited support for languages beyond English. We release a dataset of more than two million category-labeled websites in 92 languages collected from Curlie, the largest multilingual human-edited Web directory. The dataset contains 14 website categories aligned across languages. Alongside it, we introduce Homepage2Vec, a machine-learned pre-trained model for classifying and embedding websites based on their homepage in a language-agnostic way. Homepage2Vec, thanks to its feature set (textual content, metadata tags, and visual attributes) and recent progress in natural language representation, is language-independent by design and generates embedding-based representations. We show that Homepage2Vec correctly classifies websites with a macro-averaged F1-score of 0.90, with stable performance across low- as well as high-resource languages. Feature analysis shows that a small subset of efficiently computable ... : Published in Proc. of ICWSM 2022 ...
|
|
Keyword:
Artificial Intelligence cs.AI; Computation and Language cs.CL; FOS Computer and information sciences
|
|
URL: https://dx.doi.org/10.48550/arxiv.2201.03677 https://arxiv.org/abs/2201.03677
|
|
BASE
|
|
Hide details
|
|
4 |
Classifying Dyads for Militarized Conflict Analysis
|
|
|
|
In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (2021)
|
|
BASE
|
|
Show details
|
|
5 |
Cognitive Network Topology and Optimization of the Mental Lexicon ...
|
|
|
|
BASE
|
|
Show details
|
|
6 |
Linguistic effects on news headline success: Evidence from thousands of online field experiments (Registered Report Protocol)
|
|
|
|
In: PLoS One (2021)
|
|
BASE
|
|
Show details
|
|
7 |
On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation ...
|
|
|
|
BASE
|
|
Show details
|
|
8 |
On the limitations of cross-lingual encoders as exposed by reference-free machine translation evaluation
|
|
|
|
BASE
|
|
Show details
|
|
10 |
Crosslingual Document Embedding as Reduced-Rank Ridge Regression ...
|
|
|
|
BASE
|
|
Show details
|
|
14 |
Causal Effects of Brevity on Style and Success in Social Media ...
|
|
|
|
BASE
|
|
Show details
|
|
15 |
Message Distortion in Information Cascades
|
|
|
|
In: http://infoscience.epfl.ch/record/270657 (2019)
|
|
BASE
|
|
Show details
|
|
17 |
Reverse-Engineering Satire, or "Paper on Computational Humor Accepted despite Making Serious Advances"
|
|
|
|
In: http://infoscience.epfl.ch/record/271147 (2019)
|
|
BASE
|
|
Show details
|
|
18 |
Why the World Reads Wikipedia: Beyond English Speakers
|
|
|
|
In: http://infoscience.epfl.ch/record/270302 (2019)
|
|
BASE
|
|
Show details
|
|
19 |
Crosslingual Document Embedding as Reduced-Rank Ridge Regression
|
|
|
|
In: http://infoscience.epfl.ch/record/263893 (2019)
|
|
BASE
|
|
Show details
|
|
20 |
Churn Intent Detection in Multilingual Chatbot Conversations and Social Media ...
|
|
|
|
BASE
|
|
Show details
|
|
|
|