DE eng

Search in the Catalogues and Directories

Page: 1 2 3
Hits 1 – 20 of 45

1
Topic Discovery via Latent Space Clustering of Pretrained Language Model Representations ...
Meng, Yu; Zhang, Yunyi; Huang, Jiaxin. - : arXiv, 2022
BASE
Show details
2
Semantic pattern discovery in open information extraction
Chauhan, Aabhas. - 2022
BASE
Show details
3
Text mining at multiple granularity: leveraging subwords, words, phrases, and sentences
Abstract: With the rapid digitization of information, large quantities of text-heavy data is being constantly generated in many languages and across domains such as web documents, research papers, business reviews, news, and social posts. As such, efficiently and effectively searching, organizing, and extracting meaningful information and data from these massive unstructured corpora is essential to laying the foundation for many downstream text mining and natural language processing (NLP) tasks. Traditionally, NLP and text mining techniques are applied to the raw texts while treating individual words as the base semantic unit. However the assumption that individual word-tokens are the correct semantic granularity does not hold for many tasks and can lead to many problems and poor task performance. To address this, this work introduces techniques for identifying and utilizing text at different semantic granularity to solve a variety of text mining and NLP tasks. The general idea is to take a text object such as a document, and decompose it to many levels of semantic granularity such as sentences, phrases, words, or subword structures. Once the text in represented at different levels of semantic granularity, we demonstrate techniques that can leverage the properly encoded text to solve a variety of NLP tasks. Specifically, this study focuses on three levels of semantic granularity: (1) subword segmentation with an application to enriching word embeddings to address word sparsity (2) phrase mining with an application to phrase-based topic modeling and (3) leveraging sentence-level granularity for finding parallel cross-lingual data. The first granularity we study is subword-level. We introduce a subword mining problem that aims to segment individual word tokens into smaller subword structures. The motivation is that, often, individual words are too coarse of a granularity and need to be supplemented by a finer semantic granularity. Operating on these fine-grained subwords addresses many important problems in NLP namely the long-tail data-sparsity problem whereby most words in a corpus are infrequent, and the more severe out-of-vocabulary problem. To effectively and efficiently mine these subword structures, we propose an unsupervised segmentation algorithm based off a novel objective: transition entropy. We use ground-truth segmentation to assess the quality of the segmented words and further demonstrate the benefit of jointly leveraging words and subwords for distributed word representations. The second granularity we study is phrase-level and the phrase mining task to transform raw unstructured text from a fine-grained sequence of words into a coarser-granularity sequence of single and multi-word phrases. The motivation is that, often, human language contains idiomatic multi-word expressions and fine-grained words fail to capture the right semantic granularity; proper phrasal segmentation can capture this true appropriate semantic granularity. To address this problem, we propose an unsupervised phrase mining algorithm based on frequent significant contiguous text patterns. We use human-evaluation to assess the quality of the mined phrases and demonstrate the benefit of pre-mining phrases on a downstream topic-modeling task. The third granularity we study is sentence-level granularity. We motivate the need for a sentence-level granularity for capturing more complex semantically complete spans of texts. We introduce several downstream tasks that leverage sentence representations in conjunction with finer-grained units in a cross-lingual text mining task. We experimentally show how leveraging sentence-level data for cross-lingual embeddings can be used to identify cross-lingual document pairs and parallel sentences – data necessary for training machine translation models. ; U of I Only ; Author requested U of Illinois access only (OA after 2yrs) in Vireo ETD system
Keyword: cross-lingual; data mining; embedding; nlp; phrases; sentences; subwords
URL: http://hdl.handle.net/2142/108161
BASE
Hide details
4
Distantly-Supervised Named Entity Recognition with Noise-Robust Learning and Language Model Augmented Self-Training ...
BASE
Show details
5
ChemNER: Fine-Grained Chemistry Named Entity Recognition with Ontology-Guided Distant Supervision ...
BASE
Show details
6
Generation-Augmented Retrieval for Open-Domain Question Answering ...
BASE
Show details
7
Few-Shot Named Entity Recognition: An Empirical Baseline Study ...
BASE
Show details
8
Reader-Guided Passage Reranking for Open-Domain Question Answering ...
BASE
Show details
9
The Future is not One-dimensional: Complex Event Schema Induction by Graph Modeling for Event Prediction ...
BASE
Show details
10
Extract, Denoise and Enforce: Evaluating and Improving Concept Preservation for Text-to-Text Generation ...
Mao, Yuning; Ma, Wenchang; Lei, Deren. - : arXiv, 2021
BASE
Show details
11
Extract, Denoise and Enforce: Evaluating and Improving Concept Preservation for Text-to-Text Generation ...
BASE
Show details
12
Combating abuse on social media platforms using natural language processing
Seyler, Dominic. - 2021
BASE
Show details
13
Text Classification Using Label Names Only: A Language Model Self-Training Approach ...
Meng, Yu; Zhang, Yunyi; Huang, Jiaxin. - : arXiv, 2020
BASE
Show details
14
COVID-19 Literature Knowledge Graph Construction and Drug Repurposing Report Generation ...
Wang, Qingyun; Li, Manling; Wang, Xuan. - : arXiv, 2020
BASE
Show details
15
Constrained Abstractive Summarization: Preserving Factual Consistency with Constrained Generation ...
Mao, Yuning; Ren, Xiang; Ji, Heng. - : arXiv, 2020
BASE
Show details
16
Guiding Corpus-based Set Expansion by Auxiliary Sets Generation and Co-Expansion ...
Huang, Jiaxin; Xie, Yiqing; Meng, Yu. - : arXiv, 2020
BASE
Show details
17
Near-imperceptible Neural Linguistic Steganography via Self-Adjusting Arithmetic Coding ...
Shen, Jiaming; Ji, Heng; Han, Jiawei. - : arXiv, 2020
BASE
Show details
18
Cold-start universal information extraction
Huang, Lifu. - 2020
BASE
Show details
19
Cross-lingual entity extraction and linking for 300 languages
Pan, Xiaoman. - 2020
BASE
Show details
20
Text cube: construction, summarization and mining
Tao, Fangbo. - 2020
BASE
Show details

Page: 1 2 3

Catalogues
0
0
2
0
0
0
0
Bibliographies
0
0
0
0
0
0
0
0
0
Linked Open Data catalogues
0
Online resources
0
0
0
0
Open access documents
43
0
0
0
0
© 2013 - 2024 Lin|gu|is|tik | Imprint | Privacy Policy | Datenschutzeinstellungen ändern