Catalogue search • Linguistik portal • Fachinformationsdienst (FID)

1	Punctuation and Parallel Corpus Based Word Embedding Model for Low-Resource Languages
	Yang Yuan; Xiao Li; Ya-Ting Yang
	In: Information ; Volume 11 ; Issue 1 (2019)
	Abstract: To overcome the data sparseness in word embedding trained in low-resource languages, we propose a punctuation and parallel corpus based word embedding model. In particular, we generate the global word-pair co-occurrence matrix with the punctuation-based distance attenuation function, and integrate it with the intermediate word vectors generated from the small-scale bilingual parallel corpus to train word embedding. Experimental results show that compared with several widely used baseline models such as GloVe and Word2vec, our model improves the performance of word embedding for low-resource language significantly. Trained on the restricted-scale English-Chinese corpus, our model has improved by 0.71 percentage points in the word analogy task, and achieved the best results in all of the word similarity tasks.
	Keyword: distance attenuation function; GloVe; word alignment probability; word embedding; Word2vec
	URL: https://doi.org/10.3390/info11010024
	BASE
	Hide details

Search in the Catalogues and Directories