61 |
Differentiable subset pruning of transformer heads
|
|
|
|
In: Transactions of the Association for Computational Linguistics, 9 (2021)
|
|
Abstract:
Multi-head attention, a collection of several attention mechanisms that independently attend to different parts of the input, is the key ingredient in the Transformer. Recent work has shown, however, that a large proportion of the heads in a Transformer's multi-head attention mechanism can be safely pruned away without significantly harming the performance of the model; such pruning leads to models that are noticeably smaller and faster in practice. Our work introduces a new head pruning technique that we term differentiable subset pruning. Intuitively, our method learns per-head importance variables and then enforces a user-specified hard constraint on the number of unpruned heads. The importance variables are learned via stochastic gradient descent. We conduct experiments on natural language inference and machine translation; we show that differentiable subset pruning performs comparably or better than previous works while offering precise control of the sparsity level. ; ISSN:2307-387X
|
|
URL: https://doi.org/10.3929/ethz-b-000528141 https://hdl.handle.net/20.500.11850/528141
|
|
BASE
|
|
Hide details
|
|
62 |
Parameter space factorization for zero-shot learning across tasks and languages
|
|
|
|
In: Transactions of the Association for Computational Linguistics, 9 (2021)
|
|
BASE
|
|
Show details
|
|
63 |
Disambiguatory Signals are Stronger in Word-initial Positions
|
|
|
|
In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (2021)
|
|
BASE
|
|
Show details
|
|
64 |
Searching for More Efficient Dynamic Programs
|
|
|
|
In: Findings of the Association for Computational Linguistics: EMNLP 2021 (2021)
|
|
BASE
|
|
Show details
|
|
65 |
How (Non-)Optimal is the Lexicon?
|
|
|
|
In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2021)
|
|
BASE
|
|
Show details
|
|
66 |
A Bayesian Framework for Information-Theoretic Probing
|
|
|
|
In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (2021)
|
|
BASE
|
|
Show details
|
|
67 |
Examining the Inductive Bias of Neural Language Models with Artificial Languages
|
|
|
|
In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (2021)
|
|
BASE
|
|
Show details
|
|
68 |
On the Relationships Between the Grammatical Genders of Inanimate Nouns and Their Co-Occurring Adjectives and Verbs
|
|
|
|
In: Transactions of the Association for Computational Linguistics, 9 (2021)
|
|
BASE
|
|
Show details
|
|
71 |
Do Syntactic Probes Probe Syntax? Experiments with Jabberwocky Probing ...
|
|
|
|
BASE
|
|
Show details
|
|
72 |
Do Syntactic Probes Probe Syntax? Experiments with Jabberwocky Probing ...
|
|
|
|
BASE
|
|
Show details
|
|
76 |
Disambiguatory Signals are Stronger in Word-initial Positions ...
|
|
|
|
BASE
|
|
Show details
|
|
77 |
Finding Concept-specific Biases in Form--Meaning Associations ...
|
|
|
|
BASE
|
|
Show details
|
|
78 |
Backtranslation feedback improves user confidence in MT, not quality
|
|
|
|
BASE
|
|
Show details
|
|
79 |
On the Relationships Between the Grammatical Genders of Inanimate Nouns and Their Co-Occurring Adjectives and Verbs ...
|
|
|
|
BASE
|
|
Show details
|
|
80 |
Investigating Cross-Linguistic Adjective Ordering Tendencies with a Latent-Variable Model ...
|
|
|
|
BASE
|
|
Show details
|
|
|
|