Update README.md
Browse files
README.md
CHANGED
@@ -136,11 +136,6 @@ A model to test how MoE will route without square expansion.
|
|
136 |
|
137 |
### "[What is a Mixture of Experts (MoE)?](https://huggingface.co/blog/moe)"
|
138 |
|
139 |
-
<a href='https://ko-fi.com/S6S2UH2TC' target='_blank'><img height='38' style='border:0px;height:36px;' src='https://storage.ko-fi.com/cdn/kofi1.png?v=3' border='0' alt='Buy Me a Coffee at ko-fi.com' /></a>
|
140 |
-
<a href='https://discord.gg/KFS229xD' target='_blank'><img width='140' height='500' style='border:0px;height:36px;' src='https://i.ibb.co/tqwznYM/Discord-button.png' border='0' alt='Join Our Discord!' /></a>
|
141 |
-
|
142 |
-
### (from the MistralAI papers...click the quoted question above to navigate to it directly.)
|
143 |
-
|
144 |
The scale of a model is one of the most important axes for better model quality. Given a fixed computing budget, training a larger model for fewer steps is better than training a smaller model for more steps.
|
145 |
|
146 |
Mixture of Experts enable models to be pretrained with far less compute, which means you can dramatically scale up the model or dataset size with the same compute budget as a dense model. In particular, a MoE model should achieve the same quality as its dense counterpart much faster during pretraining.
|
|
|
136 |
|
137 |
### "[What is a Mixture of Experts (MoE)?](https://huggingface.co/blog/moe)"
|
138 |
|
|
|
|
|
|
|
|
|
|
|
139 |
The scale of a model is one of the most important axes for better model quality. Given a fixed computing budget, training a larger model for fewer steps is better than training a smaller model for more steps.
|
140 |
|
141 |
Mixture of Experts enable models to be pretrained with far less compute, which means you can dramatically scale up the model or dataset size with the same compute budget as a dense model. In particular, a MoE model should achieve the same quality as its dense counterpart much faster during pretraining.
|