ecker commited on
Commit
b7c10b1
1 Parent(s): 94e677c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -2
README.md CHANGED
@@ -72,6 +72,7 @@ Under `./models/experiments/` are some failed models, but are included to serve
72
 
73
  * `config.dac-nar-len.yaml` / `nar-len-llama-9`: A DAC-based model, but is a pure NAR model (+ autoregressive length task) .
74
  + Originally thought to be bunk from inferencing tests having audio drastically drop off into silence, but I suppose it was just some issue that eventually resolved itself.
 
75
  + Suffers from the same problems the above model suffers from (terrible quality).
76
  + *Huge* performance gains, but may definitely suffer from some specific qualities in the outputs, if it does get trained right.
77
 
@@ -81,9 +82,13 @@ Under `./models/experiments/` are some failed models, but are included to serve
81
  + The model definitely needs to be retrained as there's some errors for the additional tokens.
82
  + If these cannot be nailed out with more training, then I imagine a similar approach to speculative decoding where the nth tokens are discarded if the confidence is low.
83
  + Greedy sampling might be beneficial instead for this, as the NAR does benefit greatly from low temperatures / greedy sampling.
 
84
 
85
  Some additional configurations have been explored with, but experiments have not been fruitful:
86
-
87
  * Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
 
88
 
89
- * Mamba / Mamba2-based models have shown that it's ***really*** hard to have an AR+NAR model. I really do not want to bother throwing the compute at another ~~meme~~ arch I can't easily make use of all the other tech to throw at.
 
 
 
 
72
 
73
  * `config.dac-nar-len.yaml` / `nar-len-llama-9`: A DAC-based model, but is a pure NAR model (+ autoregressive length task) .
74
  + Originally thought to be bunk from inferencing tests having audio drastically drop off into silence, but I suppose it was just some issue that eventually resolved itself.
75
+ + Addendum: I don't know what magic I did for that model, but I cannot recreate a decent EnCodec-backed model instead, despite the test trainer working fine.
76
  + Suffers from the same problems the above model suffers from (terrible quality).
77
  + *Huge* performance gains, but may definitely suffer from some specific qualities in the outputs, if it does get trained right.
78
 
 
82
  + The model definitely needs to be retrained as there's some errors for the additional tokens.
83
  + If these cannot be nailed out with more training, then I imagine a similar approach to speculative decoding where the nth tokens are discarded if the confidence is low.
84
  + Greedy sampling might be beneficial instead for this, as the NAR does benefit greatly from low temperatures / greedy sampling.
85
+ + It seems naively just adjusting the "causal size" (amount of tokens to predict into the future, and in turn, how many tokens are returned per step) introduces crackles at fixed intervals.
86
 
87
  Some additional configurations have been explored with, but experiments have not been fruitful:
 
88
  * Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow. The memory savings is pretty much unneccessary as the models are already manageable at ~200M parameters.
89
+ * Mamba / Mamba2-based models have shown that it's ***really*** hard to have an AR+NAR model. I really do not want to bother throwing the compute at another ~~meme~~ arch I can't easily make use of all the other tech to throw at.
90
 
91
+ Some current "achitectural features" are in-use, but their effects need to be experimented with further:
92
+ * `split_classifier_heads` is still a mystery whether it's truly helpful or not (each RVQ level gets its own output head).
93
+ * `audio_embeddings_sum` is also a mystery whether it matters if each later RVQ level should "see" the past levels through summing embeddings, or if not doing it is preferable.
94
+ * Disabling `unified_position_ids` seems to help quality more often than not, but I'm still unsure if it's beneficial in practice.