ecker
/

vall-e

Model card Files Files and versions Community

ecker commited on Jul 1

Commit

1d72170

•

1 Parent(s): 3fe76d3

Update README.md

Browse files

actually the 44Khz DAC model seems fine for RVQ levels 0-3

Files changed (1) hide show

README.md +14 -12

README.md CHANGED Viewed

@@ -49,6 +49,19 @@ This repo contains the following configurations:
 	+ ~~I don't think audio quality differs a non-trivial amount to warrant splitting the model.~~
         - From recent experiments, it does seem a NAR-only model is beneficial.
 Some additional configurations have been explored with, but experiments have not been fruitful:
 * Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow.
@@ -58,15 +71,4 @@ Some additional configurations have been explored with, but experiments have not
 * A NAR only model has been experimented with, but seemed utterly useless in practice.
     + The underlying architecture will query the model for the duration, and then inference *all* RVQ levels in parallel (one level at a time).
     + Despite working in the overfitting test trainer and decent training metrics, inferencing will have the model fall completely flat.
-    + I have zero ideas for which path to go with for further experimentation.
-* A [Descript-Audio-Codec](https://github.com/descriptinc/descript-audio-codec/) based model has been experimented with, but has not seem fruitful.
-    + This model would make use of 16 layers instead of the default 12 layers. I feel the performance hit is negligible, even with the additional tokens-per-frame increase with DAC.
-    + This utilizes DAC's 44Khz model (erroniously at an actual 44KHz instead of 44.1KHz), as audio quantized through the 24KHz model will *always* diverge.
-    + I imagine due to the nature of DAC leaving very little room for errors (a testament to how "optimized" the codes are), it's ***really*** hard to model an LM with it.
-      + Output audio is rather crunchy and crusty from the later RVQ levels being inaccurate enough.
-    + I'm not sure which path to go with it for further experimentation:
-      + Utilizing the original model for embeddings or last hidden state as the input embeddings for the prompt/response.
-        + I don't think this is the way to go. It seems negligible for additional complexity.
-      + Training a dedicated NAR model in hopes to bolster the later RVQ levels' performance, as the issues come from the later RVQ levels.
-      + Utilizing an interleaved pattern instead to make better use of attending to past tokens for all levels.

 	+ ~~I don't think audio quality differs a non-trivial amount to warrant splitting the model.~~
         - From recent experiments, it does seem a NAR-only model is beneficial.
+* `config.dac.yaml` / `ar+nar-dac-llama-9`: Utilizes [Descript-Audio-Codec](https://github.com/descriptinc/descript-audio-codec/) instead as the audio backend.
+    + This utilizies the 44KHz (erroneously at 44,000 Hz instead of 44,100 Hz) model at 9 RVQ levels (majorly trained at 8, then the 9th was included).
+        + Originally experimented with with feeding 24Khz audio through the 44Khz model (naively assuming nothing would go wrong), but artifacts in the output proved to be too much.
+        + Later experimented with the 24Khz model, but training would *always* diverge.
+    + *Heavily* benefits from inferencing only the first four RVQ levels; levels afterwards includes far too much noise in the final output.
+        + I imagine the nature of DAC itself amplifies errors in the remaining RVQ levels (either due to less resilliency to errors in the codes, or each RVQ level affecting hte final waveform more).
+    + Has not received as much training as the EnCodec-based models.
+        + Because of this, performance leaves more to be desired.
+    + Further experimentation is needed, but the next approach is unknown.
+        + Train a NAR only model to help bolster the remaining RVQ levels (outputted utterances seem a bit sluggish).
+        + Continue training the AR+NAR to try and bolster the AR tasks (as it's quite lacking at the moment).
+        + Delve into other, exotic features, such as utilizing DAC's decoding embeddings (which might not be necessary at all since it seems *fine* at the moment).
 Some additional configurations have been explored with, but experiments have not been fruitful:
 * Exotic wrappers like `BitNet` seemed to yield little gains in inferencing, somehow.
 * A NAR only model has been experimented with, but seemed utterly useless in practice.
     + The underlying architecture will query the model for the duration, and then inference *all* RVQ levels in parallel (one level at a time).
     + Despite working in the overfitting test trainer and decent training metrics, inferencing will have the model fall completely flat.
+    + I have zero ideas for which path to go with for further experimentation.