ecker commited on
Commit
c692566
1 Parent(s): 55cdbd9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -2
README.md CHANGED
@@ -26,9 +26,11 @@ To reiterate, this is ***by no means*** complete. I am not passing this off as c
26
  + The current RVQ level is included as a token as well to help guide NAR tasks better.
27
  + This model received a few days of training on my 4xV100s, stepping up the duration window to *try* and better make the model inference for longer utterances.
28
  + Some sessions end up training the current duration window for a few epochs, but I don't know how much it affected things.
29
- + However, it seems to *only* do well with long utterances. Short utterances fumble. I believe further training with a variety of durations should allow the AR to handle a variety of durations.
30
- - I believe the "slowly stepping up the context length" only works for text, and not audio.
 
31
  + Zero-shot performance leaves a bit to be desired, as it did not receive the special training prioritizing shuffling between speakers rather than the global pool of utterances.
 
32
  + Testing showed that, despite also stepping up the prompt duration, it *really* likes three second prompts.
33
  + Definitely needs additional training.
34
 
 
26
  + The current RVQ level is included as a token as well to help guide NAR tasks better.
27
  + This model received a few days of training on my 4xV100s, stepping up the duration window to *try* and better make the model inference for longer utterances.
28
  + Some sessions end up training the current duration window for a few epochs, but I don't know how much it affected things.
29
+ + ~~However, it seems to *only* do well with long utterances. Short utterances fumble. I believe further training with a variety of durations should allow the AR to handle a variety of durations.~~
30
+ - ~~I believe the "slowly stepping up the context length" only works for text, and not audio.~~
31
+ - Addendum: Additional brief training for a variety of duration lengths seemed to have mostly fixed this issue.
32
  + Zero-shot performance leaves a bit to be desired, as it did not receive the special training prioritizing shuffling between speakers rather than the global pool of utterances.
33
+ - Addendum: Additional brief training for sampling based on speaker per "epoch" (per dataloader, not dataset) seemed to slightly improve it.
34
  + Testing showed that, despite also stepping up the prompt duration, it *really* likes three second prompts.
35
  + Definitely needs additional training.
36