mrq commited on
Commit
0c4f028
β€’
1 Parent(s): 2de4670
README.md CHANGED
@@ -7,3 +7,34 @@ This repo catalogs my weights for use with my [VALL-E](https://github.com/e-c-k-
7
  The model currently is in a *semi-usable* state, and I'm releasing them now in hopes that it also helps jumpstart anyone else that wants to use them.
8
 
9
  To reiterate, this is ***by no means*** complete. I am not passing this off as competitive.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  The model currently is in a *semi-usable* state, and I'm releasing them now in hopes that it also helps jumpstart anyone else that wants to use them.
8
 
9
  To reiterate, this is ***by no means*** complete. I am not passing this off as competitive.
10
+
11
+ ## Models
12
+
13
+ * `config.retnet.yaml` / `ar+nar-retnet-8`: The previously released weights.
14
+ + This configuration utilizes a RetNet (retention based transformer) as the underlying architecture due to a number of misleading interpretations with comparisons, for better or for worse.
15
+ + Prompt and response embeddings are summed (further RVQ levels gets the previous RVQ levels' embeddings factored in).
16
+ + Tokenizer is a homebrewed "naive" implementation.
17
+ + This model received the most training time between my 4070Ti, 7900XTX, and a few rental rigs to training further progress, entirely at `bfloat16` with `prodigyopt` (and a few optimizer restarts).
18
+ + The later part of training aimed to shuffle between speakers rather than the global pool of utterances to better focus on zero-shot performance. Due to this, I feel it achieved *decent* zero-shot performance.
19
+ + However, due to the dataset being aggressively trimmed under 12 seconds for memory savings during training, it suffers trying to inference non-short utterances. Additional training may fix this, the following models seemed to adapt well to longer utterances.
20
+ + Prior testing showed that longer prompt durations results in better utterances.
21
+
22
+ * `config.llama.yaml` / `ar+nar-llama-8`: The most recent-ishly trained weights after learning from my mistakes.
23
+ + This configuration utilizes Llama's attention-based transformer as the underlying architecture, making use of creature comforts like RoPE, GQA, and memory-efficient attention (trained under `xformers`, shouldn't really affect things).
24
+ + Prompt and response embeddings are NOT summed (each RVQ level only attends to the current RVQ level).
25
+ + Utilizes a HF tokenizer for "optimal" vocab.
26
+ + The current RVQ level is included as a token as well to help guide NAR tasks better.
27
+ + This model received a few days of training on my 4xV100s, stepping up the duration window to *try* and better make the model inference for longer utterances.
28
+ + Some sessions end up training the current duration window for a few epochs, but I don't know how much it affected things.
29
+ + However, it seems to *only* do well with long utterances. Short utterances fumble. I believe further training with a variety of durations should allow the AR to handle a variety of durations.
30
+ - I believe the "slowly stepping up the context length" only works for text, and not audio.
31
+ + Zero-shot performance leaves a bit to be desired, as it did not receive the special training prioritizing shuffling between speakers rather than the global pool of utterances.
32
+ + Testing showed that, despite also stepping up the prompt duration, it *really* likes three second prompts.
33
+ + Definitely needs additional training.
34
+
35
+ * `config.llama.split.yaml` / `ar-llama-1` + `nar-llama-8`: The above model, but split and trained a little bit more.
36
+ + This experiment is to see whether the AR and NAR benefitted from being split up after enough pretraining, to un-"lobotomize" any penalties from attending to two different tasks (as the AR predicts the next token, and the NAR predicts the same token but a different level).
37
+ + I believe I trained each separate model an additional extra day for another additional audio-duration window for similar training lengths.
38
+ + I don't think audio quality differs a non-trivial amount to warrant splitting the model.
39
+
40
+ There's a bunch of additional configurations (between the underlying arch, embedding modes, interleaving, and even a NAR-"only" model) that are to be further explored, but current experiments showed they either are not worth the additional performance penalties (interleaving) or fall flat (NAR-"only", chunked interleaving).
{old β†’ model}/ckpt/ar+nar-retnet-8/fp32.pth RENAMED
File without changes
model/{config.split.yaml β†’ config.llama-split.yaml} RENAMED
File without changes
model/{config.yaml β†’ config.llama.yaml} RENAMED
File without changes
old/config.ar_nar.yaml β†’ model/config.retnet.yaml RENAMED
@@ -1,97 +1,73 @@
1
- dataset:
2
- training: []
3
- validation: []
4
- noise: []
5
-
6
- speaker_name_getter: "lambda p: f'{p.parts[-3]}_{p.parts[-2]}'"
7
-
8
- use_hdf5: True
9
- use_metadata: True
10
- hdf5_flag: r
11
- validate: True
12
-
13
- workers: 2
14
- cache: True
15
-
16
- phones_range: [4, 256]
17
- duration_range: [1.0, 16.0]
18
-
19
- random_utterance: 1.0
20
- max_prompts: 3
21
- prompt_duration: 6.0
22
-
23
- sample_type: speaker
24
-
25
- tasks_list: [ "tts" ] # , [ "tts", "tts-c", "ns", "sr", "tse", "cse", "nse", "tts"]
26
 
27
  models:
28
- _prom_levels: 8
29
- _max_levels: 8
30
-
31
- _models:
32
- - name: "ar+nar"
33
- size: "full"
34
- resp_levels: 8
35
- prom_levels: 8
36
- tasks: 8
37
- arch_type: "retnet"
38
- training: True
39
- version: 2
 
 
 
40
 
41
  hyperparameters:
42
- batch_size: 8
43
- gradient_accumulation_steps: 32
44
- gradient_clipping: 100
45
-
 
 
 
 
 
 
 
46
  optimizer: Prodigy
47
- torch_optimizer: True
48
  learning_rate: 1.0
 
49
 
50
- scheduler_type: ""
51
- #scheduler_type: OneCycle
52
- #scheduler_params:
53
- # cycle_first_step_size: 10_000
54
- # cycle_first_stair_count: 10_000
55
-
56
- # cycle_second_step_size: 15_000
57
- # cycle_second_stair_count: 15_000
58
-
59
- # decay_step_size: 5_000
60
-
61
- # cycle_min_lr: 2.5e-4 # 1.0e-5
62
- # cycle_max_lr: 2.5e-4 # 1.0e-4
63
- # decay_lr_rate: 0.0
64
-
65
- # cycle_min_mom: 0.90
66
- # cycle_max_mom: 0.99
67
- # decay_mom_rate: 0.0
68
 
69
  evaluation:
70
  batch_size: 16
71
- frequency: 250
72
  size: 16
73
 
74
- steps: 450
75
  ar_temperature: 0.95
76
  nar_temperature: 0.25
77
  load_disabled_engines: True
78
 
79
  trainer:
 
 
 
80
  iterations: 1_000_000
81
 
82
  save_tag: step
83
  save_on_oom: True
84
  save_on_quit: True
85
- save_frequency: 100
86
  export_on_save: True
87
 
88
- keep_last_checkpoints: 4
89
 
90
  aggressive_optimizations: False
91
  load_disabled_engines: False
 
92
 
93
  #load_state_dict: True
94
- #strict_loading: False
95
  #load_tag: "9500"
96
  #load_states: False
97
  #restart_step_count: True
@@ -99,25 +75,66 @@ trainer:
99
  gc_mode: None # "global_step"
100
 
101
  weight_dtype: bfloat16
102
- amp: False
103
 
104
  backend: deepspeed
105
  deepspeed:
 
106
  zero_optimization_level: 0
107
- use_compression_training: True
 
 
108
 
109
- activation_checkpointing: True
110
 
111
  inference:
112
- use_vocos: True
 
113
  normalize: False
114
 
115
  weight_dtype: bfloat16
116
- amp: False
117
-
118
- bitsandbytes:
119
- enabled: False
120
- injects: True
121
- linear: True
122
- embedding: True
123
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ sample_rate: 24_000
2
+ audio_backend: vocos
3
+ experimental: True
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
 
5
  models:
6
+ - name: "ar+nar"
7
+ size: "full"
8
+ resp_levels: 8
9
+ prom_levels: 8
10
+ tasks: 8
11
+ langs: 2
12
+ tones: 1
13
+ arch_type: retnet
14
+ training: False
15
+ version: 2
16
+ dropout: 0.1
17
+ audio_embedding_sums: True
18
+ interleave: False
19
+ experimental: False
20
+ capabilities: ["ar", "nar"]
21
 
22
  hyperparameters:
23
+ autotune: False
24
+ autotune_params:
25
+ start_profile_step: 1
26
+ end_profile_step: 50
27
+ num_tuning_micro_batch_sizes: 8
28
+
29
+ batch_size: 16
30
+ gradient_accumulation_steps: 8
31
+ gradient_clipping: 1.0
32
+ warmup_steps: 250
33
+
34
  optimizer: Prodigy
 
35
  learning_rate: 1.0
36
+ torch_optimizer: True
37
 
38
+ scheduler: "" # ScheduleFree
39
+ torch_scheduler: True
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
 
41
  evaluation:
42
  batch_size: 16
43
+ frequency: 1000
44
  size: 16
45
 
46
+ steps: 500
47
  ar_temperature: 0.95
48
  nar_temperature: 0.25
49
  load_disabled_engines: True
50
 
51
  trainer:
52
+ #no_logger: True
53
+ ddp: False
54
+ check_for_oom: False
55
  iterations: 1_000_000
56
 
57
  save_tag: step
58
  save_on_oom: True
59
  save_on_quit: True
60
+ save_frequency: 500
61
  export_on_save: True
62
 
63
+ keep_last_checkpoints: 8
64
 
65
  aggressive_optimizations: False
66
  load_disabled_engines: False
67
+ gradient_checkpointing: True
68
 
69
  #load_state_dict: True
70
+ strict_loading: False
71
  #load_tag: "9500"
72
  #load_states: False
73
  #restart_step_count: True
 
75
  gc_mode: None # "global_step"
76
 
77
  weight_dtype: bfloat16
78
+ amp: True
79
 
80
  backend: deepspeed
81
  deepspeed:
82
+ inferencing: True
83
  zero_optimization_level: 0
84
+ use_compression_training: False
85
+
86
+ amp: False
87
 
88
+ load_webui: False
89
 
90
  inference:
91
+ backend: deepspeed
92
+ audio_backend: "vocos"
93
  normalize: False
94
 
95
  weight_dtype: bfloat16
96
+ amp: True
97
+
98
+ optimizations:
99
+ injects: False
100
+ replace: True
101
+
102
+ linear: False
103
+ embedding: False
104
+ optimizers: True
105
+
106
+ bitsandbytes: False
107
+ dadaptation: False
108
+ bitnet: False
109
+ fp8: False
110
+
111
+ dataset:
112
+ speaker_name_getter: "lambda p: f'{p.parts[-3]}_{p.parts[-2]}'"
113
+ speaker_group_getter: "lambda p: f'{p.parts[-3]}'"
114
+ speaker_languages:
115
+ ja: []
116
+
117
+ use_hdf5: True
118
+ use_metadata: True
119
+ hdf5_flag: r
120
+ validate: True
121
+
122
+ workers: 6
123
+ cache: True
124
+
125
+ duration_range: [3.0, 16.0]
126
+
127
+ random_utterance: 1.0
128
+ max_prompts: 1
129
+ prompt_duration_range: [3.0, 9.0]
130
+
131
+ max_resps: 1
132
+ p_resp_append: 0.25
133
+
134
+ sample_type: path # path # speaker
135
+
136
+ tasks_list: [ "tts" ] # , [ "tts", "tts-c", "ns", "sr", "tse", "cse", "nse", "tts"]
137
+
138
+ training: []
139
+ validation: []
140
+ noise: []
old/config.yaml DELETED
The diff for this file is too large to render. See raw diff