ecker commited on
Commit
d890d15
β€’
1 Parent(s): b2cc05f

updated README.md

README.md CHANGED
@@ -1,12 +1,39 @@
1
  ---
2
- title: Vall E
3
- emoji: πŸ”₯
4
- colorFrom: yellow
5
  colorTo: purple
6
  sdk: gradio
7
  sdk_version: 3.41.2
8
  app_file: app.py
9
  pinned: false
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: VALL-E
3
+ emoji: πŸ»β€β„οΈ
4
+ colorFrom: green
5
  colorTo: purple
6
  sdk: gradio
7
  sdk_version: 3.41.2
8
  app_file: app.py
9
  pinned: false
10
+ license: agpl-3.0
11
  ---
12
 
13
+ Recommended settings for stable speech:
14
+ * `NAR levels`: 7; less NAR levels reduces the quality of the final waveform (this may also be strictly because, when EnCodec is fed a sequence with less RVQ bin levels than what it was initialized with, it'll sound worse).
15
+ * `Temperature (AR)`: [0.85, 1.1]; It's ***really*** tough to find a one-size-fits-all value.
16
+ * `Temperature (NAR)`: [0.15, 0.85]; This is even harder to nail decent values. Too high and you'll hear artifacts from the NAR, too low and you might not have acoustic detail recreated.
17
+ * `Dynamic Temperature`: checked; Dynamic temperature seems to definitely help resolve issues with a model that is not strongly trained. Pairable with every other sampling technique.
18
+ * `Top P`: [0.85, 0.95] || 1; I feel this is cope.
19
+ * `Top K`: [768, 1024] || 0; I also feel this is cope.
20
+ * `Beam Width`: 0 || 16; beam searching helps find potential best candidates, but I'm not sure how well it helps in the realm of audio. Incompatible with mirostat.
21
+ * `Repetition Penalty`: 1.35; this and the length decay miraculously are what helps stabilize output; I have my theories.
22
+ * `Repetition Penalty Length Decay`: 0.2; this helps not severly dampen the model's output when applying rep. pen.
23
+ * `Length Penalty`: 0; this only be messed with if you're consistently having either too short output, or too long output. The AR is trained decently enough to know when to emit a STOP token.
24
+ * `Mirostat (Tau)`: [2.0, 8.0]; The "surprise value" when performing mirostat sampling, which seems to be much more favorable in comparison to typical top-k/top-p or beam search sampling. The "best" values are still unknown.
25
+ * `Mirostat (Eta)`: [0.05, 0.3]; The "learning rate" (decay value?) applied each step for mirostat sampling.
26
+
27
+ This Space:
28
+ * houses experimental models and the necessary inferencing code for my [VALL-E](https://git.ecker.tech/mrq/vall-e) implementation. I hope to gain some critical feedback with the outputs.
29
+ * utilizes a T4 with a narcoleptic 5-minute sleep timer, as I do not have another system to (easily) host this myself with a 6800XT (or two) while I'm training off my 4070Ti and 7900XTX.
30
+
31
+ The model is:
32
+ * utilizing an RetNet for faster training/inferencing with conforming dimensionality (1024 dim, 4096 ffn dim, 16 heads, 12 layers) targetting the full eight RVQ-bins (albeit the model was originally trained at two then four).
33
+ * trained on ~12.5K hour dataset composed of LibriTTS-R, LibriLight (`small`+`medium`+`duplicated`), generously donated audiobooks, and vidya voice clip rips (including some Japanese kusoge gacha clips).
34
+ * a "monolothic" approach to sharing the retention-based transformer weights between AR and NAR tasks for no immediately discernable penalties (besides retraining).
35
+ * utilizing DeepSpeed to inference using its int8 quantized inferencing (allegedly), and Vocos for better output quality.
36
+ - I do need to add a toggle between different dtypes to gauge any perceptable quality/throughput gains/losses.
37
+ * currently still being trained, and any updates to it will be pushed back to this repo.
38
+
39
+ I am also currently training an experimental model with double the layers (24 layers instead) to gauge its performance. Depending on how well it performs, I may pivot to that too, but for now, I'm starting to doubt the time investment in training it.
app.py ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ import sys
2
+
3
+ sys.argv = sys.argv + "--dtype=int8 --device=cuda".split(" ")
4
+
5
+ from vall_e.webui import ui
models/ckpt/ar+nar-retnet-4/fp32.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:91781616d713a424cab977abb718888323d1a26461bef78c8065ac30d1258d2a
3
+ size 424338659
models/ckpt/ar+nar-retnet-8/fp32.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:82db6c749d682f84881f8ec8aa3b402b56ce63118867be46bf9dada34dc0ded5
3
+ size 441076031
models/ckpt/ar-retnet-4/fp32.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e042d05f14f21a166cd5f5c16b9c9c4ac9ce18af2a4c285c7f0d3ef3ea6729bf
3
+ size 418040575
models/ckpt/nar-retnet-4/fp32.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:18027cafe3c077cb8786a5665f04f732f4e3fcacff17844182f9383a1dca640f
3
+ size 422230719
models/config.ar_nar.yaml ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ dataset:
2
+ training: []
3
+ validation: [ ]
4
+ noise: []
5
+
6
+ speaker_name_getter: "lambda p: f'{p.parts[-3]}_{p.parts[-2]}'"
7
+
8
+ use_hdf5: True
9
+ use_metadata: True
10
+ hdf5_flag: r
11
+ validate: True
12
+
13
+ workers: 2
14
+ cache: True
15
+
16
+ phones_range: [4, 256]
17
+ duration_range: [1.0, 16.0]
18
+
19
+ random_utterance: 1.0
20
+ max_prompts: 3
21
+ prompt_duration: 6.0
22
+
23
+ sample_type: speaker
24
+
25
+ tasks_list: [ "tts" ] # , [ "tts", "tts-c", "ns", "sr", "tse", "cse", "nse", "tts"]
26
+
27
+ models:
28
+ _prom_levels: 8
29
+ _max_levels: 8
30
+
31
+ _models:
32
+ - name: "ar+nar"
33
+ size: "full"
34
+ resp_levels: 8
35
+ prom_levels: 8
36
+ tasks: 8
37
+ langs: 2
38
+ arch_type: "retnet"
39
+ training: True
40
+ version: 3
41
+
42
+ hyperparameters:
43
+ batch_size: 8
44
+ gradient_accumulation_steps: 32
45
+ gradient_clipping: 100
46
+
47
+ optimizer: AdamW # Prodigy
48
+ torch_optimizer: True
49
+ learning_rate: 1.0e-4
50
+
51
+ scheduler_type: ""
52
+ #scheduler_type: OneCycle
53
+ #scheduler_params:
54
+ # cycle_first_step_size: 10_000
55
+ # cycle_first_stair_count: 10_000
56
+
57
+ # cycle_second_step_size: 15_000
58
+ # cycle_second_stair_count: 15_000
59
+
60
+ # decay_step_size: 5_000
61
+
62
+ # cycle_min_lr: 2.5e-4 # 1.0e-5
63
+ # cycle_max_lr: 2.5e-4 # 1.0e-4
64
+ # decay_lr_rate: 0.0
65
+
66
+ # cycle_min_mom: 0.90
67
+ # cycle_max_mom: 0.99
68
+ # decay_mom_rate: 0.0
69
+
70
+ evaluation:
71
+ batch_size: 16
72
+ frequency: 250
73
+ size: 16
74
+
75
+ steps: 450
76
+ ar_temperature: 0.95
77
+ nar_temperature: 0.25
78
+ load_disabled_engines: True
79
+
80
+ trainer:
81
+ iterations: 1_000_000
82
+
83
+ save_tag: step
84
+ save_on_oom: True
85
+ save_on_quit: True
86
+ save_frequency: 100
87
+ export_on_save: True
88
+
89
+ keep_last_checkpoints: 4
90
+
91
+ aggressive_optimizations: False
92
+ load_disabled_engines: False
93
+
94
+ load_state_dict: True
95
+ #strict_loading: False
96
+ #load_tag: "9500"
97
+ #load_states: False
98
+ #restart_step_count: True
99
+
100
+ gc_mode: None # "global_step"
101
+
102
+ weight_dtype: bfloat16
103
+ amp: False
104
+
105
+ backend: deepspeed
106
+ deepspeed:
107
+ inferencing: True
108
+ zero_optimization_level: 0
109
+ use_compression_training: False
110
+
111
+ activation_checkpointing: True
112
+
113
+ inference:
114
+ backend: local
115
+ use_vocos: True
116
+ normalize: False
117
+
118
+ weight_dtype: bfloat16
119
+ amp: False
120
+
121
+ bitsandbytes:
122
+ enabled: False
123
+ injects: True
124
+ linear: True
125
+ embedding: True
126
+
models/config.yaml ADDED
@@ -0,0 +1,104 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ dataset:
2
+ training: [
3
+ ]
4
+ validation: [
5
+ ]
6
+ noise: [
7
+ ]
8
+
9
+ speaker_name_getter: "lambda p: f'{p.parts[-3]}_{p.parts[-2]}'"
10
+
11
+ use_hdf5: True
12
+ use_metadata: True
13
+ hdf5_flag: r
14
+ validate: True
15
+
16
+ workers: 4
17
+ cache: True
18
+
19
+ phones_range: [4, 256]
20
+ duration_range: [1.0, 16.0]
21
+
22
+ random_utterance: 1.0
23
+ max_prompts: 3
24
+ prompt_duration: 3.0
25
+
26
+ sample_type: speaker
27
+
28
+ tasks_list: ["tts"] # , "ns", "sr", "tse", "cse", "nse", "tts"]
29
+
30
+ models:
31
+ _prom_levels: 4
32
+ _max_levels: 4
33
+
34
+ _models:
35
+ - name: "ar"
36
+ size: "full"
37
+ resp_levels: 1
38
+ prom_levels: 2
39
+ tasks: 8
40
+ arch_type: "retnet"
41
+ training: True
42
+ - name: "nar"
43
+ size: "full"
44
+ resp_levels: 3
45
+ prom_levels: 4
46
+ tasks: 8
47
+ arch_type: "retnet"
48
+ training: True
49
+
50
+
51
+ hyperparameters:
52
+ batch_size: 8
53
+ gradient_accumulation_steps: 1
54
+ gradient_clipping: 100
55
+
56
+ optimizer: AdamW
57
+ learning_rate: 1.0e-5
58
+
59
+ scheduler_type: ""
60
+
61
+ evaluation:
62
+ batch_size: 16
63
+ frequency: 500
64
+ size: 16
65
+
66
+ steps: 300
67
+ ar_temperature: 0.95
68
+ nar_temperature: 0.25
69
+ load_disabled_engines: True
70
+
71
+ trainer:
72
+ iterations: 1_000_000
73
+
74
+ save_tag: step
75
+ save_on_oom: True
76
+ save_on_quit: True
77
+ save_frequency: 100
78
+ export_on_save: True
79
+
80
+ keep_last_checkpoints: 4
81
+
82
+ load_state_dict: True
83
+
84
+ gc_mode: None # "global_step"
85
+
86
+ weight_dtype: bfloat16
87
+ amp: False
88
+
89
+ backend: deepspeed
90
+ deepspeed:
91
+ zero_optimization_level: 0
92
+ use_compression_training: True
93
+
94
+ activation_checkpointing: True
95
+
96
+ inference:
97
+ use_vocos: True
98
+ normalize: False
99
+
100
+ bitsandbytes:
101
+ enabled: False
102
+ injects: False
103
+ linear: False
104
+ embedding: False
packages.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ espeak-ng
requirements.txt ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ --extra-index-url https://download.pytorch.org/whl/cu118
2
+ torch
3
+ torchaudio
4
+
5
+
6
+ deepspeed==0.10.0
7
+ vall_e @ git+https://git.ecker.tech/mrq/vall-e
8
+ transformers