juanfkurucz commited on
Commit
a6b4f9e
1 Parent(s): b31c1f2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +74 -0
README.md CHANGED
@@ -1,3 +1,77 @@
1
  ---
 
 
2
  license: mit
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: en
3
+ thumbnail:
4
  license: mit
5
+ tags:
6
+ - question-answering
7
+ datasets:
8
+ - squad_v2
9
+ metrics:
10
+ - squad_v2
11
  ---
12
+
13
+ ## bert-large-uncased-wwm-squadv2-optimized-f16
14
+
15
+ This is an optimized model using [madlag/bert-large-uncased-wwm-squadv2-x2.63-f82.6-d16-hybrid-v1](https://huggingface.co/madlag/bert-large-uncased-wwm-squadv2-x2.63-f82.6-d16-hybrid-v1) as the base model which was created using the [nn_pruning](https://github.com/huggingface/nn_pruning) python library. This is a pruned model of [madlag/bert-large-uncased-whole-word-masking-finetuned-squadv2](https://huggingface.co/madlag/bert-large-uncased-whole-word-masking-finetuned-squadv2)
16
+
17
+ Our final optimized model weighs **579 MB**, has an inference speed of **18.184 ms** on a Tesla T4 and has a performance of **82.68%** best F1. Below there is a comparison for each base model:
18
+
19
+ | Model | Weight | Throughput on Tesla T4 | Best F1 |
20
+ | -------- | ----- | --------- | --------- |
21
+ | [madlag/bert-large-uncased-whole-word-masking-finetuned-squadv2](https://huggingface.co/madlag/bert-large-uncased-whole-word-masking-finetuned-squadv2) | 1275 MB | 140.529 ms | 86.08% |
22
+ | [madlag/bert-large-uncased-wwm-squadv2-x2.63-f82.6-d16-hybrid-v1](https://huggingface.co/madlag/bert-large-uncased-wwm-squadv2-x2.63-f82.6-d16-hybrid-v1) | 1085 MB | 90.801 ms | 82.67% |
23
+ | Our optimized model | 579 MB | 18.184 ms | 82.68% |
24
+
25
+ ## Example Usage
26
+
27
+ ```python
28
+ import torch
29
+ from huggingface_hub import hf_hub_download
30
+ from onnxruntime import InferenceSession
31
+ from transformers import AutoModelForQuestionAnswering, AutoTokenizer
32
+
33
+ MAX_SEQUENCE_LENGTH = 512
34
+
35
+ # Download the model
36
+ model= hf_hub_download(
37
+ repo_id="tryolabs/bert-large-uncased-wwm-squadv2-optimized-f16", filename="model.onnx"
38
+ )
39
+
40
+ # Load the tokenizer
41
+ tokenizer = AutoTokenizer.from_pretrained("tryolabs/bert-large-uncased-wwm-squadv2-optimized-f16")
42
+
43
+ question = "Who worked a little bit harder?"
44
+ context = "The first little pig was very lazy. He didn't want to work at all and he built his house out of straw. The second little pig worked a little bit harder but he was somewhat lazy too and he built his house out of sticks. Then, they sang and danced and played together the rest of the day."
45
+
46
+ # Generate an input
47
+ inputs = dict(
48
+ tokenizer(
49
+ question, context, return_tensors="np", max_length=MAX_SEQUENCE_LENGTH
50
+ )
51
+ )
52
+
53
+ # Create session
54
+ sess = InferenceSession(
55
+ model, providers=["CPUExecutionProvider"]
56
+ )
57
+
58
+ # Run predictions
59
+ output = sess.run(None, input_feed=inputs)
60
+
61
+ answer_start_scores, answer_end_scores = torch.tensor(output[0]), torch.tensor(
62
+ output[1]
63
+ )
64
+
65
+ # Post process predictions
66
+ input_ids = inputs["input_ids"].tolist()[0]
67
+ answer_start = torch.argmax(answer_start_scores)
68
+ answer_end = torch.argmax(answer_end_scores) + 1
69
+ answer = tokenizer.convert_tokens_to_string(
70
+ tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
71
+ )
72
+
73
+ # Output prediction
74
+ print("Answer", answer)
75
+ ```
76
+
77
+