tryolabs
/

bert-large-uncased-wwm-squadv2-optimized-f16

Question Answering

Model card Files Files and versions Community

juanfkurucz commited on Nov 21, 2022

Commit

a6b4f9e

•

1 Parent(s): b31c1f2

Update README.md

Files changed (1) hide show

README.md +74 -0

README.md CHANGED Viewed

@@ -1,3 +1,77 @@
 ---
 license: mit
 ---

 ---
+language: en
+thumbnail:
 license: mit
+tags:
+- question-answering
+datasets:
+- squad_v2
+metrics:
+- squad_v2
 ---
+## bert-large-uncased-wwm-squadv2-optimized-f16
+This is an optimized model using [madlag/bert-large-uncased-wwm-squadv2-x2.63-f82.6-d16-hybrid-v1](https://huggingface.co/madlag/bert-large-uncased-wwm-squadv2-x2.63-f82.6-d16-hybrid-v1) as the base model which was created using the [nn_pruning](https://github.com/huggingface/nn_pruning) python library. This is a pruned model of [madlag/bert-large-uncased-whole-word-masking-finetuned-squadv2](https://huggingface.co/madlag/bert-large-uncased-whole-word-masking-finetuned-squadv2)
+Our final optimized model weighs **579 MB**, has an inference speed of **18.184 ms** on a Tesla T4 and has a performance of **82.68%** best F1. Below there is a comparison for each base model:
+| Model  | Weight | Throughput on Tesla T4 | Best F1 |
+| -------- | ----- | --------- | --------- |
+| [madlag/bert-large-uncased-whole-word-masking-finetuned-squadv2](https://huggingface.co/madlag/bert-large-uncased-whole-word-masking-finetuned-squadv2) | 1275 MB | 140.529 ms | 86.08% |
+| [madlag/bert-large-uncased-wwm-squadv2-x2.63-f82.6-d16-hybrid-v1](https://huggingface.co/madlag/bert-large-uncased-wwm-squadv2-x2.63-f82.6-d16-hybrid-v1) | 1085 MB | 90.801 ms | 82.67% |
+| Our optimized model | 579 MB | 18.184 ms | 82.68% |
+## Example Usage
+```python
+import torch
+from huggingface_hub import hf_hub_download
+from onnxruntime import InferenceSession
+from transformers import AutoModelForQuestionAnswering, AutoTokenizer
+MAX_SEQUENCE_LENGTH = 512
+# Download the model
+model= hf_hub_download(
+    repo_id="tryolabs/bert-large-uncased-wwm-squadv2-optimized-f16", filename="model.onnx"
+)
+# Load the tokenizer
+tokenizer = AutoTokenizer.from_pretrained("tryolabs/bert-large-uncased-wwm-squadv2-optimized-f16")
+question = "Who worked a little bit harder?"
+context = "The first little pig was very lazy. He didn't want to work at all and he built his house out of straw. The second little pig worked a little bit harder but he was somewhat lazy too and he built his house out of sticks. Then, they sang and danced and played together the rest of the day."
+# Generate an input
+inputs = dict(
+    tokenizer(
+        question, context, return_tensors="np", max_length=MAX_SEQUENCE_LENGTH
+    )
+)
+# Create session
+sess = InferenceSession(
+    model, providers=["CPUExecutionProvider"]
+)
+# Run predictions
+output = sess.run(None, input_feed=inputs)
+answer_start_scores, answer_end_scores = torch.tensor(output[0]), torch.tensor(
+    output[1]
+)
+# Post process predictions
+input_ids = inputs["input_ids"].tolist()[0]
+answer_start = torch.argmax(answer_start_scores)
+answer_end = torch.argmax(answer_end_scores) + 1
+answer = tokenizer.convert_tokens_to_string(
+    tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])
+)
+# Output prediction
+print("Answer", answer)
+```