Edit model card

MsBERT: A New Model for the Reconstruction of Lacunae in Hebrew Manuscripts

A new pretrained dedicated BERT model, dubbed MsBERT (short for: Manuscript BERT), designed from the ground up to handle Hebrew manuscript text. MsBERT substantially outperforms all existing Hebrew BERT models regarding the prediction of missing words in fragmentary Hebrew manuscript transcriptions in multiple genres, as well as regarding the task of differentiating between quoted passages and exegetical elaborations. We provide MsBERT for free download and unrestricted use, and we also provide an interactive and user-friendly website to allow manuscript scholars to leverage the power of MsBERT in their scholarly work of reconstructing fragmentary Hebrew manuscripts.

You can try out the website here: https://msbert.dicta.org.il.

Sample usage:

from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('dicta-il/MsBERT')
model = AutoModelForMaskedLM.from_pretrained('dicta-il/MsBERT')

model.eval()

text = '''讜讬爪驻讛讜 讝讛讘 [MASK] 诪专讻讘讜 讗专讙诪谉 专' [MASK] 讗' [MASK] 驻专讜讻转 讛住诪讜讻讛 诇讛 专' 讘讬讘讬 讗' 讝讜 [MASK] 砖讝讛讘讛 讚讜诪讛 诇讗专讙诪谉'''

output = model(tokenizer.encode(text, return_tensors='pt'))
# the first [MASK] is the token #4 (including [CLS])
import torch
top_2 = torch.topk(output.logits[0, 4, :], 2)[1]
print('\n'.join(tokenizer.convert_ids_to_tokens(top_2))) # should print 讟讛讜专 / 住讙讜专 

Citation

If you use MsBERT in your research, please cite MsBERT: A New Model for the Reconstruction of Lacunae in Hebrew Manuscripts

BibTeX:

@inproceedings{msbert-2024,
    title = "MsBERT: A New Model for the Reconstruction of Lacunae in Hebrew Manuscripts",
    author = "Shmidman, Avi and Shmidman, Ometz and Gershuni, Hillel and Koppel, Moshe",
    booktitle = "Proceedings of the 1st Machine Learning for Ancient Language Workshop (ML4AL 2024)",
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
}

License

Shield: CC BY 4.0

This work is licensed under a Creative Commons Attribution 4.0 International License.

CC BY 4.0

Downloads last month
6
Safetensors
Model size
184M params
Tensor type
F32
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.