DarkBERT / README.md
Youngjin's picture
add clarification about email address
14c8e0e verified
|
raw
history blame
No virus
6.18 kB
---
license: cc-by-nc-4.0
pipeline_tag: fill-mask
widget:
- text: >-
The most trusted online bulk <mask> seller in the world -Consistent 90%+
purity -All shipments straight off the brick. 250-500g orders received a
portion of a stamped brick. At 1000g, full stamped bricks are shipped. -We
utilize the best packaging equipment available for the highest level of
stealth and security.
extra_gated_prompt: >-
DarkBERT is available for access upon request. Users may submit their request
using the form below, which includes the **name of the user**, the **user’s
institution**, the **user’s email address that matches the
institution** *(we especially emphasize this part; any non-academic addresses such as
gmail, tutanota, protonmail, etc. are automatically rejected as it makes it difficult
for us to verify your affiliation to the institution)*, and the
**purpose of usage** *(in as much detail as possible)*. By requesting and downloading DarkBERT, the user agrees to
the following: the user acknowledges that the use of this model is restricted
to research and/or academic purposes only. Access to the model will be granted
after the request is manually reviewed. A request may be declined if it does
not sufficiently describe research purposes that follow the ACM Code of Ethics
(https://www.acm.org/code-of-ethics). The information provided by the
requesting user will not be used in any way except for sending the dataset to
the user and keeping track of request history for DarkBERT. By requesting for
the model, the user agrees to our collection of the provided information. This
model shall only be used for non-profit research purposes and in a manner
consistent with fair practice. Do not redistribute this dataset to others. The
user should indicate the source of this model (found at the bottom of the
page) when using or citing the model in their research or article.
extra_gated_fields:
Full Name: text
Affiliated Institution / Organization / University: text
E-mail (must match affiliation, generic domains such as gmail not allowed): text
Position (ex doctoral student, professor, researcher, etc): text
Purpose of Usage (Please describe the purpose of usage in as much detail as possible): text
Country: text
I have read the conditions and agree to use this model for ethical, non-commercial use ONLY: checkbox
A request cannot be modified once submitted; I understand that requests with incomplete, insufficient, or inaccurate information will be rejected: checkbox
language:
- en
---
# DarkBERT
A BERT-like model pretrained with a Dark Web corpus as described in "DarkBERT: A Language Model for the Dark Side of the Internet (ACL 2023)"
# Conditions
DarkBERT is available for access upon request. Users may
submit their request using the form below, which includes the **name of the
user**, the **user’s institution**, the **user’s email address that matches the
institution** (we especially emphasize this part; any non-academic addresses such as
gmail, tutanota, protonmail, etc. are automatically rejected as it makes it difficult
for us to verify your affiliation to the institution) and the **purpose of usage**.
By requesting and downloading DarkBERT, the user agrees to the following: the user acknowledges that the use of this
model is restricted to research and/or academic purposes only. Access to the
model will be granted after the request is manually reviewed. A request may be
declined if it does not sufficiently describe research purposes that follow
the ACM Code of Ethics (https://www.acm.org/code-of-ethics). The information
provided by the requesting user will not be used in any way except for sending
the dataset to the user and keeping track of request history for DarkBERT. By
requesting for the model, the user agrees to our collection of the provided
information. This model shall only be used for non-profit research purposes
and in a manner consistent with fair practice. Do not redistribute this
dataset to others. The user should indicate the source of this model (found at
the bottom of the page) when using or citing the model in their research or
article.
## What is included?
The preprocessed version of DarkBERT.
Benchmark datasets in the `benchmark-dataset` directory.
## Sample Usage
```python
>>> from transformers import pipeline
>>> folder_dir = "DarkBERT"
>>> unmasker = pipeline('fill-mask', model=folder_dir)
>>> unmasker("RagnarLocker, LockBit, and REvil are types of <mask>.")
[{'score': 0.4952353239059448, 'token': 25346, 'token_str': ' ransomware', 'sequence': 'RagnarLocker, LockBit, and REvil are types of ransomware.'},
{'score': 0.04661545157432556, 'token': 16886, 'token_str': ' malware', 'sequence': 'RagnarLocker, LockBit, and REvil are types of malware.'},
{'score': 0.04217657446861267, 'token': 28811, 'token_str': ' wallets', 'sequence': 'RagnarLocker, LockBit, and REvil are types of wallets.'},
{'score': 0.028982503339648247, 'token': 2196, 'token_str': ' drugs', 'sequence': 'RagnarLocker, LockBit, and REvil are types of drugs.'},
{'score': 0.020001502707600594, 'token': 11344, 'token_str': ' hackers', 'sequence': 'RagnarLocker, LockBit, and REvil are types of hackers.'}]
>>> from transformers import AutoModel, AutoTokenizer
>>> model = AutoModel.from_pretrained(folder_dir)
>>> tokenizer = AutoTokenizer.from_pretrained(folder_dir)
>>> text = "Recent research has suggested that there are clear differences in the language used in the Dark Web compared to that of the Surface Web."
>>> encoded = tokenizer(text, return_tensors="pt")
>>> output = model(**encoded)
>>> output[0].shape
torch.Size([1, 27, 768])
```
## Citation
If you are using the DarkBERT model, please cite the following paper accordingly:
```
Youngjin Jin, Eugene Jang, Jian Cui, Jin-Woo Chung, Yongjae Lee, and Seungwon Shin. 2023. DarkBERT: A Language Model for the Dark Side of the Internet. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7515–7533, Toronto, Canada. Association for Computational Linguistics.
```