DarkBERT / README.md
Youngjin's picture
add clarification about email address
14c8e0e verified
|
raw
history blame
No virus
6.18 kB
metadata
license: cc-by-nc-4.0
pipeline_tag: fill-mask
widget:
  - text: >-
      The most trusted online bulk <mask> seller in the world -Consistent 90%+
      purity -All shipments straight off the brick. 250-500g orders received a
      portion of a stamped brick. At 1000g, full stamped bricks are shipped. -We
      utilize the best packaging equipment available for the highest level of
      stealth and security.
extra_gated_prompt: >-
  DarkBERT is available for access upon request. Users may submit their request
  using the form below, which includes the **name of the user**, the **user’s
  institution**, the **user’s email address that matches the institution** *(we
  especially emphasize this part; any non-academic addresses such as gmail,
  tutanota, protonmail, etc. are automatically rejected as it makes it difficult
  for us to verify your affiliation to the institution)*, and the **purpose of
  usage** *(in as much detail as possible)*. By requesting and downloading
  DarkBERT, the user agrees to the following: the user acknowledges that the use
  of this model is restricted to research and/or academic purposes only. Access
  to the model will be granted after the request is manually reviewed. A request
  may be declined if it does not sufficiently describe research purposes that
  follow the ACM Code of Ethics (https://www.acm.org/code-of-ethics). The
  information provided by the requesting user will not be used in any way except
  for sending the dataset to the user and keeping track of request history for
  DarkBERT. By requesting for the model, the user agrees to our collection of
  the provided information. This model shall only be used for non-profit
  research purposes and in a manner consistent with fair practice. Do not
  redistribute this dataset to others. The user should indicate the source of
  this model (found at the bottom of the page) when using or citing the model in
  their research or article.
extra_gated_fields:
  Full Name: text
  Affiliated Institution / Organization / University: text
  E-mail (must match affiliation, generic domains such as gmail not allowed): text
  Position (ex doctoral student, professor, researcher, etc): text
  Purpose of Usage (Please describe the purpose of usage in as much detail as possible): text
  Country: text
  I have read the conditions and agree to use this model for ethical, non-commercial use ONLY: checkbox
  A request cannot be modified once submitted; I understand that requests with incomplete, insufficient, or inaccurate information will be rejected: checkbox
language:
  - en

DarkBERT

A BERT-like model pretrained with a Dark Web corpus as described in "DarkBERT: A Language Model for the Dark Side of the Internet (ACL 2023)"

Conditions

DarkBERT is available for access upon request. Users may submit their request using the form below, which includes the name of the user, the user’s institution, the user’s email address that matches the institution (we especially emphasize this part; any non-academic addresses such as gmail, tutanota, protonmail, etc. are automatically rejected as it makes it difficult for us to verify your affiliation to the institution) and the purpose of usage. By requesting and downloading DarkBERT, the user agrees to the following: the user acknowledges that the use of this model is restricted to research and/or academic purposes only. Access to the model will be granted after the request is manually reviewed. A request may be declined if it does not sufficiently describe research purposes that follow the ACM Code of Ethics (https://www.acm.org/code-of-ethics). The information provided by the requesting user will not be used in any way except for sending the dataset to the user and keeping track of request history for DarkBERT. By requesting for the model, the user agrees to our collection of the provided information. This model shall only be used for non-profit research purposes and in a manner consistent with fair practice. Do not redistribute this dataset to others. The user should indicate the source of this model (found at the bottom of the page) when using or citing the model in their research or article.

What is included?

The preprocessed version of DarkBERT.

Benchmark datasets in the benchmark-dataset directory.

Sample Usage

>>> from transformers import pipeline
>>> folder_dir = "DarkBERT"
>>> unmasker = pipeline('fill-mask', model=folder_dir)
>>> unmasker("RagnarLocker, LockBit, and REvil are types of <mask>.")

[{'score': 0.4952353239059448, 'token': 25346, 'token_str': ' ransomware', 'sequence': 'RagnarLocker, LockBit, and REvil are types of ransomware.'},
{'score': 0.04661545157432556, 'token': 16886, 'token_str': ' malware', 'sequence': 'RagnarLocker, LockBit, and REvil are types of malware.'},
{'score': 0.04217657446861267, 'token': 28811, 'token_str': ' wallets', 'sequence': 'RagnarLocker, LockBit, and REvil are types of wallets.'},
{'score': 0.028982503339648247, 'token': 2196, 'token_str': ' drugs', 'sequence': 'RagnarLocker, LockBit, and REvil are types of drugs.'},
{'score': 0.020001502707600594, 'token': 11344, 'token_str': ' hackers', 'sequence': 'RagnarLocker, LockBit, and REvil are types of hackers.'}]

>>> from transformers import AutoModel, AutoTokenizer
>>> model = AutoModel.from_pretrained(folder_dir)
>>> tokenizer = AutoTokenizer.from_pretrained(folder_dir)
>>> text = "Recent research has suggested that there are clear differences in the language used in the Dark Web compared to that of the Surface Web."
>>> encoded = tokenizer(text, return_tensors="pt")
>>> output = model(**encoded)
>>> output[0].shape

torch.Size([1, 27, 768])

Citation

If you are using the DarkBERT model, please cite the following paper accordingly:

Youngjin Jin, Eugene Jang, Jian Cui, Jin-Woo Chung, Yongjae Lee, and Seungwon Shin. 2023. DarkBERT: A Language Model for the Dark Side of the Internet. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7515–7533, Toronto, Canada. Association for Computational Linguistics.