DarkBERT / README.md

Youngjin

add clarification about email address

14c8e0e verified 2 months ago

preview code

raw

history blame

No virus

6.18 kB

	---
	license: cc-by-nc-4.0
	pipeline_tag: fill-mask
	widget:
	- text: >-
	The most trusted online bulk <mask> seller in the world -Consistent 90%+
	purity -All shipments straight off the brick. 250-500g orders received a
	portion of a stamped brick. At 1000g, full stamped bricks are shipped. -We
	utilize the best packaging equipment available for the highest level of
	stealth and security.
	extra_gated_prompt: >-
	DarkBERT is available for access upon request. Users may submit their request
	using the form below, which includes the name of the user, the **user’s
	institution, the user’s email address that matches the
	institution** *(we especially emphasize this part; any non-academic addresses such as
	gmail, tutanota, protonmail, etc. are automatically rejected as it makes it difficult
	for us to verify your affiliation to the institution)*, and the
	purpose of usage (in as much detail as possible). By requesting and downloading DarkBERT, the user agrees to
	the following: the user acknowledges that the use of this model is restricted
	to research and/or academic purposes only. Access to the model will be granted
	after the request is manually reviewed. A request may be declined if it does
	not sufficiently describe research purposes that follow the ACM Code of Ethics
	(https://www.acm.org/code-of-ethics). The information provided by the
	requesting user will not be used in any way except for sending the dataset to
	the user and keeping track of request history for DarkBERT. By requesting for
	the model, the user agrees to our collection of the provided information. This
	model shall only be used for non-profit research purposes and in a manner
	consistent with fair practice. Do not redistribute this dataset to others. The
	user should indicate the source of this model (found at the bottom of the
	page) when using or citing the model in their research or article.
	extra_gated_fields:
	Full Name: text
	Affiliated Institution / Organization / University: text
	E-mail (must match affiliation, generic domains such as gmail not allowed): text
	Position (ex doctoral student, professor, researcher, etc): text
	Purpose of Usage (Please describe the purpose of usage in as much detail as possible): text
	Country: text
	I have read the conditions and agree to use this model for ethical, non-commercial use ONLY: checkbox
	A request cannot be modified once submitted; I understand that requests with incomplete, insufficient, or inaccurate information will be rejected: checkbox
	language:
	- en
	---

	# DarkBERT
	A BERT-like model pretrained with a Dark Web corpus as described in "DarkBERT: A Language Model for the Dark Side of the Internet (ACL 2023)"

	# Conditions
	DarkBERT is available for access upon request. Users may
	submit their request using the form below, which includes the **name of the
	user, the user’s institution, the user’s email address that matches the
	institution** (we especially emphasize this part; any non-academic addresses such as
	gmail, tutanota, protonmail, etc. are automatically rejected as it makes it difficult
	for us to verify your affiliation to the institution) and the purpose of usage.
	By requesting and downloading DarkBERT, the user agrees to the following: the user acknowledges that the use of this
	model is restricted to research and/or academic purposes only. Access to the
	model will be granted after the request is manually reviewed. A request may be
	declined if it does not sufficiently describe research purposes that follow
	the ACM Code of Ethics (https://www.acm.org/code-of-ethics). The information
	provided by the requesting user will not be used in any way except for sending
	the dataset to the user and keeping track of request history for DarkBERT. By
	requesting for the model, the user agrees to our collection of the provided
	information. This model shall only be used for non-profit research purposes
	and in a manner consistent with fair practice. Do not redistribute this
	dataset to others. The user should indicate the source of this model (found at
	the bottom of the page) when using or citing the model in their research or
	article.

	## What is included?

	The preprocessed version of DarkBERT.

	Benchmark datasets in the `benchmark-dataset` directory.

	## Sample Usage
	```python
	>>> from transformers import pipeline
	>>> folder_dir = "DarkBERT"
	>>> unmasker = pipeline('fill-mask', model=folder_dir)
	>>> unmasker("RagnarLocker, LockBit, and REvil are types of <mask>.")

	[{'score': 0.4952353239059448, 'token': 25346, 'token_str': ' ransomware', 'sequence': 'RagnarLocker, LockBit, and REvil are types of ransomware.'},
	{'score': 0.04661545157432556, 'token': 16886, 'token_str': ' malware', 'sequence': 'RagnarLocker, LockBit, and REvil are types of malware.'},
	{'score': 0.04217657446861267, 'token': 28811, 'token_str': ' wallets', 'sequence': 'RagnarLocker, LockBit, and REvil are types of wallets.'},
	{'score': 0.028982503339648247, 'token': 2196, 'token_str': ' drugs', 'sequence': 'RagnarLocker, LockBit, and REvil are types of drugs.'},
	{'score': 0.020001502707600594, 'token': 11344, 'token_str': ' hackers', 'sequence': 'RagnarLocker, LockBit, and REvil are types of hackers.'}]

	>>> from transformers import AutoModel, AutoTokenizer
	>>> model = AutoModel.from_pretrained(folder_dir)
	>>> tokenizer = AutoTokenizer.from_pretrained(folder_dir)
	>>> text = "Recent research has suggested that there are clear differences in the language used in the Dark Web compared to that of the Surface Web."
	>>> encoded = tokenizer(text, return_tensors="pt")
	>>> output = model(**encoded)
	>>> output[0].shape

	torch.Size([1, 27, 768])

	```
	## Citation
	If you are using the DarkBERT model, please cite the following paper accordingly:
	```
	Youngjin Jin, Eugene Jang, Jian Cui, Jin-Woo Chung, Yongjae Lee, and Seungwon Shin. 2023. DarkBERT: A Language Model for the Dark Side of the Internet. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7515–7533, Toronto, Canada. Association for Computational Linguistics.
	```