Fascinating work!

by tomaarsen HF staff - opened Jun 27

Jun 27

Hello!

Sentence Transformers maintainer here - this is fascinating work! The chemical natural compounds and their notations go way beyond what I'm familiar with, but it looks like the Spearman Cosine similarity is very high, and the t-SNE embeddings look quite nice!

I see that you have some plans to extend this further in the future. I wanted to point you to a potential direction of advancements: the tokenizer.
Each tokenizer tokenises text differently, and the one that you're using (from MiniLM-L6-H384-uncased) is not aware of the natural compound notations. As a result, it uses multiple tokens to denotate something that maybe can be best denoted with just one token, e.g. [C]. See an example here:

From https://huggingface.co/spaces/Xenova/the-tokenizer-playground

In short: it might make sense to 1) take an existing tokenizer trained on the chemical compounds or 2) train one yourself.
Do note that you'd likely not be able to use a pretrained model with your custom tokenizer, so you would have to perform the training from random weights. With a much smaller tokenizer, you'll also get higher throughput/faster training I suspect.

Anyways, you're free to go this route or continue finetuning "ready to go" embedding models like MiniLM-L6-H384-uncased: clearly it's also working well.

Tom Aarsen

gbyuvd

Owner Jun 28

Hello!

Thank you so much for your feedback, I appreciate your recommendations a lot. Currently, I am trying to either adapt zpn's SELFIES tokenizer or train a custom tokenizer for this, since in chemistry usually molecules are represented with SMILES and it is known a bit messy to train a model using it - and SELFIES seems better due to its consistency. I plan to start testing them shortly, and will proceed with training a base model with randomized weights along with reduced vocabulary size.

Thanks again for taking the time to engage with my work and for pointing me in this direction. I am relatively new in ML/AI, so I am excited to see the results!

G Bayu

tomaarsen

Jun 28

Excellent! I think you're well in the right direction then!
Your work reminds me somewhat of the Protein Similarity and Matryoshka Embeddings blogpost by @monsoon-nlp from a few months ago, except with proteins instead. He also used Matryoshka Embeddings (blogpost, documentation) in case that strikes your fancy. In short: Matryoshka Embeddings can be truncated on the fly with minor loss in performance, allowing for faster retrieval/clustering. This can be quite nice when your use case deals with a lot of data.

Tom Aarsen

gbyuvd

Owner Jun 28

•

edited Jun 28

I didn't know about Matryoshka, but after reading both blogs a bit, I agree it would be nice for dealing with large chemical databases. I will read those blogs again and try experimenting with them after training with the base model and custom tokenizer seems good enough. Again, thank you!

G Bayu

gbyuvd changed discussion status to closed Aug 8

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment