--- language: - zh tags: - search --- # Cross Language Search ## Search cliassical CN with modern ZH * In some cases, Classical Chinese feels like another language, I even trained 2 translation model to prove this point. * That's why, when people wants to be savvy about their words, we choose to quote our ancestors. It's exactly like westerners like to quote Latin or Shakespare, the difference is we have a much bigger pool to choose. * This model helps you **find** text within **ancient Chinese** literature, but you can **search with modern Chinese** # 跨语种搜索 ## 博古搜今 ```python from unpackai.interp import CosineSearch from sentence_transformers import SentenceTransformer import pandas as pd import numpy as np TAG = "raynardj/xlsearch-cross-lang-search-zh-vs-classicical-cn" encoder = SentenceTransformer(TAG) # all_lines is a list of all your sentences # all_lines 是一个你所有句子的列表, 可以是一本书, 按照句子分割, 也可以是很多很多书 all_lines = ["句子1","句子2",...] vec = encoder.encode(all_lines, batch_size=32, show_progress_bar=True) # consine距离搜索器 cosine = CosineSearch(vec) def search(text): enc = encoder.encode(text) # encode the search key order = cosine(enc) # distance array sentence_df = pd.DataFrame({"sentence":np.array(all_lines)[order[:5]]}) return sentence_df ``` 将史记打成句子以后, 搜索效果如下 ```python >>> search("他是一个很慷慨的人") ``` ``` sentence 0 季布者,楚人也。为气任侠,有名於楚。 1 董仲舒为人廉直。 2 大将军为人仁善退让,以和柔自媚於上,然天下未有称也。 3 勃为人木彊敦厚,高帝以为可属大事。 4 石奢者,楚昭王相也。坚直廉正,无所阿避。 ```