是否可以告诉词干分析器忽略特定语言的单词？

Question

我正在使用德语的 Cistem 词干分析器。

我提取的文档也包含英文单词。

因此我想告诉德语词干分析器忽略英语单词，然后我想告诉我的英语词干分析器忽略德语单词。

示例：

我的德文文本包含英文单词 "case"。德国词干分析器将其词干化为 "cas"，但它应该保持 "case"。因此忽略英文单词 "case".

这可能吗？

我的代码：

stemmer = Cistem()
sl = []
for line in o: 
    sp = line.split()
    sl.append(sp)

st = [[stemmer.segment(s) for s in l] for l in sl]

Answer 1

比较给定单词在英语文档语料库中出现的频率与同一单词在德语文档语料库中出现的频率的一个好方法。
例如，如果单词 w1 在德语维基百科中的出现频率高于在英语维基百科中出现的频率，则它可能是德语单词。

现在无需为这两个版本的维基百科下载、解析和计算词频，更直接的方法是使用经过预训练的模型，其中包括他们在学习过程中遇到的词频的指示培训。

我们可以在 Spacy 中使用英语和德语模型：

import spacy
nlpDE = spacy.load("de_core_news_md")
nlpEN = spacy.load('en_core_web_md')

# some test sentences in both languages:
sl = [ "Python is an interpreted, high-level, general-purpose programming language.",
"Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace.", 
"Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.", 
"Python ist eine universelle, üblicherweise interpretierte höhere Programmiersprache.",
" Wegen ihrer klaren und übersichtlichen Syntax gilt Python als einfach zu erlernen.",
"Python unterstützt mehrere Programmierparadigmen, z. B. die objektorientierte, die aspektorientierte und die funktionale Programmierung"]

#let's randomly shuffle this list of test sentences:
from random import shuffle
shuffle(sl)
s = " ".join(sl)

#Our function which will compare the likelihoods:
def compare(word):
    prob_en = nlpEN.vocab[word].prob
    prob_de = nlpDE.vocab[word].prob
    if prob_en > prob_de:
        return('EN')
    else:
        return('DE')


doc = nlpEN(s)    
print([(t, compare(t.text))  for t in doc if not t.is_punct])

以及该方法在样本数据上的结果：

[(Python, 'EN'), (is, 'EN'), (an, 'DE'), (interpreted, 'EN'), (high, 'EN'), (level, 'EN'), 
(general, 'EN'), (purpose, 'EN'), (programming, 'EN'), (language, 'EN'), (Python, 'EN'), 
(unterstützt, 'DE'), (mehrere, 'DE'), (Programmierparadigmen, 'DE'), (z., 'DE'), (B., 'DE'), (die, 'DE'), 
(objektorientierte, 'DE'), (die, 'DE'), (aspektorientierte, 'DE'), (und, 'DE'), (die, 'DE'), (funktionale, 'DE'),
 (Programmierung, 'DE'), (Created, 'EN'), (by, 'EN'), (Guido, 'DE'), (van, 'DE'), (Rossum, 'DE'), (and, 'EN'),
 (first, 'EN'), (released, 'EN'), (in, 'DE'), (1991, 'EN'), (Python, 'EN'), ('s, 'EN'), (design, 'EN'), 
(philosophy, 'EN'), (emphasizes, 'EN'), (code, 'EN'), (readability, 'EN'), (with, 'EN'), (its, 'EN'), 
(notable, 'EN'), (use, 'EN'), (of, 'EN'), (significant, 'EN'), (whitespace, 'EN'), ( , 'EN'), (Wegen, 'DE'), 
(ihrer, 'DE'), (klaren, 'DE'), (und, 'DE'), (übersichtlichen, 'DE'), (Syntax, 'DE'), (gilt, 'DE'), (Python, 'EN'), 
(als, 'DE'), (einfach, 'DE'), (zu, 'DE'), (erlernen, 'DE'), (Python, 'EN'), (ist, 'DE'), (eine, 'DE'), 
(universelle, 'DE'), (üblicherweise, 'DE'), (interpretierte, 'DE'), (höhere, 'DE'), (Programmiersprache, 'DE'),
 (Its, 'EN'), (language, 'EN'), (constructs, 'EN'), (and, 'EN'), (object, 'EN'), (oriented, 'EN'), (approach, 'EN'), 
(aim, 'EN'), (to, 'EN'), (help, 'EN'), (programmers, 'EN'), (write, 'EN'), (clear, 'EN'), (logical, 'EN'),
 (code, 'EN'), (for, 'EN'), (small, 'EN'), (and, 'EN'), (large, 'EN'), (scale, 'EN'), (projects, 'EN')]

Answer 2

一个直接的想法是检查每个标记是属于给定的德语词汇还是属于英语词汇。据此，您可以决定是否将词干分析器应用于令牌。为此，您需要一本相应语言的字典，也许您可以使用 nltk 中的字典或检查语料库中的出现次数或频率，这样您就可以指定一个词是否在目标语言中使用或是否被使用的概率一笔借贷。

是否可以告诉词干分析器忽略特定语言的单词？

Is it possible to tell a stemmer to ignore words of specific language?

python

nltk