是否有一个功能可以让我确定一段文本是否在谈论一个预先定义的主题?

Is there a function that allows me to determine if a text talks about a pre defined topic?

我想编写主题列表来检查评论是否讨论了定义的主题之一。重要的是我自己写主题列表,而不是使用主题建模来寻找可能的主题。

我以为这叫字典分析,结果什么也找不到

我有一个包含亚马逊评论的数据框:

df = pd.DataFrame({'User': ['UserA', 'UserB','UserC'], 
'text': ['Example text where he talks about a phone and his charging cable',
 'Example text where he talks about a car with some wheels',
 'Example text where he talks about a plane']})

现在我要定义主题列表:

phone = ['phone', 'cable', 'charge', 'charging', 'call', 'telephone']
car = ['car', 'wheel','steering', 'seat','roof','other car related words']
plane = ['plane', 'wings', 'turbine', 'fly']

该方法的结果对于第一个评论的 "phone" 主题应该是 3/12(主题列表中的 3 个词,在评论中有 12 个词),其他两个主题的结果应该是 0 .

第二次审核结果 "car" 主题为 2/11,其他主题为 0,第三次审核为 "plane" 主题为 1/8,其他主题为 0。

结果列表:

phone_results = [0.25, 0, 0]
car_results = [0, 0.18181818182, 0]
plane_results = [0, 0, 0.125]

当然,我只会使用小写的评论词干,这样更容易定义主题,但这现在不应该被关注。

有方法吗还是我必须写一个? 提前致谢!

您可以使用 RASA-NLU 意图分类预训练模型

NLP 可以很深,但是对于已知单词的比例,您可能可以做一些更基础的事情。例如:

word_map = {
    'phone': ['phone', 'cable', 'charge', 'charging', 'call', 'telephone'],
    'car': ['car', 'wheels','steering', 'seat','roof','other car related words'],
    'plane': ['plane', 'wings', 'turbine', 'fly']
}
sentences = [
     'Example text where he talks about a phone and his charging cable',
     'Example text where he talks about a car with some wheels',
     'Example text where he talks about a plane'
]

for sentence in sentences:
    print '==== %s ==== ' % sentence
    words = sentence.split()
    for prefix in word_map:
        match_score = 0
        for word in words:
            if word in word_map[prefix]:
                match_score += 1
        print 'Prefix: %s | MatchScore: %.2fs' % (prefix, float(match_score)/len(words)) 

你会得到这样的东西:

==== Example text where he talks about a phone and his charging cable ==== 
Prefix: phone | MatchScore: 0.25s
Prefix: plane | MatchScore: 0.00s
Prefix: car | MatchScore: 0.00s
==== Example text where he talks about a car with some wheels ==== 
Prefix: phone | MatchScore: 0.00s
Prefix: plane | MatchScore: 0.00s
Prefix: car | MatchScore: 0.18s
==== Example text where he talks about a plane ==== 
Prefix: phone | MatchScore: 0.00s
Prefix: plane | MatchScore: 0.12s
Prefix: car | MatchScore: 0.00s

这当然是一个基本示例,单词有时不以空格结尾——它可以是逗号、句号等。因此您需要考虑到这一点。还有时态,我可以 "phone" 某人或 "phoned",或 "phoning",但我们也不希望 "phonetic" 之类的词混淆。所以它在边缘情况下变得非常棘手,但对于一个非常基本的工作(!)示例,我会看看你是否可以在 python 中完成它而不使用自然语言库。最终,如果它不符合您的用例,您可以开始对其进行测试。

除此之外,您还可以查看类似 Rasa NLU or nltk 的内容。

我想我回馈社区 post 我完成的代码基于 @David542 的回答:

import pandas as pd
import numpy as np 
import re

i=0
#Iterates through the reviews
total_length = len(sentences)
print("Process started:")
s = 1
for sentence in sentences:


    #Splits a review text into single words
    words = sentence.split()
    previous_word = ""
    #Iterates through the topics, each is one column in a table
    for column in dictio:
        #Saves the topic words in the pattern list
        pattern = list(dictio[column])
        #remove nan values
        clean_pattern = [x for x in pattern if str(x) != 'nan']
        match_score = 0
        #iterates through each entry of the topic list
        for search_words in clean_pattern:
            #iterates through each word of the review
            for word in words:
                #when two consecutive words are searched for the first if statement gets activated
                if len(search_words.split())>1:

                    pattern2 = r"( "+re.escape(search_words.split()[0])+r"([a-z]+|) "+re.escape(search_words.split()[1])+r"([a-z]+|))"
                    #the spaces are important so bedtime doesnt match time
                    if re.search(pattern2, " "+previous_word+" "+word, re.IGNORECASE):
                        match_score +=1
                        #print(pattern2, " match ", previous_word," ", word)

                if len(search_words.split())==1:

                    pattern1 = r" "+re.escape(search_words)+r"([a-z]+|)"
                    if re.search(pattern1, " "+word, re.IGNORECASE):
                        match_score +=1
                        #print(pattern1, " match ", word)

                #saves the word for the next iteration to be used as the previous word
                previous_word = word


        result=0       
        if match_score > 0:
            result = 1
        df.at[i, column] = int(result)
    i+=1
    #status bar
    factor = round(s/total_length,4)
    if factor%0.05 == 0:
        print("Status: "+str(factor*100)+"%")
    s+=1

我要分析的文本在字符串列表中sentences。我想在我的文本中查找的主题在 dataFrame dictio 中。主题以主题名称开头,并包含多行搜索词。分析采用一两个连续的单词,并在每个字符串中查找具有可变结尾的单词。如果正则表达式匹配原始数据帧 df 在分配给主题的列的相应行中获得“1”。除了我的问题中指定的以外,我没有计算单词的百分比,因为我发现它不会为我的分析增加价值。应删除字符串中的标点符号,但不需要词干提取。如果您有具体问题,请发表评论,我将编辑此代码或回答您的评论。