我需要从文本数据集中映射单词，并快速根据其他数据集给它们打分

Question

我有一个2000行的文本数据集，每行包含将近一页的英文单词，我写了一个函数将每一行拆分成单词，根据另一个数据集给它们打分，然后取平均值每行的分数。例如：'Im a programmer' >> [15,20,25] >> mean = 20。问题是代码太慢了，整个数据集运行需要将近 30 分钟。有没有办法让它工作得更快？这是我尝试过的：

def get_score(text):
    word_arr = pd.Series(0, index=text.split(), dtype='float64', name='Count')
    return pd.merge(word_arr,
                    scoring,
                    how="left",
                    left_index=True,
                    right_index=True,)['count'].fillna(0).mean()

df['string'].apply(get_score)

其中word_array是以单词为索引的0s pandas系列，scoring是以单词为索引的大分数系列。

Answer 1

由于无法在您的数据上对此进行测试，我只能建议使用更简单的结构和方法的方法：

import statistics
import pandas as pd

# sample data
df = pd.DataFrame(
    {'string': ['first page of text word_wo_score', 'second page of text']}
)
df
#                               string
# 0   first page of text word_wo_score
# 1                second page of text

scoring = pd.Series([0,2,3,4,1], index=['first', 'page', 'of', 'text', 'second'])

# convert scoring values to a dictionary
scoring_dict = scoring.to_dict()
scoring_dict
# {'first': 0, 'page': 2, 'of': 3, 'text': 4, 'second': 1}

# convert your column to list
txt = df['string'].to_list()
txt
# ['first page of text word_wo_score', 'second page of text']

# map each word in each sublist to dict and take the mean of each new sublist
# if a word does not exist in the dictionary, it gets a score of 0
[statistics.mean([scoring_dict.get(word, 0) for word in page.split()]) for page in txt]
# [1.8, 2.5]

Answer 2

这是迄今为止我找到的最快的解决方案，将时间从 30 分钟减少到将近 5-8 分钟。

def score_function(text):
    ser = pd.Series(index=text.split(' ')).index.intersection(scoring.index)
    values = scoring[ser]
    return values.sum()

其中 scoring 是一个 pandas 系列，包含单词作为索引，分数作为值。

我需要从文本数据集中映射单词，并快速根据其他数据集给它们打分

I need to map words from a text dataset and give them a score based on other dataset in a fast way

python

pandas

data-science