Pandas 操作：将其他列的数据匹配到一列，唯一地应用于所有行

Question

我有一个模型可以按可能性顺序为特定课程预测 10 个单词，我想要 出现在课程描述中的那些单词的前 5 个单词 .

这是数据的格式：

course_name course_title    course_description  predicted_word_10   predicted_word_9    predicted_word_8    predicted_word_7    predicted_word_6    predicted_word_5    predicted_word_4    predicted_word_3    predicted_word_2    predicted_word_1
Xmath 32    Precalculus     Polynomial and rational functions, exponential...   directed    scholars    approach    build   african different   visual  cultures    placed  global
Xphilos 2   Morality        Introduction to ethical and political philosop...   make    presentation    weekly  european    ways    general range   questions   liberal speakers

我的想法是让每一行从 predicted_word_1 开始迭代，直到我得到描述中的前 5 行。我想按照它们在附加列 description_word_1 ... description_word_5 中出现的顺序保存这些词。（如果描述中有 <5 个预测词，我计划在相应的列中 return NAN）。

用一个例子来说明：如果一门课程的 course_description 是 'Polynomial and rational functions, exponential and logarithmic functions, trigonometry and trigonometric functions. Complex numbers, fundamental theorem of algebra, mathematical induction, binomial theorem, series, and sequences. ' 并且它的前几个预测词是 irrelevantword1, induction, exponential, logarithmic, irrelevantword2, polynomial, algebra...

我想按顺序 return induction, exponential, logarithmic, polynomial, algebra 并为其余课程做同样的事情。

我的尝试是定义一个应用函数，该函数将连续接收并从第一个预测词开始迭代，直到找到描述中的前 5 个词，但我无法弄清楚的部分是如何创建这些额外的列，其中包含每门课程的正确单词。此代码目前只会为所有行保留一门课程的单词。

def find_top_description_words(row):
    print(row['course_title'])
    description_words_index=1
    for i in range(num_words_per_course): 
        description = row.loc['course_description']
        word_i = row.loc['predicted_word_' + str(i+1)]
        if (word_i in description) & (description_words_index <=5) :
            print(description_words_index)
            row['description_word_' + str(description_words_index)] = word_i
            description_words_index += 1


df.apply(find_top_description_words,axis=1)

此数据操作的最终目标是保留模型中排名前 10 位的预测词和描述中排名前 5 位的预测词，因此数据框如下所示：

course_name course_title  course_description top_description_word_1 ... top_description_word_5 predicted_word_1 ... predicted_word_10

如有指点，我们将不胜感激。谢谢！

Answer 1

如果我理解正确的话：

创建仅包含 100 个预测词的新 DataFrame：

pred_words_lists = df.apply(lambda x: list(x[3:].dropna())[::-1], axis = 1)

请注意，每行中都有包含预测词的列表。顺序很好，我的意思是第一个，不是空的，预测的词在第一位，第二个在第二位，依此类推。

现在让我们创建一个新的 DataFrame：

pred_words_df = pd.DataFrame(pred_words_lists.tolist())
pred_words_df.columns = df.columns[:2:-1]

最后的 DataFrame：

final_df = df[['course_name', 'course_title', 'course_description']].join(pred_words_df.iloc[:,0:11])

希望这有效。

编辑

def common_elements(xx, yy):
    temp = pd.Series(range(0, len(xx)), index= xx)
    return list(df.reindex(yy).sort_values()[0:10].dropna().index)

pred_words_lists = df.apply(lambda x: common_elements(x[2].replace(',','').split(), list(x[3:].dropna())), axis = 1)

是否满足您的要求？

改编方案（OP）：

def get_sorted_descriptions_words(course_description, predicted_words, k):
    description_words = course_description.replace(',','').split()
    predicted_words_list = list(predicted_words)
    predicted_words = pd.Series(range(0, len(predicted_words_list)), index=predicted_words_list)
    predicted_words = predicted_words[~predicted_words.index.duplicated()]
    ordered_description = predicted_words.reindex(description_words).dropna().sort_values()
    ordered_description_list = pd.Series(ordered_description.index).unique()[:k]

    return ordered_description_list

df.apply(lambda x: get_sorted_descriptions_words(x['course_description'], x.filter(regex=r'predicted_word_.*'), k), axis=1)

Pandas 操作：将其他列的数据匹配到一列，唯一地应用于所有行

Pandas manipulation: matching data from other columns to one column, applied uniquely to all rows

data-processing

pandas