table.remove 删除某些元素但不是全部

Question

我正在尝试使用 .remove() 从列表中删除元素（列表存储在 pandas 数据框中）。基本思想是我遍历数据框中的所有行，然后遍历行中的每个元素（=列表），并检查该特定元素是守门员还是“死者”

data=dict()
data=pd.read_csv('raw_output_v2.csv', names=['ID','Body'])
data['Body']=data['Body'].apply(eval)  
keyword_dict={}
for row in tqdm(data['Body'], desc="building dict"):
    for word in row:
        if word in keyword_dict:
            keyword_dict[word]+=1
        else:
            keyword_dict[word]=1 

new_df=remove_sparse_words_from_df(data, keyword_dict, cutoff=1_000_000)

重要的是：

def remove_sparse_words_from_df(df, term_freq, cutoff=1):
    i=0
    for row in tqdm(df['Body'],desc="cleaning df"):
        for word in row:
            if term_freq[word]<=cutoff:
                row.remove(word)
            else:
                continue
    return df

我上传了一个简短的示例 csv 以供此处使用：https://pastebin.com/g25bHCC7。

我的问题是：remove_sparse_words_from_df 函数删除了一些低于截止值的单词，但不是全部。示例：单词“clean”出现在原始数据帧（数据）中约 10k，在运行 remove_sparse_words_from_df 之后约 2k 仍然存在。其他词也一样。

我错过了什么？

Answer 1

您正在修改列表 (row.remove)，同时遍历它 (for word in row:)。可以看到here, here and here，为什么这可能是个问题：

Modifying a sequence while iterating over it can cause undesired behavior due to the way the iterator is build. To avoid this problem, a simple solution is to iterate over a copy of the list... using the slice notation with default values list_1[:]

    ...
    for row in tqdm(df['Body'],desc="cleaning df"):
        for word in row[:]:
            if term_freq[word]<=cutoff:
                row.remove(word)
    ...

截止设置为1_000_000

                   ID Body
0  (1483785165, 2009)   []
1  (1538280431, 2010)   []
2  (1795044103, 2010)   []
...
...

table.remove 删除某些元素但不是全部

table.remove removes certain elements but not all

python

nlp

list

pandas