遍历嵌套的字符串列表以获取第一项
iterating through a nested list of strings to get first item
我正在尝试从数据框中的 gen
列中提取项目(示例如下)。我的目标是将 gen
中的每一行迭代到一个新的数据框列中,其中的项目与预定义列表 genre_code
.
相匹配
df = pd.DataFrame({'id': [620, 843, 986], 'tit': ['AAA', 'BBB', 'CCC'], 'gen': [['Romance', 'Satire', 'Fiction'], ['Science Fiction', 'Novel'], ['Mystery', 'Novel']]})
genre_code = ['Science Fiction', 'Mystery', 'Non-fiction']
到目前为止,我能够得出以下结论:
new_gen = []
for i in df['gen']:
for j in i:
if j in genre_code:
new_gen.append(j)
else:
new_gen.append('NA')
df['gen'] = new_gen
它确实遍历了列,但结果 new_gen
的长度与原始数据帧行长度不匹配。
/usr/local/lib/python3.7/dist-packages/pandas/core/internals/construction.py in sanitize_index(data, index)
746 if len(data) != len(index):
747 raise ValueError(
--> 748 "Length of values "
749 f"({len(data)}) "
750 "does not match length of index "
ValueError: Length of values (30004) does not match length of index (12841)
我知道这一定是非常基本的东西,但是有人可以指出我遗漏了什么吗?
如果您想根据您的列表过滤 gen
列,您可以这样做:
df["gen"] = df["gen"].apply(lambda x: [g for g in x if g in genre_code])
print(df)
打印:
id tit gen
0 620 AAA []
1 843 BBB [Science Fiction]
2 986 CCC [Mystery]
P.S:为了加快处理速度,可以在
之前将genre_code
转换为set()
genre_code = set(["Science Fiction", "Mystery", "Non-fiction"])
df["gen"] = df["gen"].apply(lambda x: [g for g in x if g in genre_code])
我会将列表转换为字符串,然后使用 series.str.findall
到 return 匹配的 genre_code:
df['new_gen'] = df['gen'].astype(str).str.findall('|'.join(genre_code))
print(df)
id tit gen new_gen
0 620 AAA [Romance, Satire, Fiction] []
1 843 BBB [Science Fiction, Novel] [Science Fiction]
2 986 CCC [Mystery, Novel] [Mystery]
我正在尝试从数据框中的 gen
列中提取项目(示例如下)。我的目标是将 gen
中的每一行迭代到一个新的数据框列中,其中的项目与预定义列表 genre_code
.
df = pd.DataFrame({'id': [620, 843, 986], 'tit': ['AAA', 'BBB', 'CCC'], 'gen': [['Romance', 'Satire', 'Fiction'], ['Science Fiction', 'Novel'], ['Mystery', 'Novel']]})
genre_code = ['Science Fiction', 'Mystery', 'Non-fiction']
到目前为止,我能够得出以下结论:
new_gen = []
for i in df['gen']:
for j in i:
if j in genre_code:
new_gen.append(j)
else:
new_gen.append('NA')
df['gen'] = new_gen
它确实遍历了列,但结果 new_gen
的长度与原始数据帧行长度不匹配。
/usr/local/lib/python3.7/dist-packages/pandas/core/internals/construction.py in sanitize_index(data, index)
746 if len(data) != len(index):
747 raise ValueError(
--> 748 "Length of values "
749 f"({len(data)}) "
750 "does not match length of index "
ValueError: Length of values (30004) does not match length of index (12841)
我知道这一定是非常基本的东西,但是有人可以指出我遗漏了什么吗?
如果您想根据您的列表过滤 gen
列,您可以这样做:
df["gen"] = df["gen"].apply(lambda x: [g for g in x if g in genre_code])
print(df)
打印:
id tit gen
0 620 AAA []
1 843 BBB [Science Fiction]
2 986 CCC [Mystery]
P.S:为了加快处理速度,可以在
之前将genre_code
转换为set()
genre_code = set(["Science Fiction", "Mystery", "Non-fiction"])
df["gen"] = df["gen"].apply(lambda x: [g for g in x if g in genre_code])
我会将列表转换为字符串,然后使用 series.str.findall
到 return 匹配的 genre_code:
df['new_gen'] = df['gen'].astype(str).str.findall('|'.join(genre_code))
print(df)
id tit gen new_gen
0 620 AAA [Romance, Satire, Fiction] []
1 843 BBB [Science Fiction, Novel] [Science Fiction]
2 986 CCC [Mystery, Novel] [Mystery]