遍历列表以在具有出现次数的数据框中创建新列
Iterating through a list to create new columns in a dataframe with occurence counts
我有这些列表,如下所示:
NER_LIST = ["PERSON", "NORP", "FAC", "ORG", "GPE", "LOC", "PRODUCT", "EVENT", "WORK_OF_ART", "LAW", "LANGUAGE", "DATE", "TIME", "PERCENT", "MONEY", "QUANTITY", "ORDINAL", "CARDINAL"]
我有一个数据框,其中包含一个列,其中包含来自上面列表的不同实体,如下所示:
[('James Whitey Bulger', 'PERSON'), ('Coast Guard', 'ORG')]
我想要的是为 NER_list 中的每个项目创建一个列,用于计算列表中每个实体的出现次数,并将其添加到我现有的数据框中。
所以我最终在我的数据集中有一个 PERSON 列,它计算每一行在文本列中提到了多少人。
有人可以帮忙吗?提前致谢!
df['PERSON_count'] = df.Entities.str.count('PERSON') 这行得通,但是因为我正在处理多个列表,所以我想要一种方法来自动创建这些列我的 dt 集
编辑: 一行的示例是一篇如下所示的文章:
[('Today', 'DATE'), ('Monday', 'DATE'), ('136天', 'DATE'), ('the year', 'DATE'), ('August 17 2017', 'DATE'), ('Spanish', 'NORP'), ('Barcelona', 'GPE'), ('13', 'CARDINAL'), ('120 14th', 'DATE'), ('early the next day', 'DATE'), ('Six', 'CARDINAL'), ('two', 'CARDINAL'), ('1915', 'DATE'), ('Cobb County', 'GPE'), ('Georgia', 'GPE'), ('Jewish', 'NORP'), ('Leo Frank', 'PERSON'), ('13岁', 'DATE'), ('Mary Phagan', 'PERSON'), ('Frank', 'PERSON'), ('Georgia', 'GPE'), ('1986' , 'DATE'), ('1960', 'DATE'), ('Beatles', 'PERSON'), ('the Silver Beetles', 'ORG'), ('first', 'ORDINAL'), ('Hamburg', 'GPE'), ('Germany', 'GPE'), ('Jimmy Hoffa', 'PERSON'), ('Chicago', 'GPE'), ('five years', 'DATE'), ('Hoffa', 'PERSON'), ('1971' , 'DATE'), ('Richard Nixon', 'PERSON'), ('1969', 'DATE'), ('Hurricane Camille', 'EVENT'), ('Mississippi', 'LOC'), ('256', 'CARDINAL'), ('three', 'CARDINAL'), ('Cuba', 'GPE' ), ('1978', 'DATE'), ('first', 'ORDINAL'), ('trans-Atlantic', 'LOC'), ('Maxie Anderson Ben Abruzzo', 'PERSON'), ('Larry Newman', 'PERSON'), ('1982', 'DATE'), ('first', 'ORDINAL'), ('Hanover West Germany', 'LOC'), ('1983', 'DATE'), ('Ira Gershwin', 'PERSON'), ('Beverly Hills', 'GPE'), ('Calif', 'GPE'), ('age 86', 'DATE'), ('1987', 'DATE'), ('Rudolf Hess', 'PERSON'), ('Hitler', 'PERSON'), ('Spandau Prison', 'PERSON'), ('age 93', 'DATE'), ('1988', 'DATE'), ('Pakistani', 'NORP'), ('Mohammad Zia', 'PERSON'), ('Arnold Raphel', 'PERSON'), ('1998', 'DATE'), ('Bill Clinton', 'PERSON'), ('the White House', 'ORG'), ('Monica Lewinsky', 'PERSON'), ('Lewinsky', 'PERSON'), ('Kenneth Starr', 'PERSON'), ('1999' , 'DATE'), ('more than 17', 'CARDINAL'), ('Turkey', 'GPE'), ('2018', 'DATE'), ('Donald Trump', 'PERSON')]
经过我们的讨论,看来您的列表确实是一个字符串:
I need to keep it as a string preferably, initially I thought the solution was to use something with str find
NER_list should be the columns in the end.
This works: df['PERSON_count'] = df.Entities.str.count('PERSON') but is very manual.
I'd like to iterate over each item in the NER list an create these columns automatically
因此在构建正则表达式模式后,您可以使用 str.findall
查找 NER_LIST
变量中的所有项目:
from collections import Counter
out = df['A'].str.findall(fr"'({'|'.join(NER_LIST)})'") \
.apply(lambda x: pd.Series(Counter(x), index=NER_LIST)) \
.fillna(0).astype(int)
输出:
PERSON NORP FAC ORG GPE LOC PRODUCT EVENT WORK_OF_ART LAW LANGUAGE DATE TIME PERCENT MONEY QUANTITY ORDINAL CARDINAL
0 20 3 0 2 11 3 0 1 0 0 0 24 0 0 0 0 3 6
旧答案
>>> df
A
0 [(Today, DATE), (Monday, DATE), (136 days, DAT...
重新格式化您的数据框:
out = df['A'].explode().apply(pd.Series) \
.set_index(1, append=True).rename_axis(['row', 'ner'])
此时,您的数据框如下所示:
>>> out
0
row ner
0 DATE Today
DATE Monday
DATE 136 days
DATE the year
DATE August 17 2017
... ...
DATE 1999
CARDINAL more than 17
GPE Turkey
DATE 2018
PERSON Donald Trump
计数不清楚,你要吗?
>>> out.index.to_frame().reset_index(drop=True).value_counts()
row ner
0 DATE 24
PERSON 20
GPE 11
CARDINAL 6
LOC 3
NORP 3
ORDINAL 3
ORG 2
EVENT 1
dtype: int64
我有这些列表,如下所示:
NER_LIST = ["PERSON", "NORP", "FAC", "ORG", "GPE", "LOC", "PRODUCT", "EVENT", "WORK_OF_ART", "LAW", "LANGUAGE", "DATE", "TIME", "PERCENT", "MONEY", "QUANTITY", "ORDINAL", "CARDINAL"]
我有一个数据框,其中包含一个列,其中包含来自上面列表的不同实体,如下所示:
[('James Whitey Bulger', 'PERSON'), ('Coast Guard', 'ORG')]
我想要的是为 NER_list 中的每个项目创建一个列,用于计算列表中每个实体的出现次数,并将其添加到我现有的数据框中。
所以我最终在我的数据集中有一个 PERSON 列,它计算每一行在文本列中提到了多少人。
有人可以帮忙吗?提前致谢!
df['PERSON_count'] = df.Entities.str.count('PERSON') 这行得通,但是因为我正在处理多个列表,所以我想要一种方法来自动创建这些列我的 dt 集
编辑: 一行的示例是一篇如下所示的文章:
[('Today', 'DATE'), ('Monday', 'DATE'), ('136天', 'DATE'), ('the year', 'DATE'), ('August 17 2017', 'DATE'), ('Spanish', 'NORP'), ('Barcelona', 'GPE'), ('13', 'CARDINAL'), ('120 14th', 'DATE'), ('early the next day', 'DATE'), ('Six', 'CARDINAL'), ('two', 'CARDINAL'), ('1915', 'DATE'), ('Cobb County', 'GPE'), ('Georgia', 'GPE'), ('Jewish', 'NORP'), ('Leo Frank', 'PERSON'), ('13岁', 'DATE'), ('Mary Phagan', 'PERSON'), ('Frank', 'PERSON'), ('Georgia', 'GPE'), ('1986' , 'DATE'), ('1960', 'DATE'), ('Beatles', 'PERSON'), ('the Silver Beetles', 'ORG'), ('first', 'ORDINAL'), ('Hamburg', 'GPE'), ('Germany', 'GPE'), ('Jimmy Hoffa', 'PERSON'), ('Chicago', 'GPE'), ('five years', 'DATE'), ('Hoffa', 'PERSON'), ('1971' , 'DATE'), ('Richard Nixon', 'PERSON'), ('1969', 'DATE'), ('Hurricane Camille', 'EVENT'), ('Mississippi', 'LOC'), ('256', 'CARDINAL'), ('three', 'CARDINAL'), ('Cuba', 'GPE' ), ('1978', 'DATE'), ('first', 'ORDINAL'), ('trans-Atlantic', 'LOC'), ('Maxie Anderson Ben Abruzzo', 'PERSON'), ('Larry Newman', 'PERSON'), ('1982', 'DATE'), ('first', 'ORDINAL'), ('Hanover West Germany', 'LOC'), ('1983', 'DATE'), ('Ira Gershwin', 'PERSON'), ('Beverly Hills', 'GPE'), ('Calif', 'GPE'), ('age 86', 'DATE'), ('1987', 'DATE'), ('Rudolf Hess', 'PERSON'), ('Hitler', 'PERSON'), ('Spandau Prison', 'PERSON'), ('age 93', 'DATE'), ('1988', 'DATE'), ('Pakistani', 'NORP'), ('Mohammad Zia', 'PERSON'), ('Arnold Raphel', 'PERSON'), ('1998', 'DATE'), ('Bill Clinton', 'PERSON'), ('the White House', 'ORG'), ('Monica Lewinsky', 'PERSON'), ('Lewinsky', 'PERSON'), ('Kenneth Starr', 'PERSON'), ('1999' , 'DATE'), ('more than 17', 'CARDINAL'), ('Turkey', 'GPE'), ('2018', 'DATE'), ('Donald Trump', 'PERSON')]
经过我们的讨论,看来您的列表确实是一个字符串:
I need to keep it as a string preferably, initially I thought the solution was to use something with str find NER_list should be the columns in the end.
This works: df['PERSON_count'] = df.Entities.str.count('PERSON') but is very manual.
I'd like to iterate over each item in the NER list an create these columns automatically
因此在构建正则表达式模式后,您可以使用 str.findall
查找 NER_LIST
变量中的所有项目:
from collections import Counter
out = df['A'].str.findall(fr"'({'|'.join(NER_LIST)})'") \
.apply(lambda x: pd.Series(Counter(x), index=NER_LIST)) \
.fillna(0).astype(int)
输出:
PERSON NORP FAC ORG GPE LOC PRODUCT EVENT WORK_OF_ART LAW LANGUAGE DATE TIME PERCENT MONEY QUANTITY ORDINAL CARDINAL
0 20 3 0 2 11 3 0 1 0 0 0 24 0 0 0 0 3 6
旧答案
>>> df
A
0 [(Today, DATE), (Monday, DATE), (136 days, DAT...
重新格式化您的数据框:
out = df['A'].explode().apply(pd.Series) \
.set_index(1, append=True).rename_axis(['row', 'ner'])
此时,您的数据框如下所示:
>>> out
0
row ner
0 DATE Today
DATE Monday
DATE 136 days
DATE the year
DATE August 17 2017
... ...
DATE 1999
CARDINAL more than 17
GPE Turkey
DATE 2018
PERSON Donald Trump
计数不清楚,你要吗?
>>> out.index.to_frame().reset_index(drop=True).value_counts()
row ner
0 DATE 24
PERSON 20
GPE 11
CARDINAL 6
LOC 3
NORP 3
ORDINAL 3
ORG 2
EVENT 1
dtype: int64