遍历列表以在具有出现次数的数据框中创建新列

Iterating through a list to create new columns in a dataframe with occurence counts

我有这些列表,如下所示:

NER_LIST = ["PERSON", "NORP", "FAC", "ORG", "GPE", "LOC", "PRODUCT", "EVENT", "WORK_OF_ART", "LAW", "LANGUAGE", "DATE", "TIME", "PERCENT", "MONEY", "QUANTITY", "ORDINAL", "CARDINAL"]

我有一个数据框,其中包含一个列,其中包含来自上面列表的不同实体,如下所示:

[('James Whitey Bulger', 'PERSON'), ('Coast Guard', 'ORG')]

我想要的是为 NER_list 中的每个项目创建一个列,用于计算列表中每个实体的出现次数,并将其添加到我现有的数据框中。

所以我最终在我的数据集中有一个 PERSON 列,它计算每一行在文本列中提到了多少人。

有人可以帮忙吗?提前致谢!

df['PERSON_count'] = df.Entities.str.count('PERSON') 这行得通,但是因为我正在处理多个列表,所以我想要一种方法来自动创建这些列我的 dt 集

编辑: 一行的示例是一篇如下所示的文章:

[('Today', 'DATE'), ('Monday', 'DATE'), ('136天', 'DATE'), ('the year', 'DATE'), ('August 17 2017', 'DATE'), ('Spanish', 'NORP'), ('Barcelona', 'GPE'), ('13', 'CARDINAL'), ('120 14th', 'DATE'), ('early the next day', 'DATE'), ('Six', 'CARDINAL'), ('two', 'CARDINAL'), ('1915', 'DATE'), ('Cobb County', 'GPE'), ('Georgia', 'GPE'), ('Jewish', 'NORP'), ('Leo Frank', 'PERSON'), ('13岁', 'DATE'), ('Mary Phagan', 'PERSON'), ('Frank', 'PERSON'), ('Georgia', 'GPE'), ('1986' , 'DATE'), ('1960', 'DATE'), ('Beatles', 'PERSON'), ('the Silver Beetles', 'ORG'), ('first', 'ORDINAL'), ('Hamburg', 'GPE'), ('Germany', 'GPE'), ('Jimmy Hoffa', 'PERSON'), ('Chicago', 'GPE'), ('five years', 'DATE'), ('Hoffa', 'PERSON'), ('1971' , 'DATE'), ('Richard Nixon', 'PERSON'), ('1969', 'DATE'), ('Hurricane Camille', 'EVENT'), ('Mississippi', 'LOC'), ('256', 'CARDINAL'), ('three', 'CARDINAL'), ('Cuba', 'GPE' ), ('1978', 'DATE'), ('first', 'ORDINAL'), ('trans-Atlantic', 'LOC'), ('Maxie Anderson Ben Abruzzo', 'PERSON'), ('Larry Newman', 'PERSON'), ('1982', 'DATE'), ('first', 'ORDINAL'), ('Hanover West Germany', 'LOC'), ('1983', 'DATE'), ('Ira Gershwin', 'PERSON'), ('Beverly Hills', 'GPE'), ('Calif', 'GPE'), ('age 86', 'DATE'), ('1987', 'DATE'), ('Rudolf Hess', 'PERSON'), ('Hitler', 'PERSON'), ('Spandau Prison', 'PERSON'), ('age 93', 'DATE'), ('1988', 'DATE'), ('Pakistani', 'NORP'), ('Mohammad Zia', 'PERSON'), ('Arnold Raphel', 'PERSON'), ('1998', 'DATE'), ('Bill Clinton', 'PERSON'), ('the White House', 'ORG'), ('Monica Lewinsky', 'PERSON'), ('Lewinsky', 'PERSON'), ('Kenneth Starr', 'PERSON'), ('1999' , 'DATE'), ('more than 17', 'CARDINAL'), ('Turkey', 'GPE'), ('2018', 'DATE'), ('Donald Trump', 'PERSON')]

经过我们的讨论,看来您的列表确实是一个字符串:

I need to keep it as a string preferably, initially I thought the solution was to use something with str find NER_list should be the columns in the end.

This works: df['PERSON_count'] = df.Entities.str.count('PERSON') but is very manual.

I'd like to iterate over each item in the NER list an create these columns automatically

因此在构建正则表达式模式后,您可以使用 str.findall 查找 NER_LIST 变量中的所有项目:

from collections import Counter

out = df['A'].str.findall(fr"'({'|'.join(NER_LIST)})'") \
             .apply(lambda x: pd.Series(Counter(x), index=NER_LIST)) \
             .fillna(0).astype(int)

输出:

   PERSON  NORP  FAC  ORG  GPE  LOC  PRODUCT  EVENT  WORK_OF_ART  LAW  LANGUAGE  DATE  TIME  PERCENT  MONEY  QUANTITY  ORDINAL  CARDINAL
0      20     3    0    2   11    3        0      1            0    0         0    24     0        0      0         0        3         6

旧答案

>>> df
                                                   A
0  [(Today, DATE), (Monday, DATE), (136 days, DAT...

重新格式化您的数据框:

out = df['A'].explode().apply(pd.Series) \
             .set_index(1, append=True).rename_axis(['row', 'ner'])

此时,您的数据框如下所示:

>>> out
                           0
row ner                     
0   DATE               Today
    DATE              Monday
    DATE            136 days
    DATE            the year
    DATE      August 17 2017
...                      ...
    DATE                1999
    CARDINAL    more than 17
    GPE               Turkey
    DATE                2018
    PERSON      Donald Trump

计数不清楚,你要吗?

>>> out.index.to_frame().reset_index(drop=True).value_counts()
row  ner     
0    DATE        24
     PERSON      20
     GPE         11
     CARDINAL     6
     LOC          3
     NORP         3
     ORDINAL      3
     ORG          2
     EVENT        1
dtype: int64