在 Pandas 数据框中查找子字符串的前导和后继

Question

我想在 Pandas 数据帧中找到子字符串的前导和后继。

知道如何在 pandas 或 python 中做到这一点吗？

Answer 1

您可以使用 str.extractall:

df1 = df['Text'].str.extractall(r'(\d+)\s+([^\s]+)') \
                .droplevel(level=1) \
                .reset_index() \
                .rename(columns={0: 'values', 1: 'columns'}) \
                .astype({'values': int})

中间结果：

>>> df1
   index values columns
0      0     20    bats
1      0     10    cups
2      1     10    cups
3      1      5   balls
4      2      4    bags
5      2      6    cups
6      3     13    bats
7      3     14    bats

现在旋转并加入：

df = df.join(df1.pivot_table(index='index', columns='columns', 
                             values='values', aggfunc=sum, fill_value=0))

输出：

>>> df
   ID                                          Text  bags  balls  bats  cups
0   1            I bought 20 bats today and 10 cups     0      0    20    10
1   2                    I need 10 cups and 5 balls     0      5     0    10
2   3         I will buy 4 bags and 6 cups tomorrow     4      0     0     6
3   1  I bought 13 bats yesterday and 14 bats today     0      0    27     0

正则表达式：

(\d+) 找到至少一个或多个数字
\s+ 后跟至少一个或多个空格
([^\s]+) 收集所有内容直到下一个空格

(...) 是一个捕获组。

在 Pandas 数据框中查找子字符串的前导和后继

Find predecessor and successor of substring in Pandas dataframe

python

nlp

numpy

pandas