数据框替换正则表达式

Question

我目前正在尝试用 python 中的 int 值替换我的 Dataframe 中的一组 str 值。 DataFrame有200多列，有Age_Range、Car_Year、Car_Count、Home_Value、Supermarket_Spend_Per_week、Household_Income等

我有以 a 开头的答案（在列中）。 b. C。 d. e. F。对于不同的反应。

例如a。低于 2 万美元，b。 20 到 3 万美元，c。 $30 到 $50k .. 等等

我已经通读了 wiki 并且知道如何替换为单词边界等。但是我想替换任何以 a 开头的值 1，b 开头的值 2，等等。

我该如何为我的 Dataframe 编写这个？我尝试的所有正则表达式函数都以无效语法结束

我目前有

income
h. No Answer
f. 0 to 0k
c.  to k
b.  to k
b.  to k
c.  to k
h. No Answer

我想转换成

income
8
5
3
2
2
3
7

作为整数可以让我更轻松地绘制结果图表并搜索列之间的关系。

Answer 1

这可能是实现您目标的一种方式：

>>> re.sub(r"^([abcdef])", lambda x: str(ord(x.group(0))-ord('a')), "b. US$blah blabh")
'1. US$blah blabh'

它的作用是"match either of the characters 'a' through 'f' at the beginning of a string and substitute it with the string representation of the offset of that character with respect to the letter 'a'"。对每一行文本重复。

通过一些额外的管道，您可以摆脱输入行的其余部分；有点不清楚你想要什么作为输出。

Answer 2

这里不需要正则表达式，只需创建一个查找 table 并根据该列的第一个字符应用于 DataFrame 的列，例如：

df['income'] = df['income'].apply(lambda L, rep={c:i for i,c in enumerate('abcdefh', 1)}: rep[L[0]])

这给你：[7, 6, 3, 2, 2, 3, 7]

要将此应用到所有列，然后遍历列：

for column in df.columns:
    df[column] = df[column].apply(lambda L, rep={c:i for i,c in enumerate('abcdefh', 1)}: rep[L[0]])

数据框替换正则表达式

Dataframe Replace Regex

python

regex

pandas