Pandas转换后的字符串值to_numeric大于N怎么算?
Pandas how to count when string values converted to_numeric is greater than N?
我有月度数据框 (df),它已经在最小 - 最大范围内,如下所示:
Wind Jan Feb Nov Dec calib
West 0.1-25.5 2.8-65.3 1.3-61.3 0.9-35.3 50
North 0.2-28.3 3.1-66.4 1.0-67.7 1.9-40.1 60
South 0.3-29.5 2.5-49.4 1.9-63.4 0.3-33.0 60
East 20.5 1.1-41.1 0.9-40.3 nan 50
我想知道每个月最大风速低于标准值的次数。所以我正在尝试在 calib (sbc) 下面创建一个速度列,如下所示。
month_col = ['Jan', 'Feb', 'Nov', 'Dec']
df['sbc'] = (pd.to_numeric(df[month_col].str.extract(r"(?<=-)(\d+\.\d+)")) < df["calib"]).sum(axis=1)
以上代码无效,我收到错误 AttributeError: 'DataFrame' object has no attribute 'str'
。我该如何解决这个问题?
.str
只适用于Series,不适用于DataFrame,可以stack/unstack:
month_numeric = (df[month_col].stack()
.str.extract(r"(?<=-)(\d+\.\d+)", expand=False)
.astype(float).unstack()
)
您可以使用 melt
:
sbc = (df.melt(['Wind', 'calib'], var_name='month')
.assign(value=lambda x: x['value'].str.split('-').str[1].astype(float))
.query('value < calib').value_counts('Wind'))
df['sbc'] = df['Wind'].map(sbc)
输出:
>>> df
Wind Jan Feb Nov Dec calib sbc
0 West 0.1-25.5 2.8-65.3 1.3-61.3 0.9-35.3 50 2
1 North 0.2-28.3 3.1-66.4 1.0-67.7 1.9-40.1 60 2
2 South 0.3-29.5 2.5-49.4 1.9-63.4 0.3-33.0 60 3
3 East 20.5 1.1-41.1 0.9-40.3 NaN 50 2
一步一步:
- 重塑你的数据框
>>> out = df.melt(['Wind', 'calib'], var_name='month')
Wind calib month value
0 West 50 Jan 0.1-25.5
1 North 60 Jan 0.2-28.3
2 South 60 Jan 0.3-29.5
3 East 50 Jan 20.5
4 West 50 Feb 2.8-65.3
5 North 60 Feb 3.1-66.4
6 South 60 Feb 2.5-49.4
7 East 50 Feb 1.1-41.1
8 West 50 Nov 1.3-61.3
9 North 60 Nov 1.0-67.7
10 South 60 Nov 1.9-63.4
11 East 50 Nov 0.9-40.3
12 West 50 Dec 0.9-35.3
13 North 60 Dec 1.9-40.1
14 South 60 Dec 0.3-33.0
15 East 50 Dec NaN
- 从范围中提取最大风力
>>> out = out.assign(value=lambda x: x['value'].str.split('-').str[1].astype(float))
Wind calib month value
0 West 50 Jan 25.5
1 North 60 Jan 28.3
2 South 60 Jan 29.5
3 East 50 Jan NaN
4 West 50 Feb 65.3
5 North 60 Feb 66.4
6 South 60 Feb 49.4
7 East 50 Feb 41.1
8 West 50 Nov 61.3
9 North 60 Nov 67.7
10 South 60 Nov 63.4
11 East 50 Nov 40.3
12 West 50 Dec 35.3
13 North 60 Dec 40.1
14 South 60 Dec 33.0
15 East 50 Dec NaN
- 筛选出行并计数
>>> out = out.query('value < calib').value_counts('Wind')
Wind
South 3
East 2
North 2
West 2
dtype: int64
最后将这个系列映射(合并)到您的原始数据框。
我有月度数据框 (df),它已经在最小 - 最大范围内,如下所示:
Wind Jan Feb Nov Dec calib
West 0.1-25.5 2.8-65.3 1.3-61.3 0.9-35.3 50
North 0.2-28.3 3.1-66.4 1.0-67.7 1.9-40.1 60
South 0.3-29.5 2.5-49.4 1.9-63.4 0.3-33.0 60
East 20.5 1.1-41.1 0.9-40.3 nan 50
我想知道每个月最大风速低于标准值的次数。所以我正在尝试在 calib (sbc) 下面创建一个速度列,如下所示。
month_col = ['Jan', 'Feb', 'Nov', 'Dec']
df['sbc'] = (pd.to_numeric(df[month_col].str.extract(r"(?<=-)(\d+\.\d+)")) < df["calib"]).sum(axis=1)
以上代码无效,我收到错误 AttributeError: 'DataFrame' object has no attribute 'str'
。我该如何解决这个问题?
.str
只适用于Series,不适用于DataFrame,可以stack/unstack:
month_numeric = (df[month_col].stack()
.str.extract(r"(?<=-)(\d+\.\d+)", expand=False)
.astype(float).unstack()
)
您可以使用 melt
:
sbc = (df.melt(['Wind', 'calib'], var_name='month')
.assign(value=lambda x: x['value'].str.split('-').str[1].astype(float))
.query('value < calib').value_counts('Wind'))
df['sbc'] = df['Wind'].map(sbc)
输出:
>>> df
Wind Jan Feb Nov Dec calib sbc
0 West 0.1-25.5 2.8-65.3 1.3-61.3 0.9-35.3 50 2
1 North 0.2-28.3 3.1-66.4 1.0-67.7 1.9-40.1 60 2
2 South 0.3-29.5 2.5-49.4 1.9-63.4 0.3-33.0 60 3
3 East 20.5 1.1-41.1 0.9-40.3 NaN 50 2
一步一步:
- 重塑你的数据框
>>> out = df.melt(['Wind', 'calib'], var_name='month')
Wind calib month value
0 West 50 Jan 0.1-25.5
1 North 60 Jan 0.2-28.3
2 South 60 Jan 0.3-29.5
3 East 50 Jan 20.5
4 West 50 Feb 2.8-65.3
5 North 60 Feb 3.1-66.4
6 South 60 Feb 2.5-49.4
7 East 50 Feb 1.1-41.1
8 West 50 Nov 1.3-61.3
9 North 60 Nov 1.0-67.7
10 South 60 Nov 1.9-63.4
11 East 50 Nov 0.9-40.3
12 West 50 Dec 0.9-35.3
13 North 60 Dec 1.9-40.1
14 South 60 Dec 0.3-33.0
15 East 50 Dec NaN
- 从范围中提取最大风力
>>> out = out.assign(value=lambda x: x['value'].str.split('-').str[1].astype(float))
Wind calib month value
0 West 50 Jan 25.5
1 North 60 Jan 28.3
2 South 60 Jan 29.5
3 East 50 Jan NaN
4 West 50 Feb 65.3
5 North 60 Feb 66.4
6 South 60 Feb 49.4
7 East 50 Feb 41.1
8 West 50 Nov 61.3
9 North 60 Nov 67.7
10 South 60 Nov 63.4
11 East 50 Nov 40.3
12 West 50 Dec 35.3
13 North 60 Dec 40.1
14 South 60 Dec 33.0
15 East 50 Dec NaN
- 筛选出行并计数
>>> out = out.query('value < calib').value_counts('Wind')
Wind
South 3
East 2
North 2
West 2
dtype: int64
最后将这个系列映射(合并)到您的原始数据框。