如何将显示尺寸映射到 pandas 数据框列中的标准尺寸?
How to map display sizes to standard sizes in a pandas dataframe column?
我理解标题可能不是很清楚,但请听我说完。
我有一个 pandas 数据框列,其中包含约 850 个独特的显示尺寸,例如
1 320x480
2 480x320
3 382x215
4 676x320
5 694x320
6 1080x2123
7 2094x1080
8 1080x2020
我想匹配它们/将它们转换为最接近的标准显示尺寸(use-case 数据集中提供了大约 20 个)。
320x350
320x480
480x320
640x360
800x600
1024x768
1280x720
1280x800
1280x1024
1360x768
1366x768
1440x900
1536x864
1600x900
我尝试将高度和宽度分成单独的列并将它们四舍五入,但它仍然创建了很多 non-standard 显示尺寸(对于我的用例)。
我怎样才能做到这一点?
想法是通过 Series.str.split
, then use cross join by DataFrame.merge
, get differences and sum them , get rows with minimal diffs by DataFrameGroupBy.idxmin
and DataFrame.loc
, last join together with DataFrame.pop
将 columns/Series 转换为 DataFrame 以供使用和删除列:
df11 = df1['col'].str.split('x', expand=True).astype(int)
df22 = df2['col'].str.split('x', expand=True).astype(int)
df = df11.assign(a=1).merge(df22.assign(a=1), on='a')
df['diff'] = df['0_x'].sub(df['0_y']).abs() + df['1_x'].sub(df['1_y']).abs()
df = df.loc[df.groupby(['0_x','1_x'])['diff'].idxmin()]
df['a'] = df.pop('0_x').astype(str).str.cat(df.pop('0_y').astype(str), 'x')
df['b'] = df.pop('1_x').astype(str).str.cat(df.pop('1_y').astype(str), 'x')
print (df)
a diff b
1 320x320 0 480x480
28 382x320 197 215x350
16 480x480 0 320x320
45 676x640 76 320x360
59 694x640 94 320x360
106 1080x1280 1196 2020x1024
78 1080x1280 1299 2123x1024
97 2094x1600 674 1080x900
与欧几里得距离相似,样本数据相同输出:
df11 = df1['col'].str.split('x', expand=True).astype(int)
df22 = df2['col'].str.split('x', expand=True).astype(int)
df = df11.assign(a=1).merge(df22.assign(a=1), on='a')
df['diff'] = np.sqrt(df['0_x'].sub(df['0_y']) ** 2 + df['1_x'].sub(df['1_y']) ** 2)
df = df.loc[df.groupby(['0_x','1_x'])['diff'].idxmin()]
df['a'] = df.pop('0_x').astype(str).str.cat(df.pop('0_y').astype(str), 'x')
df['b'] = df.pop('1_x').astype(str).str.cat(df.pop('1_y').astype(str), 'x')
print (df)
a diff b
1 320x320 0.000000 480x480
30 382x480 143.627992 215x320
16 480x480 0.000000 320x320
45 676x640 53.814496 320x360
59 694x640 67.201190 320x360
106 1080x1280 1015.881883 2020x1024
78 1080x1280 1117.050133 2123x1024
97 2094x1600 525.771814 1080x900
另一个 numpy 解决方案:
df11 = df1['col'].str.split('x', expand=True).astype(int)
df22 = df2['col'].str.split('x', expand=True).astype(int)
a1 = np.sqrt(np.square(df11[0].to_numpy()[:, None] - df22[0].to_numpy()) +
np.square(df11[1].to_numpy()[:, None] - df22[1].to_numpy()))
df1['b1'] = df2['col'].to_numpy()[np.argmin(a1, axis=1)]
a2 = (np.abs(df11[0].to_numpy()[:, None] - df22[0].to_numpy()) +
np.abs(df11[1].to_numpy()[:, None] - df22[1].to_numpy()))
df1['b2'] = df2['col'].to_numpy()[np.argmin(a2, axis=1)]
print (df1)
col b1 b2
1 320x480 320x480 320x480
2 480x320 480x320 480x320
3 382x215 480x320 320x350
4 676x320 640x360 640x360
5 694x320 640x360 640x360
6 1080x2123 1280x1024 1280x1024
7 2094x1080 1600x900 1600x900
8 1080x2020 1280x1024 1280x1024
我理解标题可能不是很清楚,但请听我说完。
我有一个 pandas 数据框列,其中包含约 850 个独特的显示尺寸,例如
1 320x480
2 480x320
3 382x215
4 676x320
5 694x320
6 1080x2123
7 2094x1080
8 1080x2020
我想匹配它们/将它们转换为最接近的标准显示尺寸(use-case 数据集中提供了大约 20 个)。
320x350
320x480
480x320
640x360
800x600
1024x768
1280x720
1280x800
1280x1024
1360x768
1366x768
1440x900
1536x864
1600x900
我尝试将高度和宽度分成单独的列并将它们四舍五入,但它仍然创建了很多 non-standard 显示尺寸(对于我的用例)。
我怎样才能做到这一点?
想法是通过 Series.str.split
, then use cross join by DataFrame.merge
, get differences and sum them , get rows with minimal diffs by DataFrameGroupBy.idxmin
and DataFrame.loc
, last join together with DataFrame.pop
将 columns/Series 转换为 DataFrame 以供使用和删除列:
df11 = df1['col'].str.split('x', expand=True).astype(int)
df22 = df2['col'].str.split('x', expand=True).astype(int)
df = df11.assign(a=1).merge(df22.assign(a=1), on='a')
df['diff'] = df['0_x'].sub(df['0_y']).abs() + df['1_x'].sub(df['1_y']).abs()
df = df.loc[df.groupby(['0_x','1_x'])['diff'].idxmin()]
df['a'] = df.pop('0_x').astype(str).str.cat(df.pop('0_y').astype(str), 'x')
df['b'] = df.pop('1_x').astype(str).str.cat(df.pop('1_y').astype(str), 'x')
print (df)
a diff b
1 320x320 0 480x480
28 382x320 197 215x350
16 480x480 0 320x320
45 676x640 76 320x360
59 694x640 94 320x360
106 1080x1280 1196 2020x1024
78 1080x1280 1299 2123x1024
97 2094x1600 674 1080x900
与欧几里得距离相似,样本数据相同输出:
df11 = df1['col'].str.split('x', expand=True).astype(int)
df22 = df2['col'].str.split('x', expand=True).astype(int)
df = df11.assign(a=1).merge(df22.assign(a=1), on='a')
df['diff'] = np.sqrt(df['0_x'].sub(df['0_y']) ** 2 + df['1_x'].sub(df['1_y']) ** 2)
df = df.loc[df.groupby(['0_x','1_x'])['diff'].idxmin()]
df['a'] = df.pop('0_x').astype(str).str.cat(df.pop('0_y').astype(str), 'x')
df['b'] = df.pop('1_x').astype(str).str.cat(df.pop('1_y').astype(str), 'x')
print (df)
a diff b
1 320x320 0.000000 480x480
30 382x480 143.627992 215x320
16 480x480 0.000000 320x320
45 676x640 53.814496 320x360
59 694x640 67.201190 320x360
106 1080x1280 1015.881883 2020x1024
78 1080x1280 1117.050133 2123x1024
97 2094x1600 525.771814 1080x900
另一个 numpy 解决方案:
df11 = df1['col'].str.split('x', expand=True).astype(int)
df22 = df2['col'].str.split('x', expand=True).astype(int)
a1 = np.sqrt(np.square(df11[0].to_numpy()[:, None] - df22[0].to_numpy()) +
np.square(df11[1].to_numpy()[:, None] - df22[1].to_numpy()))
df1['b1'] = df2['col'].to_numpy()[np.argmin(a1, axis=1)]
a2 = (np.abs(df11[0].to_numpy()[:, None] - df22[0].to_numpy()) +
np.abs(df11[1].to_numpy()[:, None] - df22[1].to_numpy()))
df1['b2'] = df2['col'].to_numpy()[np.argmin(a2, axis=1)]
print (df1)
col b1 b2
1 320x480 320x480 320x480
2 480x320 480x320 480x320
3 382x215 480x320 320x350
4 676x320 640x360 640x360
5 694x320 640x360 640x360
6 1080x2123 1280x1024 1280x1024
7 2094x1080 1600x900 1600x900
8 1080x2020 1280x1024 1280x1024