python 如何使用字符串值进行自定义排序?

python how to use string value for custom sort?

我有这样的数据帧

   time_posted
0   5 days ago
1  an hour ago
2    a day ago
3  6 hours ago
4  4 hours ago

我尝试了这个 df.sort_values(by='time_posted',ascending=True) 并得到了这个结果:

   time_posted
4  4 hours ago
0   5 days ago
3  6 hours ago
2    a day ago
1  an hour ago

但我想通过 hours ago 做空,所以我的 datframe 看起来像这样

   time_posted
1  an hour ago
4  4 hours ago
3  6 hours ago
2    a day ago
0   5 days ago

一个答案可能如下

设置示例数据

import pandas as pd

#your dataframe
df = pd.DataFrame(dict(time_posted=['5 days ago', 'an hour ago', 'a day ago', '6 hours ago', '4 hours ago']))

转换函数

您必须拆分字符串并决定不同的值(这里 x[0] 一个值,x[1] 一个单位)

def to_hours(s):
    x = s.split(' ')

    if x[0].lower() in ['a','an']:
        a = 1
    else:
        a = float(x[0])

    x1 = x[1].lower()
    b = 1 # 1 hour
    if x[1].startswith('day'):
        b = b*24 # 1 day = 24 hours
    
    return a*b

申请

df['hours'] = df.time_posted.apply(to_hours) # apply hours conversion
df = df.sort_values('hours',ascending=True)[['time_posted']]# Sort and skip non-necessary col

print(df)

输出:

time_posted
1  an hour ago
4  4 hours ago
3  6 hours ago
2    a day ago
0   5 days ago

如果删除“ago”并将“a/an”替换为 1,则可以将值提供给 pandas.to_timedelta:

(pd.to_timedelta(df['time_posted']
.str.replace(r'\ban?\b', '1', regex=True)
.str.replace(' ago', '', regex=False))
)

输出:

0   5 days 00:00:00
1   0 days 01:00:00
2   1 days 00:00:00
3   0 days 06:00:00
4   0 days 04:00:00
Name: time_posted, dtype: timedelta64[ns]

这使您能够获得排序顺序:

idx = (pd.to_timedelta(df['time_posted']
 .str.replace(r'\ban?\b', '1', regex=True)
 .str.replace(' ago', '', regex=False))
 .sort_values()
 .index
)

df.loc[idx]

输出:

   time_posted
1  an hour ago
4  4 hours ago
3  6 hours ago
2    a day ago
0   5 days ago