使用德语变音符号对 pandas 数据框进行排序

Question

我有一个数据框，我想通过 sort_values 对一列进行排序。

问题是单词的第一个字母是德语元音变音。

像苏黎世的 Österreich。

这将排序到 Zürich, Österreich。应该是在 Österreich, Zürich 分拣。

Ö应该在N和O之间。

我已经找到如何使用区域设置和 strxfrm 对 python 中的列表执行此操作。我可以直接在 pandas 数据框中执行此操作吗？

编辑：谢谢你。 Stef 示例工作得很好，不知何故，我有 Numbers，他的 Version 不适用于我现实生活中的 Dataframe 示例，所以我使用了 alexey 的想法。我做了以下，也许你可以缩短这个..：


df = pd.DataFrame({'location': ['Österreich','Zürich','Bern', 254345],'code':['ö','z','b', 'v']})

#create index as column for joining later
df = df.reset_index(drop=False)

#convert int to str
df['location']=df['location'].astype(str)

#sort by location with umlaute
df_sort_index = df['location'].str.normalize('NFD').sort_values(ascending=True).reset_index(drop=False)

#drop location so we dont have it in both tables
df = df.drop('location', axis=1)

#inner join on index
new_df = pd.merge(df_sort_index, df, how='inner', on='index')

#drop index as column
new_df = new_df.drop('index', axis=1)

Answer 1

你可以使用 unicode NFD 范式

>>> names = pd.Series(['Österreich', 'Ost', 'S', 'N'])
>>> names.str.normalize('NFD').sort_values()
3              N
1            Ost
0    Österreich
2              S
dtype: object

# use result to rearrange a dataframe
>>> df[names.str.normalize('NFD').sort_values().index]

这不是您想要的，但要正确排序，您需要语言知识（如您提到的语言环境）。

NFD 使用两个符号表示变音符号，例如Ö变成了O\xcc\x88（可以看出和names.str.normalize('NFD').encode('utf-8')的区别）

Answer 2

您可以使用 sorted 和区域设置感知排序函数（在我的示例中，setlocale 返回 'German_Germany.1252'）来对列值进行排序。 棘手的部分是相应地对所有其他列进行排序。一个有点棘手的解决方案是临时将索引设置为要排序的列，然后重新索引正确排序的索引值并重置索引。

import functools
import locale
locale.setlocale(locale.LC_ALL, '')
df = pd.DataFrame({'location': ['Österreich','Zürich','Bern'],'code':['ö','z','b']})

df = df.set_index('location')
df = df.reindex(sorted(df.index, key=functools.cmp_to_key(locale.strcoll))).reset_index()

打印输出（df）：

     location code
0        Bern    b
1  Österreich    ö
2      Zürich    z

混合类型列的更新 如果要排序的列是混合类型（例如字符串和整数），那么你有两种可能：

a)将列转换为字符串，然后按上面写的排序（结果列将全部为字符串）：

locale.setlocale(locale.LC_ALL, '')
df = pd.DataFrame({'location': ['Österreich','Zürich','Bern', 254345],'code':['ö','z','b','v']})
df.location=df.location.astype(str)
df = df.set_index('location')
df = df.reindex(sorted(df.index, key=functools.cmp_to_key(locale.strcoll))).reset_index()
print(df.location.values)
# ['254345' 'Bern' 'Österreich' 'Zürich']

b) 对转换为字符串的列的副本进行排序（结果列将保留混合类型）

locale.setlocale(locale.LC_ALL, '')
df = pd.DataFrame({'location': ['Österreich','Zürich','Bern', 254345],'code':['ö','z','b','v']})
df = df.set_index(df.location.astype(str))
df = df.reindex(sorted(df.index, key=functools.cmp_to_key(locale.strcoll))).reset_index(drop=True)
print(df.location.values)
# [254345 'Bern' 'Österreich' 'Zürich']

使用德语变音符号对 pandas 数据框进行排序

Sorting pandas dataframe with German Umlaute

python

sorting

locale

dataframe

pandas