numpy 排序在 pandas DataFrame 上排序时表现异常

Question

当我执行 data[genres].sum() 时，我得到以下结果

Action        1891
Adult            9
Adventure     1313
Animation      314
Biography      394
Comedy        3922
Crime         1867
Drama         5697
Family         754
Fantasy        916
Film-Noir       40
History        358
Horror        1215
Music          371
Musical        260
Mystery       1009
News             1
Reality-TV       1
Romance       2441
Sci-Fi         897
Sport          288
Thriller      2832
War            512
Western        235
dtype: int64

但是当我尝试使用 np.sort

对总和进行排序时

genre_count = np.sort(data[genres].sum())[::-1]
pd.DataFrame({'Genre Count': genre_count})`

我得到以下结果

`Out[19]:
    Genre Count
0   5697
1   3922
2   2832
3   2441
4   1891
5   1867
6   1313
7   1215
8   1009
9   916
10  897
11  754
12  512
13  394
14  371
15  358
16  314
17  288
18  260
19  235
20  40
21  9
22  1
23  1

预期的结果应该是这样的：

Genre Count
Drama   5697
Comedy  3922
Thriller    2832
Romance     2441
Action  1891
Crime   1867
Adventure   1313
Horror  1215
Mystery     1009
Fantasy     916
Sci-Fi  897
Family  754
War     512
Biography   394
Music   371
History     358
Animation   314
Sport   288
Musical     260
Western     235
Film-Noir   40
Adult   9
News    1
Reality-TV  1

numpy 似乎忽略了流派列。

有人可以帮助我了解我哪里出错了吗？

Answer 1

data[genres].sum() returns 一个系列。流派列实际上不是列 - 它是索引。

np.sort 只查看 DataFrame 或 Series 的值，not 在索引处，它 returns 一个新的 NumPy 数组data[genres].sum() 值。索引信息丢失。

排序 data[genres].sum() 并保留索引信息的方法如下：

genre_count = data[genres].sum()
genre_count.sort(ascending=False) # in-place sort of genre_count, high to low

然后，您可以根据需要将排序后的 genre_count 系列转回 DataFrame：

pd.DataFrame({'Genre Count': genre_count})

Answer 2

data[genres].sum() returns 一个系列。

如果您使用的是 pandas 版本 0.2，命令会有一些小的变化。

    genre_count = data[genres].sum()
    genre_count.sort_values(ascending=False)`

您可以在 pandas documentation

上找到参考资料

numpy 排序在 pandas DataFrame 上排序时表现异常

numpy sort acting weirdly when sorting on a pandas DataFrame

python

sorting

numpy

dataframe

pandas