如何将字符串类型列的因子级别合并到 pydatatable 中的另一个?
How to lump together factor levels of a string type column into another in pydatatable?
我有一个数据表,
DT_X = dt.Frame({'variety': ['Caturra',
'Bourbon',
'Typica',
'Catuai',
'Hawaiian Kona',
'Yellow Bourbon',
'Mundo Novo',
'Catimor',
'SL14',
'SL28',
'Pacas',
'Gesha',
'Pacamara',
'SL34',
'Arusha',
'Peaberry',
'Mandheling',
'Sumatra',
'Blue Mountain',
'Ethiopian Yirgacheffe',
'Java',
'Ruiru 11',
'Ethiopian Heirlooms',
'Marigojipe',
'Moka Peaberry',
'Pache Comun',
'Sulawesi',
'Sumatra Lintong'],
'count': [256,
226,
211,
74,
44,
35,
33,
20,
17,
15,
13,
12,
8,
8,
6,
5,
3,
3,
2,
2,
2,
2,
1,
1,
1,
1,
1,
1]})
可以看作,
Out[8]:
| variety count
-- + --------------------- -----
0 | Caturra 256
1 | Bourbon 226
2 | Typica 211
3 | Catuai 74
4 | Hawaiian Kona 44
5 | Yellow Bourbon 35
6 | Mundo Novo 33
7 | Catimor 20
8 | SL14 17
9 | SL28 15
10 | Pacas 13
11 | Gesha 12
12 | Pacamara 8
13 | SL34 8
14 | Arusha 6
15 | Peaberry 5
16 | Mandheling 3
17 | Sumatra 3
18 | Blue Mountain 2
19 | Ethiopian Yirgacheffe 2
20 | Java 2
21 | Ruiru 11 2
22 | Ethiopian Heirlooms 1
23 | Marigojipe 1
24 | Moka Peaberry 1
25 | Pache Comun 1
26 | Sulawesi 1
27 | Sumatra Lintong 1
我现在想用前4个级别'Caturra'、'Bourbon'、'Typica'、'Catuai'填写品种栏,其余级别应处理和其他人一样。
预期输出为:
Out[9]:
| variety count
-- + ------- -----
0 | Caturra 256
1 | Bourbon 226
2 | Typica 211
3 | Catuai 74
4 | Others 236
[5 rows x 2 columns]
案例二:
我有一个数据表,
DT_X_1 = dt.Frame({'variety': ['Bourbon',
'Catimor',
'Ethiopian Yirgacheffe',
'Caturra',
'Bourbon',
'SL14',
'Caturra',
'Sumatra',
'Bourbon',
'Caturra',
'SL34',
'Hawaiian Kona',
'Caturra',
'Yellow Bourbon',
'Yellow Bourbon',
'Bourbon',
'SL28',
'Bourbon',
'Caturra',
'SL28',
'Bourbon',
'SL14',
'Caturra',
'Gesha',
'Bourbon',
'Catuai',
'Caturra',
'Bourbon',
'Bourbon',
'Hawaiian Kona']})
可以看作
Out[7]:
| variety
-- + ---------------------
0 | Bourbon
1 | Catimor
2 | Ethiopian Yirgacheffe
3 | Caturra
4 | Bourbon
5 | SL14
6 | Caturra
7 | Sumatra
8 | Bourbon
9 | Caturra
10 | SL34
11 | Hawaiian Kona
12 | Caturra
13 | Yellow Bourbon
14 | Yellow Bourbon
15 | Bourbon
16 | SL28
17 | Bourbon
18 | Caturra
19 | SL28
20 | Bourbon
21 | SL14
22 | Caturra
23 | Gesha
24 | Bourbon
25 | Catuai
26 | Caturra
27 | Bourbon
28 | Bourbon
29 | Hawaiian Kona
[30 rows x 1 column]
- 列 variety 有大约 12 个不同的值,
Out[8]:
| variety count
-- + --------------------- -----
0 | Bourbon 9
1 | Catimor 1
2 | Catuai 1
3 | Caturra 7
4 | Ethiopian Yirgacheffe 1
5 | Gesha 1
6 | Hawaiian Kona 2
7 | SL14 2
8 | SL28 2
9 | SL34 1
10 | Sumatra 1
11 | Yellow Bourbon 2
[12 rows x 2 columns]
在这里,我想将字段 variety 级别从 12 折叠到 2,这是最常见的级别。
预期的输出是,
Out[13]:
| variety
-- + -------
0 | Bourbon
1 | Others
2 | Others
3 | Caturra
4 | Bourbon
5 | Others
6 | Caturra
7 | Others
8 | Bourbon
9 | Caturra
10 | Others
11 | Others
12 | Caturra
13 | Others
14 | Others
15 | Bourbon
16 | Others
17 | Bourbon
18 | Caturra
19 | Others
20 | Bourbon
21 | Others
22 | Caturra
23 | Others
24 | Bourbon
25 | Others
26 | Caturra
27 | Bourbon
28 | Bourbon
29 | Others
[30 rows x 1 column]
一种方法是首先用字符串“Other”替换从第 4 个开始的所有 variety
值,然后按 variety
:
分组
>>> DT_X[4:, f.variety] = "Other"
>>> DT_X = DT_X[:, sum(f.count), by(f.variety)]
| variety count
-- + ------- -----
0 | Bourbon 226
1 | Catuai 74
2 | Caturra 256
3 | Other 236
4 | Typica 211
[5 rows x 2 columns]
另一种可能是把原来的table,按行分成两部分,折叠第二部分并重新绑定回原来的:
>>> dt.rbind(DT_X[:4, :],
dt.Frame(variety=["Other"], count=[DT_X[4:, f.count].sum1()]))
| variety count
-- + ------- -----
0 | Caturra 256
1 | Bourbon 226
2 | Typica 211
3 | Catuai 74
4 | Other 236
[5 rows x 2 columns]
案例二
您已经按品种创建了 table 个计数,所以现在您只需按计数和 select 2 个最常见的品种进行排序:
>>> from datatable import by, sort, count, join, update, f, g
>>> counts = DT_X_1[:, count(), by(f.variety)]
>>> frequent = counts[-2:, :, sort(f.count)]
>>> frequent
| variety count
-- + ------- -----
0 | Caturra 7
1 | Bourbon 9
[2 rows x 2 columns]
(或者,您可以按计数值过滤)。
现在,下一步是将此 table 连接回原始值,以便我们获得哪些值“频繁”的指标。 join 操作可以和 update 结合起来,这样在同一个操作中我们将所有在 join 期间不匹配的字段设置为 "others"
:
>>> frequent.key = "variety"
>>> DT_X_1[g.variety==None, update(variety="others"), join(frequent)]
>>> DT_X_1
| variety
-- + -------
0 | Bourbon
1 | others
2 | others
3 | Caturra
4 | Bourbon
5 | others
6 | Caturra
7 | others
8 | Bourbon
9 | Caturra
10 | others
11 | others
12 | Caturra
13 | others
14 | others
15 | Bourbon
16 | others
17 | Bourbon
18 | Caturra
19 | others
20 | Bourbon
21 | others
22 | Caturra
23 | others
24 | Bourbon
25 | others
26 | Caturra
27 | Bourbon
28 | Bourbon
29 | others
[30 rows x 1 column]
我有一个数据表,
DT_X = dt.Frame({'variety': ['Caturra',
'Bourbon',
'Typica',
'Catuai',
'Hawaiian Kona',
'Yellow Bourbon',
'Mundo Novo',
'Catimor',
'SL14',
'SL28',
'Pacas',
'Gesha',
'Pacamara',
'SL34',
'Arusha',
'Peaberry',
'Mandheling',
'Sumatra',
'Blue Mountain',
'Ethiopian Yirgacheffe',
'Java',
'Ruiru 11',
'Ethiopian Heirlooms',
'Marigojipe',
'Moka Peaberry',
'Pache Comun',
'Sulawesi',
'Sumatra Lintong'],
'count': [256,
226,
211,
74,
44,
35,
33,
20,
17,
15,
13,
12,
8,
8,
6,
5,
3,
3,
2,
2,
2,
2,
1,
1,
1,
1,
1,
1]})
可以看作,
Out[8]:
| variety count
-- + --------------------- -----
0 | Caturra 256
1 | Bourbon 226
2 | Typica 211
3 | Catuai 74
4 | Hawaiian Kona 44
5 | Yellow Bourbon 35
6 | Mundo Novo 33
7 | Catimor 20
8 | SL14 17
9 | SL28 15
10 | Pacas 13
11 | Gesha 12
12 | Pacamara 8
13 | SL34 8
14 | Arusha 6
15 | Peaberry 5
16 | Mandheling 3
17 | Sumatra 3
18 | Blue Mountain 2
19 | Ethiopian Yirgacheffe 2
20 | Java 2
21 | Ruiru 11 2
22 | Ethiopian Heirlooms 1
23 | Marigojipe 1
24 | Moka Peaberry 1
25 | Pache Comun 1
26 | Sulawesi 1
27 | Sumatra Lintong 1
我现在想用前4个级别'Caturra'、'Bourbon'、'Typica'、'Catuai'填写品种栏,其余级别应处理和其他人一样。
预期输出为:
Out[9]:
| variety count
-- + ------- -----
0 | Caturra 256
1 | Bourbon 226
2 | Typica 211
3 | Catuai 74
4 | Others 236
[5 rows x 2 columns]
案例二:
我有一个数据表,
DT_X_1 = dt.Frame({'variety': ['Bourbon',
'Catimor',
'Ethiopian Yirgacheffe',
'Caturra',
'Bourbon',
'SL14',
'Caturra',
'Sumatra',
'Bourbon',
'Caturra',
'SL34',
'Hawaiian Kona',
'Caturra',
'Yellow Bourbon',
'Yellow Bourbon',
'Bourbon',
'SL28',
'Bourbon',
'Caturra',
'SL28',
'Bourbon',
'SL14',
'Caturra',
'Gesha',
'Bourbon',
'Catuai',
'Caturra',
'Bourbon',
'Bourbon',
'Hawaiian Kona']})
可以看作
Out[7]:
| variety
-- + ---------------------
0 | Bourbon
1 | Catimor
2 | Ethiopian Yirgacheffe
3 | Caturra
4 | Bourbon
5 | SL14
6 | Caturra
7 | Sumatra
8 | Bourbon
9 | Caturra
10 | SL34
11 | Hawaiian Kona
12 | Caturra
13 | Yellow Bourbon
14 | Yellow Bourbon
15 | Bourbon
16 | SL28
17 | Bourbon
18 | Caturra
19 | SL28
20 | Bourbon
21 | SL14
22 | Caturra
23 | Gesha
24 | Bourbon
25 | Catuai
26 | Caturra
27 | Bourbon
28 | Bourbon
29 | Hawaiian Kona
[30 rows x 1 column]
- 列 variety 有大约 12 个不同的值,
Out[8]:
| variety count
-- + --------------------- -----
0 | Bourbon 9
1 | Catimor 1
2 | Catuai 1
3 | Caturra 7
4 | Ethiopian Yirgacheffe 1
5 | Gesha 1
6 | Hawaiian Kona 2
7 | SL14 2
8 | SL28 2
9 | SL34 1
10 | Sumatra 1
11 | Yellow Bourbon 2
[12 rows x 2 columns]
在这里,我想将字段 variety 级别从 12 折叠到 2,这是最常见的级别。
预期的输出是,
Out[13]:
| variety
-- + -------
0 | Bourbon
1 | Others
2 | Others
3 | Caturra
4 | Bourbon
5 | Others
6 | Caturra
7 | Others
8 | Bourbon
9 | Caturra
10 | Others
11 | Others
12 | Caturra
13 | Others
14 | Others
15 | Bourbon
16 | Others
17 | Bourbon
18 | Caturra
19 | Others
20 | Bourbon
21 | Others
22 | Caturra
23 | Others
24 | Bourbon
25 | Others
26 | Caturra
27 | Bourbon
28 | Bourbon
29 | Others
[30 rows x 1 column]
一种方法是首先用字符串“Other”替换从第 4 个开始的所有 variety
值,然后按 variety
:
>>> DT_X[4:, f.variety] = "Other"
>>> DT_X = DT_X[:, sum(f.count), by(f.variety)]
| variety count
-- + ------- -----
0 | Bourbon 226
1 | Catuai 74
2 | Caturra 256
3 | Other 236
4 | Typica 211
[5 rows x 2 columns]
另一种可能是把原来的table,按行分成两部分,折叠第二部分并重新绑定回原来的:
>>> dt.rbind(DT_X[:4, :],
dt.Frame(variety=["Other"], count=[DT_X[4:, f.count].sum1()]))
| variety count
-- + ------- -----
0 | Caturra 256
1 | Bourbon 226
2 | Typica 211
3 | Catuai 74
4 | Other 236
[5 rows x 2 columns]
案例二
您已经按品种创建了 table 个计数,所以现在您只需按计数和 select 2 个最常见的品种进行排序:
>>> from datatable import by, sort, count, join, update, f, g
>>> counts = DT_X_1[:, count(), by(f.variety)]
>>> frequent = counts[-2:, :, sort(f.count)]
>>> frequent
| variety count
-- + ------- -----
0 | Caturra 7
1 | Bourbon 9
[2 rows x 2 columns]
(或者,您可以按计数值过滤)。
现在,下一步是将此 table 连接回原始值,以便我们获得哪些值“频繁”的指标。 join 操作可以和 update 结合起来,这样在同一个操作中我们将所有在 join 期间不匹配的字段设置为 "others"
:
>>> frequent.key = "variety"
>>> DT_X_1[g.variety==None, update(variety="others"), join(frequent)]
>>> DT_X_1
| variety
-- + -------
0 | Bourbon
1 | others
2 | others
3 | Caturra
4 | Bourbon
5 | others
6 | Caturra
7 | others
8 | Bourbon
9 | Caturra
10 | others
11 | others
12 | Caturra
13 | others
14 | others
15 | Bourbon
16 | others
17 | Bourbon
18 | Caturra
19 | others
20 | Bourbon
21 | others
22 | Caturra
23 | others
24 | Bourbon
25 | others
26 | Caturra
27 | Bourbon
28 | Bourbon
29 | others
[30 rows x 1 column]