在数据框的列中找到重复的条目?
Locating duplicated entries in a column of a dataframe?
在 11:13 行和 14:16 行中,可以观察到 'm:' 和 'n:' 列 'C2_xsampa' 中存在重复条目. 'C2_xsampa' 中的每个值都有两个级别,Singleton 或 Geminate,但 'm:' 和 'n:' 中不是这种情况。这会为数字列产生错误的平均值。
我的问题是:如何过滤重复的行?我已经手动检查了父数据集,通过它获取值。那里看起来一切都很好。
之前我是用subset()来修正输入的'real'错误
数据:
C2_xsampa Consonant Speaker C1.dn C2.dn V1.dn V2.dn total.dn
1 "d_d" Singleton 8.5 11.9 7.82 13.0 7.65 40.3
2 "d_d:" Geminate 9 11.6 11.9 11.4 7.46 42.3
3 "dZ" Singleton 8.31 7.79 7.47 14.9 9.81 40.0
4 "dZ:" Geminate 8.08 7.72 13.4 12.8 9.61 43.6
5 "g" Singleton 9 12.1 11.3 11.9 8.56 43.9
6 "g:" Geminate 8.69 11.3 11.1 12.7 10.2 45.3
7 "k" Singleton 9.5 12.3 14.4 9.71 6.97 43.4
8 "k:" Geminate 9 14.7 16.1 10.1 7.37 48.2
9 "l" Singleton 8.69 11.9 6.33 11.5 10.2 40.0
10 "l:" Geminate 8.81 11.3 10.0 10.0 11.5 42.8
11 "m" Singleton 8.36 13.6 9.11 11.1 9.20 43.0
12 "m:" Geminate 8.85 13.7 10.9 9.95 8.42 43.0
13 "m: " Geminate 14 14.6 12.4 5.66 5.01 37.7
14 "n" Singleton 8 15.1 4.44 11.6 8.99 40.2
15 "n:" Geminate 8.21 21.4 10.1 10.2 9.32 51.0
16 "n: " Geminate 11.3 32.0 10.4 8.09 7.94 58.5
17 "p" Singleton 8.4 11.2 11.9 7.98 6.53 37.7
18 "p:" Geminate 8.81 13.2 12.7 8.57 11.3 45.8
19 "t`" Singleton 9 12.9 10.5 8.69 9.20 41.3
20 "t`:" Geminate 9 13.1 13.1 8.39 10.6 45.2
谢谢。
您可以检查两列的值在整个数据集中是否唯一
df = df.drop_duplicates(subset=['C2_xsampa','Consonant'])
你可以取反df[~df]
得到不正确的行
编辑刚看到r语言标签
我相信 distinct(select(df, C2_xsampa, Consonant))
会做
C2_xsampa
的某些值中似乎有不必要的符号和空格。这是使用 {tidyverse}
的建议。首先,它删除 symbols/spaces,然后通过 C2_xsampa
和 Consonant
识别重复的行。您可以使用 dup
列过滤重复的行。
library(tidyverse)
dat1 <- dat %>%
mutate(C2_xsampa = str_trim(C2_xsampa)) %>%
group_by(C2_xsampa, Consonant) %>%
mutate(dup = n()) %>%
ungroup()
dat1
# # A tibble: 20 x 9
# C2_xsampa Consonant Speaker C1.dn C2.dn V1.dn V2.dn total.dn dup
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 d_d Singleton 8.5 11.9 7.82 13 7.65 40.3 1
# 2 d_d: Geminate 9 11.6 11.9 11.4 7.46 42.3 1
# 3 dZ Singleton 8.31 7.79 7.47 14.9 9.81 40 1
# 4 dZ: Geminate 8.08 7.72 13.4 12.8 9.61 43.6 1
# 5 g Singleton 9 12.1 11.3 11.9 8.56 43.9 1
# 6 g: Geminate 8.69 11.3 11.1 12.7 10.2 45.3 1
# 7 k Singleton 9.5 12.3 14.4 9.71 6.97 43.4 1
# 8 k: Geminate 9 14.7 16.1 10.1 7.37 48.2 1
# 9 l Singleton 8.69 11.9 6.33 11.5 10.2 40 1
# 10 l: Geminate 8.81 11.3 10 10 11.5 42.8 1
# 11 m Singleton 8.36 13.6 9.11 11.1 9.2 43 1
# 12 m: Geminate 8.85 13.7 10.9 9.95 8.42 43 2
# 13 m: Geminate 14 14.6 12.4 5.66 5.01 37.7 2
# 14 n Singleton 8 15.1 4.44 11.6 8.99 40.2 1
# 15 n: Geminate 8.21 21.4 10.1 10.2 9.32 51 2
# 16 n: Geminate 11.3 32 10.4 8.09 7.94 58.5 2
# 17 p Singleton 8.4 11.2 11.9 7.98 6.53 37.7 1
# 18 p: Geminate 8.81 13.2 12.7 8.57 11.3 45.8 1
# 19 t` Singleton 9 12.9 10.5 8.69 9.2 41.3 1
# 20 t`: Geminate 9 13.1 13.1 8.39 10.6 45.2 1
这是数据集的代码:
dat <- read.table(
text = '
C2_xsampa Consonant Speaker C1.dn C2.dn V1.dn V2.dn total.dn
1 "d_d" Singleton 8.5 11.9 7.82 13.0 7.65 40.3
2 "d_d:" Geminate 9 11.6 11.9 11.4 7.46 42.3
3 "dZ" Singleton 8.31 7.79 7.47 14.9 9.81 40.0
4 "dZ:" Geminate 8.08 7.72 13.4 12.8 9.61 43.6
5 "g" Singleton 9 12.1 11.3 11.9 8.56 43.9
6 "g:" Geminate 8.69 11.3 11.1 12.7 10.2 45.3
7 "k" Singleton 9.5 12.3 14.4 9.71 6.97 43.4
8 "k:" Geminate 9 14.7 16.1 10.1 7.37 48.2
9 "l" Singleton 8.69 11.9 6.33 11.5 10.2 40.0
10 "l:" Geminate 8.81 11.3 10.0 10.0 11.5 42.8
11 "m" Singleton 8.36 13.6 9.11 11.1 9.20 43.0
12 "m:" Geminate 8.85 13.7 10.9 9.95 8.42 43.0
13 "m: " Geminate 14 14.6 12.4 5.66 5.01 37.7
14 "n" Singleton 8 15.1 4.44 11.6 8.99 40.2
15 "n:" Geminate 8.21 21.4 10.1 10.2 9.32 51.0
16 "n: " Geminate 11.3 32.0 10.4 8.09 7.94 58.5
17 "p" Singleton 8.4 11.2 11.9 7.98 6.53 37.7
18 "p:" Geminate 8.81 13.2 12.7 8.57 11.3 45.8
19 "t`" Singleton 9 12.9 10.5 8.69 9.20 41.3
20 "t`:" Geminate 9 13.1 13.1 8.39 10.6 45.2',
header = TRUE
)
我最喜欢的方法是:
subset(dat, duplicated(C2_xsampa) | duplicated(rev(C2_xsampa))
在 11:13 行和 14:16 行中,可以观察到 'm:' 和 'n:' 列 'C2_xsampa' 中存在重复条目. 'C2_xsampa' 中的每个值都有两个级别,Singleton 或 Geminate,但 'm:' 和 'n:' 中不是这种情况。这会为数字列产生错误的平均值。
我的问题是:如何过滤重复的行?我已经手动检查了父数据集,通过它获取值。那里看起来一切都很好。
之前我是用subset()来修正输入的'real'错误
数据:
C2_xsampa Consonant Speaker C1.dn C2.dn V1.dn V2.dn total.dn
1 "d_d" Singleton 8.5 11.9 7.82 13.0 7.65 40.3
2 "d_d:" Geminate 9 11.6 11.9 11.4 7.46 42.3
3 "dZ" Singleton 8.31 7.79 7.47 14.9 9.81 40.0
4 "dZ:" Geminate 8.08 7.72 13.4 12.8 9.61 43.6
5 "g" Singleton 9 12.1 11.3 11.9 8.56 43.9
6 "g:" Geminate 8.69 11.3 11.1 12.7 10.2 45.3
7 "k" Singleton 9.5 12.3 14.4 9.71 6.97 43.4
8 "k:" Geminate 9 14.7 16.1 10.1 7.37 48.2
9 "l" Singleton 8.69 11.9 6.33 11.5 10.2 40.0
10 "l:" Geminate 8.81 11.3 10.0 10.0 11.5 42.8
11 "m" Singleton 8.36 13.6 9.11 11.1 9.20 43.0
12 "m:" Geminate 8.85 13.7 10.9 9.95 8.42 43.0
13 "m: " Geminate 14 14.6 12.4 5.66 5.01 37.7
14 "n" Singleton 8 15.1 4.44 11.6 8.99 40.2
15 "n:" Geminate 8.21 21.4 10.1 10.2 9.32 51.0
16 "n: " Geminate 11.3 32.0 10.4 8.09 7.94 58.5
17 "p" Singleton 8.4 11.2 11.9 7.98 6.53 37.7
18 "p:" Geminate 8.81 13.2 12.7 8.57 11.3 45.8
19 "t`" Singleton 9 12.9 10.5 8.69 9.20 41.3
20 "t`:" Geminate 9 13.1 13.1 8.39 10.6 45.2
谢谢。
您可以检查两列的值在整个数据集中是否唯一
df = df.drop_duplicates(subset=['C2_xsampa','Consonant'])
你可以取反df[~df]
得到不正确的行
编辑刚看到r语言标签
我相信 distinct(select(df, C2_xsampa, Consonant))
会做
C2_xsampa
的某些值中似乎有不必要的符号和空格。这是使用 {tidyverse}
的建议。首先,它删除 symbols/spaces,然后通过 C2_xsampa
和 Consonant
识别重复的行。您可以使用 dup
列过滤重复的行。
library(tidyverse)
dat1 <- dat %>%
mutate(C2_xsampa = str_trim(C2_xsampa)) %>%
group_by(C2_xsampa, Consonant) %>%
mutate(dup = n()) %>%
ungroup()
dat1
# # A tibble: 20 x 9
# C2_xsampa Consonant Speaker C1.dn C2.dn V1.dn V2.dn total.dn dup
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 d_d Singleton 8.5 11.9 7.82 13 7.65 40.3 1
# 2 d_d: Geminate 9 11.6 11.9 11.4 7.46 42.3 1
# 3 dZ Singleton 8.31 7.79 7.47 14.9 9.81 40 1
# 4 dZ: Geminate 8.08 7.72 13.4 12.8 9.61 43.6 1
# 5 g Singleton 9 12.1 11.3 11.9 8.56 43.9 1
# 6 g: Geminate 8.69 11.3 11.1 12.7 10.2 45.3 1
# 7 k Singleton 9.5 12.3 14.4 9.71 6.97 43.4 1
# 8 k: Geminate 9 14.7 16.1 10.1 7.37 48.2 1
# 9 l Singleton 8.69 11.9 6.33 11.5 10.2 40 1
# 10 l: Geminate 8.81 11.3 10 10 11.5 42.8 1
# 11 m Singleton 8.36 13.6 9.11 11.1 9.2 43 1
# 12 m: Geminate 8.85 13.7 10.9 9.95 8.42 43 2
# 13 m: Geminate 14 14.6 12.4 5.66 5.01 37.7 2
# 14 n Singleton 8 15.1 4.44 11.6 8.99 40.2 1
# 15 n: Geminate 8.21 21.4 10.1 10.2 9.32 51 2
# 16 n: Geminate 11.3 32 10.4 8.09 7.94 58.5 2
# 17 p Singleton 8.4 11.2 11.9 7.98 6.53 37.7 1
# 18 p: Geminate 8.81 13.2 12.7 8.57 11.3 45.8 1
# 19 t` Singleton 9 12.9 10.5 8.69 9.2 41.3 1
# 20 t`: Geminate 9 13.1 13.1 8.39 10.6 45.2 1
这是数据集的代码:
dat <- read.table(
text = '
C2_xsampa Consonant Speaker C1.dn C2.dn V1.dn V2.dn total.dn
1 "d_d" Singleton 8.5 11.9 7.82 13.0 7.65 40.3
2 "d_d:" Geminate 9 11.6 11.9 11.4 7.46 42.3
3 "dZ" Singleton 8.31 7.79 7.47 14.9 9.81 40.0
4 "dZ:" Geminate 8.08 7.72 13.4 12.8 9.61 43.6
5 "g" Singleton 9 12.1 11.3 11.9 8.56 43.9
6 "g:" Geminate 8.69 11.3 11.1 12.7 10.2 45.3
7 "k" Singleton 9.5 12.3 14.4 9.71 6.97 43.4
8 "k:" Geminate 9 14.7 16.1 10.1 7.37 48.2
9 "l" Singleton 8.69 11.9 6.33 11.5 10.2 40.0
10 "l:" Geminate 8.81 11.3 10.0 10.0 11.5 42.8
11 "m" Singleton 8.36 13.6 9.11 11.1 9.20 43.0
12 "m:" Geminate 8.85 13.7 10.9 9.95 8.42 43.0
13 "m: " Geminate 14 14.6 12.4 5.66 5.01 37.7
14 "n" Singleton 8 15.1 4.44 11.6 8.99 40.2
15 "n:" Geminate 8.21 21.4 10.1 10.2 9.32 51.0
16 "n: " Geminate 11.3 32.0 10.4 8.09 7.94 58.5
17 "p" Singleton 8.4 11.2 11.9 7.98 6.53 37.7
18 "p:" Geminate 8.81 13.2 12.7 8.57 11.3 45.8
19 "t`" Singleton 9 12.9 10.5 8.69 9.20 41.3
20 "t`:" Geminate 9 13.1 13.1 8.39 10.6 45.2',
header = TRUE
)
我最喜欢的方法是:
subset(dat, duplicated(C2_xsampa) | duplicated(rev(C2_xsampa))