在数据框的列中找到重复的条目?

Locating duplicated entries in a column of a dataframe?

在 11:13 行和 14:16 行中,可以观察到 'm:' 和 'n:' 列 'C2_xsampa' 中存在重复条目. 'C2_xsampa' 中的每个值都有两个级别,Singleton 或 Geminate,但 'm:' 和 'n:' 中不是这种情况。这会为数字列产生错误的平均值。

我的问题是:如何过滤重复的行?我已经手动检查了父数据集,通过它获取值。那里看起来一切都很好。

之前我是用subset()来修正输入的'real'错误

数据:

C2_xsampa Consonant Speaker C1.dn C2.dn V1.dn V2.dn total.dn
 1 "d_d"     Singleton    8.5  11.9   7.82 13.0   7.65     40.3
 2 "d_d:"    Geminate     9    11.6  11.9  11.4   7.46     42.3
 3 "dZ"      Singleton    8.31  7.79  7.47 14.9   9.81     40.0
 4 "dZ:"     Geminate     8.08  7.72 13.4  12.8   9.61     43.6
 5 "g"       Singleton    9    12.1  11.3  11.9   8.56     43.9
 6 "g:"      Geminate     8.69 11.3  11.1  12.7  10.2      45.3
 7 "k"       Singleton    9.5  12.3  14.4   9.71  6.97     43.4
 8 "k:"      Geminate     9    14.7  16.1  10.1   7.37     48.2
 9 "l"       Singleton    8.69 11.9   6.33 11.5  10.2      40.0
10 "l:"      Geminate     8.81 11.3  10.0  10.0  11.5      42.8
11 "m"       Singleton    8.36 13.6   9.11 11.1   9.20     43.0
12 "m:"      Geminate     8.85 13.7  10.9   9.95  8.42     43.0
13 "m: "     Geminate    14    14.6  12.4   5.66  5.01     37.7
14 "n"       Singleton    8    15.1   4.44 11.6   8.99     40.2
15 "n:"      Geminate     8.21 21.4  10.1  10.2   9.32     51.0
16 "n: "     Geminate    11.3  32.0  10.4   8.09  7.94     58.5
17 "p"       Singleton    8.4  11.2  11.9   7.98  6.53     37.7
18 "p:"      Geminate     8.81 13.2  12.7   8.57 11.3      45.8
19 "t`"      Singleton    9    12.9  10.5   8.69  9.20     41.3
20 "t`:"     Geminate     9    13.1  13.1   8.39 10.6      45.2

谢谢。

您可以检查两列的值在整个数据集中是否唯一

df = df.drop_duplicates(subset=['C2_xsampa','Consonant'])

你可以取反df[~df]得到不正确的行

编辑刚看到r语言标签 我相信 distinct(select(df, C2_xsampa, Consonant)) 会做

C2_xsampa的某些值中似乎有不必要的符号和空格。这是使用 {tidyverse} 的建议。首先,它删除 symbols/spaces,然后通过 C2_xsampaConsonant 识别重复的行。您可以使用 dup 列过滤重复的行。

library(tidyverse)
   
dat1 <- dat %>% 
  mutate(C2_xsampa = str_trim(C2_xsampa)) %>% 
  group_by(C2_xsampa, Consonant) %>% 
  mutate(dup = n()) %>%
  ungroup()

dat1

# # A tibble: 20 x 9
#    C2_xsampa Consonant Speaker C1.dn C2.dn V1.dn V2.dn total.dn   dup
#    <chr>     <chr>       <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl> <int>
#  1 d_d       Singleton    8.5  11.9   7.82 13     7.65     40.3     1
#  2 d_d:      Geminate     9    11.6  11.9  11.4   7.46     42.3     1
#  3 dZ        Singleton    8.31  7.79  7.47 14.9   9.81     40       1
#  4 dZ:       Geminate     8.08  7.72 13.4  12.8   9.61     43.6     1
#  5 g         Singleton    9    12.1  11.3  11.9   8.56     43.9     1
#  6 g:        Geminate     8.69 11.3  11.1  12.7  10.2      45.3     1
#  7 k         Singleton    9.5  12.3  14.4   9.71  6.97     43.4     1
#  8 k:        Geminate     9    14.7  16.1  10.1   7.37     48.2     1
#  9 l         Singleton    8.69 11.9   6.33 11.5  10.2      40       1
# 10 l:        Geminate     8.81 11.3  10    10    11.5      42.8     1
# 11 m         Singleton    8.36 13.6   9.11 11.1   9.2      43       1
# 12 m:        Geminate     8.85 13.7  10.9   9.95  8.42     43       2
# 13 m:        Geminate    14    14.6  12.4   5.66  5.01     37.7     2
# 14 n         Singleton    8    15.1   4.44 11.6   8.99     40.2     1
# 15 n:        Geminate     8.21 21.4  10.1  10.2   9.32     51       2
# 16 n:        Geminate    11.3  32    10.4   8.09  7.94     58.5     2
# 17 p         Singleton    8.4  11.2  11.9   7.98  6.53     37.7     1
# 18 p:        Geminate     8.81 13.2  12.7   8.57 11.3      45.8     1
# 19 t`        Singleton    9    12.9  10.5   8.69  9.2      41.3     1
# 20 t`:       Geminate     9    13.1  13.1   8.39 10.6      45.2     1

这是数据集的代码:

dat <- read.table(
  text = '
  C2_xsampa Consonant Speaker C1.dn C2.dn V1.dn V2.dn total.dn
 1 "d_d"     Singleton    8.5  11.9   7.82 13.0   7.65     40.3
 2 "d_d:"    Geminate     9    11.6  11.9  11.4   7.46     42.3
 3 "dZ"      Singleton    8.31  7.79  7.47 14.9   9.81     40.0
 4 "dZ:"     Geminate     8.08  7.72 13.4  12.8   9.61     43.6
 5 "g"       Singleton    9    12.1  11.3  11.9   8.56     43.9
 6 "g:"      Geminate     8.69 11.3  11.1  12.7  10.2      45.3
 7 "k"       Singleton    9.5  12.3  14.4   9.71  6.97     43.4
 8 "k:"      Geminate     9    14.7  16.1  10.1   7.37     48.2
 9 "l"       Singleton    8.69 11.9   6.33 11.5  10.2      40.0
10 "l:"      Geminate     8.81 11.3  10.0  10.0  11.5      42.8
11 "m"       Singleton    8.36 13.6   9.11 11.1   9.20     43.0
12 "m:"      Geminate     8.85 13.7  10.9   9.95  8.42     43.0
13 "m: "     Geminate    14    14.6  12.4   5.66  5.01     37.7
14 "n"       Singleton    8    15.1   4.44 11.6   8.99     40.2
15 "n:"      Geminate     8.21 21.4  10.1  10.2   9.32     51.0
16 "n: "     Geminate    11.3  32.0  10.4   8.09  7.94     58.5
17 "p"       Singleton    8.4  11.2  11.9   7.98  6.53     37.7
18 "p:"      Geminate     8.81 13.2  12.7   8.57 11.3      45.8
19 "t`"      Singleton    9    12.9  10.5   8.69  9.20     41.3
20 "t`:"     Geminate     9    13.1  13.1   8.39 10.6      45.2',
header = TRUE
)

我最喜欢的方法是:

subset(dat, duplicated(C2_xsampa) | duplicated(rev(C2_xsampa))