计算 data.table 的每个元素与参考 table 的每个值的 Levenshtein 比率,并与最大比率合并
Computing the Levenshtein ratio of each element of a data.table with each value of a reference table and merge with maximum ratio
我有一个包含 3 列的 data.table dt:
- id
- 名称为字符串
- 阈值为 num
样本是:
dt <- <- data.table(nid = c("n1","n2", "n3", "n4"), rname = c("apple", "pear", "banana", "kiwi"), maxr = c(0.5, 0.8, 0.7, 0.6))
nid | rname | maxr
n1 | apple | 0.5
n2 | pear | 0.8
n3 | banana | 0.7
n4 | kiwi | 0.6
我有第二个 table dt.ref 有 2 列:
- id
- 名称为字符串
样本是:
dt.ref <- <- data.table(cid = c("c1", "c2", "c3", "c4", "c5", "c6"), cname = c("apple", "maple", "peer", "dear", "bonobo", "kiwis"))
cid | cname
c1 | apple
c2 | maple
c3 | peer
c4 | dear
c5 | bonobo
c6 | kiwis
对于 dt 的每个 rname ,我想用 dt.ref[= 的每个 cname 计算 Levenshtein 比率62=] 这样:
Lr = 1 - (stringdist(cname, rname, method = "lv") / pmax(nchar(cname),nchar(rname)))
然后,我想在 cname 上为每个 [=34= 找到 max(Lr) ]rname of dt 并得到以下 data.table:[=18 作为输出=]
nid | rname | maxr | maxLr | cid
n1 | apple | 0.5 | 1 | c1
n2 | pear | 0.8 | 0.75 | c3
n2 | pear | 0.8 | 0.75 | c4
n3 | banana | 0.7 | 0.33 | c5
n4 | kiwi | 0.6 | 0.8 | c6
基本上,我们取dt加2列,最大编辑比和对应的cid,知道ties都加了,n2每行1。
我使用 data.table
但解决方案可以使用 dplyr
或任何其他包。
您可以尝试这样的操作:
f1 <- function(x, y) {
require(stringdist)
require(matrixStats)
dis <- stringdistmatrix(x, y, method = "lv")
mat <- sapply(nchar(y), function(i) pmax(i, nchar(x)))
r <- 1 - dis / mat
w <- apply(r, 1, function(x) which(x==max(x)))
m <- rowMaxs(r)
list(m = m, w = w)
}
r <- f1(dt[[2]], dt.ref[[2]])
r
$m
[1] 1.0000000 0.7500000 0.3333333 0.8000000
$w
$w[[1]]
[1] 1
$w[[2]]
[1] 3 4
$w[[3]]
[1] 5
$w[[4]]
[1] 6
dt[, maxLr := r$m ]
#dtnew <- dt[rep(1:.N, sapply(r$w, length)),]
dtnew <- dt[rep(1:.N, lengths(r$w),] # thanks to Frank
dtnew[, cid := dt.ref[unlist(r$w), 1]]
结果:
dtnew
nid rname maxr maxLr cid
1: n1 apple 0.5 1.0000000 c1
2: n2 pear 0.8 0.7500000 c3
3: n2 pear 0.8 0.7500000 c4
4: n3 banana 0.7 0.3333333 c5
5: n4 kiwi 0.6 0.8000000 c6
我有一个包含 3 列的 data.table dt:
- id
- 名称为字符串
- 阈值为 num
样本是:
dt <- <- data.table(nid = c("n1","n2", "n3", "n4"), rname = c("apple", "pear", "banana", "kiwi"), maxr = c(0.5, 0.8, 0.7, 0.6))
nid | rname | maxr
n1 | apple | 0.5
n2 | pear | 0.8
n3 | banana | 0.7
n4 | kiwi | 0.6
我有第二个 table dt.ref 有 2 列:
- id
- 名称为字符串
样本是:
dt.ref <- <- data.table(cid = c("c1", "c2", "c3", "c4", "c5", "c6"), cname = c("apple", "maple", "peer", "dear", "bonobo", "kiwis"))
cid | cname
c1 | apple
c2 | maple
c3 | peer
c4 | dear
c5 | bonobo
c6 | kiwis
对于 dt 的每个 rname ,我想用 dt.ref[= 的每个 cname 计算 Levenshtein 比率62=] 这样:
Lr = 1 - (stringdist(cname, rname, method = "lv") / pmax(nchar(cname),nchar(rname)))
然后,我想在 cname 上为每个 [=34= 找到 max(Lr) ]rname of dt 并得到以下 data.table:[=18 作为输出=]
nid | rname | maxr | maxLr | cid
n1 | apple | 0.5 | 1 | c1
n2 | pear | 0.8 | 0.75 | c3
n2 | pear | 0.8 | 0.75 | c4
n3 | banana | 0.7 | 0.33 | c5
n4 | kiwi | 0.6 | 0.8 | c6
基本上,我们取dt加2列,最大编辑比和对应的cid,知道ties都加了,n2每行1。
我使用 data.table
但解决方案可以使用 dplyr
或任何其他包。
您可以尝试这样的操作:
f1 <- function(x, y) {
require(stringdist)
require(matrixStats)
dis <- stringdistmatrix(x, y, method = "lv")
mat <- sapply(nchar(y), function(i) pmax(i, nchar(x)))
r <- 1 - dis / mat
w <- apply(r, 1, function(x) which(x==max(x)))
m <- rowMaxs(r)
list(m = m, w = w)
}
r <- f1(dt[[2]], dt.ref[[2]])
r
$m
[1] 1.0000000 0.7500000 0.3333333 0.8000000
$w
$w[[1]]
[1] 1
$w[[2]]
[1] 3 4
$w[[3]]
[1] 5
$w[[4]]
[1] 6
dt[, maxLr := r$m ]
#dtnew <- dt[rep(1:.N, sapply(r$w, length)),]
dtnew <- dt[rep(1:.N, lengths(r$w),] # thanks to Frank
dtnew[, cid := dt.ref[unlist(r$w), 1]]
结果:
dtnew
nid rname maxr maxLr cid
1: n1 apple 0.5 1.0000000 c1
2: n2 pear 0.8 0.7500000 c3
3: n2 pear 0.8 0.7500000 c4
4: n3 banana 0.7 0.3333333 c5
5: n4 kiwi 0.6 0.8000000 c6