R在完整字符串中寻找缩写
R look for abbreviation in full string
我正在 R 中寻找一种有效的方法来判断一个字符串是否可能是另一个字符串的缩写。我采用的基本方法是查看较短字符串中的字母是否以相同顺序出现在较长字符串中。例如,如果我的较短字符串是 "abv" 而我的较长字符串是 "abbreviation",我会想要一个肯定的结果,而如果我的较短的字符串是 "avb",我会想要一个否定的结果。我有一个我放在一起的功能,但它似乎是一个非常不优雅的解决方案,我想我可能会错过一些正则表达式的魔法。我也看过 R 的 'stringdist' 函数,但我还没有发现任何看起来像它特别这样做的东西。这是我的函数:
# This function computes whether one of the input strings (input strings x and y) could be an abbreviation of the other
# input strings should be all the same case, and probably devoid of anything but letters and numbers
abbrevFind = function(x, y) {
# Compute the number of characters in each string
len.x = nchar(x)
len.y = nchar(y)
# Find out which string is shorter, and therefore a possible abbreviation
# split each string into its component characters
if (len.x < len.y) {
# Designate the abbreviation and the full string
abv = substring(x, 1:len.x, 1:len.x)
full = substring(y, 1:len.y, 1:len.y)
} else if (len.x >= len.y) {
abv = substring(y, 1:len.y, 1:len.y)
full = substring(x, 1:len.x, 1:len.x)
}
# Get the number of letters in the abbreviation
small = length(abv)
# Set up old position, which will be a comparison criteria
pos.old = 0
# set up an empty vector which will hold the letter positions of already used letters
letters = c()
# Loop through each letter in the abbreviation
for (i in 1:small) {
# Get the position in the full string of the ith letter in the abbreviation
pos = grep(abv[i], full)
# Exclude positions which have already been used
pos = pos[!pos %in% letters]
# Get the earliest position (note that if the grep found no matches, the min function will return 'Inf' here)
pos = min(pos)
# Store that position
letters[i] = pos
# If there are no matches to the current letter, or the current letter's only match is earlier in the string than the last match
# it is not a possible abbreviation. The loop breaks, and the function returns False
# If the function makes it all the way through without breaking out of the loop, the function will return true
if (is.infinite(pos) | pos <= pos.old) {abbreviation = F; break} else {abbreviation = T}
# Set old position equal to the current position
pos.old = pos
}
return(abbreviation)
}
感谢您的帮助!
像这样的事情怎么样,你基本上需要每个字符并添加一个选项来匹配每个字母之间的任何字母 0 次或多次 ([a-z]*?
)
f <- Vectorize(function(x, y) {
xx <- strsplit(tolower(x), '')[[1]]
grepl(paste0(xx, collapse = '[a-z]*?'), y)
## add this if you only want to consider letters in y
# grepl(paste0(xx, collapse = sprintf('[%s]*?', tolower(y))), y)
}, vectorize.args = 'x')
f(c('ohb','hello','ob','ohc'), 'ohbother')
# ohb hello ob ohc
# TRUE FALSE TRUE FALSE
f(c('abbrev','abb','abv', 'avb'), 'abbreviation')
# abbrev abb abv avb
# TRUE TRUE TRUE FALSE
不是那么简短的答案,而是使用递归(递归很优雅,对吧?:p)
#Just a library I prefer to use for regular expressions
library(stringr)
#recursive function
checkAbbr <- function(abbr,word){
#Go through each letter in the abbr vector and trim the word string if found
word <- substring(word,(str_locate(word,abbr[1])[,1]+1))
abbr <- abbr[-1]
#as long as abbr still has characters, continue to loop recursively
if(!is.na(word) && length(abbr)>0){
checkAbbr(abbr,word)
}else{
#if a character from abbr was not found in word, it will return NA, which determines whether the abbr vector is an abbreviation of the word string
return(!is.na(word))
}
}
#Testing cases for abbreviation or not
checkAbbr(strsplit("abv","")[[1]],"abbreviation") #FALSE
checkAbbr(strsplit("avb","")[[1]],"abbreviation") #FALSE
checkAbbr(strsplit("z","")[[1]],"abbreviation") #FALSE
我正在 R 中寻找一种有效的方法来判断一个字符串是否可能是另一个字符串的缩写。我采用的基本方法是查看较短字符串中的字母是否以相同顺序出现在较长字符串中。例如,如果我的较短字符串是 "abv" 而我的较长字符串是 "abbreviation",我会想要一个肯定的结果,而如果我的较短的字符串是 "avb",我会想要一个否定的结果。我有一个我放在一起的功能,但它似乎是一个非常不优雅的解决方案,我想我可能会错过一些正则表达式的魔法。我也看过 R 的 'stringdist' 函数,但我还没有发现任何看起来像它特别这样做的东西。这是我的函数:
# This function computes whether one of the input strings (input strings x and y) could be an abbreviation of the other
# input strings should be all the same case, and probably devoid of anything but letters and numbers
abbrevFind = function(x, y) {
# Compute the number of characters in each string
len.x = nchar(x)
len.y = nchar(y)
# Find out which string is shorter, and therefore a possible abbreviation
# split each string into its component characters
if (len.x < len.y) {
# Designate the abbreviation and the full string
abv = substring(x, 1:len.x, 1:len.x)
full = substring(y, 1:len.y, 1:len.y)
} else if (len.x >= len.y) {
abv = substring(y, 1:len.y, 1:len.y)
full = substring(x, 1:len.x, 1:len.x)
}
# Get the number of letters in the abbreviation
small = length(abv)
# Set up old position, which will be a comparison criteria
pos.old = 0
# set up an empty vector which will hold the letter positions of already used letters
letters = c()
# Loop through each letter in the abbreviation
for (i in 1:small) {
# Get the position in the full string of the ith letter in the abbreviation
pos = grep(abv[i], full)
# Exclude positions which have already been used
pos = pos[!pos %in% letters]
# Get the earliest position (note that if the grep found no matches, the min function will return 'Inf' here)
pos = min(pos)
# Store that position
letters[i] = pos
# If there are no matches to the current letter, or the current letter's only match is earlier in the string than the last match
# it is not a possible abbreviation. The loop breaks, and the function returns False
# If the function makes it all the way through without breaking out of the loop, the function will return true
if (is.infinite(pos) | pos <= pos.old) {abbreviation = F; break} else {abbreviation = T}
# Set old position equal to the current position
pos.old = pos
}
return(abbreviation)
}
感谢您的帮助!
像这样的事情怎么样,你基本上需要每个字符并添加一个选项来匹配每个字母之间的任何字母 0 次或多次 ([a-z]*?
)
f <- Vectorize(function(x, y) {
xx <- strsplit(tolower(x), '')[[1]]
grepl(paste0(xx, collapse = '[a-z]*?'), y)
## add this if you only want to consider letters in y
# grepl(paste0(xx, collapse = sprintf('[%s]*?', tolower(y))), y)
}, vectorize.args = 'x')
f(c('ohb','hello','ob','ohc'), 'ohbother')
# ohb hello ob ohc
# TRUE FALSE TRUE FALSE
f(c('abbrev','abb','abv', 'avb'), 'abbreviation')
# abbrev abb abv avb
# TRUE TRUE TRUE FALSE
不是那么简短的答案,而是使用递归(递归很优雅,对吧?:p)
#Just a library I prefer to use for regular expressions
library(stringr)
#recursive function
checkAbbr <- function(abbr,word){
#Go through each letter in the abbr vector and trim the word string if found
word <- substring(word,(str_locate(word,abbr[1])[,1]+1))
abbr <- abbr[-1]
#as long as abbr still has characters, continue to loop recursively
if(!is.na(word) && length(abbr)>0){
checkAbbr(abbr,word)
}else{
#if a character from abbr was not found in word, it will return NA, which determines whether the abbr vector is an abbreviation of the word string
return(!is.na(word))
}
}
#Testing cases for abbreviation or not
checkAbbr(strsplit("abv","")[[1]],"abbreviation") #FALSE
checkAbbr(strsplit("avb","")[[1]],"abbreviation") #FALSE
checkAbbr(strsplit("z","")[[1]],"abbreviation") #FALSE