我能合理拆分这些数字串吗？

Question

我有一堆这样的字符串：

x  <-  c("4/757.1%", "0/10%", "6/1060%", "0/0-%", "11/2055%")

它们是分数和所述分数的百分比值，在某处以某种方式混合在一起。所以例子中第一个数字的意思是 4 out of 7 是 57.1%。我可以很容易地在 /（比如 stringr::word(x, 1, sep = "/")）之前得到第一个数字，但是第二个数字可以是一个或两个字符长，所以我很难想出一种方法来做到这一点。我不需要 % 值，因为一旦我得到数字就很容易重新计算。

任何人都可以找到这样做的方法吗？

Answer 1

正如您所指出的，一旦有了分数，百分比就可以重新计算。你能利用这个事实找出应该在哪里拆分吗？

GuessSplit <- function(string) {

  tolerance <- 0.001 #How close should the fraction be?
  numerator <- as.numeric(word(string, 1, sep = "/"))
  second.half <-word(string, 2, sep = "/")
  second.half <- strsplit(second.half, '')[[1]]

  # assuming they all end in percent signs
  possibilities <- length(second.half) - 1

  for (position in 1:possibilities) {

    denom.guess <- as.numeric(paste0(second.half[1:position], collapse=''))
    percent.guess <- as.numeric(paste0(second.half[(position+1):possibilities], collapse='')) / 100

    value <- numerator / denom.guess

    if (abs(value - percent.guess) < tolerance) {

      return(list(numerator=numerator, denominator=denom.guess))

    }
  }
}

这需要一点爱来处理更奇怪的情况，如果它无法在可能性中找到答案，可能需要一些更优雅的东西。我也不确定哪种 return 类型最好。也许您只需要分母，因为分子很容易获得，但我认为包含两者的列表是最通用的。我希望这是一个合理的开始？

Answer 2

一种看起来很丑陋的解决方案，似乎可以满足您的要求：

x  <-  c("4/757.1%", "0/10%", "6/1060%", "0/0-%", "11/2055%")

split_perc <- function(x,signif_digits=1){
  x = gsub("%","",x)
  if(grepl("-",x)) return(list(NA,NA))
  index1 = gregexpr("/",x)[[1]][1]+1
  index2 = gregexpr("\.",x)[[1]][1]-2
  if(index2==-3){index2=nchar(x)-1}

  found=FALSE
  indices = seq(index1,index2)
  k=1
  while(!found & k<=length(indices))
  {
    str1 =substr(x,1,indices[k])
    num1=as.numeric(strsplit(str1,"/")[[1]][1])
    num2 = as.numeric(strsplit(str1,"/")[[1]][2])
    value1 = round(num1/num2*100,signif_digits)
    value2 = round(as.numeric(substr(x,indices[k]+1,nchar(x))),signif_digits)
    if(value1==value2)
    {found=TRUE}
    else
    {k=k+1}
  }
  if(found)
    return(list(num1,num2))
  else
    return(list(NA,NA))
}

do.call(rbind,lapply(x,split_perc))

输出：

     [,1] [,2]
[1,] 4    7   
[2,] 0    1   
[3,] 6    10  
[4,] NA   NA  
[5,] 11   20

再举几个例子：

y = c("11/2055.003%","11/2055.2%","40/7057.1%")
do.call(rbind,lapply(y,split_perc))

     [,1] [,2]
[1,] 11   20   # default significant digits is 1, so match found.
[2,] NA   NA   # no match found since 55.1!=55.2
[3,] 40   70

Answer 3

来自 tidyverse 和 stringr 的解决方案。我们可以定义一个函数来拆分第二个数字的所有可能位置并计算百分比以查看哪个有意义。 df2是显示最佳分割位置的数据框，你要的数字在V3栏

library(tidyverse)
library(stringr)

x <- c("4/757.1%", "0/10%", "6/1060%", "0/0-%", "11/2055%")

dt <- str_split_fixed(x, pattern = "/", n = 2) %>%
  as_data_frame() %>%
  mutate(ID = 1:n()) %>%
  select(ID, V1, V2)

# Design a function to spit the second column based on position
split_df <- function(position, dt){
  dt_temp <- dt %>%
    mutate(V3 = str_sub(V2, 1, position)) %>%
    mutate(V4 =  str_sub(V2, position + 1)) %>%
    mutate(Pos = position)

  return(dt_temp)
}

# Process the data
dt2 <- map_df(1:3, split_df, dt = dt) %>%
  # Remove % in V4
  mutate(V4 = str_replace(V4, "%", "")) %>%
  # Convert V1, V3 and V4 to numeric
  mutate_at(vars(V1, V3, V4), funs(as.numeric)) %>%
  # Calculate possible percentage
  mutate(V5 = V1/V3 * 100) %>%
  # Calculate the difference between V4 and V5
  mutate(V6 = abs(V4 - V5)) %>%
  # Select the smallest difference based on V6 for each group
  group_by(ID) %>%
  arrange(ID, V6) %>%
  slice(1)

# The best split is now in V3
dt2$V3
[1]  7  1 10  0 20

我能合理拆分这些数字串吗？

Can I reasonably split these number strings?

regex

string

r

stringr