提取文本中下划线之间的数字

Question

我有文件的名称类似于

休森.George_54_4
伊夫兰.Dean_51_3
休斯顿。Amanda_49_6

我想创建一个数据框，其中每一行都是从文件名中提取的信息，格式为作者、卷、期。

我可以提取名称和卷，但似乎无法获得发行号。使用 "stringr" 包，我完成了以下操作，这给了我 _4 而不仅仅是 4.

[^a-z](?:[^_]+_){0}([^_ ]+$)

我该如何解决这个问题？

Answer 1

如果是最后一位，我们直接用base R方法提取即可

as.numeric(substring(str1, nchar(str1)))

或 sub

as.numeric(sub(".*_", "", str1))
#[1] 4 3 6

如果我们需要将其拆分为单独的列，一个选项是 separate 来自 tidyverse，它将根据分隔符 split 将列分成单独的列（_) 并确保列的类型是 converted

library(tidyverse)
data_frame(col1 = str1) %>%
    separate(col1, into = c("Author", "Volume", "Issue"), sep = "_", convert = TRUE)
# A tibble: 3 x 3
#  Author         Volume Issue
#  <chr>          <chr>  <chr>
#1 Hughson.George 54     4    
#2 Ifran.Dean     51     3    
#3 Houston.Amanda 49     6

数据

str1 <- c("Hughson.George_54_4", "Ifran.Dean_51_3", "Houston.Amanda_49_6")

Answer 2

您正在寻找：

read.table(text = string, sep ='_', col.names = c('Author', 'Volume', 'Issue'))

          Author Volume Issue
1 Hughson.George     54     4
2     Ifran.Dean     51     3
3 Houston.Amanda     49     6

其中

string <- c("Hughson.George_54_4", "Ifran.Dean_51_3", "Houston.Amanda_49_6")

编辑：您正在寻找：

 read.table(text = string, sep ='_', fill=TRUE)

Answer 3

正则表达式的 [^a-z] 部分匹配最后一位数字之前的 _。只需使用一些东西来匹配最后的数字：

x1 <- c("Hughson.George_54_4", "Ifran.Dean_51_3", "Houston.Amanda_49_6")

str_extract(x1,"([^_]+$)")
[1] "4" "3" "6"

str_extract(x1,"\d+$")
[1] "4" "3" "6"

虽然你的总体目标似乎是 strsplit 的工作：

data.frame(do.call("rbind",strsplit(sub("\."," ",x1),"_")))
              X1 X2 X3
1 Hughson George 54  4
2     Ifran Dean 51  3
3 Houston Amanda 49  6

提取文本中下划线之间的数字

Extract number between underscore in text

r

special-characters

rstudio

stringr

数据