根据特定条件在 R 中拆分字符串
String split in R based on certain criteria
我正在处理如下字符串
ID Col1
------------------------------------------------------------------------------------
11 GLIPIZIDE 10 MG TAB 1 TABLET PO QAM
23 GLIPIZIDE 5 MG TAB 2 TABLETS PO BID
32 GLIPIZIDE TAB PO
12 GLIPIZIDE TAB PO PRN
343 PIOGLITAZONE [ACTOS] 45 MG TAB 1 TABLET PO DAILY #3 MONTHS SUPPLY REFILL X3
31 METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3
44 METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #400 TABLETS REFILL X3
34 METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3
38 METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3
我想完成的有两件事。
1) Store the first word a new column (Col2)
2) Search for the term "mg" and capture the string before the word "mg"
and store that in a new column (Col3)
继续这个例子,最终的输出应该是这样的
Id Col2 Col3
---------------------------------
11 GLIPIZIDE 10 MG
23 GLIPIZIDE 5 MG
32 GLIPIZIDE
12 GLIPIZIDE
343 PIOGLITAZONE 45 MG
31 METFORMIN 500 MG
44 METFORMIN 500 MG
34 METFORMIN 500 MG
38 METFORMIN 500 MG
非常感谢有关此问题的任何帮助。
数据
dd <- read.table(header = TRUE, stringsAsFactors = FALSE, text="ID Col1
11 'GLIPIZIDE 10 MG TAB 1 TABLET PO QAM'
23 'GLIPIZIDE 5 MG TAB 2 TABLETS PO BID'
32 'GLIPIZIDE TAB PO'
12 'GLIPIZIDE TAB PO PRN'
343 'PIOGLITAZONE [ACTOS] 45 MG TAB 1 TABLET PO DAILY #3 MONTHS SUPPLY REFILL X3'
31 'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3'
44 'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #400 TABLETS REFILL X3'
34 'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3'
38 'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3'")
一个是使用两个正则表达式来 1) 捕获字符串开头的第一个单词 (^\w+
) 和 2) 查找数字后跟 "mg" (\d+ mg
)
dd <- read.table(header = TRUE, stringsAsFactors = FALSE, text="ID Col1
11 'GLIPIZIDE 10 MG TAB 1 TABLET PO QAM'
23 'GLIPIZIDE 5 MG TAB 2 TABLETS PO BID'
32 'GLIPIZIDE TAB PO'
12 'GLIPIZIDE TAB PO PRN'
343 'PIOGLITAZONE [ACTOS] 45 MG TAB 1 TABLET PO DAILY #3 MONTHS SUPPLY REFILL X3'
31 'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3'
44 'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #400 TABLETS REFILL X3'
34 'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3'
38 'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3'")
within(dd, {
col1 <- gsub('(^\w+)|.', '\1', Col1)
dose <- gsub('(?i)(\d+ mg)|.', '\1', Col1)
})[, c('col1','dose')]
# col1 dose
# 1 GLIPIZIDE 10 MG
# 2 GLIPIZIDE 5 MG
# 3 GLIPIZIDE
# 4 GLIPIZIDE
# 5 PIOGLITAZONE 45 MG
# 6 METFORMIN 500 MG
# 7 METFORMIN 500 MG
# 8 METFORMIN 500 MG
# 9 METFORMIN 500 MG
下面是 stringi.
library(stringi)
ss <- stri_extract_all_regex(dd$Col1, "(?i)(^\w+)|(\d+ mg)", simplify = TRUE)
setNames(cbind(dd[1], ss), c("ID", "Col2", "Col3")))
# ID Col2 Col3
# 1 11 GLIPIZIDE 10 MG
# 2 23 GLIPIZIDE 5 MG
# 3 32 GLIPIZIDE
# 4 12 GLIPIZIDE
# 5 343 PIOGLITAZONE 45 MG
# 6 31 METFORMIN 500 MG
# 7 44 METFORMIN 500 MG
# 8 34 METFORMIN 500 MG
# 9 38 METFORMIN 500 MG
我正在处理如下字符串
ID Col1
------------------------------------------------------------------------------------
11 GLIPIZIDE 10 MG TAB 1 TABLET PO QAM
23 GLIPIZIDE 5 MG TAB 2 TABLETS PO BID
32 GLIPIZIDE TAB PO
12 GLIPIZIDE TAB PO PRN
343 PIOGLITAZONE [ACTOS] 45 MG TAB 1 TABLET PO DAILY #3 MONTHS SUPPLY REFILL X3
31 METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3
44 METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #400 TABLETS REFILL X3
34 METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3
38 METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3
我想完成的有两件事。
1) Store the first word a new column (Col2)
2) Search for the term "mg" and capture the string before the word "mg"
and store that in a new column (Col3)
继续这个例子,最终的输出应该是这样的
Id Col2 Col3
---------------------------------
11 GLIPIZIDE 10 MG
23 GLIPIZIDE 5 MG
32 GLIPIZIDE
12 GLIPIZIDE
343 PIOGLITAZONE 45 MG
31 METFORMIN 500 MG
44 METFORMIN 500 MG
34 METFORMIN 500 MG
38 METFORMIN 500 MG
非常感谢有关此问题的任何帮助。
数据
dd <- read.table(header = TRUE, stringsAsFactors = FALSE, text="ID Col1
11 'GLIPIZIDE 10 MG TAB 1 TABLET PO QAM'
23 'GLIPIZIDE 5 MG TAB 2 TABLETS PO BID'
32 'GLIPIZIDE TAB PO'
12 'GLIPIZIDE TAB PO PRN'
343 'PIOGLITAZONE [ACTOS] 45 MG TAB 1 TABLET PO DAILY #3 MONTHS SUPPLY REFILL X3'
31 'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3'
44 'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #400 TABLETS REFILL X3'
34 'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3'
38 'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3'")
一个是使用两个正则表达式来 1) 捕获字符串开头的第一个单词 (^\w+
) 和 2) 查找数字后跟 "mg" (\d+ mg
)
dd <- read.table(header = TRUE, stringsAsFactors = FALSE, text="ID Col1
11 'GLIPIZIDE 10 MG TAB 1 TABLET PO QAM'
23 'GLIPIZIDE 5 MG TAB 2 TABLETS PO BID'
32 'GLIPIZIDE TAB PO'
12 'GLIPIZIDE TAB PO PRN'
343 'PIOGLITAZONE [ACTOS] 45 MG TAB 1 TABLET PO DAILY #3 MONTHS SUPPLY REFILL X3'
31 'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3'
44 'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #400 TABLETS REFILL X3'
34 'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3'
38 'METFORMIN [GLUCOPHAGE XR] 500 MG TAB SR 24HR 2 TABLETS PO DAILY #200 TABLETS REFILL X3'")
within(dd, {
col1 <- gsub('(^\w+)|.', '\1', Col1)
dose <- gsub('(?i)(\d+ mg)|.', '\1', Col1)
})[, c('col1','dose')]
# col1 dose
# 1 GLIPIZIDE 10 MG
# 2 GLIPIZIDE 5 MG
# 3 GLIPIZIDE
# 4 GLIPIZIDE
# 5 PIOGLITAZONE 45 MG
# 6 METFORMIN 500 MG
# 7 METFORMIN 500 MG
# 8 METFORMIN 500 MG
# 9 METFORMIN 500 MG
下面是 stringi.
library(stringi)
ss <- stri_extract_all_regex(dd$Col1, "(?i)(^\w+)|(\d+ mg)", simplify = TRUE)
setNames(cbind(dd[1], ss), c("ID", "Col2", "Col3")))
# ID Col2 Col3
# 1 11 GLIPIZIDE 10 MG
# 2 23 GLIPIZIDE 5 MG
# 3 32 GLIPIZIDE
# 4 12 GLIPIZIDE
# 5 343 PIOGLITAZONE 45 MG
# 6 31 METFORMIN 500 MG
# 7 44 METFORMIN 500 MG
# 8 34 METFORMIN 500 MG
# 9 38 METFORMIN 500 MG