从非结构化开放式问题中提取字符串 [R]

Question

我正在尝试从包含自由文本的变量中提取一些字符串（因此这个变量根本没有结构化）。我的目标是提取他们对所问的三个问题分别给出的三个答案。即使内容很乱，也可以将变量分解如下：

NAME 姓氏，dd/mm/yyyy hh:mm:ss Q1 Q2 Q3

一些受访者确实使用了标记，例如：

史密斯约翰 10:01:36 1° 我同意 blabla 2° 我对 blabla 的看法 3 ° 否

MAC DONALD Ronald 在 02:26:02 1) 我不同意但是 blabla 2) 我 100 blabla 3) 是

的积极百分比

CURIE Mary 在 11:00:56 - 我不能说 blabla - 绝对是 - 不适用

正如您所见，标记（以粗体强调）（如果使用）恰好不同。答案非常多样化。

因此我的问题是，是否可以处理这些文本缺陷并将每个答案减去并将其放入相应的变量中？如果不是，如果每个答案的结构都相同（我的意思是使用相同的标记）是否可能。

非常感谢您提供的线索。

Answer 1

如果数据来自文件，请将下面的 textConnection(Lines) 替换为您的文件名。我们用分号替换分隔符，但如果您的输入中有任何分号，我们会选择不同的字符。

# input
Lines <- r"{
SMITH John at 10:01:36 1° I agree blabla 2° My opinion about blabla 3° No
MAC DONALD Ronald at 02:26:02 1) I disagree but blabla 2) I'm 100% positive that blabla 3) yes
CURIE Mary at 11:00:56 - I cannot say that blabla - Definitely yes - not applicable}"

L <- readLines(textConnection(Lines))  # read it into separate lines

L2 <- gsub("\d+°|\d+\)|-", ";", L)  # replace separators with semicolon

# if we included all separators then this should emit character(0)
num <- count.fields(textConnection(L2), sep = ";", quote = "")
L2[num != 4]
## character(0)

# if there were some separators that were not included in the gsub 
# we will see the offending lines above and can add their separators to gsub pattern
# above and then rerun everything from the gsub line onwards with modified pattern

# finally if we have included all the separators we can read it in
read.table(text = L2, sep = ";", strip.white = TRUE, quote = "")

给予：

                             V1                       V2                            V3             V4
1        SMITH John at 10:01:36           I agree blabla       My opinion about blabla             No
2 MAC DONALD Ronald at 02:26:02    I disagree but blabla I'm 100% positive that blabla            yes
3        CURIE Mary at 11:00:56 I cannot say that blabla                Definitely yes not applicable

从非结构化开放式问题中提取字符串 [R]

Extract Strings from unstructured open-question [R]

string

text

substring

r