RStudio 中的拆分扬声器和对话
Split Speaker and Dialogue in RStudio
我有这样的文件:
President Dr. Norbert Lammert: I declare the session open.
I will now give the floor to Bundesminister Alexander Dobrindt.
(Applause of CDU/CSU and delegates of the SPD)
Alexander Dobrindt, Minister for Transport and Digital Infrastructure:
Ladies and Gentleman. We will today start the biggest investment in infrastructure that ever existed, with over 270 billion Euro, over 1 000 projects and a clear financing perspective.
(Volker Kauder [CDU/CSU]: Genau!)
(Applause of the CDU/CSU and the SPD)
当我阅读那些 .txt 文档时,我想创建第二列来指示演讲者姓名。
所以我尝试的是首先创建所有可能名称的列表并替换它们..
library(qdap)
members <- c("Alexander Dobrindt, Minister for Transport and Digital Infrastructure:","President Dr. Norbert Lammert:")
members_r <- c("@Alexander Dobrindt, Minister for Transport and Digital Infrastructure:","@President Dr. Norbert Lammert:")
prok <- scan(".txt", what = "character", sep = "\n")
prok <- mgsub(members,members_r,prok)
prok <- as.data.frame(prok)
prok$speaker <- grepl("@[^\@:]*:",prok$prok, ignore.case = T)
然后我的计划是如果 speaker == true 通过正则表达式获取 @ 和 : 之间的名称并向下应用它直到有不同的名称(显然删除所有 applause/shout 括号),但这是还有我不确定该怎么做的地方。
方法如下:
require (qdap)
#text is the document text
# remove round brackets and text b/w ()
a <- bracketX(text, "round")
names <- c("President Dr. Norbert Lammert","Alexander Dobrindt" )
searchString <- paste(names[1],names[2], sep = ".+")
# Get string from names[1] till names[2] with the help of searchString
string <- regmatches(a, regexpr(searchString, a))
# remove names[2] from string
string <- gsub(names[2],"",string)
当名字超过2个时,这段代码可以循环
这似乎有效
library(qdap)
members <- c("Alexander Dobrindt, Minister for Transport and Digital Infrastructure:","President Dr. Norbert Lammert:")
members_r <- c("@Alexander Dobrindt, Minister for Transport and Digital Infrastructure:","@President Dr. Norbert Lammert:")
testprok <- read.table("txt",header=FALSE,quote = "\"",comment.char="",sep="\t")
testprok$V1 <- mgsub(members,members_r,testprok$V1)
testprok$V2 <- ifelse(grepl("@[^\@:]*:",testprok$V1),testprok$V1,NA)
####function from
repeat.before = function(x) { # repeats the last non NA value. Keeps leading NA
ind = which(!is.na(x)) # get positions of nonmissing values
if(is.na(x[1])) # if it begins with a missing, add the
ind = c(1,ind) # first position to the indices
rep(x[ind], times = diff( # repeat the values at these indices
c(ind, length(x) + 1) )) # diffing the indices + length yields how often
} # they need to be repeated
testprok$V2 = repeat.before(testprok$V2)
这是一种严重依赖 dplyr
的方法。
首先,我在您的示例文本中添加了一个句子来说明为什么我们不能只使用冒号来识别说话人姓名。
sampleText <-
"President Dr. Norbert Lammert: I declare the session open.
I will now give the floor to Bundesminister Alexander Dobrindt.
(Applause of CDU/CSU and delegates of the SPD)
Alexander Dobrindt, Minister for Transport and Digital Infrastructure:
Ladies and Gentleman. We will today start the biggest investment in infrastructure that ever existed, with over 270 billion Euro, over 1 000 projects and a clear financing perspective.
(Volker Kauder [CDU/CSU]: Genau!)
(Applause of the CDU/CSU and the SPD)
This sentence right here: it is an example of a problem"
然后我拆分文本以模拟您正在阅读的格式(这还将每个演讲都放在列表的一部分中)。
splitText <- strsplit(sampleText, "\n")
然后,我将所有潜在的演讲者(冒号之前的任何内容)拉出
allSpeakers <- lapply(splitText, function(thisText){
grep(":", thisText, value = TRUE) %>%
gsub(":.*", "", .) %>%
gsub("\(", "", .)
}) %>%
unlist() %>%
unique()
这给了我们:
[1] "President Dr. Norbert Lammert"
[2] "Alexander Dobrindt, Minister for Transport and Digital Infrastructure"
[3] "Volker Kauder [CDU/CSU]"
[4] "This sentence right here"
显然,最后一个不是合法名称,因此应从我们的发言人名单中排除:
legitSpeakers <-
allSpeakers[-4]
现在,我们准备好完成演讲了。我在下面包含了逐步评论,而不是在此处用文字描述
speechText <- lapply(splitText, function(thisText){
# Remove applause and interjections (things in parentheses)
# along with any blank lines; though you could leave blanks if you want
cleanText <-
grep("(^\(.*\)$)|(^$)", thisText
, value = TRUE, invert = TRUE)
# Split each line by a semicolor
strsplit(cleanText, ":") %>%
lapply(function(x){
# Check if the first element is a legit speaker
if(x[1] %in% legitSpeakers){
# If so, set the speaker, and put the statement in a separate portion
# taking care to re-collapse any breaks caused by additional colons
out <- data.frame(speaker = x[1]
, text = paste(x[-1], collapse = ":"))
} else{
# If not a legit speaker, set speaker to NA and reset text as above
out <- data.frame(speaker = NA
, text = paste(x, collapse = ":"))
}
# Return whichever version we made above
return(out)
}) %>%
# Bind all of the rows together
bind_rows %>%
# Identify clusters of speech that go with a single speaker
mutate(speakingGroup = cumsum(!is.na(speaker))) %>%
# Group by those clusters
group_by(speakingGroup) %>%
# Collapse that speaking down into a single row
summarise(speaker = speaker[1]
, fullText = paste(text, collapse = "\n"))
})
这会产生
[[1]]
speakingGroup speaker fullText
1 President Dr. Norbert Lammert I declare the session open.\nI will now give the floor to Bundesminister Alexander Dobrindt.
2 Alexander Dobrindt, Minister for Transport and Digital Infrastructure Ladies and Gentleman. We will today start the biggest investment in infrastructure that ever existed, with over 270 billion Euro, over 1 000 projects and a clear financing perspective.\nThis sentence right here: it is an example of a problem
如果您希望将每一行文本分开,请将末尾的 summarise
替换为 mutate(speaker = speaker[1])
,您将得到每行语音一行,如下所示:
speaker text speakingGroup
President Dr. Norbert Lammert I declare the session open. 1
President Dr. Norbert Lammert I will now give the floor to Bundesminister Alexander Dobrindt. 1
Alexander Dobrindt, Minister for Transport and Digital Infrastructure 2
Alexander Dobrindt, Minister for Transport and Digital Infrastructure Ladies and Gentleman. We will today start the biggest investment in infrastructure that ever existed, with over 270 billion Euro, over 1 000 projects and a clear financing perspective. 2
Alexander Dobrindt, Minister for Transport and Digital Infrastructure This sentence right here: it is an example of a problem 2
我有这样的文件:
President Dr. Norbert Lammert: I declare the session open.
I will now give the floor to Bundesminister Alexander Dobrindt.
(Applause of CDU/CSU and delegates of the SPD)
Alexander Dobrindt, Minister for Transport and Digital Infrastructure:
Ladies and Gentleman. We will today start the biggest investment in infrastructure that ever existed, with over 270 billion Euro, over 1 000 projects and a clear financing perspective.
(Volker Kauder [CDU/CSU]: Genau!)
(Applause of the CDU/CSU and the SPD)
当我阅读那些 .txt 文档时,我想创建第二列来指示演讲者姓名。
所以我尝试的是首先创建所有可能名称的列表并替换它们..
library(qdap)
members <- c("Alexander Dobrindt, Minister for Transport and Digital Infrastructure:","President Dr. Norbert Lammert:")
members_r <- c("@Alexander Dobrindt, Minister for Transport and Digital Infrastructure:","@President Dr. Norbert Lammert:")
prok <- scan(".txt", what = "character", sep = "\n")
prok <- mgsub(members,members_r,prok)
prok <- as.data.frame(prok)
prok$speaker <- grepl("@[^\@:]*:",prok$prok, ignore.case = T)
然后我的计划是如果 speaker == true 通过正则表达式获取 @ 和 : 之间的名称并向下应用它直到有不同的名称(显然删除所有 applause/shout 括号),但这是还有我不确定该怎么做的地方。
方法如下:
require (qdap)
#text is the document text
# remove round brackets and text b/w ()
a <- bracketX(text, "round")
names <- c("President Dr. Norbert Lammert","Alexander Dobrindt" )
searchString <- paste(names[1],names[2], sep = ".+")
# Get string from names[1] till names[2] with the help of searchString
string <- regmatches(a, regexpr(searchString, a))
# remove names[2] from string
string <- gsub(names[2],"",string)
当名字超过2个时,这段代码可以循环
这似乎有效
library(qdap)
members <- c("Alexander Dobrindt, Minister for Transport and Digital Infrastructure:","President Dr. Norbert Lammert:")
members_r <- c("@Alexander Dobrindt, Minister for Transport and Digital Infrastructure:","@President Dr. Norbert Lammert:")
testprok <- read.table("txt",header=FALSE,quote = "\"",comment.char="",sep="\t")
testprok$V1 <- mgsub(members,members_r,testprok$V1)
testprok$V2 <- ifelse(grepl("@[^\@:]*:",testprok$V1),testprok$V1,NA)
####function from
repeat.before = function(x) { # repeats the last non NA value. Keeps leading NA
ind = which(!is.na(x)) # get positions of nonmissing values
if(is.na(x[1])) # if it begins with a missing, add the
ind = c(1,ind) # first position to the indices
rep(x[ind], times = diff( # repeat the values at these indices
c(ind, length(x) + 1) )) # diffing the indices + length yields how often
} # they need to be repeated
testprok$V2 = repeat.before(testprok$V2)
这是一种严重依赖 dplyr
的方法。
首先,我在您的示例文本中添加了一个句子来说明为什么我们不能只使用冒号来识别说话人姓名。
sampleText <-
"President Dr. Norbert Lammert: I declare the session open.
I will now give the floor to Bundesminister Alexander Dobrindt.
(Applause of CDU/CSU and delegates of the SPD)
Alexander Dobrindt, Minister for Transport and Digital Infrastructure:
Ladies and Gentleman. We will today start the biggest investment in infrastructure that ever existed, with over 270 billion Euro, over 1 000 projects and a clear financing perspective.
(Volker Kauder [CDU/CSU]: Genau!)
(Applause of the CDU/CSU and the SPD)
This sentence right here: it is an example of a problem"
然后我拆分文本以模拟您正在阅读的格式(这还将每个演讲都放在列表的一部分中)。
splitText <- strsplit(sampleText, "\n")
然后,我将所有潜在的演讲者(冒号之前的任何内容)拉出
allSpeakers <- lapply(splitText, function(thisText){
grep(":", thisText, value = TRUE) %>%
gsub(":.*", "", .) %>%
gsub("\(", "", .)
}) %>%
unlist() %>%
unique()
这给了我们:
[1] "President Dr. Norbert Lammert"
[2] "Alexander Dobrindt, Minister for Transport and Digital Infrastructure"
[3] "Volker Kauder [CDU/CSU]"
[4] "This sentence right here"
显然,最后一个不是合法名称,因此应从我们的发言人名单中排除:
legitSpeakers <-
allSpeakers[-4]
现在,我们准备好完成演讲了。我在下面包含了逐步评论,而不是在此处用文字描述
speechText <- lapply(splitText, function(thisText){
# Remove applause and interjections (things in parentheses)
# along with any blank lines; though you could leave blanks if you want
cleanText <-
grep("(^\(.*\)$)|(^$)", thisText
, value = TRUE, invert = TRUE)
# Split each line by a semicolor
strsplit(cleanText, ":") %>%
lapply(function(x){
# Check if the first element is a legit speaker
if(x[1] %in% legitSpeakers){
# If so, set the speaker, and put the statement in a separate portion
# taking care to re-collapse any breaks caused by additional colons
out <- data.frame(speaker = x[1]
, text = paste(x[-1], collapse = ":"))
} else{
# If not a legit speaker, set speaker to NA and reset text as above
out <- data.frame(speaker = NA
, text = paste(x, collapse = ":"))
}
# Return whichever version we made above
return(out)
}) %>%
# Bind all of the rows together
bind_rows %>%
# Identify clusters of speech that go with a single speaker
mutate(speakingGroup = cumsum(!is.na(speaker))) %>%
# Group by those clusters
group_by(speakingGroup) %>%
# Collapse that speaking down into a single row
summarise(speaker = speaker[1]
, fullText = paste(text, collapse = "\n"))
})
这会产生
[[1]]
speakingGroup speaker fullText
1 President Dr. Norbert Lammert I declare the session open.\nI will now give the floor to Bundesminister Alexander Dobrindt.
2 Alexander Dobrindt, Minister for Transport and Digital Infrastructure Ladies and Gentleman. We will today start the biggest investment in infrastructure that ever existed, with over 270 billion Euro, over 1 000 projects and a clear financing perspective.\nThis sentence right here: it is an example of a problem
如果您希望将每一行文本分开,请将末尾的 summarise
替换为 mutate(speaker = speaker[1])
,您将得到每行语音一行,如下所示:
speaker text speakingGroup
President Dr. Norbert Lammert I declare the session open. 1
President Dr. Norbert Lammert I will now give the floor to Bundesminister Alexander Dobrindt. 1
Alexander Dobrindt, Minister for Transport and Digital Infrastructure 2
Alexander Dobrindt, Minister for Transport and Digital Infrastructure Ladies and Gentleman. We will today start the biggest investment in infrastructure that ever existed, with over 270 billion Euro, over 1 000 projects and a clear financing perspective. 2
Alexander Dobrindt, Minister for Transport and Digital Infrastructure This sentence right here: it is an example of a problem 2