计算每个议程的页数 - r 中的文本挖掘

Question

我必须计算每个议程项目的页数。我已经将pdf文档中的文本提取到一个数据框中，基本上这个数据框的一行包含一页文本。这是我的数据的样子：

mydf <- data.frame(text = c("AGENDA ITEM 1
        4", "This particular row contains a lot of text, really its all text present in one page", 
        "So ineffect, one page of text per row", "This is another page of text in this row", 
        "lets include another page for agenda 1", "AGENDA ITEM 2
        9",
        "now all the text in agenda 2 is included here","the 2nd page text of agenda 2", 
        "AGENDA ITEM 3
        12", "Now lets just add one row for this agenda, meaning it only has one page inside it"))

在议程文本（同一行）下，数字是页码，并且在同一行中。要计算每个议程的页数，我只需要计算行数，直到下一个议程项目出现。考虑到上面的例子，答案应该是

AGENDA ITEM 1 = 4 Pages, AGENDA ITEM 2 = 2 Pages and AGENDA ITEM 3 = 1 Page.

我该怎么做？我对分析文本还很陌生。谢谢

Answer 1

如果模式 "AGENDA ITEM ##" 没有出现在您的普通文本中，您可以使用以下使用 grep() 的方法。我希望这对你有用。

#get all rownumbers of rows starting with the pattern
start_rows <- grep("AGENDA ITEM \d+", mydf$text)

#get the end of each "AGENDA ITEM chapter"
#a chapter ends one line before the next chapter starts, hence, 
#-1 and offset -1 from startrows
#and the final chapter ends with the last line
end_rows <- c(start_rows[-1]-1
              ,length(mydf$text))

end_rows-start_rows
#[1] 4 2 1

Answer 2

你可以这样使用grep

mydf <- data.frame(text = c("AGENDA ITEM 1
                            4", "This particular row contains a lot of text, really its all text present in one page", 
                            "So ineffect, one page of text per row", "This is another page of text in this row", 
                            "lets include another page for agenda 1", "AGENDA ITEM 2
                            9",
                            "now all the text in agenda 2 is included here","the 2nd page text of agenda 2", 
                            "AGENDA ITEM 3
                            12", "Now lets just add one row for this agenda, meaning it only has one page inside it"))

lst <- as.character(mydf$text)
index <- grep(pattern = "AGENDA ITEM", lst)
index <- c(index,length(lst))

pages <- diff(index)
pages[1:length(pages)-1] <- pages[1:length(pages)-1] - 1
pages

[1] 4 2 1

计算每个议程的页数 - r 中的文本挖掘

Count Number of Pages per AGENDA- Text Mining in r

nlp

r

text-mining