R 中的文本挖掘 - 从文本文件中删除以关键字开头的行
Text Mining in R - Remove Rows from Text File Starting With Keywords
我正在将文本文件读入 R,如下所示:
test<-readLines("D:/AAPL MSFT Earnings Calls/Test/Test.txt")
此文件是从 PDF 转换而来,并保留了一些我想删除的 header 数据。他们将以 "Page," "Market Cap," 等词开头。
如何删除 TXT 文件中以这些关键字开头的所有行?这与删除包含该词的行相反。
使用下面的答案之一,我修改了一点以读入
setwd("C:/Users/George/Google Drive/PhD/Strategic agility/Source Data/Peripherals Earnings Calls 2016")
text1<-readLines("test.txt")
text
library(purrr)
library(stringr)
text1 <- "foo
Page, bar
baz
Market Cap, qux"
text1 <- readLines(con = textConnection(file))
ignore_patterns <- c("^Page,", "^Market\s+Cap,")
text1 %>% discard(~ any(str_detect(.x, ignore_patterns)))
text1
这是我得到的输出:
> text1
[1] "foo" "Page, bar" "baz" "Market Cap, qux"
foo/baz/qux 个字符是什么?谢谢
library(purrr)
library(stringr)
file <- "foo
Page, bar
baz
Market Cap, qux"
test <- readLines(con = textConnection(file))
ignore_patterns <- c("^Page,", "^Market\s+Cap,")
test %>% discard(~ any(str_detect(.x, ignore_patterns)))
# once you have read and stored in a data.frame
# perform below subsetting :
x = grepl("^(Page|Market Cap)", df$id) # where df is you data.frame and 'id' is your
# column name that has those unwanted keywords
df <- df[!x,] # does the job!
^
有助于检查开始。因此,如果行以 Page
或 (|
)Market Cap
开头,则 grepl
return TRUE
我正在将文本文件读入 R,如下所示:
test<-readLines("D:/AAPL MSFT Earnings Calls/Test/Test.txt")
此文件是从 PDF 转换而来,并保留了一些我想删除的 header 数据。他们将以 "Page," "Market Cap," 等词开头。
如何删除 TXT 文件中以这些关键字开头的所有行?这与删除包含该词的行相反。
使用下面的答案之一,我修改了一点以读入
setwd("C:/Users/George/Google Drive/PhD/Strategic agility/Source Data/Peripherals Earnings Calls 2016")
text1<-readLines("test.txt")
text
library(purrr)
library(stringr)
text1 <- "foo
Page, bar
baz
Market Cap, qux"
text1 <- readLines(con = textConnection(file))
ignore_patterns <- c("^Page,", "^Market\s+Cap,")
text1 %>% discard(~ any(str_detect(.x, ignore_patterns)))
text1
这是我得到的输出:
> text1
[1] "foo" "Page, bar" "baz" "Market Cap, qux"
foo/baz/qux 个字符是什么?谢谢
library(purrr)
library(stringr)
file <- "foo
Page, bar
baz
Market Cap, qux"
test <- readLines(con = textConnection(file))
ignore_patterns <- c("^Page,", "^Market\s+Cap,")
test %>% discard(~ any(str_detect(.x, ignore_patterns)))
# once you have read and stored in a data.frame
# perform below subsetting :
x = grepl("^(Page|Market Cap)", df$id) # where df is you data.frame and 'id' is your
# column name that has those unwanted keywords
df <- df[!x,] # does the job!
^
有助于检查开始。因此,如果行以 Page
或 (|
)Market Cap
开头,则 grepl
return TRUE