在 readtext 和 quanteda 中清除 SEC Edgar 文件中的标签
Clean tags from SEC Edgar filings in readtext and quanteda
我正在尝试使用我从 SEC Edgar 公开上市公司文件数据库中解析的 readtext 和 quanteda 将 .txt 文件读入 R。 .txt 文件的示例是 here and a more user friendly version is here 用于比较(加州野火期间的 PG&E)。
我的代码如下,一个 1996 年的文件夹,包含许多 .txt 文件:
directory<-("D:")
text <- readtext(paste0(directory,"/1996/*.txt"))
corpus<-corpus(text)
dfm<-dfm(corpus,tolower=TRUE,stem=TRUE,remove=stopwords("english"),remove_punct=TRUE)
我注意到 dfm 仍然包含很多 'useless' 标记,例如 'font-style'、'italic',最后还有许多无用的标记,例如 '3eyn' 和 'kq',我认为这是 .txt 文件底部的 .jpg 部分的一部分。
当我使用 readtext 对文档进行编码时,问题仍然存在,例如在执行以下操作时:
text<-readtext(paste0(directory,"/*.txt"),encoding="UTF-8")
text<-readtext(paste0(directory,"/*.txt"),encoding="ASCII")
非常感谢任何有关如何清理这些文件以使它们看起来更像上面的用户友好版本(即仅包含正文)的帮助。
此处的关键是在文本中找到指示所需文本开头的标记,以及指示文本结束位置的标记。这可以是使用 |
.
在正则表达式中分隔的一组条件
保留第一个标记之前的任何内容(默认情况下),您可以使用 corpus_subset()
从语料库中删除结束标记后面的文本。在您发现实际数据中的各种模式后,实际模式无疑需要调整。
这是我为您的示例文档所做的:
library("quanteda")
## Package version: 2.0.0
corp <- readtext::readtext("https://www.sec.gov/Archives/edgar/data/75488/000114036117038612/0001140361-17-038612.txt") %>%
corpus()
# clean up text
corp <- gsub("<.*?>|&#\d+;", "", corp)
corp <- gsub("&", "&", corp)
corp <- corpus_segment(corp,
pattern = "Item 8\.01 Other Events\.|SIGNATURES",
valuetype = "regex"
) %>%
corpus_subset(pattern != "SIGNATURES")
print(corp, max_nchar = -1)
## Corpus consisting of 1 document and 1 docvar.
## 0001140361-17-038612.txt.1 :
## "Investigation of Northern California Fires Since October 8, 2017, several catastrophic wildfires have started and remain active in Northern California. The causes of these fires are being investigated by the California Department of Forestry and Fire Protection (Cal Fire), including the possible role of power lines and other facilities of Pacific Gas and Electric Companys (the Utility), a subsidiary of PG&E Corporation. It currently is unknown whether the Utility would have any liability associated with these fires. The Utility has approximately 0 million in liability insurance for potential losses that may result from these fires. If the amount of insurance is insufficient to cover the Utility's liability or if insurance is otherwise unavailable, PG&E Corporations and the Utilitys financial condition or results of operations could be materially affected."
我正在尝试使用我从 SEC Edgar 公开上市公司文件数据库中解析的 readtext 和 quanteda 将 .txt 文件读入 R。 .txt 文件的示例是 here and a more user friendly version is here 用于比较(加州野火期间的 PG&E)。
我的代码如下,一个 1996 年的文件夹,包含许多 .txt 文件:
directory<-("D:")
text <- readtext(paste0(directory,"/1996/*.txt"))
corpus<-corpus(text)
dfm<-dfm(corpus,tolower=TRUE,stem=TRUE,remove=stopwords("english"),remove_punct=TRUE)
我注意到 dfm 仍然包含很多 'useless' 标记,例如 'font-style'、'italic',最后还有许多无用的标记,例如 '3eyn' 和 'kq',我认为这是 .txt 文件底部的 .jpg 部分的一部分。
当我使用 readtext 对文档进行编码时,问题仍然存在,例如在执行以下操作时:
text<-readtext(paste0(directory,"/*.txt"),encoding="UTF-8")
text<-readtext(paste0(directory,"/*.txt"),encoding="ASCII")
非常感谢任何有关如何清理这些文件以使它们看起来更像上面的用户友好版本(即仅包含正文)的帮助。
此处的关键是在文本中找到指示所需文本开头的标记,以及指示文本结束位置的标记。这可以是使用 |
.
保留第一个标记之前的任何内容(默认情况下),您可以使用 corpus_subset()
从语料库中删除结束标记后面的文本。在您发现实际数据中的各种模式后,实际模式无疑需要调整。
这是我为您的示例文档所做的:
library("quanteda")
## Package version: 2.0.0
corp <- readtext::readtext("https://www.sec.gov/Archives/edgar/data/75488/000114036117038612/0001140361-17-038612.txt") %>%
corpus()
# clean up text
corp <- gsub("<.*?>|&#\d+;", "", corp)
corp <- gsub("&", "&", corp)
corp <- corpus_segment(corp,
pattern = "Item 8\.01 Other Events\.|SIGNATURES",
valuetype = "regex"
) %>%
corpus_subset(pattern != "SIGNATURES")
print(corp, max_nchar = -1)
## Corpus consisting of 1 document and 1 docvar.
## 0001140361-17-038612.txt.1 :
## "Investigation of Northern California Fires Since October 8, 2017, several catastrophic wildfires have started and remain active in Northern California. The causes of these fires are being investigated by the California Department of Forestry and Fire Protection (Cal Fire), including the possible role of power lines and other facilities of Pacific Gas and Electric Companys (the Utility), a subsidiary of PG&E Corporation. It currently is unknown whether the Utility would have any liability associated with these fires. The Utility has approximately 0 million in liability insurance for potential losses that may result from these fires. If the amount of insurance is insufficient to cover the Utility's liability or if insurance is otherwise unavailable, PG&E Corporations and the Utilitys financial condition or results of operations could be materially affected."