quanteda 中的自定义词典
custom dictionaries in quanteda
我需要做 LIWC(语言查询和字数统计)并且我正在使用 quanteda/quanteda.dictionaries。我需要 "load" 自定义词典:我将我的单词列表保存为单独的 .txt 文件和一个 "load" 通过阅读行(只有一本词典的示例):
autonomy = readLines("Dictionary/autonomy.txt", encoding = "UTF-8")
EODic<-quanteda::dictionary(list(autonomy=autonomy),encoding = "auto")
这是我正在试用的文字
txt <- c("12th Battalion Productions is producing a fully holographic feature length production. Presenting a 3D audio-visual projection without a single cast member present, to give the illusion of live stage performance.")
那我运行它:
liwcalike(txt, EODic, what = "word")
并得到这个错误:
Error in stri_replace_all_charclass(value, "\p{Z}", concatenator) :
invalid UTF-8 byte sequence detected; perhaps you should try calling stri_enc_toutf8()
显然,问题出在我的 txt 文件上。我有很多词典,而是将它们作为文件加载。
我该如何解决这个错误?在 readlines 中指定编码似乎没有帮助
这是文件https://drive.google.com/file/d/12plgfJdMawmqTkcLWxD1BfWdaeHuPTXV/view?usp=sharing
更新:在 Mac 上解决此问题的最简单方法是在 Word 而不是 TextEdit 中打开 .txt 文件。 Word 提供了与默认 TextEdit 不同的编码选项!
好的,问题不在于编码问题,因为您链接的文件中的所有内容都可以完全用低位 128 字符 ASCII 编码。问题是由空行引起的空白。还有一些前导空格需要删除。使用一些子集和一些 stringi 清理操作很容易做到这一点。
library("quanteda")
## Package version: 1.3.14
autonomy <- readLines("~/Downloads/risktaking.txt", encoding = "UTF-8")
head(autonomy, 15)
## [1] "adventuresome" " adventurous" " audacious" " bet"
## [5] " bold" " bold-spirited" " brash" " brave"
## [9] " chance" " chancy" " courageous" " danger"
## [13] "" "dangerous" " dare"
# strip leading or trailing whitespace
autonomy <- stringi::stri_trim_both(autonomy)
# get rid of empties
autonomy <- autonomy[!autonomy == ""]
现在您可以创建字典并应用 quanteda.dictionaries::liwcalike()
函数。
# now define the quanteda dictionary
EODic <- dictionary(list(autonomy = autonomy))
txt <- c("12th Battalion Productions is producing a fully holographic feature length production. Presenting a 3D audio-visual projection without a single cast member present, to give the illusion of live stage performance.")
library("quanteda.dictionaries")
liwcalike(txt, dictionary = EODic)
## docname Segment WC WPS Sixltr Dic autonomy AllPunc Period Comma Colon
## 1 text1 1 35 15.5 34.29 0 0 11.43 5.71 2.86 0
## SemiC QMark Exclam Dash Quote Apostro Parenth OtherP
## 1 0 0 0 2.86 0 0 0 8.57
我需要做 LIWC(语言查询和字数统计)并且我正在使用 quanteda/quanteda.dictionaries。我需要 "load" 自定义词典:我将我的单词列表保存为单独的 .txt 文件和一个 "load" 通过阅读行(只有一本词典的示例):
autonomy = readLines("Dictionary/autonomy.txt", encoding = "UTF-8")
EODic<-quanteda::dictionary(list(autonomy=autonomy),encoding = "auto")
这是我正在试用的文字
txt <- c("12th Battalion Productions is producing a fully holographic feature length production. Presenting a 3D audio-visual projection without a single cast member present, to give the illusion of live stage performance.")
那我运行它:
liwcalike(txt, EODic, what = "word")
并得到这个错误:
Error in stri_replace_all_charclass(value, "\p{Z}", concatenator) :
invalid UTF-8 byte sequence detected; perhaps you should try calling stri_enc_toutf8()
显然,问题出在我的 txt 文件上。我有很多词典,而是将它们作为文件加载。
我该如何解决这个错误?在 readlines 中指定编码似乎没有帮助
这是文件https://drive.google.com/file/d/12plgfJdMawmqTkcLWxD1BfWdaeHuPTXV/view?usp=sharing
更新:在 Mac 上解决此问题的最简单方法是在 Word 而不是 TextEdit 中打开 .txt 文件。 Word 提供了与默认 TextEdit 不同的编码选项!
好的,问题不在于编码问题,因为您链接的文件中的所有内容都可以完全用低位 128 字符 ASCII 编码。问题是由空行引起的空白。还有一些前导空格需要删除。使用一些子集和一些 stringi 清理操作很容易做到这一点。
library("quanteda")
## Package version: 1.3.14
autonomy <- readLines("~/Downloads/risktaking.txt", encoding = "UTF-8")
head(autonomy, 15)
## [1] "adventuresome" " adventurous" " audacious" " bet"
## [5] " bold" " bold-spirited" " brash" " brave"
## [9] " chance" " chancy" " courageous" " danger"
## [13] "" "dangerous" " dare"
# strip leading or trailing whitespace
autonomy <- stringi::stri_trim_both(autonomy)
# get rid of empties
autonomy <- autonomy[!autonomy == ""]
现在您可以创建字典并应用 quanteda.dictionaries::liwcalike()
函数。
# now define the quanteda dictionary
EODic <- dictionary(list(autonomy = autonomy))
txt <- c("12th Battalion Productions is producing a fully holographic feature length production. Presenting a 3D audio-visual projection without a single cast member present, to give the illusion of live stage performance.")
library("quanteda.dictionaries")
liwcalike(txt, dictionary = EODic)
## docname Segment WC WPS Sixltr Dic autonomy AllPunc Period Comma Colon
## 1 text1 1 35 15.5 34.29 0 0 11.43 5.71 2.86 0
## SemiC QMark Exclam Dash Quote Apostro Parenth OtherP
## 1 0 0 0 2.86 0 0 0 8.57