使文本文档成为数字列表
making a text document a numeric list
我正在尝试自动将大型语料库制作成数字列表。每行一个数字。例如我有以下数据:
Df.txt =
In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”.
We love you Mr. Brown.
Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him.
so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home.
With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!
If you have an alternative argument, let's hear it! :)
首先,我使用命令 readLines
:
阅读文本
text <- readLines("Df.txt", encoding = "UTF-8")
其次,我将所有文本变成小写字母,并删除了不必要的间距:
## Lower cases input:
lower_text <- tolower(text)
## removing leading and trailing spaces:
Spaces_remove <- str_trim(lower_text)
从现在开始,我想为每一行分配一个数字,例如:
"In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”." = 1
"We love you Mr. Brown." = 2
...
"If you have an alternative argument, let's hear it! :)" = 6
有什么想法吗?
您已经有点将数字行 # 与矢量相关联(它按数字索引),但是……
text_input <- 'In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”.
We love you Mr. Brown.
Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him.
so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home.
With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one\'s life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE\'s new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!
If you have an alternative argument, let\'s hear it! :)'
library(dplyr)
library(purrr)
library(stringi)
textConnection(text_input) %>%
readLines(encoding="UTF-8") %>%
stri_trans_tolower() %>%
stri_trim() -> corpus
# data frame with explicit line # column
df <- data_frame(line_number=1:length(corpus), text=corpus)
# list with an explicit line number field
lst <- map(1:length(corpus), ~list(line_number=., text=corpus[.]))
# implicit list numeric ids
as.list(corpus)
# explicit list numeric id's (but they're really string keys)
setNames(as.list(corpus), 1:length(corpus))
# named vector
set_names(corpus, 1:length(corpus))
有 过多 的 R 包可以显着减轻文本 processing/NLP 操作的负担。在他们之外做这项工作很可能是重新发明轮子。 CRAN NLP Task View 列出了其中的许多。
我正在尝试自动将大型语料库制作成数字列表。每行一个数字。例如我有以下数据:
Df.txt =
In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”.
We love you Mr. Brown.
Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him.
so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home.
With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!
If you have an alternative argument, let's hear it! :)
首先,我使用命令 readLines
:
text <- readLines("Df.txt", encoding = "UTF-8")
其次,我将所有文本变成小写字母,并删除了不必要的间距:
## Lower cases input:
lower_text <- tolower(text)
## removing leading and trailing spaces:
Spaces_remove <- str_trim(lower_text)
从现在开始,我想为每一行分配一个数字,例如:
"In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”." = 1
"We love you Mr. Brown." = 2
...
"If you have an alternative argument, let's hear it! :)" = 6
有什么想法吗?
您已经有点将数字行 # 与矢量相关联(它按数字索引),但是……
text_input <- 'In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”.
We love you Mr. Brown.
Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him.
so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home.
With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one\'s life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE\'s new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!
If you have an alternative argument, let\'s hear it! :)'
library(dplyr)
library(purrr)
library(stringi)
textConnection(text_input) %>%
readLines(encoding="UTF-8") %>%
stri_trans_tolower() %>%
stri_trim() -> corpus
# data frame with explicit line # column
df <- data_frame(line_number=1:length(corpus), text=corpus)
# list with an explicit line number field
lst <- map(1:length(corpus), ~list(line_number=., text=corpus[.]))
# implicit list numeric ids
as.list(corpus)
# explicit list numeric id's (but they're really string keys)
setNames(as.list(corpus), 1:length(corpus))
# named vector
set_names(corpus, 1:length(corpus))
有 过多 的 R 包可以显着减轻文本 processing/NLP 操作的负担。在他们之外做这项工作很可能是重新发明轮子。 CRAN NLP Task View 列出了其中的许多。