从网络抓取的数据中处理字符串
Manipulate strings from web-scraped data
我正在尝试从 webpage and I have trouble manipulating the strings. If you visit the page, you'll realize that this is a website written in French. I am trying to get the data in tabular format at the bottom of the page. In French, thousand separators are either .
or spaces
, which are used on the webpage.
中抓取数据
这是我用于废弃第二列中的值的代码:
library(rvest)
link <- read_html("http://perspective.usherbrooke.ca/bilan/servlet/BMTendanceStatPays?langue=fr&codePays=NOR&codeTheme=1&codeStat=SP.POP.TOTL")
link %>%
html_nodes(".tableauBarreDroite") %>%
html_text() -> pop
head(pop)
[1] "3Â 581Â 239" "3Â 609Â 800" "3Â 638Â 918" "3Â 666Â 537" "3Â 694Â 339" "3Â 723Â 168"
pop
向量中的值包含预期的 spaces
和意外的 Â
。我尝试了以下删除 spaces
:
new.pop <- gsub(pattern = " ", replacement = "", x = pop)
head(new.pop)
[1] "3Â 581Â 239" "3Â 609Â 800" "3Â 638Â 918" "3Â 666Â 537" "3Â 694Â 339" "3Â 723Â 168"
spaces
仍然存在于 new.pop
变量中。我也尝试删除标签:
new.pop <- gsub(pattern = "\n", replacement = "", x = pop)
head(new.pop)
[1] "3Â 581Â 239" "3Â 609Â 800" "3Â 638Â 918" "3Â 666Â 537" "3Â 694Â 339" "3Â 723Â 168"
如您所见,spaces
并没有消失。您知道在删除不需要的字符后我应该如何将 pop
向量转换为数字向量吗?
只是一个提示,你应该使用这个:
gsub(pattern="\s",replacement="",x=pop) or
gsub(pattern=".\s",replacement="@",x=pop)
因为space是一个特殊字符
最佳,
罗伯特
我正在尝试从 webpage and I have trouble manipulating the strings. If you visit the page, you'll realize that this is a website written in French. I am trying to get the data in tabular format at the bottom of the page. In French, thousand separators are either .
or spaces
, which are used on the webpage.
这是我用于废弃第二列中的值的代码:
library(rvest)
link <- read_html("http://perspective.usherbrooke.ca/bilan/servlet/BMTendanceStatPays?langue=fr&codePays=NOR&codeTheme=1&codeStat=SP.POP.TOTL")
link %>%
html_nodes(".tableauBarreDroite") %>%
html_text() -> pop
head(pop)
[1] "3Â 581Â 239" "3Â 609Â 800" "3Â 638Â 918" "3Â 666Â 537" "3Â 694Â 339" "3Â 723Â 168"
pop
向量中的值包含预期的 spaces
和意外的 Â
。我尝试了以下删除 spaces
:
new.pop <- gsub(pattern = " ", replacement = "", x = pop)
head(new.pop)
[1] "3Â 581Â 239" "3Â 609Â 800" "3Â 638Â 918" "3Â 666Â 537" "3Â 694Â 339" "3Â 723Â 168"
spaces
仍然存在于 new.pop
变量中。我也尝试删除标签:
new.pop <- gsub(pattern = "\n", replacement = "", x = pop)
head(new.pop)
[1] "3Â 581Â 239" "3Â 609Â 800" "3Â 638Â 918" "3Â 666Â 537" "3Â 694Â 339" "3Â 723Â 168"
如您所见,spaces
并没有消失。您知道在删除不需要的字符后我应该如何将 pop
向量转换为数字向量吗?
只是一个提示,你应该使用这个:
gsub(pattern="\s",replacement="",x=pop) or
gsub(pattern=".\s",replacement="@",x=pop)
因为space是一个特殊字符
最佳,
罗伯特