从网络抓取的数据中处理字符串

Manipulate strings from web-scraped data

我正在尝试从 webpage and I have trouble manipulating the strings. If you visit the page, you'll realize that this is a website written in French. I am trying to get the data in tabular format at the bottom of the page. In French, thousand separators are either . or spaces, which are used on the webpage.

中抓取数据

这是我用于废弃第二列中的值的代码:

library(rvest)

link <- read_html("http://perspective.usherbrooke.ca/bilan/servlet/BMTendanceStatPays?langue=fr&codePays=NOR&codeTheme=1&codeStat=SP.POP.TOTL")

link %>%
   html_nodes(".tableauBarreDroite") %>%
   html_text() -> pop

head(pop)
[1] "3Â 581Â 239" "3Â 609Â 800" "3Â 638Â 918" "3Â 666Â 537" "3Â 694Â 339" "3Â 723Â 168"

pop 向量中的值包含预期的 spaces 和意外的 Â。我尝试了以下删除 spaces:

new.pop <- gsub(pattern = " ", replacement = "", x = pop)

head(new.pop)
[1] "3Â 581Â 239" "3Â 609Â 800" "3Â 638Â 918" "3Â 666Â 537" "3Â 694Â 339" "3Â 723Â 168"

spaces 仍然存在于 new.pop 变量中。我也尝试删除标签:

new.pop <- gsub(pattern = "\n", replacement = "", x = pop)

head(new.pop)
[1] "3Â 581Â 239" "3Â 609Â 800" "3Â 638Â 918" "3Â 666Â 537" "3Â 694Â 339" "3Â 723Â 168"

如您所见,spaces 并没有消失。您知道在删除不需要的字符后我应该如何将 pop 向量转换为数字向量吗?

只是一个提示,你应该使用这个:

gsub(pattern="\s",replacement="",x=pop) or
gsub(pattern=".\s",replacement="@",x=pop)

因为space是一个特殊字符

最佳,

罗伯特