as.numeric() 为应该是数字的内容生成 NA

Question

我正在抓取伊利诺伊州惩教署网站 (https://www2.illinois.gov/idoc/facilities/Pages/Covid19Response.aspx) 并尝试创建一个名为 Staff.Confirmed 的向量，用于记录感染 COVID-19 的员工人数。我运行以下代码


    url <- "https://www2.illinois.gov/idoc/facilities/Pages/Covid19Response.aspx"
    web <- read_html(url)
    Staff.Confirmed.html <- html_nodes(web,'.soi-rteTableOddCol-1:nth-child(2)')
    Staff.Confirmed <- html_text(Staff.Confirmed.html)
    Staff.Confirmed <- as.numeric(Staff.Confirmed)

当我调用 Staff.Confirmed 时会产生以下输出：

[1] "1"  "1"  "1"  "4"  "5"  "7"  "1"  "1"  "2"  "1"  "13" "3"  "4"  "2"  "4"  "2"  "0"  "8"  "8"  "1"  "79" "37" "0"  "1"  "1"

然而，当我运行

Staff.Confirmed <- as.numeric(Staff.Confirmed)

我收到警告消息“强制引入 NA”。每个数字都变成了 NA。据我所知，没有空格，也没有任何其他应该导致此问题的问题。以前有其他人遇到过这个问题吗？

我试过了运行宁

Staff.Confirmed <- gsub(pattern="^[0-9]",replacement="",Staff.Confirmed)
Staff.Confirmed <- as.numeric(Staff.Confirmed)

但是还是出现了同样的错误。任何帮助将不胜感激！谢谢。

Answer 1

有一个unicode字母（ =零宽度space）隐藏在每个字符串的第一个位置。删除它，它会起作用：

url <- "https://www2.illinois.gov/idoc/facilities/Pages/Covid19Response.aspx"
web <- read_html(url)
Staff.Confirmed.html <- html_nodes(web,'.soi-rteTableOddCol-1:nth-child(2)')
Staff.Confirmed <- substr(html_text(Staff.Confirmed.html), 2, 999)
Staff.Confirmed <- as.numeric(Staff.Confirmed)

在 RStudio 的“环境窗格”中很容易发现。窗格显示矢量如下：

chr [1:25] "<U+200B>1" "<U+200B>1" "<U+200B>1" "<U+200B>4" ...

并且我能够通过 nchar(html_text(Staff.Confirmed.html)) 确认每个字符串都太长了恰好 1 个字符。

Answer 2

新思路。两个原则：

此方法不会查找无效字符。它会查找我们想要保留的所有内容并丢弃其余内容。
我们在一个大 gulp 中得到 table 而不是一列一列

    # Fetch the data
    url <- "https://www2.illinois.gov/idoc/facilities/Pages/Covid19Response.aspx"
    web <- read_html(url)
    
    # extract table and keep only valid characters
    raw <- as.character(html_node(web, ".soi-rteTable-1")) # get table and coerce it to character
    nchar(raw) # see how long it is
    raw <- gsub("[^[:alnum:]|[:space:]|[:punct:]]", "", raw) # keep only valid characters
    nchar(raw) # length got reduced
    df <- html_table(minimal_html(raw), header = TRUE, trim = TRUE)
    df <- df[[1]] # html_table returns a list, so get the first element

    # check if columns are numeric
    str(df)
    'data.frame':   26 obs. of  5 variables:
     $ Locations                         : chr  "Crossroads ATC" "Danville" "Dixon" "East Moline" ...
     $ Staff Confirmed                   : int  1 1 1 4 5 7 1 1 2 1 ...
     $ Staff Recovered                   : int  1 1 1 2 5 7 1 1 2 1 ...
     $ Incarcerated Individuals Confirmed: int  3 0 0 28 0 4 0 0 15 0 ...
     $ Incarcerated Individuals Recovered: int  3 0 0 1 0 4 0 0 15 0 ...

瞧！在带有标签和 4 个整数列的字符列上。

as.numeric() 为应该是数字的内容生成 NA

as.numeric() producing NAs for what should be numbers

r

character-encoding

gsub

web-scraping

na