as.numeric() 为应该是数字的内容生成 NA
as.numeric() producing NAs for what should be numbers
我正在抓取伊利诺伊州惩教署网站 (https://www2.illinois.gov/idoc/facilities/Pages/Covid19Response.aspx) 并尝试创建一个名为 Staff.Confirmed 的向量,用于记录感染 COVID-19 的员工人数。我运行以下代码
url <- "https://www2.illinois.gov/idoc/facilities/Pages/Covid19Response.aspx"
web <- read_html(url)
Staff.Confirmed.html <- html_nodes(web,'.soi-rteTableOddCol-1:nth-child(2)')
Staff.Confirmed <- html_text(Staff.Confirmed.html)
Staff.Confirmed <- as.numeric(Staff.Confirmed)
当我调用 Staff.Confirmed 时会产生以下输出:
[1] "1" "1" "1" "4" "5" "7" "1" "1" "2" "1" "13" "3" "4" "2" "4" "2" "0" "8" "8" "1" "79" "37" "0" "1" "1"
然而,当我运行
Staff.Confirmed <- as.numeric(Staff.Confirmed)
我收到警告消息“强制引入 NA”。每个数字都变成了 NA。据我所知,没有空格,也没有任何其他应该导致此问题的问题。以前有其他人遇到过这个问题吗?
我试过了运行宁
Staff.Confirmed <- gsub(pattern="^[0-9]",replacement="",Staff.Confirmed)
Staff.Confirmed <- as.numeric(Staff.Confirmed)
但是还是出现了同样的错误。任何帮助将不胜感激!谢谢。
有一个unicode字母( =零宽度space)隐藏在每个字符串的第一个位置。删除它,它会起作用:
url <- "https://www2.illinois.gov/idoc/facilities/Pages/Covid19Response.aspx"
web <- read_html(url)
Staff.Confirmed.html <- html_nodes(web,'.soi-rteTableOddCol-1:nth-child(2)')
Staff.Confirmed <- substr(html_text(Staff.Confirmed.html), 2, 999)
Staff.Confirmed <- as.numeric(Staff.Confirmed)
在 RStudio 的“环境窗格”中很容易发现。窗格显示矢量如下:
chr [1:25] "<U+200B>1" "<U+200B>1" "<U+200B>1" "<U+200B>4" ...
并且我能够通过 nchar(html_text(Staff.Confirmed.html))
确认每个字符串都太长了恰好 1 个字符。
新思路。两个原则:
- 此方法不会查找无效字符。它会查找我们想要保留的所有内容并丢弃其余内容。
- 我们在一个大 gulp 中得到 table 而不是一列一列
# Fetch the data
url <- "https://www2.illinois.gov/idoc/facilities/Pages/Covid19Response.aspx"
web <- read_html(url)
# extract table and keep only valid characters
raw <- as.character(html_node(web, ".soi-rteTable-1")) # get table and coerce it to character
nchar(raw) # see how long it is
raw <- gsub("[^[:alnum:]|[:space:]|[:punct:]]", "", raw) # keep only valid characters
nchar(raw) # length got reduced
df <- html_table(minimal_html(raw), header = TRUE, trim = TRUE)
df <- df[[1]] # html_table returns a list, so get the first element
# check if columns are numeric
str(df)
'data.frame': 26 obs. of 5 variables:
$ Locations : chr "Crossroads ATC" "Danville" "Dixon" "East Moline" ...
$ Staff Confirmed : int 1 1 1 4 5 7 1 1 2 1 ...
$ Staff Recovered : int 1 1 1 2 5 7 1 1 2 1 ...
$ Incarcerated Individuals Confirmed: int 3 0 0 28 0 4 0 0 15 0 ...
$ Incarcerated Individuals Recovered: int 3 0 0 1 0 4 0 0 15 0 ...
瞧!在带有标签和 4 个整数列的字符列上。
我正在抓取伊利诺伊州惩教署网站 (https://www2.illinois.gov/idoc/facilities/Pages/Covid19Response.aspx) 并尝试创建一个名为 Staff.Confirmed 的向量,用于记录感染 COVID-19 的员工人数。我运行以下代码
url <- "https://www2.illinois.gov/idoc/facilities/Pages/Covid19Response.aspx"
web <- read_html(url)
Staff.Confirmed.html <- html_nodes(web,'.soi-rteTableOddCol-1:nth-child(2)')
Staff.Confirmed <- html_text(Staff.Confirmed.html)
Staff.Confirmed <- as.numeric(Staff.Confirmed)
当我调用 Staff.Confirmed 时会产生以下输出:
[1] "1" "1" "1" "4" "5" "7" "1" "1" "2" "1" "13" "3" "4" "2" "4" "2" "0" "8" "8" "1" "79" "37" "0" "1" "1"
然而,当我运行
Staff.Confirmed <- as.numeric(Staff.Confirmed)
我收到警告消息“强制引入 NA”。每个数字都变成了 NA。据我所知,没有空格,也没有任何其他应该导致此问题的问题。以前有其他人遇到过这个问题吗?
我试过了运行宁
Staff.Confirmed <- gsub(pattern="^[0-9]",replacement="",Staff.Confirmed)
Staff.Confirmed <- as.numeric(Staff.Confirmed)
但是还是出现了同样的错误。任何帮助将不胜感激!谢谢。
有一个unicode字母( =零宽度space)隐藏在每个字符串的第一个位置。删除它,它会起作用:
url <- "https://www2.illinois.gov/idoc/facilities/Pages/Covid19Response.aspx"
web <- read_html(url)
Staff.Confirmed.html <- html_nodes(web,'.soi-rteTableOddCol-1:nth-child(2)')
Staff.Confirmed <- substr(html_text(Staff.Confirmed.html), 2, 999)
Staff.Confirmed <- as.numeric(Staff.Confirmed)
在 RStudio 的“环境窗格”中很容易发现。窗格显示矢量如下:
chr [1:25] "<U+200B>1" "<U+200B>1" "<U+200B>1" "<U+200B>4" ...
并且我能够通过 nchar(html_text(Staff.Confirmed.html))
确认每个字符串都太长了恰好 1 个字符。
新思路。两个原则:
- 此方法不会查找无效字符。它会查找我们想要保留的所有内容并丢弃其余内容。
- 我们在一个大 gulp 中得到 table 而不是一列一列
# Fetch the data
url <- "https://www2.illinois.gov/idoc/facilities/Pages/Covid19Response.aspx"
web <- read_html(url)
# extract table and keep only valid characters
raw <- as.character(html_node(web, ".soi-rteTable-1")) # get table and coerce it to character
nchar(raw) # see how long it is
raw <- gsub("[^[:alnum:]|[:space:]|[:punct:]]", "", raw) # keep only valid characters
nchar(raw) # length got reduced
df <- html_table(minimal_html(raw), header = TRUE, trim = TRUE)
df <- df[[1]] # html_table returns a list, so get the first element
# check if columns are numeric
str(df)
'data.frame': 26 obs. of 5 variables:
$ Locations : chr "Crossroads ATC" "Danville" "Dixon" "East Moline" ...
$ Staff Confirmed : int 1 1 1 4 5 7 1 1 2 1 ...
$ Staff Recovered : int 1 1 1 2 5 7 1 1 2 1 ...
$ Incarcerated Individuals Confirmed: int 3 0 0 28 0 4 0 0 15 0 ...
$ Incarcerated Individuals Recovered: int 3 0 0 1 0 4 0 0 15 0 ...
瞧!在带有标签和 4 个整数列的字符列上。