我需要在不丢失信息的情况下过滤数据,它是字符但我不能过滤它
I need to filter data without missing information, it is character but I can't filter it
图书馆(XML)
图书馆(dplyr)
图书馆(rvest)
presid <- read_html("https://en.wikipedia.org/wiki/List_of_presidents_of_Peru") %>% # 阅读 html 页
html_nodes("table") %>% # extract nodes which contain a table
.[3] %>% # select the node which contains the relevant table
html_table(header = NA,
trim = T) # extract the table
t3 <- presid[[1]] # flatten data
t4 <-t3[unique(t3$N),] # eliminated duplicate
t5 <- subset(t4,!is.na(President))#
我需要阅读此 table 并以最佳方式过滤数据,在过滤数据时不允许丢失大量信息。
行的丢失非常重要,它从 t3 中的 98 行减少到 t4 中的 72 行和 t5 中的 63 行,而实际上我只需要将信息从 98 行减少到 84 行,可以通过列 N
我试过这些公式,但没有结果
strsplit (as.character (t3$N), split = "(? <= [a-zA-Z]) (? = [0-9])", perl = TRUE)
其他
grep("[[:numeric:]]{2, }",N,value=T)
我需要过滤的N列行是小数点为0.5、2.5、6.5、6.6的行,以及其他以.5结尾的行,总共有14行我必须删除。
我的数据框将从 98 行减少到 84 行。
我可以按日期过滤,但我没有找到太多 material 可以帮助我的东西,
谢谢
由于来自网站的数据具有重复的列名,我们可以使用 janitor::clean_names()
来获得干净的列名,然后仅保留 n
列中具有整数的那些行。
library(rvest)
library(dplyr)
read_html("https://en.wikipedia.org/wiki/List_of_presidents_of_Peru") %>%
html_nodes("table") %>%
.[3] %>%
html_table(header = NA,trim = T) %>%
.[[1]] %>%
janitor::clean_names() %>%
filter(grepl('^\d+$', n)) -> result
result
# A tibble: 85 x 10
# n president president_2 president_3 term_of_office term_of_office_2 title
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 1 "" "" José de la R… 28 February 18… 23 June 1823 President of …
# 2 2 "" "" José Bernard… 16 August 1823 18 November 1823 President of …
# 3 2 "" "" José Bernard… 18 November 18… 10 February 1824 Constitutiona…
# 4 3 "" "" José de La M… 10 June 1827 7 June 1829 Constitutiona…
# 5 4 "" "" Agustín Gama… 7 June 1829 19 December 1829 Antonio Gutié…
# 6 4 "" "" Agustín Gama… 1 September 18… 19 December 1829 Provisional P…
# 7 4 "" "" Agustín Gama… 19 December 18… 19 December 1833 Constitutiona…
# 8 5 "" "" Luis José de… 21 December 18… 21 December 1833 Provisional P…
# 9 6 "" "" Felipe Salav… 25 February 18… 7 February 1836 Supreme Head …
#10 7 "" "" Agustín Gama… 20 January 183… 15 August 1839 Provisional P…
# … with 75 more rows, and 3 more variables: form_of_entry <chr>, vice_president <chr>,
# vice_president_2 <chr>
图书馆(XML) 图书馆(dplyr) 图书馆(rvest)
presid <- read_html("https://en.wikipedia.org/wiki/List_of_presidents_of_Peru") %>% # 阅读 html 页
html_nodes("table") %>% # extract nodes which contain a table
.[3] %>% # select the node which contains the relevant table
html_table(header = NA,
trim = T) # extract the table
t3 <- presid[[1]] # flatten data
t4 <-t3[unique(t3$N),] # eliminated duplicate
t5 <- subset(t4,!is.na(President))#
我需要阅读此 table 并以最佳方式过滤数据,在过滤数据时不允许丢失大量信息。 行的丢失非常重要,它从 t3 中的 98 行减少到 t4 中的 72 行和 t5 中的 63 行,而实际上我只需要将信息从 98 行减少到 84 行,可以通过列 N
我试过这些公式,但没有结果
strsplit (as.character (t3$N), split = "(? <= [a-zA-Z]) (? = [0-9])", perl = TRUE)
其他
grep("[[:numeric:]]{2, }",N,value=T)
我需要过滤的N列行是小数点为0.5、2.5、6.5、6.6的行,以及其他以.5结尾的行,总共有14行我必须删除。 我的数据框将从 98 行减少到 84 行。
我可以按日期过滤,但我没有找到太多 material 可以帮助我的东西,
谢谢
由于来自网站的数据具有重复的列名,我们可以使用 janitor::clean_names()
来获得干净的列名,然后仅保留 n
列中具有整数的那些行。
library(rvest)
library(dplyr)
read_html("https://en.wikipedia.org/wiki/List_of_presidents_of_Peru") %>%
html_nodes("table") %>%
.[3] %>%
html_table(header = NA,trim = T) %>%
.[[1]] %>%
janitor::clean_names() %>%
filter(grepl('^\d+$', n)) -> result
result
# A tibble: 85 x 10
# n president president_2 president_3 term_of_office term_of_office_2 title
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 1 "" "" José de la R… 28 February 18… 23 June 1823 President of …
# 2 2 "" "" José Bernard… 16 August 1823 18 November 1823 President of …
# 3 2 "" "" José Bernard… 18 November 18… 10 February 1824 Constitutiona…
# 4 3 "" "" José de La M… 10 June 1827 7 June 1829 Constitutiona…
# 5 4 "" "" Agustín Gama… 7 June 1829 19 December 1829 Antonio Gutié…
# 6 4 "" "" Agustín Gama… 1 September 18… 19 December 1829 Provisional P…
# 7 4 "" "" Agustín Gama… 19 December 18… 19 December 1833 Constitutiona…
# 8 5 "" "" Luis José de… 21 December 18… 21 December 1833 Provisional P…
# 9 6 "" "" Felipe Salav… 25 February 18… 7 February 1836 Supreme Head …
#10 7 "" "" Agustín Gama… 20 January 183… 15 August 1839 Provisional P…
# … with 75 more rows, and 3 more variables: form_of_entry <chr>, vice_president <chr>,
# vice_president_2 <chr>