Dataframe 使总行数加倍

Question

我运行正在编写以下代码：

library(tidyverse)
library(rvest)
library(magrittr)
library(dplyr)
library(tidyr)
library(data.table)
library(zoo)

commits_url <- paste0("https://247sports.com/Season/2022-Football/Commits/?Page=", 1:7)

commits_school_gather <- map_df(commits_url, ~.x %>% read_html %>%
                                  html_nodes('div.status img') %>%
                                  html_attr('title') %>%
                                  matrix(ncol = 1, byrow = T) %>% 
                                  as.data.frame)

这应该 return 238 行（至少截至目前，在 2021 年 3 月 5 日美国东部标准时间 05:36 下午。请注意这一点以备将来使用，因为该数字会随着时间的推移而变化).当我运行代码时，它 return 有 476 行，正好是我预期的两倍。

如果你运行 commits_school_gather %>% head(10) 它看起来像这样：

V1
Rutgers
Rutgers
Notre Dame
Notre Dame
Michigan
Michigan
Akron
Akron
Notre Dame
Notre Dame

我希望输出看起来像这样：

V1
Rutgers
Notre Dame
Michigan
Akron
Notre Dame

Answer 1

我们可以使用rleid

library(dplyr)
library(data.table)
commits_school_gather %>%
   filter(!duplicated(rleid(V1)))

Answer 2

问题是 html_nodes() 步骤中的 css 选择器选择了 'dic.status img' 路径内的两个节点。

试试这个：

commits_school_gather <- map_df(commits_url, ~.x %>% read_html %>%
                              html_nodes('div.status img.jsonly') %>%
                              html_attr('title') %>%
                              matrix(ncol = 1, byrow = T) %>% 
                              as.data.frame)

结果

                   V1
1             Rutgers
2          Notre Dame
3            Michigan
4               Akron
5             Clemson
6              Oregon
7            Oklahoma
8             Arizona
9              Kansas
10  Mississippi State
...

解释错误：

看看原来的 css = 'div.status img' 将如何选择第 3 行和第 5 行中的节点。备选 css 仅将查询固定为第一个节点。

Dataframe 使总行数加倍

Dataframe doubling the total number of rows

r

dplyr

rvest

purrr

结果

解释错误：