不确定如何分离我抓取的一列数据
Not sure how to separate a column of data that I scraped
我从 espn 网站上抓取了奥尔巴尼女子篮球队的赛程表数据,win/loss 列的格式如下:W 77-70,这意味着奥尔巴尼以 77-70 获胜。我想将其分开,以便一栏显示奥尔巴尼得了多少分,以及对手得了多少分。
这是我的代码,不知道接下来要做什么:
library(rvest)
library(stringr)
library(tidyr)
w.url <- "http://www.espn.com/womens-college-basketball/team/schedule/_/id/399"
webpage <- read_html(w.url)
w_table <- html_nodes(webpage, 'table')
w <- html_table(w_table)[[1]]
head(w)
w <- w[-(1:2), ]
names(w) <- c("Date", "Opponent", "Score", "Record")
head(w)
您可以先使用grepl
函数trim排除那些没有提供真实结果的行,然后使用正则表达式获取特定信息:
w <- w[grepl("-", w$Score),]
gsub("^([A-Z])([0-9]+)-([0-9]+).*", "\1,\2,\3", w$Score) %>%
strsplit(., split = ",") %>%
lapply(function(x){
data.frame(
result = x[1],
oponent = ifelse(x[1] == "L", x[2], x[3]),
albany = ifelse(x[1] == "W", x[2], x[3])
)
}) %>%
do.call('rbind',.) %>%
cbind(w,.) -> w2
head(w2)
# Date Opponent Score Record result oponent albany
#3 Fri, Nov 9 @#22 South Florida L74-37 0-1 (0-0) L 74 37
#4 Mon, Nov 12 @Cornell L48-34 0-2 (0-0) L 48 34
#5 Wed, Nov 14 vsManhattan W60-54 1-2 (0-0) W 54 60
#6 Sun, Nov 18 @Rutgers L65-39 1-3 (0-0) L 65 39
#7 Wed, Nov 21 @Monmouth L64-56 1-4 (0-0) L 64 56
#8 Sun, Nov 25 vsHoly Cross L56-50 1-5 (0-0) L 56 50
我就是这样做的。基本上,根据奥尔巴尼是赢还是输,使用 sub 提取赢或输值。奥尔巴尼是赢还是输赢家列在第一位。所以ifelse函数是必须的。 “\1”捕获括号中的数字。
w<-w[1:24,]
w$Albany<-ifelse(substr(w$Score,1,1)=='W',sub('W(\d+)-\d+','\1',w$Score),sub('L\d+-(\d+)','\1',w$Score))
w$Opponent_Team<-ifelse(substr(w$Score,1,1)=='W',sub('W\d+-(\d+)','\1',w$Score),sub('L(\d+)-\d+','\1',w$Score))
head(w)
Date Opponent Score Record Albany Opponent_Team
3 Fri, Nov 9 @#22 South Florida L74-37 0-1 (0-0) 37 74
4 Mon, Nov 12 @Cornell L48-34 0-2 (0-0) 34 48
5 Wed, Nov 14 vsManhattan W60-54 1-2 (0-0) 60 54
6 Sun, Nov 18 @Rutgers L65-39 1-3 (0-0) 39 65
7 Wed, Nov 21 @Monmouth L64-56 1-4 (0-0) 56 64
8 Sun, Nov 25 vsHoly Cross L56-50 1-5 (0-0) 50 56
````
我从 espn 网站上抓取了奥尔巴尼女子篮球队的赛程表数据,win/loss 列的格式如下:W 77-70,这意味着奥尔巴尼以 77-70 获胜。我想将其分开,以便一栏显示奥尔巴尼得了多少分,以及对手得了多少分。
这是我的代码,不知道接下来要做什么:
library(rvest)
library(stringr)
library(tidyr)
w.url <- "http://www.espn.com/womens-college-basketball/team/schedule/_/id/399"
webpage <- read_html(w.url)
w_table <- html_nodes(webpage, 'table')
w <- html_table(w_table)[[1]]
head(w)
w <- w[-(1:2), ]
names(w) <- c("Date", "Opponent", "Score", "Record")
head(w)
您可以先使用grepl
函数trim排除那些没有提供真实结果的行,然后使用正则表达式获取特定信息:
w <- w[grepl("-", w$Score),]
gsub("^([A-Z])([0-9]+)-([0-9]+).*", "\1,\2,\3", w$Score) %>%
strsplit(., split = ",") %>%
lapply(function(x){
data.frame(
result = x[1],
oponent = ifelse(x[1] == "L", x[2], x[3]),
albany = ifelse(x[1] == "W", x[2], x[3])
)
}) %>%
do.call('rbind',.) %>%
cbind(w,.) -> w2
head(w2)
# Date Opponent Score Record result oponent albany
#3 Fri, Nov 9 @#22 South Florida L74-37 0-1 (0-0) L 74 37
#4 Mon, Nov 12 @Cornell L48-34 0-2 (0-0) L 48 34
#5 Wed, Nov 14 vsManhattan W60-54 1-2 (0-0) W 54 60
#6 Sun, Nov 18 @Rutgers L65-39 1-3 (0-0) L 65 39
#7 Wed, Nov 21 @Monmouth L64-56 1-4 (0-0) L 64 56
#8 Sun, Nov 25 vsHoly Cross L56-50 1-5 (0-0) L 56 50
我就是这样做的。基本上,根据奥尔巴尼是赢还是输,使用 sub 提取赢或输值。奥尔巴尼是赢还是输赢家列在第一位。所以ifelse函数是必须的。 “\1”捕获括号中的数字。
w<-w[1:24,]
w$Albany<-ifelse(substr(w$Score,1,1)=='W',sub('W(\d+)-\d+','\1',w$Score),sub('L\d+-(\d+)','\1',w$Score))
w$Opponent_Team<-ifelse(substr(w$Score,1,1)=='W',sub('W\d+-(\d+)','\1',w$Score),sub('L(\d+)-\d+','\1',w$Score))
head(w)
Date Opponent Score Record Albany Opponent_Team
3 Fri, Nov 9 @#22 South Florida L74-37 0-1 (0-0) 37 74
4 Mon, Nov 12 @Cornell L48-34 0-2 (0-0) 34 48
5 Wed, Nov 14 vsManhattan W60-54 1-2 (0-0) 60 54
6 Sun, Nov 18 @Rutgers L65-39 1-3 (0-0) 39 65
7 Wed, Nov 21 @Monmouth L64-56 1-4 (0-0) 56 64
8 Sun, Nov 25 vsHoly Cross L56-50 1-5 (0-0) 50 56
````