分隔 data.frame 中包含的字符串并根据拆分创建新的关联值
Separating Strings Contained in data.frame and creating a new associated value based on splits
我有一大组数据,其中一些行包含多个以逗号分隔的县。我一直在尝试将它们分成单独的行,并将与字符串相关联的货币值除以字符串中的县数。除了具有县字符串之外,还有必须在该州的每个县之间拆分的全州值。为了一个易于复制的示例,我们将不得不假设马里兰州只有三个县。对于可复制的示例,我将在下面粘贴一些代码:
> df1 <- data.frame(State =
c("Maryland","Maryland","Maryland","Washington","Arizona","California"),
County = c("Baltimore,Montgomery,Frederick","Statewide","Baltimore,
Carrol","Douglas","Washington","San Bernadino,Orange"),Spending =
c(15000,20000,10000,5000,2000,34000))
> print(df1)
State County Spending
1 Maryland Baltimore,Montgomery,Frederick 15000
2 Maryland Statewide 20000
3 Maryland Baltimore, Carrol 10000
4 Washington Douglas 5000
5 Arizona Washington 2000
6 California San Bernadino,Orange 34000
我希望输出看起来像下面打印的 data.frame。
> df2 <- data.frame(State = c("Maryland","Maryland","Maryland","Maryland",
"Washington","Arizona","California","California"),County =
c("Baltimore","Montgomery","Frederick","Carrol","Douglas",
"Washington","San Bernadino","Orange"),
Spending = c(15000,10000,10000,10000,5000,2000,17000,17000))
> print(df2)
State County Spending
1 Maryland Baltimore 15000
2 Maryland Montgomery 10000
3 Maryland Frederick 10000
4 Maryland Carrol 10000
5 Washington Douglas 5000
6 Arizona Washington 2000
7 California San Bernadino 17000
8 California Orange 17000
诡计在于并非所有行都有县字符串,全州值必须基于州内包含的县数量。
我自由地稍微修改了你的数据框,因为它不起作用(正如评论者所说)。这是您要找的吗?
df1 <- data.frame(State =
c("Maryland","Maryland","Washington","Arizona","California"),County =
c("Baltimore,Montgomery,Frederick","Statewide","Douglas","Washington","San Bernadino,Orange"),Spending = c(15000, 15000,6000,2000,34000))
library(dplyr)
library(stringr)
library(tidyr)
df1 %>%
group_by(State) %>% # group by state
mutate(Total_Spending = sum(Spending)) %>% # calculate total spending
filter(County != "Statewide") %>% # delete rows for statewide as they don't count as a county
separate_rows(County, sep = ",") %>% # expand rows
mutate(Spending_PC = Total_Spending / n()) # calculate spending per county
根据 OP 的新信息编辑:
我大致采取了五步法:
- 取出全州的行,稍后再处理
- 与其余部分一起,计算每行中的县数(通过查看县中逗号的数量)并相应地划分支出金额。
- 展开行数
- 按县分组并汇总其支出
加入全州信息并分发到各县(根据州内县数进行调整)
statewide <- df1 %>%
filter(County == "Statewide") %>%
select(-County, Spending_State = Spending)
df1 %>%
filter(County != "Statewide") %>% # drop the statewide rows
# divide the spending of each row by the number of counties (as counted by "," + 1)
mutate(Spending_div = Spending / (str_count(County, ",") + 1)) %>%
separate_rows(County, sep = ",") %>% # expand rows
# calculate spending per county (account for multiple rows per county)
group_by(State, County) %>%
summarize(Spending_County = sum(Spending_div)) %>%
# join the statewide spending
left_join(statewide) %>%
replace_na(list(Spending_State = 0)) %>% # replace non matched with 0
# calculate final value
group_by(State) %>% # group to get number of counties in each state to distribute the Spending_State
mutate(Spending_County = Spending_County + (Spending_State / n()))
还有一种方法可以在不首先提取全州信息的情况下执行此操作,但这更加混乱而不是 "tidy" 恕我直言。
我有一大组数据,其中一些行包含多个以逗号分隔的县。我一直在尝试将它们分成单独的行,并将与字符串相关联的货币值除以字符串中的县数。除了具有县字符串之外,还有必须在该州的每个县之间拆分的全州值。为了一个易于复制的示例,我们将不得不假设马里兰州只有三个县。对于可复制的示例,我将在下面粘贴一些代码:
> df1 <- data.frame(State =
c("Maryland","Maryland","Maryland","Washington","Arizona","California"),
County = c("Baltimore,Montgomery,Frederick","Statewide","Baltimore,
Carrol","Douglas","Washington","San Bernadino,Orange"),Spending =
c(15000,20000,10000,5000,2000,34000))
> print(df1)
State County Spending
1 Maryland Baltimore,Montgomery,Frederick 15000
2 Maryland Statewide 20000
3 Maryland Baltimore, Carrol 10000
4 Washington Douglas 5000
5 Arizona Washington 2000
6 California San Bernadino,Orange 34000
我希望输出看起来像下面打印的 data.frame。
> df2 <- data.frame(State = c("Maryland","Maryland","Maryland","Maryland",
"Washington","Arizona","California","California"),County =
c("Baltimore","Montgomery","Frederick","Carrol","Douglas",
"Washington","San Bernadino","Orange"),
Spending = c(15000,10000,10000,10000,5000,2000,17000,17000))
> print(df2)
State County Spending
1 Maryland Baltimore 15000
2 Maryland Montgomery 10000
3 Maryland Frederick 10000
4 Maryland Carrol 10000
5 Washington Douglas 5000
6 Arizona Washington 2000
7 California San Bernadino 17000
8 California Orange 17000
诡计在于并非所有行都有县字符串,全州值必须基于州内包含的县数量。
我自由地稍微修改了你的数据框,因为它不起作用(正如评论者所说)。这是您要找的吗?
df1 <- data.frame(State =
c("Maryland","Maryland","Washington","Arizona","California"),County =
c("Baltimore,Montgomery,Frederick","Statewide","Douglas","Washington","San Bernadino,Orange"),Spending = c(15000, 15000,6000,2000,34000))
library(dplyr)
library(stringr)
library(tidyr)
df1 %>%
group_by(State) %>% # group by state
mutate(Total_Spending = sum(Spending)) %>% # calculate total spending
filter(County != "Statewide") %>% # delete rows for statewide as they don't count as a county
separate_rows(County, sep = ",") %>% # expand rows
mutate(Spending_PC = Total_Spending / n()) # calculate spending per county
根据 OP 的新信息编辑:
我大致采取了五步法:
- 取出全州的行,稍后再处理
- 与其余部分一起,计算每行中的县数(通过查看县中逗号的数量)并相应地划分支出金额。
- 展开行数
- 按县分组并汇总其支出
加入全州信息并分发到各县(根据州内县数进行调整)
statewide <- df1 %>% filter(County == "Statewide") %>% select(-County, Spending_State = Spending) df1 %>% filter(County != "Statewide") %>% # drop the statewide rows # divide the spending of each row by the number of counties (as counted by "," + 1) mutate(Spending_div = Spending / (str_count(County, ",") + 1)) %>% separate_rows(County, sep = ",") %>% # expand rows # calculate spending per county (account for multiple rows per county) group_by(State, County) %>% summarize(Spending_County = sum(Spending_div)) %>% # join the statewide spending left_join(statewide) %>% replace_na(list(Spending_State = 0)) %>% # replace non matched with 0 # calculate final value group_by(State) %>% # group to get number of counties in each state to distribute the Spending_State mutate(Spending_County = Spending_County + (Spending_State / n()))
还有一种方法可以在不首先提取全州信息的情况下执行此操作,但这更加混乱而不是 "tidy" 恕我直言。