分隔 data.frame 中包含的字符串并根据拆分创建新的关联值

Question

我有一大组数据，其中一些行包含多个以逗号分隔的县。我一直在尝试将它们分成单独的行，并将与字符串相关联的货币值除以字符串中的县数。除了具有县字符串之外，还有必须在该州的每个县之间拆分的全州值。为了一个易于复制的示例，我们将不得不假设马里兰州只有三个县。对于可复制的示例，我将在下面粘贴一些代码：

> df1 <- data.frame(State = 
c("Maryland","Maryland","Maryland","Washington","Arizona","California"),
County = c("Baltimore,Montgomery,Frederick","Statewide","Baltimore, 
Carrol","Douglas","Washington","San Bernadino,Orange"),Spending = 
c(15000,20000,10000,5000,2000,34000))
> print(df1)
       State                         County Spending
1   Maryland Baltimore,Montgomery,Frederick    15000
2   Maryland                      Statewide    20000
3   Maryland              Baltimore, Carrol    10000
4 Washington                        Douglas     5000
5    Arizona                     Washington     2000
6 California           San Bernadino,Orange    34000

我希望输出看起来像下面打印的 data.frame。

> df2 <- data.frame(State = c("Maryland","Maryland","Maryland","Maryland",
"Washington","Arizona","California","California"),County = 
c("Baltimore","Montgomery","Frederick","Carrol","Douglas",
"Washington","San Bernadino","Orange"),
Spending = c(15000,10000,10000,10000,5000,2000,17000,17000))
> print(df2)
       State        County Spending
1   Maryland     Baltimore    15000
2   Maryland    Montgomery    10000
3   Maryland     Frederick    10000
4   Maryland        Carrol    10000
5 Washington       Douglas     5000
6    Arizona    Washington     2000
7 California San Bernadino    17000
8 California        Orange    17000

诡计在于并非所有行都有县字符串，全州值必须基于州内包含的县数量。

Answer 1

我自由地稍微修改了你的数据框，因为它不起作用（正如评论者所说）。这是您要找的吗？

df1 <- data.frame(State = 
                c("Maryland","Maryland","Washington","Arizona","California"),County = 
                c("Baltimore,Montgomery,Frederick","Statewide","Douglas","Washington","San Bernadino,Orange"),Spending = c(15000, 15000,6000,2000,34000))

library(dplyr)
library(stringr)
library(tidyr)


df1 %>% 
  group_by(State) %>% # group by state
  mutate(Total_Spending = sum(Spending)) %>% # calculate total spending
  filter(County != "Statewide") %>% # delete rows for statewide as they don't count as a county
  separate_rows(County, sep = ",") %>% # expand rows
  mutate(Spending_PC = Total_Spending / n()) # calculate spending per county

根据 OP 的新信息编辑：

我大致采取了五步法：

取出全州的行，稍后再处理
与其余部分一起，计算每行中的县数（通过查看县中逗号的数量）并相应地划分支出金额。
展开行数
按县分组并汇总其支出

加入全州信息并分发到各县（根据州内县数进行调整）

statewide <- df1 %>% 
  filter(County == "Statewide") %>% 
  select(-County, Spending_State = Spending)

df1 %>% 
  filter(County != "Statewide") %>%  # drop the statewide rows 
  # divide the spending of each row by the number of counties (as counted by "," + 1)
  mutate(Spending_div = Spending / (str_count(County, ",") + 1)) %>% 
  separate_rows(County, sep = ",") %>% # expand rows
  # calculate spending per county (account for multiple rows per county)
  group_by(State, County) %>% 
  summarize(Spending_County = sum(Spending_div)) %>%  
  # join the statewide spending 
  left_join(statewide) %>% 
  replace_na(list(Spending_State = 0)) %>% # replace non matched with 0 
  # calculate final value 
  group_by(State) %>% # group to get number of counties in each state to distribute the Spending_State
  mutate(Spending_County = Spending_County + (Spending_State / n()))

还有一种方法可以在不首先提取全州信息的情况下执行此操作，但这更加混乱而不是 "tidy" 恕我直言。

分隔 data.frame 中包含的字符串并根据拆分创建新的关联值

Separating Strings Contained in data.frame and creating a new associated value based on splits

r

data-manipulation

dplyr