分隔 data.frame 中包含的字符串并根据拆分创建新的关联值

Separating Strings Contained in data.frame and creating a new associated value based on splits

我有一大组数据,其中一些行包含多个以逗号分隔的县。我一直在尝试将它们分成单独的行,并将与字符串相关联的货币值除以字符串中的县数。除了具有县字符串之外,还有必须在该州的每个县之间拆分的全州值。为了一个易于复制的示例,我们将不得不假设马里兰州只有三个县。对于可复制的示例,我将在下面粘贴一些代码:

> df1 <- data.frame(State = 
c("Maryland","Maryland","Maryland","Washington","Arizona","California"),
County = c("Baltimore,Montgomery,Frederick","Statewide","Baltimore, 
Carrol","Douglas","Washington","San Bernadino,Orange"),Spending = 
c(15000,20000,10000,5000,2000,34000))
> print(df1)
       State                         County Spending
1   Maryland Baltimore,Montgomery,Frederick    15000
2   Maryland                      Statewide    20000
3   Maryland              Baltimore, Carrol    10000
4 Washington                        Douglas     5000
5    Arizona                     Washington     2000
6 California           San Bernadino,Orange    34000

我希望输出看起来像下面打印的 data.frame。

> df2 <- data.frame(State = c("Maryland","Maryland","Maryland","Maryland",
"Washington","Arizona","California","California"),County = 
c("Baltimore","Montgomery","Frederick","Carrol","Douglas",
"Washington","San Bernadino","Orange"),
Spending = c(15000,10000,10000,10000,5000,2000,17000,17000))
> print(df2)
       State        County Spending
1   Maryland     Baltimore    15000
2   Maryland    Montgomery    10000
3   Maryland     Frederick    10000
4   Maryland        Carrol    10000
5 Washington       Douglas     5000
6    Arizona    Washington     2000
7 California San Bernadino    17000
8 California        Orange    17000

诡计在于并非所有行都有县字符串,全州值必须基于州内包含的县数量。

我自由地稍微修改了你的数据框,因为它不起作用(正如评论者所说)。这是您要找的吗?

df1 <- data.frame(State = 
                c("Maryland","Maryland","Washington","Arizona","California"),County = 
                c("Baltimore,Montgomery,Frederick","Statewide","Douglas","Washington","San Bernadino,Orange"),Spending = c(15000, 15000,6000,2000,34000))

library(dplyr)
library(stringr)
library(tidyr)


df1 %>% 
  group_by(State) %>% # group by state
  mutate(Total_Spending = sum(Spending)) %>% # calculate total spending
  filter(County != "Statewide") %>% # delete rows for statewide as they don't count as a county
  separate_rows(County, sep = ",") %>% # expand rows
  mutate(Spending_PC = Total_Spending / n()) # calculate spending per county

根据 OP 的新信息编辑

我大致采取了五步法:

  1. 取出全州的行,稍后再处理
  2. 与其余部分一起,计算每行中的县数(通过查看县中逗号的数量)并相应地划分支出金额。
  3. 展开行数
  4. 按县分组并汇总其支出
  5. 加入全州信息并分发到各县(根据州内县数进行调整)

    statewide <- df1 %>% 
      filter(County == "Statewide") %>% 
      select(-County, Spending_State = Spending)
    
    df1 %>% 
      filter(County != "Statewide") %>%  # drop the statewide rows 
      # divide the spending of each row by the number of counties (as counted by "," + 1)
      mutate(Spending_div = Spending / (str_count(County, ",") + 1)) %>% 
      separate_rows(County, sep = ",") %>% # expand rows
      # calculate spending per county (account for multiple rows per county)
      group_by(State, County) %>% 
      summarize(Spending_County = sum(Spending_div)) %>%  
      # join the statewide spending 
      left_join(statewide) %>% 
      replace_na(list(Spending_State = 0)) %>% # replace non matched with 0 
      # calculate final value 
      group_by(State) %>% # group to get number of counties in each state to distribute the Spending_State
      mutate(Spending_County = Spending_County + (Spending_State / n()))
    

还有一种方法可以在不首先提取全州信息的情况下执行此操作,但这更加混乱而不是 "tidy" 恕我直言。