R:如果名称包含特定文本,则将其分组

R: If name includes specific text, then group it

我正在使用能源绩效证书数据来识别一个区域中建筑物的取暖燃料类型,但是,它们分为 9 种主要燃料类型的 60 多个不同子集。我想为燃料类型添加另一列,以便它们可以按 9 种主要燃料类型分组。

数据相关列的示例是:

BuildingID <- c(1,2,3,4,5,6,7,8,9,10)

MainHeatDesc <- c("Boiler and radiators, mains gas", "Boiler and radiators, oil", "Room heaters, electric", "Room heaters, LPG", "Air source heat pump, underfloor heating, electric", "Air source heat pump, fan coil units, electric", "Ground source heat pump, mains gas", "Electric storage heaters", "Room heaters, wood logs", "Boilers and radiators, wood chips")

data <- data.frame(BuildingID, MainHeatDesc)

这是一个采用原始数据的某些子集的小示例。在这个例子中,我想为主要燃料类型创建另一个列,将它们分组为:Mains gas、Oil、Electric、LPG 和 wood。

最终结果应该是这样的:

# BuildingID            MainHeatDesc                             MainFuelType
#     1       Boiler and radiators, mains gas                      Mains gas
#     2       Boiler and radiators, oil                              Oil
#     3       Room heaters, electric                               Electric
#     4       Room heaters, LPG                                      LPG 
#     5       Air source heat pump, underfloor heating, electric   Electric
#     6       Air source heat pump, fan coil units, electric       Electric
#     7       Ground source heat pump, mains gas                   Mains Gas
#     8       Electric storage heaters                             Electric
#     9       Room heaters, wood logs                                Wood
#    10       Boilers and radiators, wood chips                      Wood

如果有人能帮助我,我将不胜感激。如果您有任何疑问或需要更多信息,请告诉我。

谢谢!

一个dplyrstringr选项可以是:

data %>%
 mutate(group = str_extract(MainHeatDesc, regex("\bMains gas|\bOil|\bElectric|\bLPG|\bwood", ignore_case = TRUE)))

   BuildingID                                       MainHeatDesc     group
1           1                    Boiler and radiators, mains gas mains gas
2           2                          Boiler and radiators, oil       oil
3           3                             Room heaters, electric  electric
4           4                                  Room heaters, LPG       LPG
5           5 Air source heat pump, underfloor heating, electric  electric
6           6     Air source heat pump, fan coil units, electric  electric
7           7                 Ground source heat pump, mains gas mains gas
8           8                           Electric storage heaters  Electric
9           9                            Room heaters, wood logs      wood
10         10                  Boilers and radiators, wood chips      wood

如果你有很多花样,那么你可以这样准备:

x <- paste(paste0("\b", c("Mains gas", "Oil", "Electric", "LPG", "wood"), "\b"), collapse = "|")

data %>%
 mutate(group = str_extract(MainHeatDesc, regex(x, ignore_case = TRUE)))

如果你想进一步匹配你的预期输出,那么你可以使用替换向量:

y <- c("Mains gas", "Oil", "Electric", "LPG", "Wood")

data %>%
 mutate(group = str_extract(MainHeatDesc, regex(x, ignore_case = TRUE)),
        group = str_replace(group, regex(x, ignore_case = TRUE), y))

   BuildingID                                       MainHeatDesc     group
1           1                    Boiler and radiators, mains gas Mains gas
2           2                          Boiler and radiators, oil       Oil
3           3                             Room heaters, electric  Electric
4           4                                  Room heaters, LPG       LPG
5           5 Air source heat pump, underfloor heating, electric      Wood
6           6     Air source heat pump, fan coil units, electric Mains gas
7           7                 Ground source heat pump, mains gas       Oil
8           8                           Electric storage heaters  Electric
9           9                            Room heaters, wood logs       LPG
10         10                  Boilers and radiators, wood chips      Wood

与@tmfmnk 类似的逻辑,但在 base R 中使用 sub

types <- c('Mains Gas', 'Oil', 'Electric', 'LPG', 'Wood')
data$MainFuelType <- sub(paste0(".*(?i)(", paste0("\b", types, "\b", 
                        collapse = "|"), ").*"), "\1", data$MainHeatDesc)

data
#   BuildingID                                       MainHeatDesc MainFuelType
#1           1                    Boiler and radiators, mains gas    mains gas
#2           2                          Boiler and radiators, oil          oil
#3           3                             Room heaters, electric     electric
#4           4                                  Room heaters, LPG          LPG
#5           5 Air source heat pump, underfloor heating, electric     electric
#6           6     Air source heat pump, fan coil units, electric     electric
#7           7                 Ground source heat pump, mains gas    mains gas
#8           8                           Electric storage heaters     Electric
#9           9                            Room heaters, wood logs         wood
#10         10                  Boilers and radiators, wood chips         wood

动态生成的正则表达式如下所示:

paste0(".*(?i)(", paste0("\b", types, "\b", collapse = "|"), ").*")
#[1] ".*(?i)(\bMains Gas\b|\bOil\b|\bElectric\b|\bLPG\b|\bWood\b).*"

其中 (?i) 不区分大小写。

另一种方法是使用嵌套的 ifelse 语句和 grepl,它匹配正则表达式模式:

data$MainFuelType <- ifelse(grepl("mains gas", data$MainHeatDesc), "Mains gas",
                        ifelse(grepl("\boil", data$MainHeatDesc), "Oil",
                               ifelse(grepl("(e|E)lectric", data$MainHeatDesc), "Electric",
                                      ifelse(grepl("LPG", data$MainHeatDesc), "LPG", "Wood"))))

结果:

data
   BuildingID                                       MainHeatDesc MainFuelType
1           1                    Boiler and radiators, mains gas    Mains gas
2           2                          Boiler and radiators, oil          Oil
3           3                             Room heaters, electric     Electric
4           4                                  Room heaters, LPG          LPG
5           5 Air source heat pump, underfloor heating, electric     Electric
6           6     Air source heat pump, fan coil units, electric     Electric
7           7                 Ground source heat pump, mains gas    Mains gas
8           8                           Electric storage heaters     Electric
9           9                            Room heaters, wood logs         Wood
10         10                  Boilers and radiators, wood chips         Wood