R数据操作

Question

非常感谢您的帮助...
我有每个国家不同国家和不同地区的东西的高度分布直方图（见table），需要做以下工作：

将直方图数据转换回“原始”数据向量（即复制 height、count 次）。例如：
英格兰北部：12,12,12,12,12,11,11,11,10,8
英格兰南部：12,12,10,10,7,7,7,7,7
等等...
计算每对向量之间的 Wasserstein 距离 - transport::wasserstein1d(vectorA, vectorB)。

国家	地区	身高	计数
英国	北	12	5
英国	北	11	3
英国	北	10	1
英国	北	8	1
英国	南	12	2
英国	南	10	2
英国	南	7	5
法国	东	11	3
法国	东	10	1
法国	东	8	1
法国	南	12	2
法国	南	11	3
法国	南	10	1

Answer 1

这是使用 combn -

的方法

library(dplyr)
library(tidyr)

df %>%
  uncount(Count) %>%
  split(.[c('Country', 'Region')]) %>%
  Filter(nrow, .) -> list_df

do.call(rbind, combn(seq_along(list_df), 2, function(x) {
  data.frame(region1 = paste0(list_df[[x[1]]]$Country[1],list_df[[x[1]]]$Region[1]), 
             region2 = paste0(list_df[[x[2]]]$Country[1],list_df[[x[2]]]$Region[1]),
              result = transport::wasserstein1d(list_df[[x[1]]]$Height, 
                                                list_df[[x[2]]]$Height))
}, simplify = FALSE))

#       region1      region2 result
#1   FranceEast Englandnorth  0.900
#2   FranceEast Englandsouth  1.867
#3   FranceEast  FranceSouth  0.967
#4 Englandnorth Englandsouth  2.322
#5 Englandnorth  FranceSouth  0.400
#6 Englandsouth  FranceSouth  2.389

Answer 2

这是我生成距离矩阵的方法：

library(dplyr)

your_data %>%
  mutate(full_region = paste(Country, Region)) %>%
  group_by(full_region) %>%
  summarize(points = list(rep(Height, Count))) %>%
  (\(df) matrix(df$points, dimnames = list(df$full_region))) %>%
  usedist::dist_make(\(a, b) transport::wasserstein1d(a[[1]], b[[1]]))

Returns:

              England north England south France East
England south     2.3222222                          
France East       0.9000000     1.8666667            
France South      0.4000000     2.3888889   0.9666667

使用的数据：

your_data <- structure(list(Country = c("England", "England", "England", "England", "England", "England", "England", "France", "France", "France", "France", "France", "France"), Region = c("north", "north", "north", "north", "south", "south", "south", "East", "East", "East", "South", "South", "South"), Height = c(12L, 11L, 10L, 8L, 12L, 10L, 7L, 11L, 10L, 8L, 12L, 11L, 10L), Count = c(5L, 3L, 1L, 1L, 2L, 2L, 5L, 3L, 1L, 1L, 2L, 3L, 1L)), row.names = c(NA, -13L), class = c("tbl_df", "tbl", "data.frame"))

R数据操作

R data manipulations

r

data-manipulation