一列中的变量;另一个->目标中的值:为变量添加列

Variables in one col; values in another->goal: add columns for variables

我认为我遇到了一个(希望)小问题,但搜索功能没有为我提供任何帮助。我在通过 OECD 软件包提取数据时遇到问题。问题是,我得到了一个数据集,其中所有变量都存储在一列中。数据集采用长格式,这很好,但我希望变量成为单列。目前数据集如下所示:

如您所见,"VAR" 列包含多个变量:"B11"、"B12"...总共 11 个变量。许多国家/地区的所有变量都经过测量 (Col "COU")。我想做的是,向数据集添加新列,代表现在存储在 "VAR" 中的单个变量,并包含 "obsValue" 列的相应值?

这样我就可以看到 B11 的值,例如阿富汗 1999 在一排,2000 在另一排,但 1999 年 B12 的值与 B11 在同一排,依此类推。我希望我的目标越来越明确,如果没有,请不要犹豫,问。

这是重现数据集头部的代码:

dput(head(MIG,20)) 

structure(list(CO2 = c("AFG", "AFG", "AFG", "AFG", "AFG", "AFG", 
"AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", 
"AFG", "AFG", "AFG", "AFG", "AFG"), VAR = c("B11", "B11", "B11", 
"B11", "B11", "B11", "B11", "B11", "B11", "B11", "B11", "B11", 
"B11", "B11", "B11", "B11", "B12", "B12", "B12", "B12"), GEN = c("WMN", 
"WMN", "WMN", "WMN", "WMN", "WMN", "WMN", "WMN", "WMN", "WMN", 
"WMN", "WMN", "WMN", "WMN", "WMN", "WMN", "WMN", "WMN", "WMN", 
"WMN"), COU = c("AUS", "AUS", "AUS", "AUS", "AUS", "AUS", "AUS", 
"AUS", "AUS", "AUS", "AUS", "AUS", "AUS", "AUS", "AUS", "AUS", 
"AUS", "AUS", "AUS", "AUS"), TIME_FORMAT = c("P1Y", "P1Y", "P1Y", 
"P1Y", "P1Y", "P1Y", "P1Y", "P1Y", "P1Y", "P1Y", "P1Y", "P1Y", 
"P1Y", "P1Y", "P1Y", "P1Y", "P1Y", "P1Y", "P1Y", "P1Y"), obsTime = c("1999", 
"2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", 
"2008", "2009", "2010", "2011", "2012", "2013", "2014", "1999", 
"2000", "2001", "2004"), obsValue = c(434, 398, 225, 345, 544, 
726, 1099, 1607, 1377, 1018, 946, 873, 1131, 903, 1230, 2939, 
0, 0, 2, 24), OBS_STATUS = c(NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_), migrants = c(434, 398, 225, 345, 
544, 726, 1099, 1607, 1377, 1018, 946, 873, 1131, 903, 1230, 
2939, 0, 0, 2, 24)), .Names = c("CO2", "VAR", "GEN", "COU", "TIME_FORMAT", 
"obsTime", "obsValue", "OBS_STATUS", "migrants"), row.names = c(NA, 
-20L), class = c("tbl_df", "tbl", "data.frame"))

这是我的整个代码,包括我自己解决问题的两次尝试,但没有成功,因为它们只是复制 "obsValue" 列或给我一个显示 TRUE 或 FALSE 的列。请注意,R 将需要相当多的时间来加载数据集。

library(OECD)
library(plyr)
library(dplyr)

search_dataset("migration")
MIG<- get_dataset("MIG")
get_data_structure("MIG")

MIG$migrants <- if(MIG$VAR == "B11")MIG$migrants<-MIG$obsValue else MIG$migrants<-NA


MIG_long <- mutate(MIG,migrants=VAR=="B11")
if(MIG_long$migrants==T)MIG_long$migrants<-MIG_long$obsValue else MIG_long$migrants<-NA

我希望这个问题对你来说不会太低,你可以 "work" 我的解释。不过,如果您有任何问题,请问我。

最良好的祝愿, 马塞尔

您可以使用 tidyrspread 中的 VARobsValue 插入列中。如果您确实希望每行一年,正如@atiretoo 突出显示的那样,您可以简单地删除 migrants 列以获得每年的唯一值。

library(tidyr)
library(dplyr)

MIG %>% 
  select(-migrants) %>%
  spread(VAR, obsValue)

     CO2 obsTime   B11   B12
   (chr)   (chr) (dbl) (dbl)
1    AFG    1999   434     0
2    AFG    2000   398     0
3    AFG    2001   225     2
4    AFG    2002   345    NA
5    AFG    2003   544    NA
6    AFG    2004   726    24
7    AFG    2005  1099    NA
8    AFG    2006  1607    NA
9    AFG    2007  1377    NA
10   AFG    2008  1018    NA
11   AFG    2009   946    NA
12   AFG    2010   873    NA
13   AFG    2011  1131    NA
14   AFG    2012   903    NA
15   AFG    2013  1230    NA
16   AFG    2014  2939    NA