如何用数据集中其他地方的等效值替换 NA？

Question

我试图寻找类似的问题，但找不到。如果你这样做，请告诉我！

我一直在从事一个研究谷物主食的项目

这是我的数据集的一个子集：

                nutrient.component.      grain nutrients
1                Beta-carotene (μg) White Rice      0.00
2                Beta-carotene (μg) Brown Rice        NA
3                      Calcium (mg) White Rice     28.00
4                      Calcium (mg) Brown Rice     23.00
5                 Carbohydrates (g) White Rice     80.00
6                 Carbohydrates (g) Brown Rice     77.00
7                       Copper (mg) White Rice      0.22
8                       Copper (mg) Brown Rice        NA
9                       Energy (kJ) White Rice   1528.00
10                      Energy (kJ) Brown Rice   1549.00
11                          Fat (g) White Rice      0.66
12                          Fat (g) Brown Rice      2.92
13                        Fiber (g) White Rice      1.30
14                        Fiber (g) Brown Rice      3.50
15           Folate Total (B9) (μg) White Rice      8.00
16           Folate Total (B9) (μg) Brown Rice     20.00
17                        Iron (mg) White Rice      0.80
18                        Iron (mg) Brown Rice      1.47
19           Lutein+zeaxanthin (μg) White Rice      0.00
20           Lutein+zeaxanthin (μg) Brown Rice        NA
21                   Magnesium (mg) White Rice     25.00
22                   Magnesium (mg) Brown Rice    143.00
23                   Manganese (mg) White Rice      1.09
24                   Manganese (mg) Brown Rice      3.74
25  Monounsaturated fatty acids (g) White Rice      0.21
26  Monounsaturated fatty acids (g) Brown Rice      1.05
27                 Niacin (B3) (mg) White Rice      1.60
28                 Niacin (B3) (mg) Brown Rice      5.09
29       Pantothenic acid (B5) (mg) White Rice      1.01
30       Pantothenic acid (B5) (mg) Brown Rice      1.49
31                  Phosphorus (mg) White Rice    115.00
32                  Phosphorus (mg) Brown Rice    333.00
33  Polyunsaturated fatty acids (g) White Rice      0.18
34  Polyunsaturated fatty acids (g) Brown Rice      1.04
35                   Potassium (mg) White Rice    115.00
36                   Potassium (mg) Brown Rice    223.00
37                      Protein (g) White Rice      7.10
38                      Protein (g) Brown Rice      7.90
39              Riboflavin (B2)(mg) White Rice      0.05
40              Riboflavin (B2)(mg) Brown Rice      0.09
41        Saturated fatty acids (g) White Rice      0.18
42        Saturated fatty acids (g) Brown Rice      0.58
43                    Selenium (μg) White Rice     15.10
44                    Selenium (μg) Brown Rice        NA
45                      Sodium (mg) White Rice      5.00
46                      Sodium (mg) Brown Rice      7.00
47                        Sugar (g) White Rice      0.12
48                        Sugar (g) Brown Rice      0.85
49                 Thiamin (B1)(mg) White Rice      0.07
50                 Thiamin (B1)(mg) Brown Rice      0.40
51                   Vitamin A (IU) White Rice      0.00
52                   Vitamin A (IU) Brown Rice      0.00
53                  Vitamin B6 (mg) White Rice      0.16
54                  Vitamin B6 (mg) Brown Rice      0.51
55                   Vitamin C (mg) White Rice      0.00
56                   Vitamin C (mg) Brown Rice      0.00
57 Vitamin E, alpha-tocopherol (mg) White Rice      0.11
58 Vitamin E, alpha-tocopherol (mg) Brown Rice      0.59
59                  Vitamin K1 (μg) White Rice      0.10
60                  Vitamin K1 (μg) Brown Rice      1.90
61                        Water (g) White Rice     12.00
62                        Water (g) Brown Rice     10.00
63                        Zinc (mg) White Rice      1.09
64                        Zinc (mg) Brown Rice      2.02

糙米有四个 NA 值。
基于这张图，我认为可以公平地假设糙米的 NA 值将非常接近白米的等效值。并且反映白米值而不是将值转换为零会更准确。

我的问题是，除了手动查找和输入糙米的白米当量营养素外，代码如何将 NA 替换为白米的等效值？我希望结果能转换为铜的 NA 值；糙米与铜的价值相同；白米饭（0.22）。先用零替换 NA 会更好吗？但是，如果我这样做，那么我有六种营养素的值为零，而不是四个具有 NA 的值。我试图找出通过代码解决这个问题的正确心态。任何对此的见解将不胜感激。

谢谢

Answer 1

我假设你的数据集是 class data.frame 并且它被命名为 dat.

我相信下面的代码可以做到。它将 df 分成 2 行或 1 行的列表（示例中的最后一行缺少糙米）。然后它检查这些列表是否有 2 行，以及糙米的营养成分是否为 NA。如果是这样，它会分配白米饭的价值。然后，将结果列表收集回 data.frame.

sp <- split(dat, dat$nutrient.component.)
res <- lapply(sp, function(x){
            if(nrow(x) == 2 & is.na(x$nutrients[x$grain == "Brown Rice"]))
                x$grain[x$grain == "Brown Rice"] <- "White Rice"
            x
            }
        )

rm(sp)   # tidy up

res <- do.call(rbind, res)
res

Answer 2

zoo包有一些有用的函数可以处理NA:

library(data.table)
setDT(DT)[, nutrients := zoo::na.aggregate(nutrients), by = nutrient.component][]

                  nutrient.component      grain nutrients
 1:        Beta-carotene (<U+00B5>g) White Rice      0.00
 2:        Beta-carotene (<U+00B5>g) Brown Rice      0.00
 3:                     Calcium (mg) White Rice     28.00
 4:                     Calcium (mg) Brown Rice     23.00
 5:                Carbohydrates (g) White Rice     80.00
 6:                Carbohydrates (g) Brown Rice     77.00
 7:                      Copper (mg) White Rice      0.22
 8:                      Copper (mg) Brown Rice      0.22
 9:                      Energy (kJ) White Rice   1528.00
10:                      Energy (kJ) Brown Rice   1549.00
11:                          Fat (g) White Rice      0.66
12:                          Fat (g) Brown Rice      2.92
13:                        Fiber (g) White Rice      1.30
14:                        Fiber (g) Brown Rice      3.50
15:    Folate Total (B9) (<U+00B5>g) White Rice      8.00
16:    Folate Total (B9) (<U+00B5>g) Brown Rice     20.00
17:                        Iron (mg) White Rice      0.80
18:                        Iron (mg) Brown Rice      1.47
19:    Lutein+zeaxanthin (<U+00B5>g) White Rice      0.00
20:    Lutein+zeaxanthin (<U+00B5>g) Brown Rice      0.00
...

记下第 2、8 和 20 行。

data.table 在这里使用是因为它更新 DT 到位避免复制整个 table 以节省内存和时间。

Answer 3

假设你的输入数据的数据框叫做dt，我们可以使用tidyr包中的fill函数来完成这个任务。 dt2 是最终输出。

library(tidyr)

dt2 <- dt %>% fill(nutrients)

dt2
  nutrient.component.                         grain nutrients
1                   1 Beta-carotene (µg) White Rice      0.00
2                   2 Beta-carotene (µg) Brown Rice      0.00
3                   3       Calcium (mg) White Rice     28.00
4                   4       Calcium (mg) Brown Rice     23.00
5                   5  Carbohydrates (g) White Rice     80.00
6                   6  Carbohydrates (g) Brown Rice     77.00
7                   7        Copper (mg) White Rice      0.22
8                   8        Copper (mg) Brown Rice      0.22
...

fill 的默认值将根据前一个和最近的非 NA 行估算 NA。所以重要的是要确保每个糙米记录恰好是相关白米记录的下一行。

如何用数据集中其他地方的等效值替换 NA？

How to replace an NA with an equivalent value from elsewhere in a dataset?

r

missing-data

dataframe

na

imputation