R - 确定哪些列包含货币数据 $

Question

我有一个非常大的数据集，其中一些列的格式为货币，一些是数字，一些是字符。在读取数据时，所有货币列都被标识为因子，我需要将它们转换为数字。数据集太宽，无法手动识别列。我试图找到一种编程方式来确定列是否包含货币数据（例如以“$”开头），然后传递要清理的列列表。

name <- c('john','carl', 'hank')
salary <- c(',456.33',',677.43',',234.88')
emp_data <- data.frame(name,salary)

clean <- function(ttt){
as.numeric(gsub('[^a-zA-z0-9.]','', ttt))
}
sapply(emp_data, clean)

此示例中的问题是此应用适用于所有列，导致名称列被 NA 替换。我需要一种方法来以编程方式仅标识需要应用清理函数的列。在此示例中为薪水。

Answer 1

使用 dplyr 和 stringr 包，您可以使用 mutate_if 来识别具有任何以 $ 开头的字符串的列，然后相应地更改。

library(dplyr)
library(stringr)

emp_data %>%
  mutate_if(~any(str_detect(., '^\$'), na.rm = TRUE),
            ~as.numeric(str_replace_all(., '[$,]', '')))

Answer 2

基本 R 选项是使用 startsWith 检测美元列，并使用 gsub 从列中删除 "$" 和 ","。

doll_cols <- sapply(emp_data, function(x) any(startsWith(as.character(x), '$')))
emp_data[doll_cols] <- lapply(emp_data[doll_cols], 
                              function(x) as.numeric(gsub('\$|,', '', x)))

Answer 3

利用 readr 软件包开箱即用的强大解析器：

my_parser <- function(col) {
  # Try first with parse_number that handles currencies automatically quite well
  res <- suppressWarnings(readr::parse_number(col))
  if (is.null(attr(res, "problems", exact = TRUE))) {
    res
  } else {
    # If parse_number fails, fall back on parse_guess
    readr::parse_guess(col)
    # Alternatively, we could simply return col without further parsing attempt
  }
}

library(dplyr)

emp_data %>% 
  mutate(foo = "USD13.4",
         bar = "£37") %>% 
  mutate_all(my_parser)

#   name   salary  foo bar
# 1 john 23456.33 13.4  37
# 2 carl 45677.43 13.4  37
# 3 hank 76234.88 13.4  37

R - 确定哪些列包含货币数据 $

R - identify which columns contain currency data $

currency

r

data-cleaning