无法使用 R 对数据进行分类
Can't categorize data using R
我尝试对我的数据进行分类,以便制作逻辑回归模型。我对 R 很陌生,为了我的学习而学习它。我已经使用了我在多个示例中看到的这段代码,但没有任何东西经过并保持不变。也没有报错。
ds <- read.csv("adult.csv")
colnames(ds)<- c("age","workclass","responsenum","education","education_years","marital_status","occupation","familyrole", "race","sex", "capital_gain", "capital_loss", "hours_per_week","country", "income")
ds$workclass <- as.character(ds$workclass)
ds$workclass[ds$workclass == "Without-pay" | ds$workclass == "Never-worked"] <- "Jobless"
ds$workclass[ds$workclass == "State-gov" | ds$workclass == "Local-gov"] <- "govt"
ds$workclass[ds$workclass == "Self-emp-inc" | ds$workclass == "Self-emp-not-inc"] <- "Self-employed"
当我之后使用 table() 时,我仍然从中提取旧名称。
有人知道出了什么问题吗?
dput(head(ds)) 的输出=
structure(list(age = c(50L, 38L, 53L, 28L, 37L, 49L), workclass = c(" Self-emp-not-inc",
" Private", " Private", " Private", " Private", " Private"),
responsenum = c(83311L, 215646L, 234721L, 338409L, 284582L,
160187L), education = c(" Bachelors", " HS-grad", " 11th",
" Bachelors", " Masters", " 9th"), education_years = c(13L,
9L, 7L, 13L, 14L, 5L), marital_status = c(" Married-civ-spouse",
" Divorced", " Married-civ-spouse", " Married-civ-spouse",
" Married-civ-spouse", " Married-spouse-absent"), occupation = c(" Exec-managerial",
" Handlers-cleaners", " Handlers-cleaners", " Prof-specialty",
" Exec-managerial", " Other-service"), familyrole = c(" Husband",
" Not-in-family", " Husband", " Wife", " Wife", " Not-in-family"
), race = c(" White", " White", " Black", " Black", " White",
" Black"), sex = c(" Male", " Male", " Male", " Female",
" Female", " Female"), capital_gain = c(0L, 0L, 0L, 0L, 0L,
0L), capital_loss = c(0L, 0L, 0L, 0L, 0L, 0L), hours_per_week = c(13L,
40L, 40L, 40L, 40L, 16L), country = c(" United-States", " United-States",
" United-States", " Cuba", " United-States", " Jamaica"),
income = c(" <=50K", " <=50K", " <=50K", " <=50K", " <=50K",
" <=50K")), row.names = c(NA, 6L), class = "data.frame")
您的数据有前导空格,因此 " Self-imp-not-inc"
永远不会匹配 "Self-emp-not-inc"
。
想法:
您可以 trim leading/trailing 所有类似字符串的列中的空格。
str(ds, list.len = 4)
# 'data.frame': 6 obs. of 15 variables:
# $ X39 : int 50 38 53 28 37 49
# $ State.gov : chr " Self-emp-not-inc" " Private" " Private" " Private" ...
# $ X77516 : int 83311 215646 234721 338409 284582 160187
# $ Bachelors : chr " Bachelors" " HS-grad" " 11th" " Bachelors" ...
# [list output truncated]
ischr <- sapply(ds, is.character)
ischr
# X39 State.gov X77516 Bachelors X13 Never.married Adm.clerical Not.in.family
# FALSE TRUE FALSE TRUE FALSE TRUE TRUE TRUE
# White Male X2174 X0 X40 United.States X..50K
# TRUE TRUE FALSE FALSE FALSE TRUE TRUE
ds[ischr] <- lapply(ds[ischr], trimws)
str(ds, list.len = 4)
# 'data.frame': 6 obs. of 15 variables:
# $ X39 : int 50 38 53 28 37 49
# $ State.gov : chr "Self-emp-not-inc" "Private" "Private" "Private" ...
# $ X77516 : int 83311 215646 234721 338409 284582 160187
# $ Bachelors : chr "Bachelors" "HS-grad" "11th" "Bachelors" ...
# [list output truncated]
或者,您可以在所有模式中添加空格,例如:
ds$workclass[ds$workclass == " Without-pay" | ds$workclass == " Never-worked"] <- " Jobless"
ds$workclass[ds$workclass == " State-gov" | ds$workclass == " Local-gov"] <- " govt"
ds$workclass[ds$workclass == " Self-emp-inc" | ds$workclass == " Self-emp-not-inc"] <- " Self-employed"
(我认为这不是最好的方法,所以我假设您选择使用 trimws
作为第一个项目符号。)
您可以使用 %in%
:
来简化您的一些 |
ds$workclass[ds$workclass %in% c("Without-pay", "Never-worked")] <- "Jobless"
ds$workclass[ds$workclass %in% c("State-gov", "Local-gov")] <- "govt"
ds$workclass[ds$workclass %in% c("Self-emp-inc", "Self-emp-not-inc")] <- "Self-employed"
您还可以创建各种字典,将某些内容翻译成其他内容。例如,
translations <- read.csv(header = TRUE, text = "
src,tgt
Without-pay,Jobless
Never-worked,Jobless
State-gov,govt
Local-gov,govt
Self-emp-inc,Self-employed
Self-emp-not-inc,Self-employed")
ds$State.gov
# [1] "Self-emp-not-inc" "Private" "Private" "Private" "Private" "Private"
ifelse(ds$State.gov %in% translations$src, translations$tgt[ match(ds$State.gov, translations$src) ], ds$State.gov)
# [1] "Self-employed" "Private" "Private" "Private" "Private" "Private"
ds$State.gov <- ifelse(ds$State.gov %in% translations$src,
translations$tgt[ match(ds$State.gov, translations$src) ],
ds$State.gov)
(此技术也可以作为 merge
或 dplyr::*_join
操作来实现,但目前我认为这可能比您需要的更复杂。)
使用这种类似字典的翻译的主要优点(在我看来)是它最容易查看、理解和维护。
我尝试对我的数据进行分类,以便制作逻辑回归模型。我对 R 很陌生,为了我的学习而学习它。我已经使用了我在多个示例中看到的这段代码,但没有任何东西经过并保持不变。也没有报错。
ds <- read.csv("adult.csv")
colnames(ds)<- c("age","workclass","responsenum","education","education_years","marital_status","occupation","familyrole", "race","sex", "capital_gain", "capital_loss", "hours_per_week","country", "income")
ds$workclass <- as.character(ds$workclass)
ds$workclass[ds$workclass == "Without-pay" | ds$workclass == "Never-worked"] <- "Jobless"
ds$workclass[ds$workclass == "State-gov" | ds$workclass == "Local-gov"] <- "govt"
ds$workclass[ds$workclass == "Self-emp-inc" | ds$workclass == "Self-emp-not-inc"] <- "Self-employed"
当我之后使用 table() 时,我仍然从中提取旧名称。
有人知道出了什么问题吗?
dput(head(ds)) 的输出=
structure(list(age = c(50L, 38L, 53L, 28L, 37L, 49L), workclass = c(" Self-emp-not-inc",
" Private", " Private", " Private", " Private", " Private"),
responsenum = c(83311L, 215646L, 234721L, 338409L, 284582L,
160187L), education = c(" Bachelors", " HS-grad", " 11th",
" Bachelors", " Masters", " 9th"), education_years = c(13L,
9L, 7L, 13L, 14L, 5L), marital_status = c(" Married-civ-spouse",
" Divorced", " Married-civ-spouse", " Married-civ-spouse",
" Married-civ-spouse", " Married-spouse-absent"), occupation = c(" Exec-managerial",
" Handlers-cleaners", " Handlers-cleaners", " Prof-specialty",
" Exec-managerial", " Other-service"), familyrole = c(" Husband",
" Not-in-family", " Husband", " Wife", " Wife", " Not-in-family"
), race = c(" White", " White", " Black", " Black", " White",
" Black"), sex = c(" Male", " Male", " Male", " Female",
" Female", " Female"), capital_gain = c(0L, 0L, 0L, 0L, 0L,
0L), capital_loss = c(0L, 0L, 0L, 0L, 0L, 0L), hours_per_week = c(13L,
40L, 40L, 40L, 40L, 16L), country = c(" United-States", " United-States",
" United-States", " Cuba", " United-States", " Jamaica"),
income = c(" <=50K", " <=50K", " <=50K", " <=50K", " <=50K",
" <=50K")), row.names = c(NA, 6L), class = "data.frame")
您的数据有前导空格,因此 " Self-imp-not-inc"
永远不会匹配 "Self-emp-not-inc"
。
想法:
您可以 trim leading/trailing 所有类似字符串的列中的空格。
str(ds, list.len = 4) # 'data.frame': 6 obs. of 15 variables: # $ X39 : int 50 38 53 28 37 49 # $ State.gov : chr " Self-emp-not-inc" " Private" " Private" " Private" ... # $ X77516 : int 83311 215646 234721 338409 284582 160187 # $ Bachelors : chr " Bachelors" " HS-grad" " 11th" " Bachelors" ... # [list output truncated] ischr <- sapply(ds, is.character) ischr # X39 State.gov X77516 Bachelors X13 Never.married Adm.clerical Not.in.family # FALSE TRUE FALSE TRUE FALSE TRUE TRUE TRUE # White Male X2174 X0 X40 United.States X..50K # TRUE TRUE FALSE FALSE FALSE TRUE TRUE ds[ischr] <- lapply(ds[ischr], trimws) str(ds, list.len = 4) # 'data.frame': 6 obs. of 15 variables: # $ X39 : int 50 38 53 28 37 49 # $ State.gov : chr "Self-emp-not-inc" "Private" "Private" "Private" ... # $ X77516 : int 83311 215646 234721 338409 284582 160187 # $ Bachelors : chr "Bachelors" "HS-grad" "11th" "Bachelors" ... # [list output truncated]
或者,您可以在所有模式中添加空格,例如:
ds$workclass[ds$workclass == " Without-pay" | ds$workclass == " Never-worked"] <- " Jobless" ds$workclass[ds$workclass == " State-gov" | ds$workclass == " Local-gov"] <- " govt" ds$workclass[ds$workclass == " Self-emp-inc" | ds$workclass == " Self-emp-not-inc"] <- " Self-employed"
(我认为这不是最好的方法,所以我假设您选择使用
trimws
作为第一个项目符号。)您可以使用
来简化您的一些%in%
:|
ds$workclass[ds$workclass %in% c("Without-pay", "Never-worked")] <- "Jobless" ds$workclass[ds$workclass %in% c("State-gov", "Local-gov")] <- "govt" ds$workclass[ds$workclass %in% c("Self-emp-inc", "Self-emp-not-inc")] <- "Self-employed"
您还可以创建各种字典,将某些内容翻译成其他内容。例如,
translations <- read.csv(header = TRUE, text = " src,tgt Without-pay,Jobless Never-worked,Jobless State-gov,govt Local-gov,govt Self-emp-inc,Self-employed Self-emp-not-inc,Self-employed") ds$State.gov # [1] "Self-emp-not-inc" "Private" "Private" "Private" "Private" "Private" ifelse(ds$State.gov %in% translations$src, translations$tgt[ match(ds$State.gov, translations$src) ], ds$State.gov) # [1] "Self-employed" "Private" "Private" "Private" "Private" "Private" ds$State.gov <- ifelse(ds$State.gov %in% translations$src, translations$tgt[ match(ds$State.gov, translations$src) ], ds$State.gov)
(此技术也可以作为
merge
或dplyr::*_join
操作来实现,但目前我认为这可能比您需要的更复杂。)使用这种类似字典的翻译的主要优点(在我看来)是它最容易查看、理解和维护。