SAS编程：如何使用一列替换多列中的缺失值？

Question

背景

我在 SAS 中有一个大型数据集，它有 17 个变量，其中 4 个是数字变量，13 个 character/string。我正在使用的原始数据集可以在这里找到：https://www.kaggle.com/austinreese/craigslist-carstrucks-data.

气缸
条件
开车
paint_color
类型
制造商
title_status
型号
燃料
传输
描述
地区
状态
价格（数量）
posting_date（数量）
里程表（数量）
年（数）

对数字列应用特定过滤器后，每个数字变量都没有缺失值。但是，剩下的14个char/string个变量，还有几千到几十万个变量缺失。

请求

类似于博客 post towards data science as shown here (https://towardsdatascience.com/end-to-end-data-science-project-predicting-used-car-prices-using-regression-1b12386c69c8)，具体在 Feature Engineering 部分下，我该如何编写等效的 SAS 代码，其中我在描述列上使用正则表达式来填充其他 string/char 列的缺失值和分类值，例如 cylinders、condition、drive、paint_color，等等？

这是来自博客 post 的 Python 代码。

import re

manufacturer = '(gmc | hyundai | toyota | mitsubishi | ford | chevrolet | ram | buick | jeep | dodge | subaru | nissan | audi | rover  | lexus \
| honda | chrysler | mini | pontiac | mercedes-benz | cadillac | bmw | kia | volvo | volkswagen | jaguar | acura | saturn | mazda | \
mercury | lincoln | infiniti | ferrari | fiat | tesla | land rover | harley-davidson | datsun | alfa-romeo | morgan | aston-martin | porche \
| hennessey)'
condition = '(excellent | good | fair | like new | salvage | new)'
fuel = '(gas | hybrid | diesel |electric)'
title_status = '(clean | lien | rebuilt | salvage | missing | parts only)'
transmission = '(automatic | manual)'
drive = '(4x4 | awd | fwd | rwd | 4wd)'
size = '(mid-size | full-size | compact | sub-compact)'
type_ = '(sedan | truck | SUV | mini-van | wagon | hatchback | coupe | pickup | convertible | van | bus | offroad)'
paint_color = '(red | grey | blue | white | custom | silver | brown | black | purple | green | orange | yellow)'
cylinders = '(\s[1-9] cylinders? |\s1[0-6]? cylinders?)'

keys =    ['manufacturer', 'condition', 'fuel', 'title_status', 'transmission', 'drive','size', 'type', 'paint_color' , 'cylinders']
columns = [ manufacturer,   condition,   fuel,  title_status, transmission ,drive, size, type_, paint_color,   cylinders]

for i,column in zip(keys,columns):
    database[i] = database[i].fillna(
      database['description'].str.extract(column, flags=re.IGNORECASE, expand=False)).str.lower()

database.drop('description', axis=1, inplace= True)

上面显示的 Python 代码的等效 SAS 代码是什么？

Answer 1

它基本上只是进行各种单词搜索。

SAS 中的简化示例：

data want;
set have;
array _fuel(*) $ _temporary_ ("gas", "hybrid", "diesel", "electric");

do i=1 to dim(_fuel);
if find(description, _fuel(i), 'it')>0 then fuel = _fuel(i);
*does not deal with multiple finds so the last one found will be kept;
end;

run;

您可以通过为每个变量创建一个数组然后遍历您的列表来扩展它。我认为您也可以在 SAS 中用 REGEX 命令替换循环，但正则表达式需要太多思考，因此其他人将不得不提供该答案。

SAS编程：如何使用一列替换多列中的缺失值？

SAS Programming: How to replace missing values in multiple columns using one column?

python

null

sas