通过 DataFrame 中的 For 循环赋值
Assigning value over a ForLoop in DataFrame
我有这样一个数据库:
manufacturer cylinders description
0 toyota 5 cylinders toyota, gmc 10 years old.
1 NaN NaN gmc, Motor runs and drives good.
2 NaN NaN Motor old, in pieces. 4 cylinders
3 NaN 12 cylinders 2 owner 0 rust. Cadillac.
而这组关键字:
manufacturer = ['gmc', 'toyota', 'cadillac']
cylinders = ['12 cylinders', '4 cylinders', '5 cylinders']
我想创建一个程序来读取描述并根据所需的关键字向每个 column.Ideally 添加正确的信息,它看起来像这样:
manufacturer cylinders description
0 toyota 5 cylinders toyota, gmc 10 years old.
1 gmc NaN gmc, Motor runs and drives good.
2 NaN 4 cylinders Motor old, in pieces. 4 cylinders
3 cadillac 12 cylinders 2 owner 0 rust. Cadillac.
尝试了所有方法,但似乎没有任何效果。这是我尝试将单词添加到一列中的方法,但我需要将其更改为多列并且该程序更改值,即使它不是 NaN(f.e 将“toyota”更改为“ gmc"),这是我不想要的。
import re
keyword = ['gmc', 'toyota', 'cadillac']
bag_of_words = []
for i, description in enumerate(test3['description']):
bag_of_words = re.findall(r"""[A-Za-z\-]+""", test3["description"][i])
for word in bag_of_words:
if word.lower() in keyword:
test3.loc[i, 'manufacturer'] = word.lower()
知道如何解决这个问题吗?谢谢。
无需使用for循环。相反,您可以使用 pandas
矢量化函数。
- 您可以将
fillna()
与 .str.extract()
与 pandas
库一起使用。本质上,您是用描述列中提取的信息替换 NaN
值。
- 你可以传递一个标志,
flags=re.IGNORECASE
匹配时忽略case-sensitivity。
- 最后还要用
, expand=False
到return一个series,因为str.extract()
return是一个dataframe,做[=19=的时候会报错] 在数据框上而不是系列上。
import pandas
import re
keyword = ['gmc', 'toyota', 'cadillac']
df['manufacturer'] = df['manufacturer'].fillna(
df['description'].str.extract('(gmc|toyota|cadillac)', flags=re.IGNORECASE, expand=False))
df['cylinders'] = df['cylinders'].fillna(
df['description'].str.extract('(\d+\s+cylinders?)', flags=re.IGNORECASE, expand=False))
df
Out[1]:
manufacturer cylinders description
0 toyota 5 cylinders toyota, gmc 10 years old.
1 gmc NaN gmc, Motor runs and drives good.
2 NaN 4 cylinders Motor old, in pieces. 4 cylinders
3 Cadillac 12 cylinders 2 owner 0 rust. Cadillac.
如果您需要小写的输出,您可以将 str.lower()
或 str.casefold()
添加到上面每一列代码的末尾。 casefold()
操作与 lower()
类似,但使用符号和不同语言时更可靠。
我有这样一个数据库:
manufacturer cylinders description
0 toyota 5 cylinders toyota, gmc 10 years old.
1 NaN NaN gmc, Motor runs and drives good.
2 NaN NaN Motor old, in pieces. 4 cylinders
3 NaN 12 cylinders 2 owner 0 rust. Cadillac.
而这组关键字:
manufacturer = ['gmc', 'toyota', 'cadillac']
cylinders = ['12 cylinders', '4 cylinders', '5 cylinders']
我想创建一个程序来读取描述并根据所需的关键字向每个 column.Ideally 添加正确的信息,它看起来像这样:
manufacturer cylinders description
0 toyota 5 cylinders toyota, gmc 10 years old.
1 gmc NaN gmc, Motor runs and drives good.
2 NaN 4 cylinders Motor old, in pieces. 4 cylinders
3 cadillac 12 cylinders 2 owner 0 rust. Cadillac.
尝试了所有方法,但似乎没有任何效果。这是我尝试将单词添加到一列中的方法,但我需要将其更改为多列并且该程序更改值,即使它不是 NaN(f.e 将“toyota”更改为“ gmc"),这是我不想要的。
import re
keyword = ['gmc', 'toyota', 'cadillac']
bag_of_words = []
for i, description in enumerate(test3['description']):
bag_of_words = re.findall(r"""[A-Za-z\-]+""", test3["description"][i])
for word in bag_of_words:
if word.lower() in keyword:
test3.loc[i, 'manufacturer'] = word.lower()
知道如何解决这个问题吗?谢谢。
无需使用for循环。相反,您可以使用 pandas
矢量化函数。
- 您可以将
fillna()
与.str.extract()
与pandas
库一起使用。本质上,您是用描述列中提取的信息替换NaN
值。 - 你可以传递一个标志,
flags=re.IGNORECASE
匹配时忽略case-sensitivity。 - 最后还要用
, expand=False
到return一个series,因为str.extract()
return是一个dataframe,做[=19=的时候会报错] 在数据框上而不是系列上。
import pandas
import re
keyword = ['gmc', 'toyota', 'cadillac']
df['manufacturer'] = df['manufacturer'].fillna(
df['description'].str.extract('(gmc|toyota|cadillac)', flags=re.IGNORECASE, expand=False))
df['cylinders'] = df['cylinders'].fillna(
df['description'].str.extract('(\d+\s+cylinders?)', flags=re.IGNORECASE, expand=False))
df
Out[1]:
manufacturer cylinders description
0 toyota 5 cylinders toyota, gmc 10 years old.
1 gmc NaN gmc, Motor runs and drives good.
2 NaN 4 cylinders Motor old, in pieces. 4 cylinders
3 Cadillac 12 cylinders 2 owner 0 rust. Cadillac.
如果您需要小写的输出,您可以将 str.lower()
或 str.casefold()
添加到上面每一列代码的末尾。 casefold()
操作与 lower()
类似,但使用符号和不同语言时更可靠。