删除所有非字母字符并分成新列
Remove all non-letter characters and separate into new columns
我的 DataFrame 中有一个名为 "amenities" 的列,这是 1 条记录的样子:
print(df["amenities"][0])
{"Wireless Internet",Kitchen,"Free parking on premises",Breakfast,Heating,Washer,Dryer,"Smoke detector","Carbon monoxide detector",Essentials,Shampoo,Hangers,"Hair dryer","Laptop friendly workspace","translation missing: en.hosting_amenity_49","translation missing: en.hosting_amenity_50"}
我想做的是删除特殊字符,然后我想将它们分开,以便每个便利设施都有自己的列
Room Amenity1 Amenity2 Amenity3 Amenity4
1 Wireless Internet Kitchen Free Parking Breakfast
我做的是:
import re
df['amenities'] = df['amenities'].map(lambda x:re.sub('\W+',' ', x))
Wireless Internet Air conditioning Pool Kitchen Free parking on premises Gym Hot tub Indoor fireplace Heating Family kid friendly Suitable for events Washer Dryer Smoke detector Carbon monoxide detector First aid kit Fire extinguisher Essentials Shampoo Lock on bedroom door 24 hour check in Hangers Hair dryer Iron Laptop friendly workspace
.
这会清理字符串,但现在我不知道如何将它们分成自己的列,因为无线互联网应该是一个列,而不是两个。
一般来说,您想使用列表推导式而不是映射函数。它们更具可读性,并且通常足以实现相同的目的。你可以这样做:
sc_sub = re.compile('\W+')
df['amenities'] = [sc_sub.sub('', amenity) for amenity in df['amenities']]
我的 DataFrame 中有一个名为 "amenities" 的列,这是 1 条记录的样子:
print(df["amenities"][0])
{"Wireless Internet",Kitchen,"Free parking on premises",Breakfast,Heating,Washer,Dryer,"Smoke detector","Carbon monoxide detector",Essentials,Shampoo,Hangers,"Hair dryer","Laptop friendly workspace","translation missing: en.hosting_amenity_49","translation missing: en.hosting_amenity_50"}
我想做的是删除特殊字符,然后我想将它们分开,以便每个便利设施都有自己的列
Room Amenity1 Amenity2 Amenity3 Amenity4
1 Wireless Internet Kitchen Free Parking Breakfast
我做的是:
import re
df['amenities'] = df['amenities'].map(lambda x:re.sub('\W+',' ', x))
Wireless Internet Air conditioning Pool Kitchen Free parking on premises Gym Hot tub Indoor fireplace Heating Family kid friendly Suitable for events Washer Dryer Smoke detector Carbon monoxide detector First aid kit Fire extinguisher Essentials Shampoo Lock on bedroom door 24 hour check in Hangers Hair dryer Iron Laptop friendly workspace
.
这会清理字符串,但现在我不知道如何将它们分成自己的列,因为无线互联网应该是一个列,而不是两个。
一般来说,您想使用列表推导式而不是映射函数。它们更具可读性,并且通常足以实现相同的目的。你可以这样做:
sc_sub = re.compile('\W+')
df['amenities'] = [sc_sub.sub('', amenity) for amenity in df['amenities']]