如何在 Pandas 中获取文本中特定单词的一种热编码?
How to get one hot encoding of specific words in a text in Pandas?
假设我有一个数据框和单词列表,即
toxic = ['bad','horrible','disguisting']
df = pd.DataFrame({'text':['You look horrible','You are good','you are bad and disguisting']})
main = pd.concat([df,pd.DataFrame(columns=toxic)]).fillna(0)
samp = main['text'].str.split().apply(lambda x : [i for i in toxic if i in x])
for i,j in enumerate(samp):
for k in j:
main.loc[i,k] = 1
这导致:
bad disguisting horrible text
0 0 0 1 You look horrible
1 0 0 0 You are good
2 1 1 0 you are bad and disguisting
这比 get_dummies 快一点,但是当数据量很大时 pandas 中的 for 循环并不明显。
我试过 str.get_dummies
,这将对系列中的每个单词进行一次热编码,这使得它有点慢。
pd.concat([df,main['text'].str.get_dummies(' ')[toxic]],1)
text bad horrible disguisting
0 You look horrible 0 1 0
1 You are good 0 0 0
2 you are bad and disguisting 1 0 1
如果我在 scipy 中尝试相同的方法。
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(toxic)
main['text'].str.split().apply(le.transform)
这导致 Value Error,y contains new labels
。有没有办法忽略 scipy 中的错误?
如何提高实现速度,有没有其他快速的方法?
使用sklearn.feature_extraction.text.CountVectorizer:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(vocabulary=toxic)
r = pd.SparseDataFrame(cv.fit_transform(df['text']),
df.index,
cv.get_feature_names(),
default_fill_value=0)
结果:
In [127]: r
Out[127]:
bad horrible disguisting
0 0 1 0
1 0 0 0
2 1 0 1
In [128]: type(r)
Out[128]: pandas.core.sparse.frame.SparseDataFrame
In [129]: r.info()
<class 'pandas.core.sparse.frame.SparseDataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
bad 3 non-null int64
horrible 3 non-null int64
disguisting 3 non-null int64
dtypes: int64(3)
memory usage: 104.0 bytes
In [130]: r.memory_usage()
Out[130]:
Index 80
bad 8 # <--- NOTE: it's using 8 bytes (1x int64) instead of 24 bytes for three values (3x8)
horrible 8
disguisting 8
dtype: int64
将 SparseDataFrame 与原始 DataFrame 合并:
In [137]: r2 = df.join(r)
In [138]: r2
Out[138]:
text bad horrible disguisting
0 You look horrible 0 1 0
1 You are good 0 0 0
2 you are bad and disguisting 1 0 1
In [139]: r2.memory_usage()
Out[139]:
Index 80
text 24
bad 8
horrible 8
disguisting 8
dtype: int64
In [140]: type(r2)
Out[140]: pandas.core.frame.DataFrame
In [141]: type(r2['horrible'])
Out[141]: pandas.core.sparse.series.SparseSeries
In [142]: type(r2['text'])
Out[142]: pandas.core.series.Series
PS 在较旧的 Pandas 版本中,在将 SparsedDataFrame 与常规 DataFrame 连接后,稀疏列失去了稀疏性(变得密集),现在我们可以混合使用常规系列(列)和 SparseSeries -非常好的功能!
接受的答案已弃用,请参阅发行说明:
SparseSeries and SparseDataFrame were removed in pandas 1.0.0. This migration guide is present to aid in migrating from previous versions.
Pandas 1.0.5 解决方案:
r = df = pd.DataFrame.sparse.from_spmatrix(cv.fit_transform(df['text']),
df.index,
cv.get_feature_names())
假设我有一个数据框和单词列表,即
toxic = ['bad','horrible','disguisting']
df = pd.DataFrame({'text':['You look horrible','You are good','you are bad and disguisting']})
main = pd.concat([df,pd.DataFrame(columns=toxic)]).fillna(0)
samp = main['text'].str.split().apply(lambda x : [i for i in toxic if i in x])
for i,j in enumerate(samp):
for k in j:
main.loc[i,k] = 1
这导致:
bad disguisting horrible text
0 0 0 1 You look horrible
1 0 0 0 You are good
2 1 1 0 you are bad and disguisting
这比 get_dummies 快一点,但是当数据量很大时 pandas 中的 for 循环并不明显。
我试过 str.get_dummies
,这将对系列中的每个单词进行一次热编码,这使得它有点慢。
pd.concat([df,main['text'].str.get_dummies(' ')[toxic]],1)
text bad horrible disguisting
0 You look horrible 0 1 0
1 You are good 0 0 0
2 you are bad and disguisting 1 0 1
如果我在 scipy 中尝试相同的方法。
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(toxic)
main['text'].str.split().apply(le.transform)
这导致 Value Error,y contains new labels
。有没有办法忽略 scipy 中的错误?
如何提高实现速度,有没有其他快速的方法?
使用sklearn.feature_extraction.text.CountVectorizer:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(vocabulary=toxic)
r = pd.SparseDataFrame(cv.fit_transform(df['text']),
df.index,
cv.get_feature_names(),
default_fill_value=0)
结果:
In [127]: r
Out[127]:
bad horrible disguisting
0 0 1 0
1 0 0 0
2 1 0 1
In [128]: type(r)
Out[128]: pandas.core.sparse.frame.SparseDataFrame
In [129]: r.info()
<class 'pandas.core.sparse.frame.SparseDataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
bad 3 non-null int64
horrible 3 non-null int64
disguisting 3 non-null int64
dtypes: int64(3)
memory usage: 104.0 bytes
In [130]: r.memory_usage()
Out[130]:
Index 80
bad 8 # <--- NOTE: it's using 8 bytes (1x int64) instead of 24 bytes for three values (3x8)
horrible 8
disguisting 8
dtype: int64
将 SparseDataFrame 与原始 DataFrame 合并:
In [137]: r2 = df.join(r)
In [138]: r2
Out[138]:
text bad horrible disguisting
0 You look horrible 0 1 0
1 You are good 0 0 0
2 you are bad and disguisting 1 0 1
In [139]: r2.memory_usage()
Out[139]:
Index 80
text 24
bad 8
horrible 8
disguisting 8
dtype: int64
In [140]: type(r2)
Out[140]: pandas.core.frame.DataFrame
In [141]: type(r2['horrible'])
Out[141]: pandas.core.sparse.series.SparseSeries
In [142]: type(r2['text'])
Out[142]: pandas.core.series.Series
PS 在较旧的 Pandas 版本中,在将 SparsedDataFrame 与常规 DataFrame 连接后,稀疏列失去了稀疏性(变得密集),现在我们可以混合使用常规系列(列)和 SparseSeries -非常好的功能!
接受的答案已弃用,请参阅发行说明:
SparseSeries and SparseDataFrame were removed in pandas 1.0.0. This migration guide is present to aid in migrating from previous versions.
Pandas 1.0.5 解决方案:
r = df = pd.DataFrame.sparse.from_spmatrix(cv.fit_transform(df['text']),
df.index,
cv.get_feature_names())