ValueError: Length of values does not match length of index in nested loop
ValueError: Length of values does not match length of index in nested loop
我正在尝试删除列中每一行的停用词。列包含行和行,因为我已经 word_tokenized
它与 nltk
然后现在它是一个包含元组的列表。我试图用这个嵌套列表理解删除停用词,但它说 ValueError: Length of values does not match length of index in nested loop
。如何解决这个问题?
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
data = pd.read_csv(r"D:/python projects/read_files/spam.csv",
encoding = "latin-1")
data = data[['v1','v2']]
data = data.rename(columns = {'v1': 'label', 'v2': 'text'})
stopwords = set(stopwords.words('english'))
data['text'] = data['text'].str.lower()
data['new'] = [word_tokenize(row) for row in data['text']]
data['new'] = [word for new in data['new'] for word in new if word not in stopwords]
我的文本数据
data['text'].head(5)
Out[92]:
0 go until jurong point, crazy.. available only ...
1 ok lar... joking wif u oni...
2 free entry in 2 a wkly comp to win fa cup fina...
3 u dun say so early hor... u c already then say...
4 nah i don't think he goes to usf, he lives aro...
Name: text, dtype: object
在我word_tokenized
之后用nltk
data['new'].head(5)
Out[89]:
0 [go, until, jurong, point, ,, crazy.., availab...
1 [ok, lar, ..., joking, wif, u, oni, ...]
2 [free, entry, in, 2, a, wkly, comp, to, win, f...
3 [u, dun, say, so, early, hor, ..., u, c, alrea...
4 [nah, i, do, n't, think, he, goes, to, usf, ,,...
Name: new, dtype: object
回溯
runfile('D:/python projects/NLP_nltk_first.py', wdir='D:/python projects')
Traceback (most recent call last):
File "D:\python projects\NLP_nltk_first.py", line 36, in <module>
data['new'] = [new for new in data['new'] for word in new if word not in stopwords]
File "C:\Users\Ramadhina\Anaconda3\lib\site-packages\pandas\core\frame.py", line 3487, in __setitem__
self._set_item(key, value)
File "C:\Users\Ramadhina\Anaconda3\lib\site-packages\pandas\core\frame.py", line 3564, in _set_item
value = self._sanitize_column(key, value)
File "C:\Users\Ramadhina\Anaconda3\lib\site-packages\pandas\core\frame.py", line 3749, in _sanitize_column
value = sanitize_index(value, self.index, copy=False)
File "C:\Users\Ramadhina\Anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 612, in sanitize_index
raise ValueError("Length of values does not match length of index")
ValueError: Length of values does not match length of index
仔细阅读错误信息:
ValueError: Length of values does not match length of index
本例中的"values"就是=
右边的东西:
values = [word for new in data['new'] for word in new if word not in stopwords]
本例中的"index"是DataFrame的行索引:
index = data.index
这里的index
总是和DataFrame本身有相同的行数。
问题是 values
对于 index
来说太长了——也就是说,它们对于 DataFrame 来说太长了。如果你检查你的代码,这应该是显而易见的。如果您仍然看不到问题,请尝试以下操作:
data['text_tokenized'] = [word_tokenize(row) for row in data['text']]
values = [word for new in data['text_tokenized'] for word in new if word not in stopwords]
print('N rows:', data.shape[0])
print('N new values:', len(values))
至于如何解决问题——这完全取决于您要达到的目标。一种选择是 "explode" 数据(还要注意使用 .map
而不是列表理解):
data['text_tokenized'] = data['text'].map(word_tokenize)
# Flatten the token lists without a nested list comprehension
tokens_flat = data['text_tokenized'].explode()
# Join your labels w/ your flattened tokens, if desired
data_flat = data[['label']].join(tokens_flat)
# Add a 2nd index level to track token appearance order,
# might make your life easier
data_flat['token_id'] = data.groupby(level=0).cumcount()
data_flat = data_flat.set_index('token_id', append=True)
作为一个不相关的提示,您可以通过只加载您需要的列来提高 CSV 处理的效率,如下所示:
data = pd.read_csv(r"D:/python projects/read_files/spam.csv",
encoding="latin-1",
usecols=["v1", "v2"])
我正在尝试删除列中每一行的停用词。列包含行和行,因为我已经 word_tokenized
它与 nltk
然后现在它是一个包含元组的列表。我试图用这个嵌套列表理解删除停用词,但它说 ValueError: Length of values does not match length of index in nested loop
。如何解决这个问题?
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
data = pd.read_csv(r"D:/python projects/read_files/spam.csv",
encoding = "latin-1")
data = data[['v1','v2']]
data = data.rename(columns = {'v1': 'label', 'v2': 'text'})
stopwords = set(stopwords.words('english'))
data['text'] = data['text'].str.lower()
data['new'] = [word_tokenize(row) for row in data['text']]
data['new'] = [word for new in data['new'] for word in new if word not in stopwords]
我的文本数据
data['text'].head(5)
Out[92]:
0 go until jurong point, crazy.. available only ...
1 ok lar... joking wif u oni...
2 free entry in 2 a wkly comp to win fa cup fina...
3 u dun say so early hor... u c already then say...
4 nah i don't think he goes to usf, he lives aro...
Name: text, dtype: object
在我word_tokenized
之后用nltk
data['new'].head(5)
Out[89]:
0 [go, until, jurong, point, ,, crazy.., availab...
1 [ok, lar, ..., joking, wif, u, oni, ...]
2 [free, entry, in, 2, a, wkly, comp, to, win, f...
3 [u, dun, say, so, early, hor, ..., u, c, alrea...
4 [nah, i, do, n't, think, he, goes, to, usf, ,,...
Name: new, dtype: object
回溯
runfile('D:/python projects/NLP_nltk_first.py', wdir='D:/python projects')
Traceback (most recent call last):
File "D:\python projects\NLP_nltk_first.py", line 36, in <module>
data['new'] = [new for new in data['new'] for word in new if word not in stopwords]
File "C:\Users\Ramadhina\Anaconda3\lib\site-packages\pandas\core\frame.py", line 3487, in __setitem__
self._set_item(key, value)
File "C:\Users\Ramadhina\Anaconda3\lib\site-packages\pandas\core\frame.py", line 3564, in _set_item
value = self._sanitize_column(key, value)
File "C:\Users\Ramadhina\Anaconda3\lib\site-packages\pandas\core\frame.py", line 3749, in _sanitize_column
value = sanitize_index(value, self.index, copy=False)
File "C:\Users\Ramadhina\Anaconda3\lib\site-packages\pandas\core\internals\construction.py", line 612, in sanitize_index
raise ValueError("Length of values does not match length of index")
ValueError: Length of values does not match length of index
仔细阅读错误信息:
ValueError: Length of values does not match length of index
本例中的"values"就是=
右边的东西:
values = [word for new in data['new'] for word in new if word not in stopwords]
本例中的"index"是DataFrame的行索引:
index = data.index
这里的index
总是和DataFrame本身有相同的行数。
问题是 values
对于 index
来说太长了——也就是说,它们对于 DataFrame 来说太长了。如果你检查你的代码,这应该是显而易见的。如果您仍然看不到问题,请尝试以下操作:
data['text_tokenized'] = [word_tokenize(row) for row in data['text']]
values = [word for new in data['text_tokenized'] for word in new if word not in stopwords]
print('N rows:', data.shape[0])
print('N new values:', len(values))
至于如何解决问题——这完全取决于您要达到的目标。一种选择是 "explode" 数据(还要注意使用 .map
而不是列表理解):
data['text_tokenized'] = data['text'].map(word_tokenize)
# Flatten the token lists without a nested list comprehension
tokens_flat = data['text_tokenized'].explode()
# Join your labels w/ your flattened tokens, if desired
data_flat = data[['label']].join(tokens_flat)
# Add a 2nd index level to track token appearance order,
# might make your life easier
data_flat['token_id'] = data.groupby(level=0).cumcount()
data_flat = data_flat.set_index('token_id', append=True)
作为一个不相关的提示,您可以通过只加载您需要的列来提高 CSV 处理的效率,如下所示:
data = pd.read_csv(r"D:/python projects/read_files/spam.csv",
encoding="latin-1",
usecols=["v1", "v2"])