导入包含两个混合列的 Txt 文件
Import Txt file with two mixed columns
我想导入一个 txt 文件如下:
0 @switchfoot http://twitpic.com/2y1zl - Awww that's a bummer. You shoulda got David Carr of Third Day to do it. ;D
0 is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!
0 @Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds
4 my whole body feels itchy and like its on fire
4 @nationwideclass no it's not behaving at all. i'm mad. why am i here? because I can't see you all over there.
0 @Kwesidei not the whole crew
所需的 return 是一个包含两列的 numpy.array,即 sentiment='0' or '4'
和 tw='string'
。但它一直给我错误。有人可以帮忙吗?
Train_tw=np.genfromtxt("classified_tweets0.txt",dtype=(int,str),names=['sentiment','tw'])
你的表达错误是
ValueError: mismatch in size of old and new data-descriptor
如果我使用dtype=None
,我得到
ValueError: Some errors were detected !
Line #2 (got 22 columns instead of 20)
Line #3 (got 19 columns instead of 20)
Line #4 (got 11 columns instead of 20)
Line #5 (got 22 columns instead of 20)
Line #6 (got 6 columns instead of 20)
从 'white space' 定界符开始,它将每行分成 20,22 等字段。文本中的 space 与第一个一样是分隔符。
一个选项是编辑文件,并将第一个 space 替换为一些独特的分隔符。另一种选择是使用定界符的字段长度版本。经过一些实验,这个负载看起来很合理(这是 Py3,所以我使用的是 Unicode 字符串 dtype)。
In [32]: np.genfromtxt("stack42754603.txt",dtype='int,U100',delimiter=[2,100],names=['sentiment','tw'])
Out[32]:
array([ (0, "@switchfoot http://twitpic.com/2y1zl - Awww that's a bummer. You shoulda got David Carr of Third D"),
(0, "is upset that he can't update his Facebook by texting it... and might cry as a result School today "),
(0, '@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds\n'),
(4, 'my whole body feels itchy and like its on fire\n'),
(4, "@nationwideclass no it's not behaving at all. i'm mad. why am i here? because I can't see you all o"),
(0, '@Kwesidei not the whole crew')],
dtype=[('sentiment', '<i4'), ('tw', '<U100')])
我想导入一个 txt 文件如下:
0 @switchfoot http://twitpic.com/2y1zl - Awww that's a bummer. You shoulda got David Carr of Third Day to do it. ;D
0 is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!
0 @Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds
4 my whole body feels itchy and like its on fire
4 @nationwideclass no it's not behaving at all. i'm mad. why am i here? because I can't see you all over there.
0 @Kwesidei not the whole crew
所需的 return 是一个包含两列的 numpy.array,即 sentiment='0' or '4'
和 tw='string'
。但它一直给我错误。有人可以帮忙吗?
Train_tw=np.genfromtxt("classified_tweets0.txt",dtype=(int,str),names=['sentiment','tw'])
你的表达错误是
ValueError: mismatch in size of old and new data-descriptor
如果我使用dtype=None
,我得到
ValueError: Some errors were detected !
Line #2 (got 22 columns instead of 20)
Line #3 (got 19 columns instead of 20)
Line #4 (got 11 columns instead of 20)
Line #5 (got 22 columns instead of 20)
Line #6 (got 6 columns instead of 20)
从 'white space' 定界符开始,它将每行分成 20,22 等字段。文本中的 space 与第一个一样是分隔符。
一个选项是编辑文件,并将第一个 space 替换为一些独特的分隔符。另一种选择是使用定界符的字段长度版本。经过一些实验,这个负载看起来很合理(这是 Py3,所以我使用的是 Unicode 字符串 dtype)。
In [32]: np.genfromtxt("stack42754603.txt",dtype='int,U100',delimiter=[2,100],names=['sentiment','tw'])
Out[32]:
array([ (0, "@switchfoot http://twitpic.com/2y1zl - Awww that's a bummer. You shoulda got David Carr of Third D"),
(0, "is upset that he can't update his Facebook by texting it... and might cry as a result School today "),
(0, '@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds\n'),
(4, 'my whole body feels itchy and like its on fire\n'),
(4, "@nationwideclass no it's not behaving at all. i'm mad. why am i here? because I can't see you all o"),
(0, '@Kwesidei not the whole crew')],
dtype=[('sentiment', '<i4'), ('tw', '<U100')])