导入包含两个混合列的 Txt 文件

Question

我想导入一个 txt 文件如下：

0 @switchfoot http://twitpic.com/2y1zl - Awww  that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D
0 is upset that he can't update his Facebook by texting it... and might cry as a result  School today also. Blah!
0 @Kenichan I dived many times for the ball. Managed to save 50%  The rest go out of bounds
4 my whole body feels itchy and like its on fire 
4 @nationwideclass no  it's not behaving at all. i'm mad. why am i here? because I can't see you all over there. 
0 @Kwesidei not the whole crew

所需的 return 是一个包含两列的 numpy.array，即 sentiment='0' or '4' 和 tw='string'。但它一直给我错误。有人可以帮忙吗？

Train_tw=np.genfromtxt("classified_tweets0.txt",dtype=(int,str),names=['sentiment','tw'])

Answer 1

你的表达错误是

ValueError: mismatch in size of old and new data-descriptor

如果我使用dtype=None，我得到

ValueError: Some errors were detected !
    Line #2 (got 22 columns instead of 20)
    Line #3 (got 19 columns instead of 20)
    Line #4 (got 11 columns instead of 20)
    Line #5 (got 22 columns instead of 20)
    Line #6 (got 6 columns instead of 20)

从 'white space' 定界符开始，它将每行分成 20,22 等字段。文本中的 space 与第一个一样是分隔符。

一个选项是编辑文件，并将第一个 space 替换为一些独特的分隔符。另一种选择是使用定界符的字段长度版本。经过一些实验，这个负载看起来很合理（这是 Py3，所以我使用的是 Unicode 字符串 dtype）。

In [32]: np.genfromtxt("stack42754603.txt",dtype='int,U100',delimiter=[2,100],names=['sentiment','tw'])
Out[32]: 
array([ (0, "@switchfoot http://twitpic.com/2y1zl - Awww  that's a bummer.  You shoulda got David Carr of Third D"),
       (0, "is upset that he can't update his Facebook by texting it... and might cry as a result  School today "),
       (0, '@Kenichan I dived many times for the ball. Managed to save 50%  The rest go out of bounds\n'),
       (4, 'my whole body feels itchy and like its on fire\n'),
       (4, "@nationwideclass no  it's not behaving at all. i'm mad. why am i here? because I can't see you all o"),
       (0, '@Kwesidei not the whole crew')], 
      dtype=[('sentiment', '<i4'), ('tw', '<U100')])

导入包含两个混合列的 Txt 文件

Import Txt file with two mixed columns

numpy

sentiment-analysis