如何使用 Stanford LexParser 处理中文文本？

Question

我似乎无法获得 Stanford NLP's LexParser 的正确输入编码。

如何使用 Stanford LexParser 分析中文文本？

我已完成以下操作来下载该工具：

$ wget http://nlp.stanford.edu/software/stanford-parser-full-2015-04-20.zip
$ unzip stanford-parser-full-2015-04-20.zip 
$ cd stanford-parser-full-2015-04-20/

我的输入文本在 UTF-8:

$ echo "应有尽有 的 丰富 选择 定 将 为 您 的 旅程 增添 无数 的 赏心 乐事 。" > input.txt

$ echo "应有尽有#VV 的#DEC 丰富#JJ 选择#NN 定#VV 将#AD 为#P 您#PN 的#DEG 旅程#NN 增添#VV 无数#CD 的#DEG 赏心#NN 乐事#NN  。#PUNCT" > pos-input.txt

根据 README.txt，解析器在以下方面进行了训练：

Chinese There are Chinese grammars trained just on mainland material from Xinhua and more mixed material from the LDC Chinese Treebank. The default input encoding is GB18030.

所以我首先尝试使用 UTF-8 文件：

$ bash lexparser-lang.sh Chinese 80 edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz parsed input.txt
Loading parser from serialized file edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz ...  done [1.0 sec].
Parsing file: input.txt
Parsing [sent. 1 len. 16]: 应有尽有 的1�7 丰富 选择 宄1�7 射1�7 丄1�7 悄1�7 的1�7 旅程 增添 无数 的1�7 赏心 乐事 〄1�7
Parsed file: input.txt [1 sentences].
Parsed 16 words in 1 sentences (21.00 wds/sec; 1.31 sents/sec).

好像没用。解析器生成了这个文件，input.txt.parsed.80.stp

[输出]:

$ cat input.txt.parsed.80.stp 
(FRAG (NR 应有尽有) (NR 的1�7) (NT 丰富) (NT 选择) (NN 宄1�7) (NN 射1�7) (NN 丄1�7) (NN 悄1�7) (NR 的1�7) (NT 旅程) (NT 增添) (NN 无数) (NN 的1�7) (NR 赏心) (NR 乐事) (VV 〄1�7))

然后我试着把句子编码成GB18030:

$ bash lexparser-lang.sh Chinese 80 edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz parsed input-gb18030.txt
Loading parser from serialized file edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz ...  done [1.0 sec].
Parsing file: input-gb18030.txt
Parsing [sent. 1 len. 16]: Ӧ�о��� �� �ḻ ѡ�� �� �� Ϊ �� �� �ó� ���� ���� �� ���� ���� ��
Parsed file: input-gb18030.txt [1 sentences].
Parsed 16 words in 1 sentences (19.90 wds/sec; 1.24 sents/sec).
alvas@ubi:~/stanford-parser-full-2015-04-20$ cat input-gb18030.txt.parsed.80.stp 
(IP
  (NP
    (CP
      (IP
        (VP (VV Ӧ�о���)))
      (DEC ��))
    (ADJP (JJ �ḻ))
    (NP (NN ѡ��)))
  (VP (VV ��)
    (VP
      (ADVP (AD ��))
      (PP (P Ϊ)
        (NP
          (DNP
            (NP (PN ��))
            (DEG ��))
          (NP (NN �ó�))))
      (VP (VV ����)
        (NP
          (DNP
            (ADJP (JJ ����))
            (DEG ��))
          (NP (NN ����) (NN ����))))))
  (PU ��))

它似乎可以正常工作，但是 我如何将文件转换回 UTF8？

我试过了，但没用：

$ cat input-gb18030.txt.parsed.80.stp | python -c "print raw_input().decode('GB18030').encode('utf8')"
(IP

这里是一些结论性问题：

如何将 GB18030 转换为 UTF8 以及将 UTF8 转换为 GB18030？
如何将 Stanford LexParser 用于中文 UTF8 文本？

Answer 1

我按照您的步骤操作，结果表明您可以简单地使用编码转换器来实现您的目标。

我在测试中使用 iconv。

iconv -f GB18030 -t UTF-8 input2.txt.parsed.80.stp -o output

这是我的输出：

dmk@dmk-debian /t/stanford-parser-full-2015-04-20 ❯❯❯ cat input2.txt.parsed.80.stp
(IP
  (NP
    (CP
      (IP
        (VP (VV Ӧ�о���)))
      (DEC ��))
    (ADJP (JJ �ḻ))
    (NP (NN ѡ��)))
  (VP (VV ��)
    (VP
      (ADVP (AD ��))
      (PP (P Ϊ)
        (NP
          (DNP
            (NP (PN ��))
            (DEG ��))
          (NP (NN �ó�))))
      (VP (VV ����)
        (NP
          (DNP
            (ADJP (JJ ����))
            (DEG ��))
          (NP (NN ����) (NN ����))))))
  (PU ��))

dmk@dmk-debian /t/stanford-parser-full-2015-04-20 ❯❯❯ iconv -f GB18030 -t UTF-8 input2.txt.parsed.80.stp -o output
dmk@dmk-debian /t/stanford-parser-full-2015-04-20 ❯❯❯ cat output
(IP
  (NP
    (CP
      (IP
        (VP (VV 应有尽有)))
      (DEC 的))
    (ADJP (JJ 丰富))
    (NP (NN 选择)))
  (VP (VV 定)
    (VP
      (ADVP (AD 将))
      (PP (P 为)
        (NP
          (DNP
            (NP (PN 您))
            (DEG 的))
          (NP (NN 旅程))))
      (VP (VV 增添)
        (NP
          (DNP
            (ADJP (JJ 无数))
            (DEG 的))
          (NP (NN 赏心) (NN 乐事))))))
  (PU 。))

如何使用 Stanford LexParser 处理中文文本？

How to use Stanford LexParser for Chinese text?

encoding

nlp

utf-8

stanford-nlp