对段落进行 Pyparsing
Pyparsing for Paragraphs
我 运行 遇到了一个我似乎无法解决的 pyparsing 小问题。我想编写一个规则来为我解析多行段落。最终目标是得到一个 递归语法 来解析如下内容:
Heading: awesome
This is a paragraph and then
a line break is inserted
then we have more text
but this is also a different line
with more lines attached
Other: cool
This is another indented block
possibly with more paragraphs
This is another way to keep this up
and write more things
But then we can keep writing at the old level
and get this
变成类似HTML的东西:所以也许(当然有了解析树,我可以把它转换成我喜欢的任何格式)。
<Heading class="awesome">
<p> This is a paragraph and then a line break is inserted and then we have more text </p>
<p> but this is also a different line with more lines attached<p>
<Other class="cool">
<p> This is another indented block possibly with more paragraphs</p>
<p> This is another way to keep this up and write more things</p>
</Other>
<p> But then we can keep writing at the old level and get this</p>
</Heading>
进度
我已经成功地进入了可以使用 pyparsing 解析标题行和缩进块的阶段。但我不能:
- 将一个段落定义为应该连接的多行
- 允许段落缩进
一个例子
从 here 开始,我可以将段落输出到单行,但似乎没有办法在不删除换行符的情况下将其转换为解析树。
我认为一段应该是:
words = ## I've defined words to allow a set of characters I need
lines = OneOrMore(words)
paragraph = OneOrMore(lines) + lineEnd
但这似乎对我不起作用。任何想法都会很棒 :)
所以我设法解决了这个问题,以供将来偶然发现此问题的任何人使用。您可以像这样定义段落。虽然它肯定不理想,也不完全符合我描述的语法。相关代码为:
line = OneOrMore(CharsNotIn('\n')) + Suppress(lineEnd)
emptyline = ~line
paragraph = OneOrMore(line) + emptyline
paragraph.setParseAction(join_lines)
其中 join_lines
定义为:
def join_lines(tokens):
stripped = [t.strip() for t in tokens]
joined = " ".join(stripped)
return joined
如果符合您的需要,那应该会为您指明正确的方向:) 希望对您有所帮助!
更好的空行
上面给出的空行定义肯定不理想,可以大大改进。我发现的最佳方法如下:
empty_line = Suppress(LineStart() + ZeroOrMore(" ") + LineEnd())
empty_line.setWhitespaceChars("")
这允许您在不中断匹配的情况下使用空格填充空行。
我 运行 遇到了一个我似乎无法解决的 pyparsing 小问题。我想编写一个规则来为我解析多行段落。最终目标是得到一个 递归语法 来解析如下内容:
Heading: awesome
This is a paragraph and then
a line break is inserted
then we have more text
but this is also a different line
with more lines attached
Other: cool
This is another indented block
possibly with more paragraphs
This is another way to keep this up
and write more things
But then we can keep writing at the old level
and get this
变成类似HTML的东西:所以也许(当然有了解析树,我可以把它转换成我喜欢的任何格式)。
<Heading class="awesome">
<p> This is a paragraph and then a line break is inserted and then we have more text </p>
<p> but this is also a different line with more lines attached<p>
<Other class="cool">
<p> This is another indented block possibly with more paragraphs</p>
<p> This is another way to keep this up and write more things</p>
</Other>
<p> But then we can keep writing at the old level and get this</p>
</Heading>
进度
我已经成功地进入了可以使用 pyparsing 解析标题行和缩进块的阶段。但我不能:
- 将一个段落定义为应该连接的多行
- 允许段落缩进
一个例子
从 here 开始,我可以将段落输出到单行,但似乎没有办法在不删除换行符的情况下将其转换为解析树。
我认为一段应该是:
words = ## I've defined words to allow a set of characters I need
lines = OneOrMore(words)
paragraph = OneOrMore(lines) + lineEnd
但这似乎对我不起作用。任何想法都会很棒 :)
所以我设法解决了这个问题,以供将来偶然发现此问题的任何人使用。您可以像这样定义段落。虽然它肯定不理想,也不完全符合我描述的语法。相关代码为:
line = OneOrMore(CharsNotIn('\n')) + Suppress(lineEnd)
emptyline = ~line
paragraph = OneOrMore(line) + emptyline
paragraph.setParseAction(join_lines)
其中 join_lines
定义为:
def join_lines(tokens):
stripped = [t.strip() for t in tokens]
joined = " ".join(stripped)
return joined
如果符合您的需要,那应该会为您指明正确的方向:) 希望对您有所帮助!
更好的空行
上面给出的空行定义肯定不理想,可以大大改进。我发现的最佳方法如下:
empty_line = Suppress(LineStart() + ZeroOrMore(" ") + LineEnd())
empty_line.setWhitespaceChars("")
这允许您在不中断匹配的情况下使用空格填充空行。