使用 lxml 解析时,字符串在方括号上中断
String breaks on square bracket when parsed with lxml
我是 lxml 解析的新手,无法处理简单的解析问题。我的 xml 中有一行看起来像:
The IgM BCR is essential for survival of peripheral B cells [<xref ref-type="bibr" rid="CR34">34</xref>]. In the absence of BTK B cell...
所以,当我执行下面的代码时:
e = open('somexml.xml', encoding='utf8')
tree = etree.parse(e)
titles = tree.xpath('/pmc-articleset/article/front/article-meta/title-group/article-title')
for node in titles:
text = tree.xpath('/pmc-articleset/article/body/sec/p')
for node in text:
content = str(node.text).encode("utf-8")
s = str(' '.join(lxml.html.fromstring(content).xpath("//text()")).encode('latin1'))
print (s)
结果如下:
The IgM BCR is essential for survival of peripheral B cells ['
即使我只打印 node.text 而没有任何 "join" 命令,结果看起来也很相似。
如何跳过方括号部分并接收完整的字符串?任何帮助将不胜感激!
尝试以下方法:
e = open('somexml.xml', encoding='utf8')
tree = etree.parse(e)
titles = tree.xpath('/pmc-articleset/article/front/article-meta/title-group/article-title')
for title in titles:
ps = title.xpath('/pmc-articleset/article/body/sec/p')
for p in ps:
text = ''.join(p.itertext())
print(text)
]. In the absence of BTK B cell...
是<xref>
元素的tail
属性的值。参见 http://infohost.nmt.edu/tcc/help/pubs/pylxml/web/etree-view.html。
方括号没有什么特别之处;他们只是角色。
使用itertext()
您可以获得元素及其后代的文本内容。 tail
默认包含内容。参见 http://lxml.de/api/lxml.etree._Element-class.html#itertext。
小演示:
from lxml import etree
xml = "<p>TEXT <xref>34</xref>TAIL</p>"
p = etree.fromstring(xml)
print(p.text)
print(''.join(p.itertext()))
print(p.text + p.find("xref").tail)
输出:
TEXT
TEXT 34TAIL
TEXT TAIL
我是 lxml 解析的新手,无法处理简单的解析问题。我的 xml 中有一行看起来像:
The IgM BCR is essential for survival of peripheral B cells [<xref ref-type="bibr" rid="CR34">34</xref>]. In the absence of BTK B cell...
所以,当我执行下面的代码时:
e = open('somexml.xml', encoding='utf8')
tree = etree.parse(e)
titles = tree.xpath('/pmc-articleset/article/front/article-meta/title-group/article-title')
for node in titles:
text = tree.xpath('/pmc-articleset/article/body/sec/p')
for node in text:
content = str(node.text).encode("utf-8")
s = str(' '.join(lxml.html.fromstring(content).xpath("//text()")).encode('latin1'))
print (s)
结果如下:
The IgM BCR is essential for survival of peripheral B cells ['
即使我只打印 node.text 而没有任何 "join" 命令,结果看起来也很相似。
如何跳过方括号部分并接收完整的字符串?任何帮助将不胜感激!
尝试以下方法:
e = open('somexml.xml', encoding='utf8')
tree = etree.parse(e)
titles = tree.xpath('/pmc-articleset/article/front/article-meta/title-group/article-title')
for title in titles:
ps = title.xpath('/pmc-articleset/article/body/sec/p')
for p in ps:
text = ''.join(p.itertext())
print(text)
]. In the absence of BTK B cell...
是<xref>
元素的tail
属性的值。参见 http://infohost.nmt.edu/tcc/help/pubs/pylxml/web/etree-view.html。
方括号没有什么特别之处;他们只是角色。
使用itertext()
您可以获得元素及其后代的文本内容。 tail
默认包含内容。参见 http://lxml.de/api/lxml.etree._Element-class.html#itertext。
小演示:
from lxml import etree
xml = "<p>TEXT <xref>34</xref>TAIL</p>"
p = etree.fromstring(xml)
print(p.text)
print(''.join(p.itertext()))
print(p.text + p.find("xref").tail)
输出:
TEXT
TEXT 34TAIL
TEXT TAIL