下载并包含 XML 中引用的 URL
Download and include referenced URL in XML
我有一个新闻源的 RSS 提要。在新闻文本和其他元数据中,提要还包含对评论部分的 URL 引用,也可以采用 RSS 格式。我想下载并包含每篇新闻文章的评论部分的内容。我的目标是创建一个 RSS 提要,其中包含文章和 RSS 中包含的每篇文章的评论,然后将这个新的 RSS in calibre 转换为 PDF。
这是一个例子XML:
<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<entry>
<author>
<name>Some Author</name>
<uri>http://thenews.com</uri>
</author>
<category term="sports" label="Sports" />
<content type="html">This is the news text.</content>
<id>123abc</id>
<link href="http://thenews.com/article/123abc/comments" />
<updated>2016-04-29T13:44:00+00:00</updated>
<title>The Title</title>
</entry>
<entry>
<author>
<name>Some other Author</name>
<uri>http://thenews.com</uri>
</author>
<category term="sports" label="Sports" />
<content type="html">This is another news text.</content>
<id>123abd</id>
<link href="http://thenews.com/article/123abd/comments" />
<updated>2016-04-29T14:46:00+00:00</updated>
<title>The other Title</title>
</entry>
</feed>
现在我想用 URL 的内容替换 。可以通过在 URL 末尾添加 /rss 来获取 RSS 提要。所以最后,单个条目将如下所示:
<entry>
<author>
<name>Some Author</name>
<uri>http://thenews.com</uri>
</author>
<category term="sports" label="Sports" />
<content type="html">This is the news text.</content>
<id>123abc</id>
<comments>
<comment>
<author>A commenter</author>
<timestamp>2016-04-29T16:00:00+00:00</timestamp>
<text>Cool story, yo!</text>
</comment>
<comment>
<author>Another commenter</author>
<timestamp>2016-04-29T16:01:00+00:00</timestamp>
<text>This is interesting news.</text>
</comment>
</comments>
<updated>2016-04-29T13:44:00+00:00</updated>
<title>The Title</title>
</entry>
我对任何编程语言都持开放态度。我用 python 和 lxml 试过了,但没能走多远。我能够提取评论 URL 并下载评论提要,但无法替换实际的 -标签。
无需下载实际的 RSS,以下是我的进展:
import lxml.etree as et
import urllib2
import re
# These will be downloaded from the RSS feed source when the code works
xmltext = """[The above news feed, too long to paste]"""
commentsRSS = """[The above comments feed]"""
hdr = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
article = et.fromstring(xmltext)
for elem in article.xpath('//feed/entry'):
commentsURL = elem.xpath('link/@href')
#request = urllib2.Request(commentsURL[0] + '.rss', headers=hdr)
#comments = urllib2.urlopen(request).read()
comments = commentsRSS
# Now the <link>-tag should be replaced by the comments feed without the <?xml ...> tag
对于每个 <link>
元素,从 href
属性下载 XML,然后将 XML 解析为新的 Element
。然后用相应的新 Element
替换 <link>
,像这样:
....
article = et.fromstring(xmltext)
ns = {'d': 'http://www.w3.org/2005/Atom'}
for elem in article.xpath('//d:feed/d:entry/d:link', namespaces=ns):
request = urllib2.Request(elem.attrib['href'] + '.rss', headers=hdr)
comments = urllib2.urlopen(request).read()
newElem = et.fromstring(comments)
elem.getparent().replace(elem, newElem)
# print the result
print et.tostring(article)
我有一个新闻源的 RSS 提要。在新闻文本和其他元数据中,提要还包含对评论部分的 URL 引用,也可以采用 RSS 格式。我想下载并包含每篇新闻文章的评论部分的内容。我的目标是创建一个 RSS 提要,其中包含文章和 RSS 中包含的每篇文章的评论,然后将这个新的 RSS in calibre 转换为 PDF。
这是一个例子XML:
<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<entry>
<author>
<name>Some Author</name>
<uri>http://thenews.com</uri>
</author>
<category term="sports" label="Sports" />
<content type="html">This is the news text.</content>
<id>123abc</id>
<link href="http://thenews.com/article/123abc/comments" />
<updated>2016-04-29T13:44:00+00:00</updated>
<title>The Title</title>
</entry>
<entry>
<author>
<name>Some other Author</name>
<uri>http://thenews.com</uri>
</author>
<category term="sports" label="Sports" />
<content type="html">This is another news text.</content>
<id>123abd</id>
<link href="http://thenews.com/article/123abd/comments" />
<updated>2016-04-29T14:46:00+00:00</updated>
<title>The other Title</title>
</entry>
</feed>
现在我想用 URL 的内容替换 。可以通过在 URL 末尾添加 /rss 来获取 RSS 提要。所以最后,单个条目将如下所示:
<entry>
<author>
<name>Some Author</name>
<uri>http://thenews.com</uri>
</author>
<category term="sports" label="Sports" />
<content type="html">This is the news text.</content>
<id>123abc</id>
<comments>
<comment>
<author>A commenter</author>
<timestamp>2016-04-29T16:00:00+00:00</timestamp>
<text>Cool story, yo!</text>
</comment>
<comment>
<author>Another commenter</author>
<timestamp>2016-04-29T16:01:00+00:00</timestamp>
<text>This is interesting news.</text>
</comment>
</comments>
<updated>2016-04-29T13:44:00+00:00</updated>
<title>The Title</title>
</entry>
我对任何编程语言都持开放态度。我用 python 和 lxml 试过了,但没能走多远。我能够提取评论 URL 并下载评论提要,但无法替换实际的 -标签。 无需下载实际的 RSS,以下是我的进展:
import lxml.etree as et
import urllib2
import re
# These will be downloaded from the RSS feed source when the code works
xmltext = """[The above news feed, too long to paste]"""
commentsRSS = """[The above comments feed]"""
hdr = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
article = et.fromstring(xmltext)
for elem in article.xpath('//feed/entry'):
commentsURL = elem.xpath('link/@href')
#request = urllib2.Request(commentsURL[0] + '.rss', headers=hdr)
#comments = urllib2.urlopen(request).read()
comments = commentsRSS
# Now the <link>-tag should be replaced by the comments feed without the <?xml ...> tag
对于每个 <link>
元素,从 href
属性下载 XML,然后将 XML 解析为新的 Element
。然后用相应的新 Element
替换 <link>
,像这样:
....
article = et.fromstring(xmltext)
ns = {'d': 'http://www.w3.org/2005/Atom'}
for elem in article.xpath('//d:feed/d:entry/d:link', namespaces=ns):
request = urllib2.Request(elem.attrib['href'] + '.rss', headers=hdr)
comments = urllib2.urlopen(request).read()
newElem = et.fromstring(comments)
elem.getparent().replace(elem, newElem)
# print the result
print et.tostring(article)