下载并包含 XML 中引用的 URL

Download and include referenced URL in XML

我有一个新闻源的 RSS 提要。在新闻文本和其他元数据中,提要还包含对评论部分的 URL 引用,也可以采用 RSS 格式。我想下载并包含每篇新闻文章的评论部分的内容。我的目标是创建一个 RSS 提要,其中包含文章和 RSS 中包含的每篇文章的评论,然后将这个新的 RSS in calibre 转换为 PDF。

这是一个例子XML:

<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <entry>
        <author>
            <name>Some Author</name>
            <uri>http://thenews.com</uri>
        </author>
        <category term="sports" label="Sports" />
        <content type="html">This is the news text.</content>
        <id>123abc</id>
        <link href="http://thenews.com/article/123abc/comments" />
        <updated>2016-04-29T13:44:00+00:00</updated>
        <title>The Title</title>
    </entry>
    <entry>
        <author>
            <name>Some other Author</name>
            <uri>http://thenews.com</uri>
        </author>
        <category term="sports" label="Sports" />
        <content type="html">This is another news text.</content>
        <id>123abd</id>
        <link href="http://thenews.com/article/123abd/comments" />
        <updated>2016-04-29T14:46:00+00:00</updated>
        <title>The other Title</title>
    </entry>
</feed>

现在我想用 URL 的内容替换 。可以通过在 URL 末尾添加 /rss 来获取 RSS 提要。所以最后,单个条目将如下所示:

<entry>
  <author>
    <name>Some Author</name>
    <uri>http://thenews.com</uri>
  </author>
  <category term="sports" label="Sports" />
  <content type="html">This is the news text.</content>
  <id>123abc</id>
  <comments>
    <comment>    
      <author>A commenter</author>
      <timestamp>2016-04-29T16:00:00+00:00</timestamp>
      <text>Cool story, yo!</text>
    </comment>
    <comment>
      <author>Another commenter</author>
      <timestamp>2016-04-29T16:01:00+00:00</timestamp>
      <text>This is interesting news.</text>
    </comment>
  </comments>
  <updated>2016-04-29T13:44:00+00:00</updated>
  <title>The Title</title>
</entry>

我对任何编程语言都持开放态度。我用 python 和 lxml 试过了,但没能走多远。我能够提取评论 URL 并下载评论提要,但无法替换实际的 -标签。 无需下载实际的 RSS,以下是我的进展:

import lxml.etree as et
import urllib2
import re

# These will be downloaded from the RSS feed source when the code works
xmltext = """[The above news feed, too long to paste]"""
commentsRSS = """[The above comments feed]"""

hdr = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}

article = et.fromstring(xmltext)

for elem in article.xpath('//feed/entry'):
    commentsURL = elem.xpath('link/@href')

    #request  = urllib2.Request(commentsURL[0] + '.rss', headers=hdr)
    #comments = urllib2.urlopen(request).read()
    comments = commentsRSS

    # Now the <link>-tag should be replaced by the comments feed without the <?xml ...> tag

对于每个 <link> 元素,从 href 属性下载 XML,然后将 XML 解析为新的 Element。然后用相应的新 Element 替换 <link>,像这样:

....
article = et.fromstring(xmltext)
ns = {'d': 'http://www.w3.org/2005/Atom'}
for elem in article.xpath('//d:feed/d:entry/d:link', namespaces=ns):
    request  = urllib2.Request(elem.attrib['href'] + '.rss', headers=hdr)
    comments = urllib2.urlopen(request).read()
    newElem = et.fromstring(comments)
    elem.getparent().replace(elem, newElem)

# print the result
print et.tostring(article)