如何替换许多 HTML 文件中的 <title> 元素？

Question

我的目录 ~/foo 包含许多 HTML 文件。每个都有不同的不需要的 title 元素。即每个文件包含代码

<title>something unwanted</title>

其中许多文件还包含一些 span 元素，例如

<span class="org-document-info-keyword">#+Title:</span> 
<span class="org-document-title">correct title</span>

我想编写一个脚本来扫描每个 HTML 文件，并且对于每个包含第二种类型 code-block 的文件，将不需要的 title 替换为正确的标题。

替换标题后，我希望脚本删除第二块中的代码。

例如，运行

上的脚本

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<!-- Created by htmlize-1.47 in css mode. -->
<html>
  <head>
    <title>foo.org</title>
    <style type="text/css">
    <!--
      body {
        color: #839496;
        background-color: #002b36;
      }
      .org-document-info {
        /* org-document-info */
        color: #839496;
      }
      .org-document-info-keyword {
        /* org-document-info-keyword */
        color: #586e75;
      }
      .org-document-title {
        /* org-document-title */
        color: #93a1a1;
        font-size: 130%;
        font-weight: bold;
      }
      .org-level-1 {
        /* org-level-1 */
        color: #cb4b16;
        font-size: 130%;
      }

      a {
        color: inherit;
        background-color: inherit;
        font: inherit;
        text-decoration: inherit;
      }
      a:hover {
        text-decoration: underline;
      }
    -->
    </style>
  </head>
  <body>
    <pre>
<span class="org-document-info-keyword">#+Title:</span> <span class="org-document-title">my desired title
</span><span class="org-document-info-keyword">#+Date:</span> <span class="org-document-info">&lt;2015-08-23 Sun&gt;
</span>
<span class="org-level-1">* hello world</span>

Vivamus id enim.  

</pre>
  </body>
</html>

应该导致

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<!-- Created by htmlize-1.47 in css mode. -->
<html>
  <head>
    <title>my desired title</title>
    <style type="text/css">
      <!--
      body {
          color: #839496;
          background-color: #002b36;
      }
      .org-document-info {
          /* org-document-info */
          color: #839496;
      }
      .org-document-info-keyword {
          /* org-document-info-keyword */
          color: #586e75;
      }
      .org-document-title {
          /* org-document-title */
          color: #93a1a1;
          font-size: 130%;
          font-weight: bold;
      }
      .org-level-1 {
          /* org-level-1 */
          color: #cb4b16;
          font-size: 130%;
      }

      a {
          color: inherit;
          background-color: inherit;
          font: inherit;
          text-decoration: inherit;
      }
      a:hover {
          text-decoration: underline;
      }
    -->
    </style>
  </head>
  <body>
    <pre>
      <span class="org-document-info-keyword">#+Date:</span> <span class="org-document-info">&lt;2015-08-23 Sun&gt;
      </span>
      <span class="org-level-1">* hello world</span>

      Vivamus id enim.  

    </pre>
  </body>
</html>

知道什么 HTML 解析器适合这项工作吗？

Answer 1

以下是一种方法 Python。

import sys
from lxml import etree
from lxml.html import parse
doc = parse(sys.argv[1])
title = doc.find('//title')
span1 = doc.find('//span[@class="org-document-info-keyword"]')
span2 = doc.find('//span[@class="org-document-title"]')
title.text = span2.text.strip()
span1.getparent().remove(span1)
span2.getparent().remove(span2)
print etree.tostring(doc)

您可以将其放入名为 script.py 的文件中，然后将其运行放入 HTML 源文件 foo.html 并将结果写入 new-foo.html，这样做：

python script.py foo.html > new-foo.html

如何替换许多 HTML 文件中的 <title> 元素？

How can I replace the <title> element in many HTML files?

html

html-parsing