替换 html 文件中的匹配项

Question

我必须替换数千个 html 文件中出现的某种情况，我打算为此使用 linux 脚本。这是我必须做的一些替换示例

发件人： <a class="wiki_link" href="/WebSphere+Application+Server">

收件人： <a class="wiki_link" href="/confluence/display/WIKIHAB1/WebSphere%20Application%20Server">

这意味着，添加 /confluence/display/WIKIHAB1 作为前缀并将“+”替换为“%20”。

我会对其他标签执行相同的操作，例如 img、iframe 等等...

首先，我应该用什么工具来制作呢？赛德？哎呀？其他？

如果有人有任何例子，我真的很感激。

Answer 1

经过一番研究，我发现了Beautiful Soup。它是一个 python 库，用于解析 html 文件，非常易于使用并且有很好的文档记录。我没有使用 Python 的经验，可以毫无问题地编写代码。这是一个 python 代码示例，用于进行我在问题中提到的替换。

#!/usr/bin/python

import os
from bs4 import BeautifulSoup

#Replaces plus sign(+) by %20 and add /confluence... prefix to each
#href parameter at anchor(a) tag that has wiki_link in class parameter
def fixAnchorTags(soup):
    tags = soup.find_all('a')

    for tag in tags:
        newhref = tag.get("href")

        if newhref is not None:
            if tag.get("class") is not None and "wiki_link" in tag.get("class"):
                newhref = newhref.replace("+", "%20")
                newhref = "/confluence/display/WIKIHAB1" + newhref
                tag['href'] = newhref

#Creates a folder to save the converted files                   
def setup():
    if not os.path.exists("converted"):
        os.makedirs("converted")

#Run all methods for each html file in the current folder
def run():
    for file in os.listdir("."):
        if file.endswith(".html"):
            print "Converting " + file
            htmlfile = open(file, "r")
            converted = open("converted/"+file, "w")
            soup = BeautifulSoup(htmlfile, "html.parser")

            fixAnchorTags(soup)

            converted.write(soup.prettify("UTF-8"))
            converted.close()
            htmlfile.close()

setup()
run()

替换 html 文件中的匹配项

Replace occurrences on html file

awk

sed

html-parsing