在尝试使用 Python 替换 html 标签时擦除以下标签

Question

我有一个word文档，格式如下

test1.docx

 ["<h58>This is article|", "", ", "<s1>Author is|", "<s33>Research is on|", "<h4>CASE IS|", "<s6>1-3|"]

试图找到以开头的标签并将标签及其内容替换为“”

def locatestag():
 fileis = open("test1.docx")

 for line in fileis:
   print(line)
   newfile = re.sub('<s.*?> .*? ','',line)

with open("new file.json","w") as newoutput:
   son.dump(newfile, newoutput)

最终输出文件也使标签像 disapper。

最终内容如

["<h58>This is article|", "", ", ]

如何仅删除及其内容，同时保留标签的其余部分（即保留标签）

Answer 1

您只想删除标签，而不是标签后面的所有内容，因此无需添加额外的 .*?

这是给你的最终代码

re.sub('<s.*?>','',line)

Answer 2

调整您的正则表达式，以便只有 <s.*> 标签及其内容匹配。

for line in fileis:
    print(line)
    newfile = re.sub('<s[^>]*>[^"]*(?=")','',line)

结果 newfile 将是：

["<h58>This is article|", "", ", "", "", "<h4>CASE IS|", ""]

当然，这假设看起来像字符串数组的“未闭合”双引号不是拼写错误，而是有意为您的文件内容。

在尝试使用 Python 替换 html 标签时擦除以下标签

Erases the following tag while trying to replace a html tag using Python

python

python-re

这是给你的最终代码