用匹配文本的长度替换正则表达式匹配

Question

如何用 <4> 和 </8> 替换 <html> 和 </mainbody> 等模式。这里的 4 和 8 是 <> 中的字母数。输入将从文件中获取。

import re
def main():
        fh=open("input.txt")
        pattern=re.compile("</?[a-zA-Z]+>") #regular expression to find patterns <html>, </html> 
        for line in fh:
                print(re.sub(pattern,"***",line.strip()))



if __name__=="__main__":main()

Answer 1

使用自定义方法return匹配的长度：

def get_length(obj):
    s = obj.groups()[0]
    return '</{}>'.format(len(s[1:])) if s.startswith('/') else '<{}>'.format(len(s))

>>> re.sub("<(/?[a-zA-Z]+)>", get_length, '<html>')
'<4>'
>>> re.sub("<(/?[a-zA-Z]+)>", get_length, '</html>')
'</4>'

我希望您意识到您的正则表达式非常基础，它不会正确处理具有属性的标签。

用匹配文本的长度替换正则表达式匹配

Replacing a regex match with the length of matched text

python

regex