忽略正则表达式中的常规 HTML 标记
Ignore regular HTML tags in regex
我需要在丑陋的 HTML 文件的文本中找到模式。这很丑陋,因为每个字符都包裹在一个绝对定位的 <span>
中,并且每个 <span>
都在自己的行上,如下所示:
<span style="position:absolute; color:black; left:422px; top:3497px; font-size:21.6px;">M</span>
<span style="position:absolute; color:black; left:440px; top:3497px; font-size:21.6px;">T</span>
<span style="position:absolute; color:black; left:452px; top:3497px; font-size:21.6px;">V</span>
<span style="position:absolute; color:black; left:464px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:470px; top:3497px; font-size:21.6px;">N</span>
<span style="position:absolute; color:black; left:484px; top:3497px; font-size:21.6px;">e</span>
<span style="position:absolute; color:black; left:493px; top:3497px; font-size:21.6px;">t</span>
<span style="position:absolute; color:black; left:499px; top:3497px; font-size:21.6px;">w</span>
<span style="position:absolute; color:black; left:513px; top:3497px; font-size:21.6px;">o</span>
<span style="position:absolute; color:black; left:523px; top:3497px; font-size:21.6px;">r</span>
<span style="position:absolute; color:black; left:531px; top:3497px; font-size:21.6px;">k</span>
<span style="position:absolute; color:black; left:541px; top:3497px; font-size:21.6px;">s</span>
<span style="position:absolute; color:black; left:549px; top:3497px; font-size:21.6px;">,</span>
<span style="position:absolute; color:black; left:554px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:559px; top:3497px; font-size:21.6px;">I</span>
<span style="position:absolute; color:black; left:566px; top:3497px; font-size:21.6px;">n</span>
<span style="position:absolute; color:black; left:577px; top:3497px; font-size:21.6px;">c</span>
<span style="position:absolute; color:black; left:586px; top:3497px; font-size:21.6px;">.</span>
<span style="position:absolute; color:black; left:592px; top:3497px; font-size:21.6px;">,</span>
<span style="position:absolute; color:black; left:597px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:602px; top:3497px; font-size:21.6px;">1</span>
<span style="position:absolute; color:black; left:613px; top:3497px; font-size:21.6px;">5</span>
<span style="position:absolute; color:black; left:623px; top:3497px; font-size:21.6px;">2</span>
<span style="position:absolute; color:black; left:634px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:639px; top:3497px; font-size:21.6px;">F</span>
<span style="position:absolute; color:black; left:650px; top:3497px; font-size:21.6px;">.</span>
<span style="position:absolute; color:black; left:656px; top:3497px; font-size:21.6px;">3</span>
<span style="position:absolute; color:black; left:666px; top:3497px; font-size:21.6px;">d</span>
<span style="position:absolute; color:black; left:677px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:682px; top:3497px; font-size:21.6px;">1</span>
<span style="position:absolute; color:black; left:693px; top:3497px; font-size:21.6px;">2</span>
<span style="position:absolute; color:black; left:703px; top:3497px; font-size:21.6px;">0</span>
<span style="position:absolute; color:black; left:714px; top:3497px; font-size:21.6px;">9</span>
<span style="position:absolute; color:black; left:724px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:729px; top:3497px; font-size:21.6px;">(</span>
<span style="position:absolute; color:black; left:736px; top:3497px; font-size:21.6px;">9</span>
<span style="position:absolute; color:black; left:747px; top:3496px; font-size:13.6px;">t</span>
<span style="position:absolute; color:black; left:751px; top:3496px; font-size:13.6px;">h</span>
<span style="position:absolute; color:black; left:757px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:763px; top:3497px; font-size:21.6px;">C</span>
<span style="position:absolute; color:black; left:777px; top:3497px; font-size:21.6px;">i</span>
<span style="position:absolute; color:black; left:782px; top:3497px; font-size:21.6px;">r</span>
<span style="position:absolute; color:black; left:789px; top:3497px; font-size:21.6px;">.</span>
<span style="position:absolute; color:black; left:795px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:800px; top:3497px; font-size:21.6px;">1</span>
<span style="position:absolute; color:black; left:810px; top:3497px; font-size:21.6px;">9</span>
<span style="position:absolute; color:black; left:821px; top:3497px; font-size:21.6px;">9</span>
<span style="position:absolute; color:black; left:831px; top:3497px; font-size:21.6px;">8</span>
<span style="position:absolute; color:black; left:842px; top:3497px; font-size:21.6px;">)</span>
这是我想要匹配的正则表达式(在 Vim 语法中):[0-9]\+ F\.3d [0-9]\+
。所以,在这个例子中,我想匹配 152 F.3d 1209
。我想把它包装在 <a>
中以这样结束:
<a href="http://www.whosebug.com/">
<span style="position:absolute; color:black; left:602px; top:3497px; font-size:21.6px;">1</span>
<span style="position:absolute; color:black; left:613px; top:3497px; font-size:21.6px;">5</span>
<span style="position:absolute; color:black; left:623px; top:3497px; font-size:21.6px;">2</span>
<span style="position:absolute; color:black; left:634px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:639px; top:3497px; font-size:21.6px;">F</span>
<span style="position:absolute; color:black; left:650px; top:3497px; font-size:21.6px;">.</span>
<span style="position:absolute; color:black; left:656px; top:3497px; font-size:21.6px;">3</span>
<span style="position:absolute; color:black; left:666px; top:3497px; font-size:21.6px;">d</span>
<span style="position:absolute; color:black; left:677px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:682px; top:3497px; font-size:21.6px;">1</span>
<span style="position:absolute; color:black; left:693px; top:3497px; font-size:21.6px;">2</span>
<span style="position:absolute; color:black; left:703px; top:3497px; font-size:21.6px;">0</span>
<span style="position:absolute; color:black; left:714px; top:3497px; font-size:21.6px;">9</span>
</a>
我可以写一个冗长的正则表达式来忽略每个 HTML 标签,但很快就变得不可行(例如,如果有 HTML 就很难匹配 [0-9]\+
在每个数字前后标记)。
我可以使用 %s/<.*>\(.*\)<.*>//g
之类的方法去除 HTML,但这也不起作用,因为我需要保留格式。
我明白了 can't parse HTML with a regex。但是我不需要解析任意 HTML;我只需要解决一组已知的标签。有没有一种优雅的方法可以做到这一点?或者我应该放弃正则表达式并使用类似 XPath 解析器的东西吗?
我对任何语言都持开放态度,但我更愿意使用 Python、JavaScript 或 Vim.
好吧,我会将文本节点提取为一个简单的字符串,对其进行匹配,然后返回到 DOM 树以检索初始的 HTML。类似的东西:
import lxml.html, lxml.etree
import re
with open('foo.html') as f:
source = lxml.html.parse(f)
letters = source.findall('//span')
string = ''.join(s.text for s in letters)
match = re.search(r'[0-9]+ F\.3d [0-9]+', string)
assert match is not None
start, end = match.span()
html = '\n'.join(lxml.etree.tostring(el).decode('utf8')
for el in letters[start:end])
print('<a href="foo">{}</a>'.format(html))
请注意,tostring()
在循环中的性能可能不是最好的。您应该改为构建 a
元素,在其中附加字母并在 a
元素上调用一次 tostring()
。
此代码缺少很多错误处理,并且依赖于严格的输入格式,但请考虑:
import re
import os
html = '''<span style="position:absolute; color:black; left:422px; top:3497px; font-size:21.6px;">M</span>
<span style="position:absolute; color:black; left:440px; top:3497px; font-size:21.6px;">T</span>
<span style="position:absolute; color:black; left:452px; top:3497px; font-size:21.6px;">V</span>
... (Lines omitted)
<span style="position:absolute; color:black; left:842px; top:3497px; font-size:21.6px;">)</span>
'''
# This is sloppy, but if your input format remains the same should work...
chars = ''.join([line[line.find('>') + 1] for line in html.splitlines()])
# chars => "MTV Networks, Inc., 152 F.3d 1209 (9th Cir. 1998)"
# Use regex to search chars
mat = re.search(r'\d+ F\.3d \d+', chars)
# Extract lines from html based on the start and end positions of the regex match
block = html.splitlines()[mat.start():mat.end()]
# Wrap the lines with your anchor tag
block = ['<a href="http://www.whosebug.com/>'] + block + ['</a>']
# Print the list
print os.linesep.join(block)
它首先提取<span>
标签内的单个字符并将它们放入一个字符串中。然后它会在该字符串中搜索您的正则表达式(针对 python 的 re
模块进行了修改)。
由于字符在chars
字符串中的位置恰好对应html
中对应行的行号,我们可以在[=中使用匹配的起止位置=13=] 字符串到 select 我们要换行的 html
行。
我们在 block
列表的开头和结尾插入元素,对应你的锚标签,并打印出来。
只要您的输入与您指定的完全一致,就无需调用 DOM 解析器或任何非常复杂的东西——尽管结果可能需要类似的东西。
这是一个使用 awk 的解决方案:
$ cat mornin.awk
NR == FNR {
gsub("</?span[^<]*>","",[=10=])
s = s [=10=]
next
}
FNR == 1 {
i = match(s, "[0-9]+ F\.3d [0-9]+")
len = RLENGTH
print "<a href=\"http://www.whosebug.com/\">"
}
FNR == i, FNR == (i + RLENGTH - 1)
END {
print "</a>"
}
此解决方案需要两次传递文本,因此您将文件两次放在命令行中:
$ awk -f mornin.awk mornin.txt mornin.txt
<a href="http://www.whosebug.com/">
<span style="position:absolute; color:black; left:602px; top:3497px; font-size:21.6px;">1</span>
<span style="position:absolute; color:black; left:613px; top:3497px; font-size:21.6px;">5</span>
<span style="position:absolute; color:black; left:623px; top:3497px; font-size:21.6px;">2</span>
<span style="position:absolute; color:black; left:634px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:639px; top:3497px; font-size:21.6px;">F</span>
<span style="position:absolute; color:black; left:650px; top:3497px; font-size:21.6px;">.</span>
<span style="position:absolute; color:black; left:656px; top:3497px; font-size:21.6px;">3</span>
<span style="position:absolute; color:black; left:666px; top:3497px; font-size:21.6px;">d</span>
<span style="position:absolute; color:black; left:677px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:682px; top:3497px; font-size:21.6px;">1</span>
<span style="position:absolute; color:black; left:693px; top:3497px; font-size:21.6px;">2</span>
<span style="position:absolute; color:black; left:703px; top:3497px; font-size:21.6px;">0</span>
<span style="position:absolute; color:black; left:714px; top:3497px; font-size:21.6px;">9</span>
</a>
我需要在丑陋的 HTML 文件的文本中找到模式。这很丑陋,因为每个字符都包裹在一个绝对定位的 <span>
中,并且每个 <span>
都在自己的行上,如下所示:
<span style="position:absolute; color:black; left:422px; top:3497px; font-size:21.6px;">M</span>
<span style="position:absolute; color:black; left:440px; top:3497px; font-size:21.6px;">T</span>
<span style="position:absolute; color:black; left:452px; top:3497px; font-size:21.6px;">V</span>
<span style="position:absolute; color:black; left:464px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:470px; top:3497px; font-size:21.6px;">N</span>
<span style="position:absolute; color:black; left:484px; top:3497px; font-size:21.6px;">e</span>
<span style="position:absolute; color:black; left:493px; top:3497px; font-size:21.6px;">t</span>
<span style="position:absolute; color:black; left:499px; top:3497px; font-size:21.6px;">w</span>
<span style="position:absolute; color:black; left:513px; top:3497px; font-size:21.6px;">o</span>
<span style="position:absolute; color:black; left:523px; top:3497px; font-size:21.6px;">r</span>
<span style="position:absolute; color:black; left:531px; top:3497px; font-size:21.6px;">k</span>
<span style="position:absolute; color:black; left:541px; top:3497px; font-size:21.6px;">s</span>
<span style="position:absolute; color:black; left:549px; top:3497px; font-size:21.6px;">,</span>
<span style="position:absolute; color:black; left:554px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:559px; top:3497px; font-size:21.6px;">I</span>
<span style="position:absolute; color:black; left:566px; top:3497px; font-size:21.6px;">n</span>
<span style="position:absolute; color:black; left:577px; top:3497px; font-size:21.6px;">c</span>
<span style="position:absolute; color:black; left:586px; top:3497px; font-size:21.6px;">.</span>
<span style="position:absolute; color:black; left:592px; top:3497px; font-size:21.6px;">,</span>
<span style="position:absolute; color:black; left:597px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:602px; top:3497px; font-size:21.6px;">1</span>
<span style="position:absolute; color:black; left:613px; top:3497px; font-size:21.6px;">5</span>
<span style="position:absolute; color:black; left:623px; top:3497px; font-size:21.6px;">2</span>
<span style="position:absolute; color:black; left:634px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:639px; top:3497px; font-size:21.6px;">F</span>
<span style="position:absolute; color:black; left:650px; top:3497px; font-size:21.6px;">.</span>
<span style="position:absolute; color:black; left:656px; top:3497px; font-size:21.6px;">3</span>
<span style="position:absolute; color:black; left:666px; top:3497px; font-size:21.6px;">d</span>
<span style="position:absolute; color:black; left:677px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:682px; top:3497px; font-size:21.6px;">1</span>
<span style="position:absolute; color:black; left:693px; top:3497px; font-size:21.6px;">2</span>
<span style="position:absolute; color:black; left:703px; top:3497px; font-size:21.6px;">0</span>
<span style="position:absolute; color:black; left:714px; top:3497px; font-size:21.6px;">9</span>
<span style="position:absolute; color:black; left:724px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:729px; top:3497px; font-size:21.6px;">(</span>
<span style="position:absolute; color:black; left:736px; top:3497px; font-size:21.6px;">9</span>
<span style="position:absolute; color:black; left:747px; top:3496px; font-size:13.6px;">t</span>
<span style="position:absolute; color:black; left:751px; top:3496px; font-size:13.6px;">h</span>
<span style="position:absolute; color:black; left:757px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:763px; top:3497px; font-size:21.6px;">C</span>
<span style="position:absolute; color:black; left:777px; top:3497px; font-size:21.6px;">i</span>
<span style="position:absolute; color:black; left:782px; top:3497px; font-size:21.6px;">r</span>
<span style="position:absolute; color:black; left:789px; top:3497px; font-size:21.6px;">.</span>
<span style="position:absolute; color:black; left:795px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:800px; top:3497px; font-size:21.6px;">1</span>
<span style="position:absolute; color:black; left:810px; top:3497px; font-size:21.6px;">9</span>
<span style="position:absolute; color:black; left:821px; top:3497px; font-size:21.6px;">9</span>
<span style="position:absolute; color:black; left:831px; top:3497px; font-size:21.6px;">8</span>
<span style="position:absolute; color:black; left:842px; top:3497px; font-size:21.6px;">)</span>
这是我想要匹配的正则表达式(在 Vim 语法中):[0-9]\+ F\.3d [0-9]\+
。所以,在这个例子中,我想匹配 152 F.3d 1209
。我想把它包装在 <a>
中以这样结束:
<a href="http://www.whosebug.com/">
<span style="position:absolute; color:black; left:602px; top:3497px; font-size:21.6px;">1</span>
<span style="position:absolute; color:black; left:613px; top:3497px; font-size:21.6px;">5</span>
<span style="position:absolute; color:black; left:623px; top:3497px; font-size:21.6px;">2</span>
<span style="position:absolute; color:black; left:634px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:639px; top:3497px; font-size:21.6px;">F</span>
<span style="position:absolute; color:black; left:650px; top:3497px; font-size:21.6px;">.</span>
<span style="position:absolute; color:black; left:656px; top:3497px; font-size:21.6px;">3</span>
<span style="position:absolute; color:black; left:666px; top:3497px; font-size:21.6px;">d</span>
<span style="position:absolute; color:black; left:677px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:682px; top:3497px; font-size:21.6px;">1</span>
<span style="position:absolute; color:black; left:693px; top:3497px; font-size:21.6px;">2</span>
<span style="position:absolute; color:black; left:703px; top:3497px; font-size:21.6px;">0</span>
<span style="position:absolute; color:black; left:714px; top:3497px; font-size:21.6px;">9</span>
</a>
我可以写一个冗长的正则表达式来忽略每个 HTML 标签,但很快就变得不可行(例如,如果有 HTML 就很难匹配 [0-9]\+
在每个数字前后标记)。
我可以使用 %s/<.*>\(.*\)<.*>//g
之类的方法去除 HTML,但这也不起作用,因为我需要保留格式。
我明白了 can't parse HTML with a regex。但是我不需要解析任意 HTML;我只需要解决一组已知的标签。有没有一种优雅的方法可以做到这一点?或者我应该放弃正则表达式并使用类似 XPath 解析器的东西吗?
我对任何语言都持开放态度,但我更愿意使用 Python、JavaScript 或 Vim.
好吧,我会将文本节点提取为一个简单的字符串,对其进行匹配,然后返回到 DOM 树以检索初始的 HTML。类似的东西:
import lxml.html, lxml.etree
import re
with open('foo.html') as f:
source = lxml.html.parse(f)
letters = source.findall('//span')
string = ''.join(s.text for s in letters)
match = re.search(r'[0-9]+ F\.3d [0-9]+', string)
assert match is not None
start, end = match.span()
html = '\n'.join(lxml.etree.tostring(el).decode('utf8')
for el in letters[start:end])
print('<a href="foo">{}</a>'.format(html))
请注意,tostring()
在循环中的性能可能不是最好的。您应该改为构建 a
元素,在其中附加字母并在 a
元素上调用一次 tostring()
。
此代码缺少很多错误处理,并且依赖于严格的输入格式,但请考虑:
import re
import os
html = '''<span style="position:absolute; color:black; left:422px; top:3497px; font-size:21.6px;">M</span>
<span style="position:absolute; color:black; left:440px; top:3497px; font-size:21.6px;">T</span>
<span style="position:absolute; color:black; left:452px; top:3497px; font-size:21.6px;">V</span>
... (Lines omitted)
<span style="position:absolute; color:black; left:842px; top:3497px; font-size:21.6px;">)</span>
'''
# This is sloppy, but if your input format remains the same should work...
chars = ''.join([line[line.find('>') + 1] for line in html.splitlines()])
# chars => "MTV Networks, Inc., 152 F.3d 1209 (9th Cir. 1998)"
# Use regex to search chars
mat = re.search(r'\d+ F\.3d \d+', chars)
# Extract lines from html based on the start and end positions of the regex match
block = html.splitlines()[mat.start():mat.end()]
# Wrap the lines with your anchor tag
block = ['<a href="http://www.whosebug.com/>'] + block + ['</a>']
# Print the list
print os.linesep.join(block)
它首先提取<span>
标签内的单个字符并将它们放入一个字符串中。然后它会在该字符串中搜索您的正则表达式(针对 python 的 re
模块进行了修改)。
由于字符在chars
字符串中的位置恰好对应html
中对应行的行号,我们可以在[=中使用匹配的起止位置=13=] 字符串到 select 我们要换行的 html
行。
我们在 block
列表的开头和结尾插入元素,对应你的锚标签,并打印出来。
只要您的输入与您指定的完全一致,就无需调用 DOM 解析器或任何非常复杂的东西——尽管结果可能需要类似的东西。
这是一个使用 awk 的解决方案:
$ cat mornin.awk
NR == FNR {
gsub("</?span[^<]*>","",[=10=])
s = s [=10=]
next
}
FNR == 1 {
i = match(s, "[0-9]+ F\.3d [0-9]+")
len = RLENGTH
print "<a href=\"http://www.whosebug.com/\">"
}
FNR == i, FNR == (i + RLENGTH - 1)
END {
print "</a>"
}
此解决方案需要两次传递文本,因此您将文件两次放在命令行中:
$ awk -f mornin.awk mornin.txt mornin.txt
<a href="http://www.whosebug.com/">
<span style="position:absolute; color:black; left:602px; top:3497px; font-size:21.6px;">1</span>
<span style="position:absolute; color:black; left:613px; top:3497px; font-size:21.6px;">5</span>
<span style="position:absolute; color:black; left:623px; top:3497px; font-size:21.6px;">2</span>
<span style="position:absolute; color:black; left:634px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:639px; top:3497px; font-size:21.6px;">F</span>
<span style="position:absolute; color:black; left:650px; top:3497px; font-size:21.6px;">.</span>
<span style="position:absolute; color:black; left:656px; top:3497px; font-size:21.6px;">3</span>
<span style="position:absolute; color:black; left:666px; top:3497px; font-size:21.6px;">d</span>
<span style="position:absolute; color:black; left:677px; top:3497px; font-size:21.6px;"> </span>
<span style="position:absolute; color:black; left:682px; top:3497px; font-size:21.6px;">1</span>
<span style="position:absolute; color:black; left:693px; top:3497px; font-size:21.6px;">2</span>
<span style="position:absolute; color:black; left:703px; top:3497px; font-size:21.6px;">0</span>
<span style="position:absolute; color:black; left:714px; top:3497px; font-size:21.6px;">9</span>
</a>