Python。如何查找所有匹配的子字符串？

Question

我有一个大字符串 - html 页。我需要找到闪存驱动器的所有名称，即我需要在双引号之间获取内容：data-name="USB Flash-drive Leef Fuse 32Gb">。所以我需要一个介于 data-name=" 和 "> 之间的字符串。拜托，不要提及 BeautifulSoup，我需要在没有 BeautifulSoup 的情况下完成，最好不要使用正则表达式，但也接受正则表达式。

我试过用这个：

p = re.compile('(?<=")[^,]+(?=")')
result = p.match(html_str)
print(result)

但结果是 None。但是在 regex101.com 上它起作用了：

Answer 1

py2: https://docs.python.org/2/library/htmlparser.html

py3: https://docs.python.org/3/library/html.parser.html

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        # tag = 'sometag'
        for attr in attrs:
            # attr = ('data-name', 'USB Flash-drive Leef Fuse 32Gb')
            if attr[0] == 'data-name':
                print(attr[1])

parser = MyHTMLParser()
parser.feed('<sometag data-name="USB Flash-drive Leef Fuse 32Gb">hello  world</sometag>')

输出：

USB Flash-drive Leef Fuse 32Gb

我在代码中添加了一些注释，以向您展示解析器返回的数据结构类型。

从这里开始构建应该很容易。

只需输入 HTML，它就会很好地解析它。参考文档，继续尝试。

Answer 2

如果你想用基本的python字符串解析来做到这一点

s="html string"
start = s.find('data-name="')
end = s.find('">')
output = s[start:end]

这就是我 python shell

中发生的事情

>>> s='junk...data-name="USB Flash-drive Leef Fuse 32Gb">...junk'
>>> start = s.find('data-name="')
>>> end = s.find('">')
>>> output = s[start:end]
>>> output
'data-name="USB Flash-drive Leef Fuse 32Gb'

让我知道这部分脚本是否单独工作

Python。如何查找所有匹配的子字符串？

Python. How to find all occurrences of matched substring?

python

regex

html-parsing

python-3.x