Python </div> 和 </td> 之间的字符串提取

Question

我正在使用 python 和 BeautifulSoup 抓取一个网站。我能够使用以下命令在页面上找到所有 tds：

data = soup.find_all('td')

然后我在dividual td 中找到我需要使用的第一个：

td = data[19]

如果我打印这个 td 输出是：

<td data-geoid="0617568" data-isnumeric="1" data-srcnote="true" data-value="18.8">
<span data-title="Culver City city, California"></span><div class="qf-sourcenote">
<span></span><a title="Source: 2018 American Community Survey (ACS), 5-year estimates. Estimates are not comparable to other geographic levels due to methodology differences that may exist between different data sources."></a>
</div>18.8%</td>

现在我想提取 div 末尾和 td 末尾之间的数据，即 18.8%。我用这个 post 尝试用下面的代码提取它：

m = re.search('</div>(.+?)</td>', td)

这给了我以下错误：

Traceback (most recent call last):
  File "/Users/Alfie/PycharmProjects/474scrape/srape.py", line 18, in <module>
    m = re.search('</div>(.+?)</td>', td)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/re.py", line 183, in search
return _compile(pattern, flags).search(string)
TypeError: expected string or bytes-like object

我认为问题出在我使用的标记中的转义字符或类似字符。感谢任何帮助

Answer 1

td 可能不是 str.

类型

如果 td 是 str 类型，代码应该可以正常工作。

import re

td = """
<td data-geoid="0617568" data-isnumeric="1" data-srcnote="true" data-value="18.8">
<span data-title="Culver City city, California"></span><div class="qf-sourcenote">
<span></span><a title="Source: 2018 American Community Survey (ACS), 5-year estimates. Estimates are not comparable to other geographic levels due to methodology differences that may exist between different data sources."></a>
</div>18.8%</td>
"""

m = re.search(r'</div>(.+?)</td>', td)
print(m.group(1))
# 18.8%

尝试替换

m = re.search(r'</div>(.+?)</td>', td)

和

m = re.search(r'</div>(.+?)</td>', str(td))

Answer 2

尝试将模式作为原始字符串传递。

m = re.search(r'</div>(.+?)</td>', td)

如果这不起作用，请检查 td 的类型，如果它不是字符串，则将其转换为字符串，然后传递给函数。

Python </div> 和 </td> 之间的字符串提取

Python String Extract between </div> and </td>

python

python-re