Python </div> 和 </td> 之间的字符串提取
Python String Extract between </div> and </td>
我正在使用 python 和 BeautifulSoup 抓取一个网站。我能够使用以下命令在页面上找到所有 tds:
data = soup.find_all('td')
然后我在dividual td 中找到我需要使用的第一个:
td = data[19]
如果我打印这个 td 输出是:
<td data-geoid="0617568" data-isnumeric="1" data-srcnote="true" data-value="18.8">
<span data-title="Culver City city, California"></span><div class="qf-sourcenote">
<span></span><a title="Source: 2018 American Community Survey (ACS), 5-year estimates. Estimates are not comparable to other geographic levels due to methodology differences that may exist between different data sources."></a>
</div>18.8%</td>
现在我想提取 div 末尾和 td 末尾之间的数据,即 18.8%。我用这个 post 尝试用下面的代码提取它:
m = re.search('</div>(.+?)</td>', td)
这给了我以下错误:
Traceback (most recent call last):
File "/Users/Alfie/PycharmProjects/474scrape/srape.py", line 18, in <module>
m = re.search('</div>(.+?)</td>', td)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/re.py", line 183, in search
return _compile(pattern, flags).search(string)
TypeError: expected string or bytes-like object
我认为问题出在我使用的标记中的转义字符或类似字符。感谢任何帮助
td
可能不是 str
.
类型
如果 td
是 str
类型,代码应该可以正常工作。
import re
td = """
<td data-geoid="0617568" data-isnumeric="1" data-srcnote="true" data-value="18.8">
<span data-title="Culver City city, California"></span><div class="qf-sourcenote">
<span></span><a title="Source: 2018 American Community Survey (ACS), 5-year estimates. Estimates are not comparable to other geographic levels due to methodology differences that may exist between different data sources."></a>
</div>18.8%</td>
"""
m = re.search(r'</div>(.+?)</td>', td)
print(m.group(1))
# 18.8%
尝试替换
m = re.search(r'</div>(.+?)</td>', td)
和
m = re.search(r'</div>(.+?)</td>', str(td))
尝试将模式作为原始字符串传递。
m = re.search(r'</div>(.+?)</td>', td)
如果这不起作用,请检查 td 的类型,如果它不是字符串,则将其转换为字符串,然后传递给函数。
我正在使用 python 和 BeautifulSoup 抓取一个网站。我能够使用以下命令在页面上找到所有 tds:
data = soup.find_all('td')
然后我在dividual td 中找到我需要使用的第一个:
td = data[19]
如果我打印这个 td 输出是:
<td data-geoid="0617568" data-isnumeric="1" data-srcnote="true" data-value="18.8">
<span data-title="Culver City city, California"></span><div class="qf-sourcenote">
<span></span><a title="Source: 2018 American Community Survey (ACS), 5-year estimates. Estimates are not comparable to other geographic levels due to methodology differences that may exist between different data sources."></a>
</div>18.8%</td>
现在我想提取 div 末尾和 td 末尾之间的数据,即 18.8%。我用这个 post 尝试用下面的代码提取它:
m = re.search('</div>(.+?)</td>', td)
这给了我以下错误:
Traceback (most recent call last):
File "/Users/Alfie/PycharmProjects/474scrape/srape.py", line 18, in <module>
m = re.search('</div>(.+?)</td>', td)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/re.py", line 183, in search
return _compile(pattern, flags).search(string)
TypeError: expected string or bytes-like object
我认为问题出在我使用的标记中的转义字符或类似字符。感谢任何帮助
td
可能不是 str
.
如果 td
是 str
类型,代码应该可以正常工作。
import re
td = """
<td data-geoid="0617568" data-isnumeric="1" data-srcnote="true" data-value="18.8">
<span data-title="Culver City city, California"></span><div class="qf-sourcenote">
<span></span><a title="Source: 2018 American Community Survey (ACS), 5-year estimates. Estimates are not comparable to other geographic levels due to methodology differences that may exist between different data sources."></a>
</div>18.8%</td>
"""
m = re.search(r'</div>(.+?)</td>', td)
print(m.group(1))
# 18.8%
尝试替换
m = re.search(r'</div>(.+?)</td>', td)
和
m = re.search(r'</div>(.+?)</td>', str(td))
尝试将模式作为原始字符串传递。
m = re.search(r'</div>(.+?)</td>', td)
如果这不起作用,请检查 td 的类型,如果它不是字符串,则将其转换为字符串,然后传递给函数。