解析包含不间断 space 字符的 unicoded 文件

Question

我正在使用 BeautifulSoup 解析 HTML 页面，以便查找和提取指定项目。

据我所知，问题是 BeautifulSoup 和 Python 解析器之间的冲突导致了这个问题。我正在寻找 HTML 中的特定文本，它将引导我找到要提取的锚标记。我似乎无法解决问题。这是我的代码：

with requests.Session() as s:
  r = s.get('https://www.rbkc.gov.uk/planning/searches/details.aspx?batch=20&id=PP/11/04187&type=&tab=#tabs-planning-6')
  c = s.cookies.get_dict()
soup = BeautifulSoup(r.text, 'lxml')
table = soup.find('table', {'id': 'casefiledocs'})

vals = []
rows = table.findAll('tr')
for tr in rows:
  cols = tr.findAll('td')
  for td in cols:
    if td.get_text().encode('utf-8') == 'Application Form':
      print td

有人对此有解决方案吗？感激不尽

Answer 1

去掉空格：

if td.get_text().strip() == 'Application Form':
    ...

解析包含不间断 space 字符的 unicoded 文件

Parsing unicoded file containing non-breaking space character

python

unicode

beautifulsoup