如何避免在网络爬虫时出现断词

Question

我正在尝试从该网站抓取电影片名：https://www.the-numbers.com/market/2019/top-grossing-movies

并且不断出现像“John Wick: Chapter 3 –”这样的断句。

这是图片：

这是代码：

url = "https://www.the-numbers.com/market/" + "2019" + "/top-grossing-movies"
raw = requests.get(url,
                  headers={'User-Agent':'Mozilla/5.0'})  
html = BeautifulSoup(raw.text, "html.parser")
movie_list = html.select("#page_filling_chart table tr > td > b > a") #"#page_filling_chart > table > tbody > tr > td > b"
for i in range(len(movie_list)):
  print(movie_list[i].text)

这些是输出：

Avengers: Endgame
The Lion King
Frozen II
Toy Story 4
Captain Marvel
Star Wars: The Rise of Skyw…
Spider-Man: Far From Home
Aladdin
Joker
Jumanji: The Next Level
It: Chapter Two
Us
Fast & Furious Presents: Ho…
John Wick: Chapter 3 â€” Para…
How to Train Your Dragon: T…
The Secret Life of Pets 2
PokÃ©mon: Detective Pikachu
Once Upon a Timeâ€¦in Hollywo…

我想知道为什么我总是收到这些断字以及如何解决这个问题！

Answer 1

由于此页面是 server-render，当标题损坏时，您可以单独请求这些页面。（另外不要忘记通过正则表达式获取标题，因为其页面标题包含出版物日期。)

试试下面的代码：

import requests
from bs4 import BeautifulSoup

url = "https://www.the-numbers.com/market/" + "2019" + "/top-grossing-movies"
raw = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
html = BeautifulSoup(raw.text, "html.parser")
movie_list = html.select("#page_filling_chart table tr > td > b > a")  # "#page_filling_chart > table > tbody > tr > td > b"
for movie in movie_list:
    raw = requests.get("https://www.the-numbers.com" + movie.get("href"), headers={'User-Agent': 'Mozilla/5.0'})
    raw.encoding = 'utf-8'
    html = BeautifulSoup(raw.text, "html.parser")
    print(html.select_one("#main > div > h1").text)

这给了我：

Avengers: Endgame (2019)
The Lion King (2019)
Frozen II (2019)
Toy Story 4 (2019)
Captain Marvel (2019)
Star Wars: The Rise of Skywalker (2019)
Spider-Man: Far From Home (2019)
....

Answer 2

需要这样处理字符串，解决方法代码为：

import requests
from bs4 import BeautifulSoup

url = "https://www.the-numbers.com/market/" + "2019" + "/top-grossing-movies"
raw = requests.get(url,
                  headers={'User-Agent':'Mozilla/5.0'})  
html = BeautifulSoup(raw.text, "lxml")
movie_list = html.select("#page_filling_chart table tr > td > b > a") #"#page_filling_chart > table > tbody > tr > td > b"


import unicodedata
for i in range(len(movie_list)):
    movie_name = movie_list[i].text
    print(unicodedata.normalize('NFKD', movie_name).encode('ascii', 'ignore').decode())

输出是这样的：

Avengers: Endgame
The Lion King
Frozen II
Toy Story 4
Captain Marvel
Star Wars: The Rise of Skyw...
Spider-Man: Far From Home
Aladdin
Joker
Jumanji: The Next Level
It: Chapter Two
Us
Fast & Furious Presents: Ho...
John Wick: Chapter 3 a Para...
How to Train Your Dragon: T...
The Secret Life of Pets 2
PokAmon: Detective Pikachu
Once Upon a Timeain Hollywo...
Shazam!
Aquaman
Knives Out
Dumbo
Maleficent: Mistress of Evil
.
.

Narcissister Organ Player
Chef Flynn
I am Not a Witch
Divide and Conquer: The Sto...
Senso
Never-Ending Man: Hayao Miy...

如何避免在网络爬虫时出现断词

How to avoid getting broken words while webcrawling

python

web-crawler

css-selectors