如何避免在网络爬虫时出现断词
How to avoid getting broken words while webcrawling
我正在尝试从该网站抓取电影片名:https://www.the-numbers.com/market/2019/top-grossing-movies
并且不断出现像“John Wick: Chapter 3 –”这样的断句。
这是图片:
这是代码:
url = "https://www.the-numbers.com/market/" + "2019" + "/top-grossing-movies"
raw = requests.get(url,
headers={'User-Agent':'Mozilla/5.0'})
html = BeautifulSoup(raw.text, "html.parser")
movie_list = html.select("#page_filling_chart table tr > td > b > a") #"#page_filling_chart > table > tbody > tr > td > b"
for i in range(len(movie_list)):
print(movie_list[i].text)
这些是输出:
Avengers: Endgame
The Lion King
Frozen II
Toy Story 4
Captain Marvel
Star Wars: The Rise of Skyw…
Spider-Man: Far From Home
Aladdin
Joker
Jumanji: The Next Level
It: Chapter Two
Us
Fast & Furious Presents: Ho…
John Wick: Chapter 3 — Para…
How to Train Your Dragon: T…
The Secret Life of Pets 2
Pokémon: Detective Pikachu
Once Upon a Time…in Hollywo…
我想知道为什么我总是收到这些断字以及如何解决这个问题!
由于此页面是 server-render,当标题损坏时,您可以单独请求这些页面。(另外不要忘记通过正则表达式获取标题,因为其页面标题包含出版物日期。)
试试下面的代码:
import requests
from bs4 import BeautifulSoup
url = "https://www.the-numbers.com/market/" + "2019" + "/top-grossing-movies"
raw = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
html = BeautifulSoup(raw.text, "html.parser")
movie_list = html.select("#page_filling_chart table tr > td > b > a") # "#page_filling_chart > table > tbody > tr > td > b"
for movie in movie_list:
raw = requests.get("https://www.the-numbers.com" + movie.get("href"), headers={'User-Agent': 'Mozilla/5.0'})
raw.encoding = 'utf-8'
html = BeautifulSoup(raw.text, "html.parser")
print(html.select_one("#main > div > h1").text)
这给了我:
Avengers: Endgame (2019)
The Lion King (2019)
Frozen II (2019)
Toy Story 4 (2019)
Captain Marvel (2019)
Star Wars: The Rise of Skywalker (2019)
Spider-Man: Far From Home (2019)
....
需要这样处理字符串,解决方法代码为:
import requests
from bs4 import BeautifulSoup
url = "https://www.the-numbers.com/market/" + "2019" + "/top-grossing-movies"
raw = requests.get(url,
headers={'User-Agent':'Mozilla/5.0'})
html = BeautifulSoup(raw.text, "lxml")
movie_list = html.select("#page_filling_chart table tr > td > b > a") #"#page_filling_chart > table > tbody > tr > td > b"
import unicodedata
for i in range(len(movie_list)):
movie_name = movie_list[i].text
print(unicodedata.normalize('NFKD', movie_name).encode('ascii', 'ignore').decode())
输出是这样的:
Avengers: Endgame
The Lion King
Frozen II
Toy Story 4
Captain Marvel
Star Wars: The Rise of Skyw...
Spider-Man: Far From Home
Aladdin
Joker
Jumanji: The Next Level
It: Chapter Two
Us
Fast & Furious Presents: Ho...
John Wick: Chapter 3 a Para...
How to Train Your Dragon: T...
The Secret Life of Pets 2
PokAmon: Detective Pikachu
Once Upon a Timeain Hollywo...
Shazam!
Aquaman
Knives Out
Dumbo
Maleficent: Mistress of Evil
.
.
Narcissister Organ Player
Chef Flynn
I am Not a Witch
Divide and Conquer: The Sto...
Senso
Never-Ending Man: Hayao Miy...
我正在尝试从该网站抓取电影片名:https://www.the-numbers.com/market/2019/top-grossing-movies
并且不断出现像“John Wick: Chapter 3 –”这样的断句。
这是图片:
这是代码:
url = "https://www.the-numbers.com/market/" + "2019" + "/top-grossing-movies"
raw = requests.get(url,
headers={'User-Agent':'Mozilla/5.0'})
html = BeautifulSoup(raw.text, "html.parser")
movie_list = html.select("#page_filling_chart table tr > td > b > a") #"#page_filling_chart > table > tbody > tr > td > b"
for i in range(len(movie_list)):
print(movie_list[i].text)
这些是输出:
Avengers: Endgame
The Lion King
Frozen II
Toy Story 4
Captain Marvel
Star Wars: The Rise of Skyw…
Spider-Man: Far From Home
Aladdin
Joker
Jumanji: The Next Level
It: Chapter Two
Us
Fast & Furious Presents: Ho…
John Wick: Chapter 3 — Para…
How to Train Your Dragon: T…
The Secret Life of Pets 2
Pokémon: Detective Pikachu
Once Upon a Time…in Hollywo…
我想知道为什么我总是收到这些断字以及如何解决这个问题!
由于此页面是 server-render,当标题损坏时,您可以单独请求这些页面。(另外不要忘记通过正则表达式获取标题,因为其页面标题包含出版物日期。)
试试下面的代码:
import requests
from bs4 import BeautifulSoup
url = "https://www.the-numbers.com/market/" + "2019" + "/top-grossing-movies"
raw = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
html = BeautifulSoup(raw.text, "html.parser")
movie_list = html.select("#page_filling_chart table tr > td > b > a") # "#page_filling_chart > table > tbody > tr > td > b"
for movie in movie_list:
raw = requests.get("https://www.the-numbers.com" + movie.get("href"), headers={'User-Agent': 'Mozilla/5.0'})
raw.encoding = 'utf-8'
html = BeautifulSoup(raw.text, "html.parser")
print(html.select_one("#main > div > h1").text)
这给了我:
Avengers: Endgame (2019)
The Lion King (2019)
Frozen II (2019)
Toy Story 4 (2019)
Captain Marvel (2019)
Star Wars: The Rise of Skywalker (2019)
Spider-Man: Far From Home (2019)
....
需要这样处理字符串,解决方法代码为:
import requests
from bs4 import BeautifulSoup
url = "https://www.the-numbers.com/market/" + "2019" + "/top-grossing-movies"
raw = requests.get(url,
headers={'User-Agent':'Mozilla/5.0'})
html = BeautifulSoup(raw.text, "lxml")
movie_list = html.select("#page_filling_chart table tr > td > b > a") #"#page_filling_chart > table > tbody > tr > td > b"
import unicodedata
for i in range(len(movie_list)):
movie_name = movie_list[i].text
print(unicodedata.normalize('NFKD', movie_name).encode('ascii', 'ignore').decode())
输出是这样的:
Avengers: Endgame
The Lion King
Frozen II
Toy Story 4
Captain Marvel
Star Wars: The Rise of Skyw...
Spider-Man: Far From Home
Aladdin
Joker
Jumanji: The Next Level
It: Chapter Two
Us
Fast & Furious Presents: Ho...
John Wick: Chapter 3 a Para...
How to Train Your Dragon: T...
The Secret Life of Pets 2
PokAmon: Detective Pikachu
Once Upon a Timeain Hollywo...
Shazam!
Aquaman
Knives Out
Dumbo
Maleficent: Mistress of Evil
.
.
Narcissister Organ Player
Chef Flynn
I am Not a Witch
Divide and Conquer: The Sto...
Senso
Never-Ending Man: Hayao Miy...