带有用于抓取的 lxml 代码的 for 循环显示 'list index out of range' 错误,但适用于 2 个实例
for loop with lxml code used for scraping shows 'list index out of range' error but works for 2 instances
我们是 python 初学者。
我们有一个 links/websites 的列表,其中包含 donald trumps 的话语。每个 link 代表一个整体 interview/speech,等等。我们现在想要访问这些站点,抓取它们并为每个 link 创建一个文本文件。目前我们的代码对 link 中的 2 或 3 个执行此操作,但随后仅显示此错误:
Traceback (most recent call last):
File "C:\Users\Lotte\AppData\Local\Programs\Python\Python37\Code\Corpus_create\Scrapen und alle inhalte laden und speichern - zusammengefügt.py", line 79, in <module>
Trump=(tree.xpath('//div[@class="media topic-media-row mediahover "]/div[3]/div/div[2]/a')[item2].text_content())
IndexError: list index out of range
我们对索引元素进行了试验,尝试了 [0] 甚至将其排除在外。没有任何效果。然后我们尝试 运行 只有一个 link 并且没有第一个循环的代码,它工作得很好
import lxml
from lxml import html
from lxml.html import fromstring
import requests
import re
Linklist=['https://factba.se/transcript/donald-trump-remarks-briefing-room-border-security-january-3-2019', 'https://factba.se/transcript/donald-trump-remarks-cabinet-meeting-january-2-2019', 'https://factba.se/transcript/donald-trump-remarks-military-briefing-iraq-december-26-2018', 'https://factba.se/transcript/donald-trump-remarks-videoconference-troops-christmas-december-25-2018', 'https://factba.se/transcript/donald-trump-remarks-justice-reform-december-21-2018', 'https://factba.se/transcript/donald-trump-remarks-agriculture-bill-december-20-2018', 'https://factba.se/transcript/donald-trump-remarks-roundtable-school-safety-december-18-2018', 'https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-15-2018', 'https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-13-2018', 'https://factba.se/transcript/donald-trump-remarks-revitalization-council-executive-order-december-12-2018', 'https://factba.se/transcript/donald-trump-remarks-meeting-pelosi-schumer-december-11-2018', 'https://factba.se/transcript/donald-trump-remarks-bill-signing-genocide-december-11-2018', 'https://factba.se/transcript/donald-trump-remarks-chanukah-evening-reception-december-6-2018', 'https://factba.se/transcript/donald-trump-remarks-chanukah-afternoon-reception-december-6-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-china-xi-buenos-aires-december-1-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-germany-merkel-december-1-2018', 'https://factba.se/transcript/donald-trump-remarks-usmca-mexico-canada-buenos-aires-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-argentina-macri-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-morrison-australia-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-trilat-japan-india-abe-modi-november-30-2018']
for item in Linklist:
headers= {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"}
page = requests.get(item, headers=headers)
tree = html.fromstring(page.content)
#loads everything trump said
Text=[]
for item2 in range(len(tree.xpath('//div[@class="media topic-media-row mediahover "]'))):
Trump=(tree.xpath('//div[@class="media topic-media-row mediahover "]/div[3]/div/div[2]/a')[item2].text_content())
Text.append(Trump)
print(Text, '\n')
我们只想要来自每个 link
的王牌话语
这是您的脚本的修改版本。
code.py:
from lxml import html
import requests
import re
from pprint import pprint
url_list = [
"https://factba.se/transcript/donald-trump-remarks-briefing-room-border-security-january-3-2019",
"https://factba.se/transcript/donald-trump-remarks-cabinet-meeting-january-2-2019",
"https://factba.se/transcript/donald-trump-remarks-military-briefing-iraq-december-26-2018",
"https://factba.se/transcript/donald-trump-remarks-videoconference-troops-christmas-december-25-2018",
"https://factba.se/transcript/donald-trump-remarks-justice-reform-december-21-2018",
"https://factba.se/transcript/donald-trump-remarks-agriculture-bill-december-20-2018",
"https://factba.se/transcript/donald-trump-remarks-roundtable-school-safety-december-18-2018",
"https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-15-2018",
"https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-13-2018",
"https://factba.se/transcript/donald-trump-remarks-revitalization-council-executive-order-december-12-2018",
"https://factba.se/transcript/donald-trump-remarks-meeting-pelosi-schumer-december-11-2018",
"https://factba.se/transcript/donald-trump-remarks-bill-signing-genocide-december-11-2018",
"https://factba.se/transcript/donald-trump-remarks-chanukah-evening-reception-december-6-2018",
"https://factba.se/transcript/donald-trump-remarks-chanukah-afternoon-reception-december-6-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-china-xi-buenos-aires-december-1-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-germany-merkel-december-1-2018",
"https://factba.se/transcript/donald-trump-remarks-usmca-mexico-canada-buenos-aires-november-30-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-argentina-macri-november-30-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-morrison-australia-november-30-2018",
"https://factba.se/transcript/donald-trump-remarks-trilat-japan-india-abe-modi-november-30-2018"
]
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"
}
media_row_xpath_marker = '//div[@class="media topic-media-row mediahover "]'
normal_xpath_marker = media_row_xpath_marker + "/div[3]/div/div[2]/a"
movieless_xpath_marker = media_row_xpath_marker + "/div[3]/div/div/a"
xpath_markers = [
normal_xpath_marker,
movieless_xpath_marker,
]
for url_index, url in enumerate(url_list):
page = requests.get(url, headers=headers)
tree = html.fromstring(page.content)
lines = []
media_row_list = tree.xpath(media_row_xpath_marker)
if media_row_list:
for xpath_marker in xpath_markers:
post_list = tree.xpath(xpath_marker)
if post_list:
lines = [item.text_content() for item in post_list]
break
#pprint(lines)
print("URL index: {0:02d} - Article count: {1:03d}".format(url_index, len(lines)))
备注:
问题是 3rd URL 与其他的有点不同,如果你看它,它没有 YouTube,所以 xpath 没有比赛。结合缺少空列表测试产生了上述异常。现在,正在尝试 2 种模式:
- movieless_xpath_marker - 这将适用于 "faulty" 页面
- normal_xpath_marker - 这将对其余部分起作用(这是第 1st 次尝试)
当一种模式触发某些结果时,只需忽略其余的(如果有的话)
- 我也重构了代码:
- 摆脱了循环(以及多次执行无用的操作)
- 变量重命名
- 常量提取
- 代码风格
- 其他小改动
输出(显示每个URL的文章数):
(py_064_03.06.08_test0) e:\Work\Dev\Whosebug\q054043232>"e:\Work\Dev\VEnvs\py_064_03.06.08_test0\Scripts\python.exe" code.py
URL index: 00 - Article count: 018
URL index: 01 - Article count: 207
URL index: 02 - Article count: 063
URL index: 03 - Article count: 068
URL index: 04 - Article count: 080
URL index: 05 - Article count: 051
URL index: 06 - Article count: 045
URL index: 07 - Article count: 014
URL index: 08 - Article count: 036
URL index: 09 - Article count: 022
URL index: 10 - Article count: 105
URL index: 11 - Article count: 020
URL index: 12 - Article count: 025
URL index: 13 - Article count: 028
URL index: 14 - Article count: 010
URL index: 15 - Article count: 012
URL index: 16 - Article count: 015
URL index: 17 - Article count: 005
URL index: 18 - Article count: 005
URL index: 19 - Article count: 006
我们是 python 初学者。 我们有一个 links/websites 的列表,其中包含 donald trumps 的话语。每个 link 代表一个整体 interview/speech,等等。我们现在想要访问这些站点,抓取它们并为每个 link 创建一个文本文件。目前我们的代码对 link 中的 2 或 3 个执行此操作,但随后仅显示此错误:
Traceback (most recent call last):
File "C:\Users\Lotte\AppData\Local\Programs\Python\Python37\Code\Corpus_create\Scrapen und alle inhalte laden und speichern - zusammengefügt.py", line 79, in <module>
Trump=(tree.xpath('//div[@class="media topic-media-row mediahover "]/div[3]/div/div[2]/a')[item2].text_content())
IndexError: list index out of range
我们对索引元素进行了试验,尝试了 [0] 甚至将其排除在外。没有任何效果。然后我们尝试 运行 只有一个 link 并且没有第一个循环的代码,它工作得很好
import lxml
from lxml import html
from lxml.html import fromstring
import requests
import re
Linklist=['https://factba.se/transcript/donald-trump-remarks-briefing-room-border-security-january-3-2019', 'https://factba.se/transcript/donald-trump-remarks-cabinet-meeting-january-2-2019', 'https://factba.se/transcript/donald-trump-remarks-military-briefing-iraq-december-26-2018', 'https://factba.se/transcript/donald-trump-remarks-videoconference-troops-christmas-december-25-2018', 'https://factba.se/transcript/donald-trump-remarks-justice-reform-december-21-2018', 'https://factba.se/transcript/donald-trump-remarks-agriculture-bill-december-20-2018', 'https://factba.se/transcript/donald-trump-remarks-roundtable-school-safety-december-18-2018', 'https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-15-2018', 'https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-13-2018', 'https://factba.se/transcript/donald-trump-remarks-revitalization-council-executive-order-december-12-2018', 'https://factba.se/transcript/donald-trump-remarks-meeting-pelosi-schumer-december-11-2018', 'https://factba.se/transcript/donald-trump-remarks-bill-signing-genocide-december-11-2018', 'https://factba.se/transcript/donald-trump-remarks-chanukah-evening-reception-december-6-2018', 'https://factba.se/transcript/donald-trump-remarks-chanukah-afternoon-reception-december-6-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-china-xi-buenos-aires-december-1-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-germany-merkel-december-1-2018', 'https://factba.se/transcript/donald-trump-remarks-usmca-mexico-canada-buenos-aires-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-argentina-macri-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-bilat-morrison-australia-november-30-2018', 'https://factba.se/transcript/donald-trump-remarks-trilat-japan-india-abe-modi-november-30-2018']
for item in Linklist:
headers= {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"}
page = requests.get(item, headers=headers)
tree = html.fromstring(page.content)
#loads everything trump said
Text=[]
for item2 in range(len(tree.xpath('//div[@class="media topic-media-row mediahover "]'))):
Trump=(tree.xpath('//div[@class="media topic-media-row mediahover "]/div[3]/div/div[2]/a')[item2].text_content())
Text.append(Trump)
print(Text, '\n')
我们只想要来自每个 link
的王牌话语这是您的脚本的修改版本。
code.py:
from lxml import html
import requests
import re
from pprint import pprint
url_list = [
"https://factba.se/transcript/donald-trump-remarks-briefing-room-border-security-january-3-2019",
"https://factba.se/transcript/donald-trump-remarks-cabinet-meeting-january-2-2019",
"https://factba.se/transcript/donald-trump-remarks-military-briefing-iraq-december-26-2018",
"https://factba.se/transcript/donald-trump-remarks-videoconference-troops-christmas-december-25-2018",
"https://factba.se/transcript/donald-trump-remarks-justice-reform-december-21-2018",
"https://factba.se/transcript/donald-trump-remarks-agriculture-bill-december-20-2018",
"https://factba.se/transcript/donald-trump-remarks-roundtable-school-safety-december-18-2018",
"https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-15-2018",
"https://factba.se/transcript/donald-trump-remarks-governors-elect-white-house-december-13-2018",
"https://factba.se/transcript/donald-trump-remarks-revitalization-council-executive-order-december-12-2018",
"https://factba.se/transcript/donald-trump-remarks-meeting-pelosi-schumer-december-11-2018",
"https://factba.se/transcript/donald-trump-remarks-bill-signing-genocide-december-11-2018",
"https://factba.se/transcript/donald-trump-remarks-chanukah-evening-reception-december-6-2018",
"https://factba.se/transcript/donald-trump-remarks-chanukah-afternoon-reception-december-6-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-china-xi-buenos-aires-december-1-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-germany-merkel-december-1-2018",
"https://factba.se/transcript/donald-trump-remarks-usmca-mexico-canada-buenos-aires-november-30-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-argentina-macri-november-30-2018",
"https://factba.se/transcript/donald-trump-remarks-bilat-morrison-australia-november-30-2018",
"https://factba.se/transcript/donald-trump-remarks-trilat-japan-india-abe-modi-november-30-2018"
]
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"
}
media_row_xpath_marker = '//div[@class="media topic-media-row mediahover "]'
normal_xpath_marker = media_row_xpath_marker + "/div[3]/div/div[2]/a"
movieless_xpath_marker = media_row_xpath_marker + "/div[3]/div/div/a"
xpath_markers = [
normal_xpath_marker,
movieless_xpath_marker,
]
for url_index, url in enumerate(url_list):
page = requests.get(url, headers=headers)
tree = html.fromstring(page.content)
lines = []
media_row_list = tree.xpath(media_row_xpath_marker)
if media_row_list:
for xpath_marker in xpath_markers:
post_list = tree.xpath(xpath_marker)
if post_list:
lines = [item.text_content() for item in post_list]
break
#pprint(lines)
print("URL index: {0:02d} - Article count: {1:03d}".format(url_index, len(lines)))
备注:
问题是 3rd URL 与其他的有点不同,如果你看它,它没有 YouTube,所以 xpath 没有比赛。结合缺少空列表测试产生了上述异常。现在,正在尝试 2 种模式:
- movieless_xpath_marker - 这将适用于 "faulty" 页面
- normal_xpath_marker - 这将对其余部分起作用(这是第 1st 次尝试)
当一种模式触发某些结果时,只需忽略其余的(如果有的话)
- 我也重构了代码:
- 摆脱了循环(以及多次执行无用的操作)
- 变量重命名
- 常量提取
- 代码风格
- 其他小改动
输出(显示每个URL的文章数):
(py_064_03.06.08_test0) e:\Work\Dev\Whosebug\q054043232>"e:\Work\Dev\VEnvs\py_064_03.06.08_test0\Scripts\python.exe" code.py URL index: 00 - Article count: 018 URL index: 01 - Article count: 207 URL index: 02 - Article count: 063 URL index: 03 - Article count: 068 URL index: 04 - Article count: 080 URL index: 05 - Article count: 051 URL index: 06 - Article count: 045 URL index: 07 - Article count: 014 URL index: 08 - Article count: 036 URL index: 09 - Article count: 022 URL index: 10 - Article count: 105 URL index: 11 - Article count: 020 URL index: 12 - Article count: 025 URL index: 13 - Article count: 028 URL index: 14 - Article count: 010 URL index: 15 - Article count: 012 URL index: 16 - Article count: 015 URL index: 17 - Article count: 005 URL index: 18 - Article count: 005 URL index: 19 - Article count: 006