将 href 链接附加到数据框列表中,获取所有必需的信息,但只显示最后一页的链接
Append href links into a dataframe list, getting all required info, but only links from last page appears
使用 Beautiful Soup 和 pandas,我尝试使用以下代码将网站上的所有 link 附加到列表中。我能够 抓取 table 中包含相关信息的所有页面。该代码似乎以某种方式对我有用。但是出现的小问题是最后一页只出现links。输出不是我所期望的。最后,我想在 2 页中附加一个包含所有 40 link 的列表(在所需信息旁边)。虽然总共有 618 页,但我先尝试 scraping 2 页。您对如何调整代码以便每个 link 都附加到 table 中有什么建议吗?非常感谢。
import pandas as pd
import requests
from bs4 import BeautifulSoup
hdr={'User-Agent':'Chrome/84.0.4147.135'}
dfs=[]
for page_number in range(2):
http= "http://example.com/&Page={}".format(page_number+1)
print('Downloading page %s...' % http)
url= requests.get(http,headers=hdr)
soup = BeautifulSoup(url.text, 'html.parser')
table = soup.find('table')
df_list= pd.read_html(url.text)
df = pd.concat(df_list)
dfs.append(df)
links = []
for tr in table.findAll("tr"):
trs = tr.findAll("td")
for each in trs:
try:
link = each.find('a')['href']
links.append(link)
except:
pass
df['Link'] = links
final_df = pd.concat(dfs)
final_df.to_csv('myfile.csv',index=False,encoding='utf-8-sig')
符合你的逻辑。您只需将链接列添加到最后一个 df,因为它在您的循环之外。获取页面循环内的链接,然后将其添加到 df
,然后您可以将 df 附加到您的 dfs
列表:
import pandas as pd
import requests
from bs4 import BeautifulSoup
hdr={'User-Agent':'Chrome/84.0.4147.135'}
dfs=[]
for page_number in range(2):
http= "http://example.com/&Page={}".format(page_number+1)
print('Downloading page %s...' % http)
url= requests.get(http,headers=hdr)
soup = BeautifulSoup(url.text, 'html.parser')
table = soup.find('table')
df_list= pd.read_html(url.text)
df = pd.concat(df_list)
links = []
for tr in table.findAll("tr"):
trs = tr.findAll("td")
for each in trs:
try:
link = each.find('a')['href']
links.append(link)
except:
pass
df['Link'] = links
dfs.append(df)
final_df = pd.concat(dfs)
final_df.to_csv('myfile.csv',index=False,encoding='utf-8-sig')
使用 Beautiful Soup 和 pandas,我尝试使用以下代码将网站上的所有 link 附加到列表中。我能够 抓取 table 中包含相关信息的所有页面。该代码似乎以某种方式对我有用。但是出现的小问题是最后一页只出现links。输出不是我所期望的。最后,我想在 2 页中附加一个包含所有 40 link 的列表(在所需信息旁边)。虽然总共有 618 页,但我先尝试 scraping 2 页。您对如何调整代码以便每个 link 都附加到 table 中有什么建议吗?非常感谢。
import pandas as pd
import requests
from bs4 import BeautifulSoup
hdr={'User-Agent':'Chrome/84.0.4147.135'}
dfs=[]
for page_number in range(2):
http= "http://example.com/&Page={}".format(page_number+1)
print('Downloading page %s...' % http)
url= requests.get(http,headers=hdr)
soup = BeautifulSoup(url.text, 'html.parser')
table = soup.find('table')
df_list= pd.read_html(url.text)
df = pd.concat(df_list)
dfs.append(df)
links = []
for tr in table.findAll("tr"):
trs = tr.findAll("td")
for each in trs:
try:
link = each.find('a')['href']
links.append(link)
except:
pass
df['Link'] = links
final_df = pd.concat(dfs)
final_df.to_csv('myfile.csv',index=False,encoding='utf-8-sig')
符合你的逻辑。您只需将链接列添加到最后一个 df,因为它在您的循环之外。获取页面循环内的链接,然后将其添加到 df
,然后您可以将 df 附加到您的 dfs
列表:
import pandas as pd
import requests
from bs4 import BeautifulSoup
hdr={'User-Agent':'Chrome/84.0.4147.135'}
dfs=[]
for page_number in range(2):
http= "http://example.com/&Page={}".format(page_number+1)
print('Downloading page %s...' % http)
url= requests.get(http,headers=hdr)
soup = BeautifulSoup(url.text, 'html.parser')
table = soup.find('table')
df_list= pd.read_html(url.text)
df = pd.concat(df_list)
links = []
for tr in table.findAll("tr"):
trs = tr.findAll("td")
for each in trs:
try:
link = each.find('a')['href']
links.append(link)
except:
pass
df['Link'] = links
dfs.append(df)
final_df = pd.concat(dfs)
final_df.to_csv('myfile.csv',index=False,encoding='utf-8-sig')