如何用硒迭代hrefs?
How to to iterate over hrefs with selenium?
我一直在尝试获取新闻文章主页的所有 href。最后,我想创建一些东西,让我从所有新闻文章中得到 n 个最常用的词。为此,我认为我首先需要 href,然后一个接一个地单击它们。
在这个平台的另一个用户的大力帮助下,这是我现在得到的代码:
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'https://ad.nl'
# launch firefox with your url above
# note that you could change this to some other webdriver (e.g. Chrome)
driver = webdriver.Chrome()
driver.get(url)
# click the "accept cookies" button
btn = driver.find_element_by_name('action')
btn.click()
# grab the html. It'll wait here until the page is finished loading
html = driver.page_source
# parse the html soup
soup = BeautifulSoup(html.lower(), "html.parser")
articles = soup.findAll("article")
for i in articles:
article = driver.find_element_by_class_name('ankeiler')
hrefs = article.find_element_by_css_selector('a').get_attribute('href')
print(hrefs)
driver.quit()
它给了我我认为的第一个 href,但它不会迭代下一个。它只是给了我第一个 href 的次数,因为它必须迭代。有谁知道我如何让它进入下一个 href 而不是停留在第一个 href 上?
PS。如果有人对如何进一步开展我的小项目有一些建议,请随时分享,因为关于 Python 和一般编程,我还有很多东西需要学习。
要获取文章中的所有 href,您可以执行以下操作:
hrefs = article.find_elements_by_xpath('//a')
#OR article.find_element_by_css_selector('a')
for href in hrefs:
print(href.get_attribute('href'))
尽管如此,为了推进项目,也许下面的内容会有所帮助:
hrefs = article.find_elements_by_xpath('//a')
links = [href.get_attribute("href") for href in hrefs]
for link in link:
driver.get(link)
#Add all words in the article to a dictionary with the key being the words and
#the value being the number of times they occur
不用美汤,这个怎么样?
articles = driver.find_elements_by_css_selector('article')
for i in articles:
href = i.find_element_by_css_selector('a').get_attribute('href')
print(href)
为了改进我之前的回答,我已经为您的问题编写了完整的解决方案:
from selenium import webdriver
url = 'https://ad.nl'
#Set up selenium driver
driver = webdriver.Chrome()
driver.get(url)
#Click the accept cookies button
btn = driver.find_element_by_name('action')
btn.click()
#Get the links of all articles
article_elements = driver.find_elements_by_xpath('//a[@class="ankeiler__link"]')
links = [link.get_attribute('href') for link in article_elements]
#Create a dictionary for every word in the articles
words = dict()
#Iterate through every article
for link in links:
#Get the article
driver.get(link)
#get the elements that are the body of the article
article_elements = driver.find_elements_by_xpath('//*[@class="article__paragraph"]')
#Initalise a empty string
article_text = ''
#Add all the text from the elements to the one string
for element in article_elements:
article_text+= element.text + " "
#Convert all character to lower case
article_text = article_text.lower()
#Remove all punctuation other than spaces
for char in article_text:
if ord(char) > 122 or ord(char) < 97:
if ord(char) != 32:
article_text = article_text.replace(char,"")
#Split the article into words
for word in article_text.split(" "):
#If the word is already in the article update the count
if word in words:
words[word] += 1
#Otherwise make a new entry
else:
words[word] = 1
#Print the final dictionary (Very large so maybe sort for most occurring words and display top 10)
#print(words)
#Sort words by most used
most_used = sorted(words.items(), key=lambda x: x[1],reverse=True)
#Print top 10 used words
print("TOP 10 MOST USED: ")
for i in range(10):
print(most_used[i])
driver.quit()
对我来说工作正常,如果您有任何错误请告诉我。
我一直在尝试获取新闻文章主页的所有 href。最后,我想创建一些东西,让我从所有新闻文章中得到 n 个最常用的词。为此,我认为我首先需要 href,然后一个接一个地单击它们。
在这个平台的另一个用户的大力帮助下,这是我现在得到的代码:
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'https://ad.nl'
# launch firefox with your url above
# note that you could change this to some other webdriver (e.g. Chrome)
driver = webdriver.Chrome()
driver.get(url)
# click the "accept cookies" button
btn = driver.find_element_by_name('action')
btn.click()
# grab the html. It'll wait here until the page is finished loading
html = driver.page_source
# parse the html soup
soup = BeautifulSoup(html.lower(), "html.parser")
articles = soup.findAll("article")
for i in articles:
article = driver.find_element_by_class_name('ankeiler')
hrefs = article.find_element_by_css_selector('a').get_attribute('href')
print(hrefs)
driver.quit()
它给了我我认为的第一个 href,但它不会迭代下一个。它只是给了我第一个 href 的次数,因为它必须迭代。有谁知道我如何让它进入下一个 href 而不是停留在第一个 href 上?
PS。如果有人对如何进一步开展我的小项目有一些建议,请随时分享,因为关于 Python 和一般编程,我还有很多东西需要学习。
要获取文章中的所有 href,您可以执行以下操作:
hrefs = article.find_elements_by_xpath('//a')
#OR article.find_element_by_css_selector('a')
for href in hrefs:
print(href.get_attribute('href'))
尽管如此,为了推进项目,也许下面的内容会有所帮助:
hrefs = article.find_elements_by_xpath('//a')
links = [href.get_attribute("href") for href in hrefs]
for link in link:
driver.get(link)
#Add all words in the article to a dictionary with the key being the words and
#the value being the number of times they occur
不用美汤,这个怎么样?
articles = driver.find_elements_by_css_selector('article')
for i in articles:
href = i.find_element_by_css_selector('a').get_attribute('href')
print(href)
为了改进我之前的回答,我已经为您的问题编写了完整的解决方案:
from selenium import webdriver
url = 'https://ad.nl'
#Set up selenium driver
driver = webdriver.Chrome()
driver.get(url)
#Click the accept cookies button
btn = driver.find_element_by_name('action')
btn.click()
#Get the links of all articles
article_elements = driver.find_elements_by_xpath('//a[@class="ankeiler__link"]')
links = [link.get_attribute('href') for link in article_elements]
#Create a dictionary for every word in the articles
words = dict()
#Iterate through every article
for link in links:
#Get the article
driver.get(link)
#get the elements that are the body of the article
article_elements = driver.find_elements_by_xpath('//*[@class="article__paragraph"]')
#Initalise a empty string
article_text = ''
#Add all the text from the elements to the one string
for element in article_elements:
article_text+= element.text + " "
#Convert all character to lower case
article_text = article_text.lower()
#Remove all punctuation other than spaces
for char in article_text:
if ord(char) > 122 or ord(char) < 97:
if ord(char) != 32:
article_text = article_text.replace(char,"")
#Split the article into words
for word in article_text.split(" "):
#If the word is already in the article update the count
if word in words:
words[word] += 1
#Otherwise make a new entry
else:
words[word] = 1
#Print the final dictionary (Very large so maybe sort for most occurring words and display top 10)
#print(words)
#Sort words by most used
most_used = sorted(words.items(), key=lambda x: x[1],reverse=True)
#Print top 10 used words
print("TOP 10 MOST USED: ")
for i in range(10):
print(most_used[i])
driver.quit()
对我来说工作正常,如果您有任何错误请告诉我。