如何从推特上抓取所有主题

How to scrape all topics from twitter

twitter 中的所有话题都可以在这里找到link 我想用里面的每个子类别来抓取所有这些。

BeautifulSoup 在这里好像没什么用。我尝试使用selenium,但我不知道如何匹配单击主类别后出现的Xpath。

from selenium import webdriver
from selenium.common import exceptions

url = 'https://twitter.com/i/flow/topics_selector'
driver = webdriver.Chrome('absolute path to chromedriver')
driver.get(url)
driver.maximize_window()

main_topics = driver.find_elements_by_xpath('/html/body/div[1]/div/div/div[1]/div[2]/div/div/div/div/div/div[2]/div[2]/div/div/div[2]/div[2]/div/div/div/div/span')

topics = {}
for main_topic in main_topics[2:]:
    print(main_topic.text.strip())
    topics[main_topic.text.strip()] = {}

我知道我可以使用 main_topics[3].click() 单击主要类别,但我不知道如何递归单击它们,直到我只找到右侧带有 Follow 的类别。

看看 XPATH 是如何工作的。只需输入 '//element[@attribute="foo"]' 就不必写出整个路径。请注意主要主题和子主题(单击主要主题后可见)具有相同的 class 名称。那是导致错误的原因。所以,这是我点击子主题的方法,但我相信还有更好的方法:

我使用以下方法找到主题元素:

topics = WebDriverWait(browser, 5).until(
        EC.presence_of_all_elements_located((By.XPATH, '//div[@class="css-901oao r-13gxpu9 r-1qd0xha r-1b6yd1w r-1vr29t4 r-ad9z0x r-bcqeeo r-qvutc0"]'))
    )

然后我创建了一个名为的空列表:

main_topics = []

然后,我循环遍历主题并将每个 element.text 附加到 main_topics 列表,然后单击每个元素以显示主要主题。

for topic in topics:
    main_topics.append(topic.text)
    topic.click()

然后,我新建一个变量sub_topics:(现在是所有打开的主题)

sub_topics = WebDriverWait(browser, 5).until(
        EC.presence_of_all_elements_located((By.XPATH, '//span[@class="css-901oao css-16my406 r-1qd0xha r-ad9z0x r-bcqeeo r-qvutc0"]'))
    )

然后,我又创建了两个名为:

的空列表
subs_list = []

skip_these_words = ["Done", "Follow your favorite Topics", "You’ll see top Tweets about them in your timeline. Don’t see your favorite Topics yet? New Topics are added every week.", "Follow"]
]

然后,我 for 遍历 sub_topics 并做了一个 if 语句,仅当 elements.text 不在 main_topics 中时才将 elements.text 附加到 subs_list和 skip_these_words 列表。我这样做是为了过滤掉顶部的主要主题和不必要的文本,因为所有这些现代元素都具有相同的 class 名称。最后,单击每个子主题。最后一部分令人困惑,所以这里有一个例子:

for sub in sub_topics:
    if sub.text not in main_topics and sub.text not in skip_these_words:
        subs_list.append(sub.text)
        sub.click()

还有几个隐藏的子子主题。看看能不能点击剩下的子子主题。然后,看看能不能找到关注的按钮元素,逐一点击。

要抓取所有主要主题,例如艺术与文化商业与金融等使用 and you have to induce for visibility_of_all_elements_located() and you can use either of the following :

  • 使用 XPATHtext 属性:

    driver.get("https://twitter.com/i/flow/topics_selector")
    print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//span[contains(., 'see top Tweets about them in your timeline')]//following::div[@role='button']/div/span")))])
    
  • 使用 XPATHget_attribute():

    driver.get("https://twitter.com/i/flow/topics_selector")
    print([my_elem.get_attribute("textContent") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//span[contains(., 'see top Tweets about them in your timeline')]//following::div[@role='button']/div/span")))])
    
  • 控制台输出:

    ['Arts & culture', 'Business & finance', 'Careers', 'Entertainment', 'Fashion & beauty', 'Food', 'Gaming', 'Lifestyle', 'Movies and TV', 'Music', 'News', 'Outdoors', 'Science', 'Sports', 'Technology', 'Travel']
    

抓取所有主要子主题 使用 Selenium 和 您可以使用以下 定位器策略 :

  • 使用 XPATHget_attribute("textContent"):

    driver.get("https://twitter.com/i/flow/topics_selector")
    elements =  WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//span[contains(., 'see top Tweets about them in your timeline')]//following::div[@role='button']/div/span")))
    for element in elements:
        element.click()
    print([my_elem.get_attribute("textContent") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@role='button']/div/span[text()]")))])
    driver.quit()
    
  • 控制台输出:

    ['Arts & culture', 'Animation', 'Art', 'Books', 'Dance', 'Horoscope', 'Theater', 'Writing', 'Business & finance', 'Business personalities', 'Business professions', 'Cryptocurrencies', 'Careers', 'Education', 'Fields of study', 'Entertainment', 'Celebrities', 'Comedy', 'Digital creators', 'Entertainment brands', 'Podcasts', 'Popular franchises', 'Theater', 'Fashion & beauty', 'Beauty', 'Fashion', 'Food', 'Cooking', 'Cuisines', 'Gaming', 'Esports', 'Game development', 'Gaming hardware', 'Gaming personalities', 'Tabletop gaming', 'Video games', 'Lifestyle', 'Animals', 'At home', 'Collectibles', 'Family', 'Fitness', 'Unexplained phenomena', 'Movies and TV', 'Movies', 'Television', 'Music', 'Alternative', 'Bollywood music', 'C-pop', 'Classical music', 'Country music', 'Dance music', 'Electronic music', 'Hip-hop & rap', 'J-pop', 'K-hip hop', 'K-pop', 'Metal', 'Musical instruments', 'Pop', 'R&B and soul', 'Radio stations', 'Reggae', 'Reggaeton', 'Rock', 'World music', 'News', 'COVID-19', 'Local news', 'Social movements', 'Outdoors', 'Science', 'Biology', 'Sports', 'American football', 'Australian rules football', 'Auto racing', 'Baseball', 'Basketball', 'Combat Sports', 'Cricket', 'Extreme sports', 'Fantasy sports', 'Football', 'Golf', 'Gymnastics', 'Hockey', 'Lacrosse', 'Pub sports', 'Rugby', 'Sports icons', 'Sports journalists & coaches', 'Tennis', 'Track & field', 'Water sports', 'Winter sports', 'Technology', 'Computer programming', 'Cryptocurrencies', 'Data science', 'Information security', 'Operating system', 'Tech brands', 'Tech personalities', 'Travel', 'Adventure travel', 'Destinations', 'Transportation']
    
  • 注意:您必须添加以下导入:

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC