在使用 Python 和 Beautiful Soup 4 抓取 Twitter 时关注特定结果?

Focusing in on specific results while scraping Twitter with Python and Beautiful Soup 4?

这是我的 post .

的跟进

我没有使用 Twitter API 因为它不查看推文 这么远的标签。示例后的完整代码和输出如下。

我想从每条推文中抓取特定数据。 namehandle 正在检索我正在寻找的内容,但我无法缩小其余元素的范围。

举个例子:

 link = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
 url = link[0]

检索这个:

 <a class="tweet-timestamp js-permalink js-nav js-tooltip" href="/Mikepeeljourno/status/648787700980408320" title="2:13 AM - 29 Sep 2015">
 <span class="_timestamp js-short-timestamp " data-aria-label-part="last" data-long-form="true" data-time="1443518016" data-time-ms="1443518016000">29 Sep 2015</span></a>

对于url,我只需要第一行的href值。

类似地,retweetsfavorites 命令 return 大块 html,而我真正需要的只是为每个显示的数值。

如何将结果缩小到 url、转推计数和收藏计数输出所需的数据?

我计划在我开始工作后在所有推文中进行此循环,以防对您的建议产生影响。

完整代码:

 from bs4 import BeautifulSoup
 import requests
 import sys

 url = 'https://twitter.com/search?q=%23bangkokbombing%20since%3A2015-08-10%20until%3A2015-09-30&src=typd&lang=en'
 headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
 r = requests.get(url, headers=headers)
 data = r.text.encode('utf-8')
 soup = BeautifulSoup(data, "html.parser")

 name = soup('strong', {'class': 'fullname js-action-profile-name show-popup-with-id'})
 username = name[0].contents[0]

 handle = soup('span', {'class': 'username js-action-profile-name'})
 userhandle = handle[0].contents[1].contents[0]

 link = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
 url = link[0]

 messagetext = soup('p', {'class': 'TweetTextSize  js-tweet-text tweet-text'})
 message = messagetext[0]

 retweets = soup('button', {'class': 'ProfileTweet-actionButtonUndo js-actionButton js-actionRetweet'})
 retweetcount = retweets[0]

 favorites = soup('button', {'class': 'ProfileTweet-actionButtonUndo u-linkClean js-actionButton js-actionFavorite'})
 favcount = favorites[0]

 print (username, "\n", "@", userhandle, "\n", "\n", url, "\n", "\n", message, "\n", "\n", retweetcount, "\n", "\n", favcount) #extra linebreaks for ease of reading

完整输出:

Michael Peel

@Mikepeeljourno

<a class="tweet-timestamp js-permalink js-nav js-tooltip" href="/Mikepeeljourno/status/648787700980408320" title="2:13 AM - 29 Sep 2015"><span class="_timestamp js-short-timestamp " data-aria-label-part="last" data-long-form="true" data-time="1443518016" data-time-ms="1443518016000">29 Sep 2015</span></a>

<p class="TweetTextSize js-tweet-text tweet-text" data-aria-label-part="0" lang="en"><a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/FT?src=hash"><s>#</s><b>FT</b></a> Case closed: <a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/Thailand?src=hash"><s>#</s><b>Thailand</b></a> police chief proclaims <a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/Bangkokbombing?src=hash"><s>#</s><b><strong>Bangkokbombing</strong></b></a> solved ahead of his retirement this week -even as questions over case grow</p>

<button class="ProfileTweet-actionButtonUndo js-actionButton js-actionRetweet" data-modal="ProfileTweet-retweet" type="button">
<div class="IconContainer js-tooltip" title="Undo retweet">
<span class="Icon Icon--retweet"></span>
<span class="u-hiddenVisually">Retweeted</span>
</div>
<div class="IconTextContainer">
<span class="ProfileTweet-actionCount">
<span aria-hidden="true" class="ProfileTweet-actionCountForPresentation">4</span>
</span>
</div>
</button>

<button class="ProfileTweet-actionButtonUndo u-linkClean js-actionButton js-actionFavorite" type="button">
<div class="IconContainer js-tooltip" title="Undo like">
<div class="HeartAnimationContainer">
<div class="HeartAnimation"></div>
</div>
<span class="u-hiddenVisually">Liked</span>
</div>
<div class="IconTextContainer">
<span class="ProfileTweet-actionCount">
<span aria-hidden="true" class="ProfileTweet-actionCountForPresentation">2</span>
</span>
</div>
</button>

有人建议BeautifulSoup - extracting attribute values may have an answer to this question there. However, I think the question and its answers do not have sufficient context or explanation to be helpful in more complex situations. The link to the relevant part of the Beautiful Soup Documentation is helpful though, http://www.crummy.com/software/BeautifulSoup/documentation.html#The%20attributes%20of%20Tags

使用 dictionary-like 访问 Tag 的属性。

例如获取href属性值:

links = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
url = link[0]["href"]

或者,如果您需要为找到的每个 link 获取 href 值:

links = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
urls = [link["href"] for link in links]

附带说明一下,您不需要指定完整的 class 值来定位元素。 class 是一个特殊的 multi-valued 属性 ,您可以只使用 类 之一(如果这足以缩小搜索范围元素)。例如,而不是:

soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})

您可以使用:

soup('a', {'class': 'tweet-timestamp'})

或者,CSS selector:

soup.select("a.tweet-timestamp")

Alexe 已经解释过使用 'href' 键来获取值。

所以我要回答你问题的另一部分:

Similarly, the retweets and favorites commands return large chunks of html, when all I really need is the numerical value that is displayed for each one.

.contents return 是所有 children 的列表。由于您正在查找 'buttons' 其中有几个您感兴趣的 children,您可以从以下已解析的内容列表中获取它们:

retweetcount = retweets[0].contents[3].contents[1].contents[1].string

这将 return 值 4

如果你想要一个更易读的方法,试试这个:

retweetcount = retweets[0].find_all('span', class_='ProfileTweet-actionCountForPresentation')[0].string

favcount = favorites[0].find_all('span', { 'class' : 'ProfileTweet-actionCountForPresentation')[0].string

这个return分别是42。 这是有效的,因为我们转换结果集 returned by soup/find_all 并获取标签元素(使用 [0])并再次使用 find_all().[=16 递归查找它的所有后代=]

现在您可以遍历每条推文并相当轻松地提取此信息。