在使用 Python 和 Beautiful Soup 4 抓取 Twitter 时关注特定结果?
Focusing in on specific results while scraping Twitter with Python and Beautiful Soup 4?
这是我的 post .
的跟进
我没有使用 Twitter API 因为它不查看推文
这么远的标签。示例后的完整代码和输出如下。
我想从每条推文中抓取特定数据。 name
和 handle
正在检索我正在寻找的内容,但我无法缩小其余元素的范围。
举个例子:
link = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
url = link[0]
检索这个:
<a class="tweet-timestamp js-permalink js-nav js-tooltip" href="/Mikepeeljourno/status/648787700980408320" title="2:13 AM - 29 Sep 2015">
<span class="_timestamp js-short-timestamp " data-aria-label-part="last" data-long-form="true" data-time="1443518016" data-time-ms="1443518016000">29 Sep 2015</span></a>
对于url,我只需要第一行的href
值。
类似地,retweets
和 favorites
命令 return 大块 html,而我真正需要的只是为每个显示的数值。
如何将结果缩小到 url、转推计数和收藏计数输出所需的数据?
我计划在我开始工作后在所有推文中进行此循环,以防对您的建议产生影响。
完整代码:
from bs4 import BeautifulSoup
import requests
import sys
url = 'https://twitter.com/search?q=%23bangkokbombing%20since%3A2015-08-10%20until%3A2015-09-30&src=typd&lang=en'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
r = requests.get(url, headers=headers)
data = r.text.encode('utf-8')
soup = BeautifulSoup(data, "html.parser")
name = soup('strong', {'class': 'fullname js-action-profile-name show-popup-with-id'})
username = name[0].contents[0]
handle = soup('span', {'class': 'username js-action-profile-name'})
userhandle = handle[0].contents[1].contents[0]
link = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
url = link[0]
messagetext = soup('p', {'class': 'TweetTextSize js-tweet-text tweet-text'})
message = messagetext[0]
retweets = soup('button', {'class': 'ProfileTweet-actionButtonUndo js-actionButton js-actionRetweet'})
retweetcount = retweets[0]
favorites = soup('button', {'class': 'ProfileTweet-actionButtonUndo u-linkClean js-actionButton js-actionFavorite'})
favcount = favorites[0]
print (username, "\n", "@", userhandle, "\n", "\n", url, "\n", "\n", message, "\n", "\n", retweetcount, "\n", "\n", favcount) #extra linebreaks for ease of reading
完整输出:
Michael Peel
@Mikepeeljourno
<a class="tweet-timestamp js-permalink js-nav js-tooltip" href="/Mikepeeljourno/status/648787700980408320" title="2:13 AM - 29 Sep 2015"><span class="_timestamp js-short-timestamp " data-aria-label-part="last" data-long-form="true" data-time="1443518016" data-time-ms="1443518016000">29 Sep 2015</span></a>
<p class="TweetTextSize js-tweet-text tweet-text" data-aria-label-part="0" lang="en"><a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/FT?src=hash"><s>#</s><b>FT</b></a> Case closed: <a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/Thailand?src=hash"><s>#</s><b>Thailand</b></a> police chief proclaims <a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/Bangkokbombing?src=hash"><s>#</s><b><strong>Bangkokbombing</strong></b></a> solved ahead of his retirement this week -even as questions over case grow</p>
<button class="ProfileTweet-actionButtonUndo js-actionButton js-actionRetweet" data-modal="ProfileTweet-retweet" type="button">
<div class="IconContainer js-tooltip" title="Undo retweet">
<span class="Icon Icon--retweet"></span>
<span class="u-hiddenVisually">Retweeted</span>
</div>
<div class="IconTextContainer">
<span class="ProfileTweet-actionCount">
<span aria-hidden="true" class="ProfileTweet-actionCountForPresentation">4</span>
</span>
</div>
</button>
<button class="ProfileTweet-actionButtonUndo u-linkClean js-actionButton js-actionFavorite" type="button">
<div class="IconContainer js-tooltip" title="Undo like">
<div class="HeartAnimationContainer">
<div class="HeartAnimation"></div>
</div>
<span class="u-hiddenVisually">Liked</span>
</div>
<div class="IconTextContainer">
<span class="ProfileTweet-actionCount">
<span aria-hidden="true" class="ProfileTweet-actionCountForPresentation">2</span>
</span>
</div>
</button>
有人建议BeautifulSoup - extracting attribute values may have an answer to this question there. However, I think the question and its answers do not have sufficient context or explanation to be helpful in more complex situations. The link to the relevant part of the Beautiful Soup Documentation is helpful though, http://www.crummy.com/software/BeautifulSoup/documentation.html#The%20attributes%20of%20Tags
使用 dictionary-like 访问 Tag
的属性。
例如获取href
属性值:
links = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
url = link[0]["href"]
或者,如果您需要为找到的每个 link 获取 href
值:
links = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
urls = [link["href"] for link in links]
附带说明一下,您不需要指定完整的 class
值来定位元素。 class
是一个特殊的 multi-valued 属性 ,您可以只使用 类 之一(如果这足以缩小搜索范围元素)。例如,而不是:
soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
您可以使用:
soup('a', {'class': 'tweet-timestamp'})
或者,CSS selector:
soup.select("a.tweet-timestamp")
Alexe 已经解释过使用 'href' 键来获取值。
所以我要回答你问题的另一部分:
Similarly, the retweets and favorites commands return large chunks of
html, when all I really need is the numerical value that is displayed
for each one.
.contents return 是所有 children 的列表。由于您正在查找 'buttons' 其中有几个您感兴趣的 children,您可以从以下已解析的内容列表中获取它们:
retweetcount = retweets[0].contents[3].contents[1].contents[1].string
这将 return 值 4
。
如果你想要一个更易读的方法,试试这个:
retweetcount = retweets[0].find_all('span', class_='ProfileTweet-actionCountForPresentation')[0].string
favcount = favorites[0].find_all('span', { 'class' : 'ProfileTweet-actionCountForPresentation')[0].string
这个return分别是4
和2
。
这是有效的,因为我们转换结果集 returned by soup/find_all 并获取标签元素(使用 [0])并再次使用 find_all().[=16 递归查找它的所有后代=]
现在您可以遍历每条推文并相当轻松地提取此信息。
这是我的 post
我没有使用 Twitter API 因为它不查看推文 这么远的标签。示例后的完整代码和输出如下。
我想从每条推文中抓取特定数据。 name
和 handle
正在检索我正在寻找的内容,但我无法缩小其余元素的范围。
举个例子:
link = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
url = link[0]
检索这个:
<a class="tweet-timestamp js-permalink js-nav js-tooltip" href="/Mikepeeljourno/status/648787700980408320" title="2:13 AM - 29 Sep 2015">
<span class="_timestamp js-short-timestamp " data-aria-label-part="last" data-long-form="true" data-time="1443518016" data-time-ms="1443518016000">29 Sep 2015</span></a>
对于url,我只需要第一行的href
值。
类似地,retweets
和 favorites
命令 return 大块 html,而我真正需要的只是为每个显示的数值。
如何将结果缩小到 url、转推计数和收藏计数输出所需的数据?
我计划在我开始工作后在所有推文中进行此循环,以防对您的建议产生影响。
完整代码:
from bs4 import BeautifulSoup
import requests
import sys
url = 'https://twitter.com/search?q=%23bangkokbombing%20since%3A2015-08-10%20until%3A2015-09-30&src=typd&lang=en'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
r = requests.get(url, headers=headers)
data = r.text.encode('utf-8')
soup = BeautifulSoup(data, "html.parser")
name = soup('strong', {'class': 'fullname js-action-profile-name show-popup-with-id'})
username = name[0].contents[0]
handle = soup('span', {'class': 'username js-action-profile-name'})
userhandle = handle[0].contents[1].contents[0]
link = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
url = link[0]
messagetext = soup('p', {'class': 'TweetTextSize js-tweet-text tweet-text'})
message = messagetext[0]
retweets = soup('button', {'class': 'ProfileTweet-actionButtonUndo js-actionButton js-actionRetweet'})
retweetcount = retweets[0]
favorites = soup('button', {'class': 'ProfileTweet-actionButtonUndo u-linkClean js-actionButton js-actionFavorite'})
favcount = favorites[0]
print (username, "\n", "@", userhandle, "\n", "\n", url, "\n", "\n", message, "\n", "\n", retweetcount, "\n", "\n", favcount) #extra linebreaks for ease of reading
完整输出:
Michael Peel
@Mikepeeljourno
<a class="tweet-timestamp js-permalink js-nav js-tooltip" href="/Mikepeeljourno/status/648787700980408320" title="2:13 AM - 29 Sep 2015"><span class="_timestamp js-short-timestamp " data-aria-label-part="last" data-long-form="true" data-time="1443518016" data-time-ms="1443518016000">29 Sep 2015</span></a>
<p class="TweetTextSize js-tweet-text tweet-text" data-aria-label-part="0" lang="en"><a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/FT?src=hash"><s>#</s><b>FT</b></a> Case closed: <a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/Thailand?src=hash"><s>#</s><b>Thailand</b></a> police chief proclaims <a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/Bangkokbombing?src=hash"><s>#</s><b><strong>Bangkokbombing</strong></b></a> solved ahead of his retirement this week -even as questions over case grow</p>
<button class="ProfileTweet-actionButtonUndo js-actionButton js-actionRetweet" data-modal="ProfileTweet-retweet" type="button">
<div class="IconContainer js-tooltip" title="Undo retweet">
<span class="Icon Icon--retweet"></span>
<span class="u-hiddenVisually">Retweeted</span>
</div>
<div class="IconTextContainer">
<span class="ProfileTweet-actionCount">
<span aria-hidden="true" class="ProfileTweet-actionCountForPresentation">4</span>
</span>
</div>
</button>
<button class="ProfileTweet-actionButtonUndo u-linkClean js-actionButton js-actionFavorite" type="button">
<div class="IconContainer js-tooltip" title="Undo like">
<div class="HeartAnimationContainer">
<div class="HeartAnimation"></div>
</div>
<span class="u-hiddenVisually">Liked</span>
</div>
<div class="IconTextContainer">
<span class="ProfileTweet-actionCount">
<span aria-hidden="true" class="ProfileTweet-actionCountForPresentation">2</span>
</span>
</div>
</button>
有人建议BeautifulSoup - extracting attribute values may have an answer to this question there. However, I think the question and its answers do not have sufficient context or explanation to be helpful in more complex situations. The link to the relevant part of the Beautiful Soup Documentation is helpful though, http://www.crummy.com/software/BeautifulSoup/documentation.html#The%20attributes%20of%20Tags
使用 dictionary-like 访问 Tag
的属性。
例如获取href
属性值:
links = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
url = link[0]["href"]
或者,如果您需要为找到的每个 link 获取 href
值:
links = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
urls = [link["href"] for link in links]
附带说明一下,您不需要指定完整的 class
值来定位元素。 class
是一个特殊的 multi-valued 属性 ,您可以只使用 类 之一(如果这足以缩小搜索范围元素)。例如,而不是:
soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
您可以使用:
soup('a', {'class': 'tweet-timestamp'})
或者,CSS selector:
soup.select("a.tweet-timestamp")
Alexe 已经解释过使用 'href' 键来获取值。
所以我要回答你问题的另一部分:
Similarly, the retweets and favorites commands return large chunks of html, when all I really need is the numerical value that is displayed for each one.
.contents return 是所有 children 的列表。由于您正在查找 'buttons' 其中有几个您感兴趣的 children,您可以从以下已解析的内容列表中获取它们:
retweetcount = retweets[0].contents[3].contents[1].contents[1].string
这将 return 值 4
。
如果你想要一个更易读的方法,试试这个:
retweetcount = retweets[0].find_all('span', class_='ProfileTweet-actionCountForPresentation')[0].string
favcount = favorites[0].find_all('span', { 'class' : 'ProfileTweet-actionCountForPresentation')[0].string
这个return分别是4
和2
。
这是有效的,因为我们转换结果集 returned by soup/find_all 并获取标签元素(使用 [0])并再次使用 find_all().[=16 递归查找它的所有后代=]
现在您可以遍历每条推文并相当轻松地提取此信息。