尝试从网页中提取一些数据(抓取初学者)
Trying to extract some data from a webpage (scraping beginner)
我正在尝试使用 Requests 从网页中提取一些数据,然后使用 Beautifulsoup。我首先通过请求获取 html 代码,然后在 Beautifulsoup:
中获取 "putting it"
from bs4 import BeautifulSoup
import requests
result = requests.get("https://XXXXX")
#print(result.status_code)
#print(result.headers)
src = result.content
soup = BeautifulSoup(src, 'lxml')
然后我挑出了一些代码:
tags = soup.findAll('ol',{'class':'activity-popup-users'})
print(tags)
这是我得到的一部分:
<div class="account js-actionable-user js-profile-popup-actionable " data-emojified-name="" data-feedback-token="" data-impression-id="" data-name="The UN Times" data-screen-name="TheUNTimes" data-user-id="3787869561">
我现在要的是提取data-user-id=
之后的数据,由""
之间的数字组成。然后我希望将该数据输入某种计算器 sheet。
我是一个绝对的初学者,我正在粘贴我在其他地方的教程或文档中找到的代码。
非常感谢您的宝贵时间...
编辑:
所以这就是我的尝试:
from bs4 import BeautifulSoup
import requests
result = requests.get("https://XXXX")
src = result.content
soup = BeautifulSoup(src, 'html.parser')
tags = soup.findAll('ol',{'class':'activity-popup-users'})
print(tags['data-user-id'])
这是我得到的:
TypeError: list indices must be integers or slices, not str
所以我试过了:
from bs4 import BeautifulSoup
import requests
result = requests.get("https://XXXX")
src = result.content soup = BeautifulSoup(src, 'html.parser')
#tags = soup.findAll('a',{'class':'account-group js-user-profile-link'})
tags = soup.findAll('ol',{'class':'activity-popup-users'})
tags.attrs
#print(tags['data-user-id'])
得到:
File "C:\Users\XXXX\element.py", line 1884, in __getattr__
"ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'attrs'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
您可以通过将标签视为属性值字典来获取标签的任何属性值。
Read the BeautifulSoup documentation on attributes.
tag['data-user-id']
例如
html="""
<div class="account js-actionable-user js-profile-popup-actionable " data-emojified-name="" data-feedback-token="" data-impression-id="" data-name="The UN Times" data-screen-name="TheUNTimes" data-user-id="3787869561">
"""
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'html.parser')
tag=soup.find('div')
print(tag['data-user-id'])
输出
3787869561
编辑以包含 OP 的问题更改:
from bs4 import BeautifulSoup
import requests
result = requests.get("http://twitter.com/RussiaUN/media")
src = result.content
soup = BeautifulSoup(src, 'html.parser')
divs = soup.find_all('div',class_='account')
#just print
for div in divs:
print(div['data-user-id'])
#write to a file
with open('file.txt','w') as f:
for div in divs:
f.write(div['data-user-id']+'\n')
输出:
255471924
2154112404
408696260
1267887043
475954041
3787869561
796979978
261711504
398068796
1174451010
...
我正在尝试使用 Requests 从网页中提取一些数据,然后使用 Beautifulsoup。我首先通过请求获取 html 代码,然后在 Beautifulsoup:
中获取 "putting it"from bs4 import BeautifulSoup
import requests
result = requests.get("https://XXXXX")
#print(result.status_code)
#print(result.headers)
src = result.content
soup = BeautifulSoup(src, 'lxml')
然后我挑出了一些代码:
tags = soup.findAll('ol',{'class':'activity-popup-users'})
print(tags)
这是我得到的一部分:
<div class="account js-actionable-user js-profile-popup-actionable " data-emojified-name="" data-feedback-token="" data-impression-id="" data-name="The UN Times" data-screen-name="TheUNTimes" data-user-id="3787869561">
我现在要的是提取data-user-id=
之后的数据,由""
之间的数字组成。然后我希望将该数据输入某种计算器 sheet。
我是一个绝对的初学者,我正在粘贴我在其他地方的教程或文档中找到的代码。
非常感谢您的宝贵时间...
编辑: 所以这就是我的尝试:
from bs4 import BeautifulSoup
import requests
result = requests.get("https://XXXX")
src = result.content
soup = BeautifulSoup(src, 'html.parser')
tags = soup.findAll('ol',{'class':'activity-popup-users'})
print(tags['data-user-id'])
这是我得到的:
TypeError: list indices must be integers or slices, not str
所以我试过了:
from bs4 import BeautifulSoup
import requests
result = requests.get("https://XXXX")
src = result.content soup = BeautifulSoup(src, 'html.parser')
#tags = soup.findAll('a',{'class':'account-group js-user-profile-link'})
tags = soup.findAll('ol',{'class':'activity-popup-users'})
tags.attrs
#print(tags['data-user-id'])
得到:
File "C:\Users\XXXX\element.py", line 1884, in __getattr__
"ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'attrs'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
您可以通过将标签视为属性值字典来获取标签的任何属性值。
Read the BeautifulSoup documentation on attributes.
tag['data-user-id']
例如
html="""
<div class="account js-actionable-user js-profile-popup-actionable " data-emojified-name="" data-feedback-token="" data-impression-id="" data-name="The UN Times" data-screen-name="TheUNTimes" data-user-id="3787869561">
"""
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'html.parser')
tag=soup.find('div')
print(tag['data-user-id'])
输出
3787869561
编辑以包含 OP 的问题更改:
from bs4 import BeautifulSoup
import requests
result = requests.get("http://twitter.com/RussiaUN/media")
src = result.content
soup = BeautifulSoup(src, 'html.parser')
divs = soup.find_all('div',class_='account')
#just print
for div in divs:
print(div['data-user-id'])
#write to a file
with open('file.txt','w') as f:
for div in divs:
f.write(div['data-user-id']+'\n')
输出:
255471924
2154112404
408696260
1267887043
475954041
3787869561
796979978
261711504
398068796
1174451010
...