从 Facebook public 个帖子中抓取图像的元数据
Scrape image's metadata from Facebook public posts
这是我从 Facebook public 帖子中获取一些数据的后续问题。这次我正在尝试收集图像元数据(图像的 url)。 Link 帖子工作正常,但有些帖子 return 为空数据。我使用了 的答案中建议的相同方法,但它不适用于下面的示例。将不胜感激建议!
link = "https://www.facebook.com/228735667216/posts/10151653129902217"
res = requests.get(link,headers={'User-Agent':'Mozilla/5.0'})
comment = res.text.replace("-->", "").replace("<!--", "")
soup = BeautifulSoup(comment, "lxml")
image = soup.find("div", class_="uiScaledImageContainer _517g")
img = image.find("img", class_="scaledImageFitWidth img")
href = img["src"]
print(href)
使用requests
登录并不容易,所以我特意跳过了那个库。您可以尝试仅使用 selenium
或 selenium
结合 BeautifulSoup
来完成此操作。
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.keys import Keys
url = "https://www.facebook.com/228735667216/posts/10156284868312217"
chrome_options = webdriver.ChromeOptions()
#This is how you can make the browser headless
chrome_options.add_argument("--headless")
#The following line controls the notification popping up right after login
prefs = {"profile.default_content_setting_values.notifications" : 2}
chrome_options.add_experimental_option("prefs",prefs)
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get(url)
driver.find_element_by_id("email").send_keys("your_username")
driver.find_element_by_id("pass").send_keys("your_password",Keys.RETURN)
driver.get(url)
soup = BeautifulSoup(driver.page_source, "lxml")
for img in soup.find_all(class_="scaledImageFitWidth"):
print(img.get("src"))
driver.quit()
输出如下(部分):
https://external.fdac17-1.fna.fbcdn.net/safe_image.php?d=AQBjBuP0TBYabtnO&w=540&h=282&url=https%3A%2F%2Fs3.amazonaws.com%2Fprod-cust-photo-posts-jfaikqealaka%2F3065-6e4c325b07b921fdefed4dd727881f8d.jpg&cfs=1&upscale=1&fallback=news_d_placeholder_publisher&_nc_hash=AQCVKXMSqvNiHZik
https://external.fdac17-1.fna.fbcdn.net/safe_image.php?d=AQCJ6RFOF4dY2xTn&w=100&h=100&url=https%3A%2F%2Fcdn.images.express.co.uk%2Fimg%2Fdynamic%2F106%2F750x445%2F1046936.jpg&cfs=1&upscale=1&fallback=news_d_placeholder_publisher_square&_nc_hash=AQAyFxRaZTGV47Se
这是我从 Facebook public 帖子中获取一些数据的后续问题。这次我正在尝试收集图像元数据(图像的 url)。 Link 帖子工作正常,但有些帖子 return 为空数据。我使用了
link = "https://www.facebook.com/228735667216/posts/10151653129902217"
res = requests.get(link,headers={'User-Agent':'Mozilla/5.0'})
comment = res.text.replace("-->", "").replace("<!--", "")
soup = BeautifulSoup(comment, "lxml")
image = soup.find("div", class_="uiScaledImageContainer _517g")
img = image.find("img", class_="scaledImageFitWidth img")
href = img["src"]
print(href)
使用requests
登录并不容易,所以我特意跳过了那个库。您可以尝试仅使用 selenium
或 selenium
结合 BeautifulSoup
来完成此操作。
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.keys import Keys
url = "https://www.facebook.com/228735667216/posts/10156284868312217"
chrome_options = webdriver.ChromeOptions()
#This is how you can make the browser headless
chrome_options.add_argument("--headless")
#The following line controls the notification popping up right after login
prefs = {"profile.default_content_setting_values.notifications" : 2}
chrome_options.add_experimental_option("prefs",prefs)
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get(url)
driver.find_element_by_id("email").send_keys("your_username")
driver.find_element_by_id("pass").send_keys("your_password",Keys.RETURN)
driver.get(url)
soup = BeautifulSoup(driver.page_source, "lxml")
for img in soup.find_all(class_="scaledImageFitWidth"):
print(img.get("src"))
driver.quit()
输出如下(部分):
https://external.fdac17-1.fna.fbcdn.net/safe_image.php?d=AQBjBuP0TBYabtnO&w=540&h=282&url=https%3A%2F%2Fs3.amazonaws.com%2Fprod-cust-photo-posts-jfaikqealaka%2F3065-6e4c325b07b921fdefed4dd727881f8d.jpg&cfs=1&upscale=1&fallback=news_d_placeholder_publisher&_nc_hash=AQCVKXMSqvNiHZik
https://external.fdac17-1.fna.fbcdn.net/safe_image.php?d=AQCJ6RFOF4dY2xTn&w=100&h=100&url=https%3A%2F%2Fcdn.images.express.co.uk%2Fimg%2Fdynamic%2F106%2F750x445%2F1046936.jpg&cfs=1&upscale=1&fallback=news_d_placeholder_publisher_square&_nc_hash=AQAyFxRaZTGV47Se