如何通过无头服务器和无 GUI 使用 python 获取/抓取聚合物水疗网页
how to fetch / grab polymer spa webpage by using python with headless server and no GUI
我正在尝试抓取以下内容url:
https://docs-05-dot-polymer-project.appspot.com/0.5/articles/demos/spa/final.html
我的目标是获取访问者所见的网页内容(源代码),因此在呈现所有 javascripts 等之后
为此,我使用了此处提到的示例:http://techstonia.com/scraping-with-phantomjs-and-python.html
该示例适用于我的服务器。但挑战在于让它也适用于上述提到的基于聚合物的 SPA 网站。这些是真实呈现的 javascript 个网站。
我的代码如下:
import platform
from bs4 import BeautifulSoup
from selenium import webdriver
# PhantomJS files have different extensions
# under different operating systems
if platform.system() == 'Windows':
PHANTOMJS_PATH = './phantomjs.exe'
else:
PHANTOMJS_PATH = './phantomjs'
# here we'll use pseudo browser PhantomJS,
# but browser can be replaced with browser = webdriver.FireFox(),
# which is good for debugging.
browser = webdriver.PhantomJS(PHANTOMJS_PATH)
browser.get('https://docs-05-dot-polymer-project.appspot.com/0.5/articles/demos/spa/final.html')
print (browser)
问题是提供了以下结果:
<!DOCTYPE html>
<html><head>
<meta charset="utf-8">
<meta content="width=device-width, minimum-scale=1.0, initial-scale=1.0, user-scalable=yes" name="viewport">
<title>Single page app using Polymer</title>
<script async="" src="//www.google-analytics.com/analytics.js"></script><script src="/webcomponents.min.js"></script>
<!-- vulcanized version of imported elements --
see "elements.html" for unvulcanized list of imports. -->
<link href="vulcanized.html" rel="import">
<link href="styles.css" rel="stylesheet" shim-shadowdom="">
</link></link></meta></meta></head>
<body fullbleed="" unresolved="">
<template id="t" is="auto-binding">
<!-- Route controller. -->
<flatiron-director autohash="" route="{{route}}"></flatiron-director>
<!-- Keyboard nav controller. -->
<core-a11y-keys id="keys" keys="up down left right space space+shift" on-keys-pressed="{{keyHandler}}" target="{{parentElement}}"></core-a11y-keys>
<core-scaffold id="scaffold">
<nav>
<core-toolbar>
<span>Single Page Polymer</span>
</core-toolbar>
<core-menu on-core-select="{{menuItemSelected}}" selected="{{route}}" selectedmodel="{{selectedPage}}" valueattr="hash">
<template repeat="{{page, i in pages}}">
<paper-item hash="{{page.hash}}" noink="">
<core-icon icon="label{{route != page.hash ? '-outline' : ''}}"></core-icon>
<a href="#{{page.hash}}">{{page.name}}</a>
</paper-item>
</template>
</core-menu>
</nav>
<core-toolbar flex="" tool="">
<div flex="">{{selectedPage.page.name}}</div>
<core-icon-button icon="refresh"></core-icon-button>
<core-icon-button icon="add"></core-icon-button>
</core-toolbar>
<div center-center="" fit="" horizontal="" layout="">
<core-animated-pages id="pages" on-tap="{{cyclePages}}" selected="{{route}}" transitions="slide-from-right" valueattr="hash">
<template repeat="{{page, i in pages}}">
<section center-center="" hash="{{page.hash}}" layout="" vertical="">
<div>{{page.name}}</div>
</section>
</template>
</core-animated-pages>
</div>
</core-scaffold>
</template>
<script src="app.js"></script>
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-43475701-2', 'auto'); // ebidel's
ga('create', 'UA-39334307-1', 'auto'); // pp.org
ga('send', 'pageview');
</script>
</body></html>
正如您在使用浏览器查看时看到的与实际结果相去甚远。
我的问题....我做错了什么,如果可能的话在哪里寻找解决方案。
我认为您遗漏了 Selenium Webdriver docs 中的某些内容。
您可以获得动态页面的内容,但是您必须确保您正在搜索的元素存在并且在页面上可见:
import platform
from selenium import webdriver
browser = webdriver.PhantomJS()
browser.get('https://docs-05-dot-polymer-
project.appspot.com/0.5/articles/demos/spa/final.html')
# Getting content of the first slide
res1 = browser.find_element_by_xpath('//*[@id="pages"]/section[1]/div')
# Save a screenshot so you can see why is failing (if it is)
browser.save_screenshot('screen_test')
# Print the text within the div
print (res1.text)
如果您还需要获取其他幻灯片的文本,则需要单击(使用 webdriver
)需要使第二张幻灯片可见的位置,然后再从中获取文本。
我正在尝试抓取以下内容url: https://docs-05-dot-polymer-project.appspot.com/0.5/articles/demos/spa/final.html
我的目标是获取访问者所见的网页内容(源代码),因此在呈现所有 javascripts 等之后
为此,我使用了此处提到的示例:http://techstonia.com/scraping-with-phantomjs-and-python.html
该示例适用于我的服务器。但挑战在于让它也适用于上述提到的基于聚合物的 SPA 网站。这些是真实呈现的 javascript 个网站。
我的代码如下:
import platform
from bs4 import BeautifulSoup
from selenium import webdriver
# PhantomJS files have different extensions
# under different operating systems
if platform.system() == 'Windows':
PHANTOMJS_PATH = './phantomjs.exe'
else:
PHANTOMJS_PATH = './phantomjs'
# here we'll use pseudo browser PhantomJS,
# but browser can be replaced with browser = webdriver.FireFox(),
# which is good for debugging.
browser = webdriver.PhantomJS(PHANTOMJS_PATH)
browser.get('https://docs-05-dot-polymer-project.appspot.com/0.5/articles/demos/spa/final.html')
print (browser)
问题是提供了以下结果:
<!DOCTYPE html>
<html><head>
<meta charset="utf-8">
<meta content="width=device-width, minimum-scale=1.0, initial-scale=1.0, user-scalable=yes" name="viewport">
<title>Single page app using Polymer</title>
<script async="" src="//www.google-analytics.com/analytics.js"></script><script src="/webcomponents.min.js"></script>
<!-- vulcanized version of imported elements --
see "elements.html" for unvulcanized list of imports. -->
<link href="vulcanized.html" rel="import">
<link href="styles.css" rel="stylesheet" shim-shadowdom="">
</link></link></meta></meta></head>
<body fullbleed="" unresolved="">
<template id="t" is="auto-binding">
<!-- Route controller. -->
<flatiron-director autohash="" route="{{route}}"></flatiron-director>
<!-- Keyboard nav controller. -->
<core-a11y-keys id="keys" keys="up down left right space space+shift" on-keys-pressed="{{keyHandler}}" target="{{parentElement}}"></core-a11y-keys>
<core-scaffold id="scaffold">
<nav>
<core-toolbar>
<span>Single Page Polymer</span>
</core-toolbar>
<core-menu on-core-select="{{menuItemSelected}}" selected="{{route}}" selectedmodel="{{selectedPage}}" valueattr="hash">
<template repeat="{{page, i in pages}}">
<paper-item hash="{{page.hash}}" noink="">
<core-icon icon="label{{route != page.hash ? '-outline' : ''}}"></core-icon>
<a href="#{{page.hash}}">{{page.name}}</a>
</paper-item>
</template>
</core-menu>
</nav>
<core-toolbar flex="" tool="">
<div flex="">{{selectedPage.page.name}}</div>
<core-icon-button icon="refresh"></core-icon-button>
<core-icon-button icon="add"></core-icon-button>
</core-toolbar>
<div center-center="" fit="" horizontal="" layout="">
<core-animated-pages id="pages" on-tap="{{cyclePages}}" selected="{{route}}" transitions="slide-from-right" valueattr="hash">
<template repeat="{{page, i in pages}}">
<section center-center="" hash="{{page.hash}}" layout="" vertical="">
<div>{{page.name}}</div>
</section>
</template>
</core-animated-pages>
</div>
</core-scaffold>
</template>
<script src="app.js"></script>
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-43475701-2', 'auto'); // ebidel's
ga('create', 'UA-39334307-1', 'auto'); // pp.org
ga('send', 'pageview');
</script>
</body></html>
正如您在使用浏览器查看时看到的与实际结果相去甚远。 我的问题....我做错了什么,如果可能的话在哪里寻找解决方案。
我认为您遗漏了 Selenium Webdriver docs 中的某些内容。 您可以获得动态页面的内容,但是您必须确保您正在搜索的元素存在并且在页面上可见:
import platform
from selenium import webdriver
browser = webdriver.PhantomJS()
browser.get('https://docs-05-dot-polymer-
project.appspot.com/0.5/articles/demos/spa/final.html')
# Getting content of the first slide
res1 = browser.find_element_by_xpath('//*[@id="pages"]/section[1]/div')
# Save a screenshot so you can see why is failing (if it is)
browser.save_screenshot('screen_test')
# Print the text within the div
print (res1.text)
如果您还需要获取其他幻灯片的文本,则需要单击(使用 webdriver
)需要使第二张幻灯片可见的位置,然后再从中获取文本。