如何使用 python 从网页获取内容

Question

此页面有一个urlhttps://www.example.com

<html>
<body>
<button id="button1" onclick=func1()>
<button id="button2" onclick=func2()>
</body>
<script>
function func1(){
  open("/doubt?s=AAAB_BCCCDD");
}

function func2(){
  open("/doubt?s=AABB_CCDDEE");
}
//something like that, it is working ....
</script>
</html>

AAAB_BCCCDD 和 AABB_CCDDEE - 都是令牌 ...

我想用 python
获取页面中的第一个标记我的 python 代码 -

import requests

r = requests.get("https://www.example.com")
s = r.text

if "/doubt?s=" in s:
# After this i can' understand anything ...
# i want to get the first token here as a variable

请帮帮我....

Answer 1

通常，在获取网站的原始文本内容后，您会首先使用像 BeautifulSoup 这样的库来解析 HTML。它将创建一个文档对象模型 (DOM) 树，然后您可以在其中查询所需的元素。

但是，这不会读取或解释 JavaScript 代码。对于您的问题，您可以使用 regular expressions 从原始文本中提取必要的信息。

示例：

import re
import requests

r = requests.get("https://www.example.com")
s = r.text

pattern = re.compile('/doubt\?s=(?P<token>\w+)')
matches = pattern.findall(s)
if len(matches) > 0:
  print(matches[0])

如何使用 python 从网页获取内容

How to get something from a webpage with python

html

javascript

python

python-requests