Python urllib2 或请求 post 方法

Question

我大致了解如何使用 urllib2 发出 POST 请求（编码数据等），但问题是所有在线教程都使用完全无用的虚构示例urls 来展示如何做（someserver.com、coolsite.org 等），所以我看不到与他们使用的示例代码相对应的具体 html .就连python.org自己的tutorial在这方面也完全没用

我需要对此 url 提出 POST 请求：

https://patentscope.wipo.int/search/en/search.jsf

代码的相关部分是这样的（我认为）：

<form id="simpleSearchSearchForm" name="simpleSearchSearchForm" method="post" action="/search/en/search.jsf" enctype="application/x-www-form-urlencoded" style="display:inline">
<input type="hidden" name="simpleSearchSearchForm" value="simpleSearchSearchForm" />
<div class="rf-p " id="simpleSearchSearchForm:sSearchPanel" style="text-align:left;z-index:-1;"><div class="rf-p-hdr " id="simpleSearchSearchForm:sSearchPanel_header">

或者可能是这样的：

<input id="simpleSearchSearchForm:fpSearch" type="text" name="simpleSearchSearchForm:fpSearch" class="formInput" dir="ltr" style="width: 400px; height: 15px; text-align: left; background-image: url(&quot;https://patentscope.wipo.int/search/org.richfaces.resources/javax.faces.resource/org.richfaces.staticResource/4.5.5.Final/PackedCompressed/classic/org.richfaces.images/inputBackgroundImage.png&quot;); background-position: 1px 1px; background-repeat: no-repeat;">

如果我想将JP2014084003编码为搜索词，html中对应的值是什么？ input id？ name？ value？

附录：this answer 没有回答我的问题，因为它只是重复了我已经在 python 文档页面中查看过的信息。

更新：

我找到了 this，并尝试了其中的代码，具体来说：

import requests

headers = {'User-Agent': 'Mozilla/5.0'}
payload = {'name':'simpleSearchSearchForm:fpSearch','value':'2014084003'}
link    = 'https://patentscope.wipo.int/search/en/search.jsf'
session = requests.Session()
resp    = session.get(link,headers=headers)
cookies = requests.utils.cookiejar_from_dict(requests.utils.dict_from_cookiejar(session.cookies))
resp    = session.post(link,headers=headers,data=payload,cookies =cookies)

r = session.get(link)

f = open('htmltext.txt','w')

f.write(r.content)

f.close()

我得到了一个成功的响应（200）但是数据再次只是原始页面中的数据，所以我不知道我是否正确地发布到表单并且有一些东西否则我需要做的是从搜索结果页面获取 return 数据，否则我仍然发布错误的数据。

是的，我意识到这使用 requests 而不是 urllib2，但我只想获取数据。

Answer 1

这不是最直接的 post 请求，如果您查看开发者工具或 firebug 您可以从成功的浏览器中看到表单数据 post:

所有这些都非常简单，除非您看到一些 : 嵌入在键中，这可能有点令人困惑，simpleSearchSearchForm:commandSimpleFPSearch 是键而 Search.

唯一不能硬编码的是 javax.faces.ViewState，我们需要向站点发出请求，然后解析我们可以用 BeautifulSoup:

完成的值

import requests
from bs4 import BeautifulSoup

url = "https://patentscope.wipo.int/search/en/search.jsf"

data = {"simpleSearchSearchForm": "simpleSearchSearchForm",
        "simpleSearchSearchForm:j_idt341": "EN_ALLTXT",
        "simpleSearchSearchForm:fpSearch": "automata",
        "simpleSearchSearchForm:commandSimpleFPSearch": "Search",
        "simpleSearchSearchForm:j_idt406": "workaround"}
head = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}

with requests.Session() as s:
    # Get the cookies and the source to parse the Viewstate token
    init = s.get(url)
    soup = BeautifulSoup(init.text, "lxml")
    val = soup.select_one("#j_id1:javax.faces.ViewState:0")["value"]
    # update post data dict
    data["javax.faces.ViewState"] = val
    r = s.post(url, data=data, headers=head)
    print(r.text)

如果我们运行上面的代码：

In [13]: import requests

In [14]: from bs4 import BeautifulSoup

In [15]: url = "https://patentscope.wipo.int/search/en/search.jsf"

In [16]: data = {"simpleSearchSearchForm": "simpleSearchSearchForm",
   ....:         "simpleSearchSearchForm:j_idt341": "EN_ALLTXT",
   ....:         "simpleSearchSearchForm:fpSearch": "automata",
   ....:         "simpleSearchSearchForm:commandSimpleFPSearch": "Search",
   ....:         "simpleSearchSearchForm:j_idt406": "workaround"}

In [17]: head = {
   ....:     "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}

In [18]: with requests.Session() as s:
   ....:         init = s.get(url)
   ....:         soup = BeautifulSoup(init.text, "lxml")
   ....:         val = soup.select_one("#j_id1:javax.faces.ViewState:0")["value"]
   ....:         data["javax.faces.ViewState"] = val
   ....:         r = s.post(url, data=data, headers=head)
   ....:         print("\n".join([s.text.strip() for s in BeautifulSoup(r.text,"lxml").select("span.trans-section")]))
   ....:     

Fuzzy genetic learning automata classifier
Fuzzy genetic learning automata classifier
FINITE AUTOMATA MANAGER
CELLULAR AUTOMATA MUSIC GENERATOR
CELLULAR AUTOMATA MUSIC GENERATOR
ANALOG LOGIC AUTOMATA
Incremental automata verification
Cellular automata music generator
Analog logic automata
Symbolic finite automata

您会看到它与网页匹配。如果你想抓取网站，你需要熟悉开发人员 tools/firebug 等。观察请求是如何发出的，然后尝试模仿。要打开 firebug，请右键单击页面并 select 检查元素，单击网络选项卡并提交您的请求。您只需要 select 列表中的请求，然后 select 您想要的任何选项卡信息，即 post 请求的参数：

您可能还会发现此对于如何 post 访问网站很有用。

Python urllib2 或请求 post 方法

Python urllib2 or requests post method

python

post

urllib2

python-2.7

python-requests