解析带有冗余标记的 HTML 的精确答案
Parsing precise answers of HTML with redundant tag
我正在寻找 the Bert as a service 的常见问题解答。
我对此很感兴趣 HTML :
<h5>
<a id="user-content-q-how-do-you-get-the-fixed-representation-did-you-do-pooling-or-something" class="anchor" aria-hidden="true" href="#q-how-do-you-get-the-fixed-representation-did-you-do-pooling-or-something">
<svg class="octicon octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">
<path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45">
</path>
</svg>
</a>
<strong>Q:</strong> How do you get the fixed representation? Did you do pooling or something?
</h5>
<p><strong>A:</strong> Yes, pooling is required to get a fixed representation of a sentence. In the default strategy <code>REDUCE_MEAN</code>, I take the second-to-last hidden layer of all of the tokens in the sentence and do average pooling.</p>
我成功地分别检索了问题和答案。但是答案的标签形式并不是多余的。这是我解析此 HTML 的代码:
import requests
from bs4 import BeautifulSoup
wp = requests.get("https://github.com/hanxiao/bert-as-service")
soup = BeautifulSoup(wp.text, "html.parser")
# Parse the questions
results = soup.find_all("h5")
questions = []
for result in results:
question = result.contents[2]
questions.append(question)
# Parse the answers
new_tag = soup.find_all("p")
new_tag = new_tag[114:165] # specify the tag of the answers
answers = []
for new in new_tag:
answer = new.contents[1]
我的回答形式非常糟糕,因为 <p>
标签非常频繁。
如果你运行
for i in results:
print(i.text)
print(i.findNext('p').text)
你得到(随机挑选一对q/a):
Q: Can I use multilingual BERT model provided by Google?
A: Yes.
然后您可以将这些附加到您的列表中并从那里开始。
您还可以执行以下操作
import requests
from bs4 import BeautifulSoup
wp = requests.get("https://github.com/hanxiao/bert-as-service")
soup = BeautifulSoup(wp.text, "lxml")
titles = [item.text.lstrip('Q: ') for item in soup.select('h5')]
initial_paras = [item.text.lstrip('A: ') for item in soup.select('h5 + p')]
print(len(titles), len(initial_paras))
我正在寻找 the Bert as a service 的常见问题解答。
我对此很感兴趣 HTML :
<h5>
<a id="user-content-q-how-do-you-get-the-fixed-representation-did-you-do-pooling-or-something" class="anchor" aria-hidden="true" href="#q-how-do-you-get-the-fixed-representation-did-you-do-pooling-or-something">
<svg class="octicon octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">
<path fill-rule="evenodd" d="M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45">
</path>
</svg>
</a>
<strong>Q:</strong> How do you get the fixed representation? Did you do pooling or something?
</h5>
<p><strong>A:</strong> Yes, pooling is required to get a fixed representation of a sentence. In the default strategy <code>REDUCE_MEAN</code>, I take the second-to-last hidden layer of all of the tokens in the sentence and do average pooling.</p>
我成功地分别检索了问题和答案。但是答案的标签形式并不是多余的。这是我解析此 HTML 的代码:
import requests
from bs4 import BeautifulSoup
wp = requests.get("https://github.com/hanxiao/bert-as-service")
soup = BeautifulSoup(wp.text, "html.parser")
# Parse the questions
results = soup.find_all("h5")
questions = []
for result in results:
question = result.contents[2]
questions.append(question)
# Parse the answers
new_tag = soup.find_all("p")
new_tag = new_tag[114:165] # specify the tag of the answers
answers = []
for new in new_tag:
answer = new.contents[1]
我的回答形式非常糟糕,因为 <p>
标签非常频繁。
如果你运行
for i in results:
print(i.text)
print(i.findNext('p').text)
你得到(随机挑选一对q/a):
Q: Can I use multilingual BERT model provided by Google?
A: Yes.
然后您可以将这些附加到您的列表中并从那里开始。
您还可以执行以下操作
import requests
from bs4 import BeautifulSoup
wp = requests.get("https://github.com/hanxiao/bert-as-service")
soup = BeautifulSoup(wp.text, "lxml")
titles = [item.text.lstrip('Q: ') for item in soup.select('h5')]
initial_paras = [item.text.lstrip('A: ') for item in soup.select('h5 + p')]
print(len(titles), len(initial_paras))