如何使用 Python 在 Selenium 和网络驱动程序中获取部分文本

How to get a part of text in Selenium and web driver using Python

我想使用Selenium 和Web 驱动程序来捕获一部分信息。 我想抓取以下信息:

7197409

下面的代码是他们的html代码,我要抓“7197409”

<script type="text/javascript">
  var messageid = 7197409;
  var highlight_id = -1;
  var authorOnly = "N";
  var ftype = 'MB';
  var adsenseFront = '<table width="99%" cellspacing="0" cellpadding="0" style="background-color: #000000; margin-left: auto; margin-right: auto;"><tr><td style="width: 100%; background-color: #F7F3F7;">';
  var adsenseEnd = '</td></tr></table>';
  var Submitted = false;
  var subject = true;
  var HiddenThreads = new Array(26); //Temp variable to save the threads temporary
  var blocked_list = Sys.Serialization.JavaScriptSerializer.deserialize('[]');
  var currentUser = undefined;
  var followList = [];
  var lock = false;
</script>

我检查了他们的完整 xpath 是 /html/body/form/div[5]/div/div/div[2]/div[1]/script/text()

我用下面的代码来执行

from datetime import date,datetime
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
from selenium.webdriver.support.ui import Select
from selenium.common.exceptions import NoSuchElementException
import numpy as np
import xlrd
import csv
import codecs
import time

url = "https://forumd.hkgolden.com/view.aspx?type=MB&message=7197409"
driver_blank=webdriver.Chrome('./chromedriver')
driver_blank.get(url)
id=driver_blank.find_element_by_xpath("/html/body/form/div[5]/div/div/div[2]/div[1]/script/text()")
print("ID:"+id.text)

driver_blank.close()

但是,我收到以下错误消息。他们说The result of the xpath expression "/html/body/form/div[5]/div/div/div[2]/div[1]/script/text()" is: [object Text]. It should be an element.

DevTools listening on ws://127.0.0.1:50519/devtools/browser/845d0800-1dd9-4f8a-a847-7d955c8cc5e3 libpng warning: iCCP: cHRM chunk does not match sRGB [16136:16764:0411/213956.920:ERROR:ssl_client_socket_impl.cc(941)] handshake failed; returned -1, SSL error code 1, net_error -107 [16136:16764:0411/213957.351:ERROR:ssl_client_socket_impl.cc(941)] handshake failed; returned -1, SSL error code 1, net_error -107 Traceback (most recent call last): File ".\test.py", line 28, in id=driver_blank.find_element_by_xpath("/html/body/form/div[5]/div/div/div[2]/div1/script/text()") File "C:\Program Files\Python37\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 394, in find_element_by_xpath return self.find_element(by=By.XPATH, value=xpath) File "C:\Program Files\Python37\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 978, in find_element 'value': value})['value'] File "C:\Program Files\Python37\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute self.error_handler.check_response(response) File "C:\Program Files\Python37\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response raise exception_class(message, screen, stacktrace) selenium.common.exceptions.InvalidSelectorException: Message: invalid selector: The result of the xpath expression "/html/body/form/div[5]/div/div/div[2]/div1/script/text()" is: [object Text]. It should be an element. (Session info: chrome=80.0.3987.132)

我想请教两个问题:

  1. 如何解决错误?

  2. 如何在相同的 xpath 范围内只获取 7197409 的文本?

谁能帮帮我?谢谢

首先找到脚本 WebElement:

div = driver.find_element_by_id("ctl00_ContentPlaceHolder1_view_form")
script = div.find_element_by_tag_name('script')

获取脚本 InnerHTML:

text = script.get_attribute('innerHTML')
print(text)

找到包含"var messageid"的行:

line = [l for l in text.split("\n") if "var messageid" in l][0]
print("Line:", line)

从行中获取数字:

ix_1 = line.find("=")
ix_2 = line.find(";")

number = int(line[ix_1+1:ix_2])
print("Number:", number)

输出(在 Chromium 80.x 中测试):

Number: 7197409