使用 python selenium 读取、写入和控制动态实例化 HTML web table

Question

假设有一些特定的搜索者在搜索一些商品，我用'Teddy'搜索。总结果数为140个，显示在每行每列<div>组成的小table中（行为每条内容，列为内容信息），有滚动条。这向我展示了一个很好的列表，最多 5 个单一显示（每个内容使用 40px 作为它们的高度），如果我需要看到更多，我需要向下滚动这个 table.

如果看到第45到49条的商品，HTML如下图（第45条内容在当前视图的顶部）。

<div class="table-body" style="height:200px">            // This contains scrollbar
    <div class="table-panel" style="height:5600px">
        <div class="ag-row" style="height:40px row="42"> // This is each row of goods
            <div class="name">Teddy</div>                // This is each column of good
            <div class="price">200</div>
            <input class="amount">0</input>              // Input text box for put amount of goods to buy
        </div>
        <div class="ag-row" style="height:40px row="43">
            <div class="name">Brown Bess</div>
            <div class="price">230</div>
            <input class="amount">0</input>
        </div>
        <div class="ag-row" style="height:40px row="44"> // <-- This is what I'am seeing at the top. 0 based row attribute
            <div class="name">Blue</div>
            <div class="price">280</div>
            <input class="amount">0</input>
        </div>
        <div class="ag-row" style="height:40px row="45">
            <div class="name">Scientist</div>
            <div class="price">400</div>
            <input class="amount">0</input>
        </div>
        <div class="ag-row" style="height:40px row="46">
            <div class="name">Mouse</div>
            <div class="price">120</div>
            <input class="amount">0</input>
        </div>
        <div class="ag-row" style="height:40px row="47">
            <div class="name">Hangover</div>
            <div class="price">150</div>
            <input class="amount">0</input>
        </div>
        <div class="ag-row" style="height:40px row="48"> // <-- This is what I'am seeing last.
            <div class="name">Building</div>
            <div class="price">420</div>
            <input class="amount">0</input>
        </div>
        <div class="ag-row" style="height:40px row="50">
            <div class="name">Park</div>
            <div class="price">60</div>
            <input class="amount">0</input>
        </div>
        <div class="ag-row" style="height:40px row="51">
            <div class="name">Coffee</div>
            <div class="price">160</div>
            <input class="amount">0</input>
        </div>
        <div class="ag-row" style="height:40px row="49">
            <div class="name">Juice</div>
            <div class="price">100</div>
            <input class="amount">0</input>
        </div>
    </div>
</div>

这也是我想象中的代码，实际代码由于其样式、属性和脚本而复杂得多。我觉得问我的主题就够了

我检查了这个网页的行为。它只会使 html 靠近我所看到的位置。当我看到接近第 100 个内容时，它会在第 92 ~ 108 个之间创建 html - 它实例化的数量是相当随机的。当我向下或向上滚动时，它会删除远离当前位置的内容并为当前屏幕创建新内容。

我需要解析该数据并需要在 python 中创建一些类似列表的数据结构。导致它根据屏幕实例化部分数据（准确地说，它似乎使用滚动条来检查我看到的位置）我试图控制滚动条并裁剪 html 中的所有数据并删除重复项。代码如下

from selenium import webdriver
..blah..

def iterateOptionTable(driver):
    el_viewport = driver.find_element_by_class_name('table-body')
    driver.execute_script('document.getElementsByClassName("{}")[0].scrollTop = 0;'.format('table-body'))
    max_height = int(driver.execute_script('return document.getElementsByClassName("{}")[0].scrollHeight;'.format('table-body')))
    scrolling_amnt = int(40 * 5) # Each row height is 40
    cur_scroll = 0
    table = defaultdict(int) # Don't put into list which is already pushed
    ret = []
    while cur_scroll < max_height:
            el_products = el_viewport.find_elements_by_xpath('./div/*')
            for el_p in el_products:
                rownum = int(el_p.get_attribute("row"))
                if rownum not in table:
                    table[rownum] = True
                    ret.append(el_p)
            yield ret   # List of WebElement of good
            ret.clear()        
            cur_scroll += scrolling_amnt
            driver.execute_script('document.getElementsByClassName("{}")[0].scrollTop = {};'.format('table-body', cur_scroll))

def parseElementToData(elems):
    ret = []
    for el in elems:
        single_data = DO_EXTRACT_DATA_FROM_EL()
        ret.append(single_data)

def parseTable(driver):
    ret = []
    for elems in iterateOptionTable(driver):
        data += parseElementToData(elems)
    return ret

该页面还有其他几个作业，由于网页层次结构，它使用 yield 进行编程。

当我一个一个执行时，它在调试器中运行良好。但在实际运行时，它甚至不会向下滚动 table。更不用说我认为它效率低下。还通过从 selenium 执行脚本尝试了相同版本的 Javascript。

是否有更复杂的方法或者我可以得到为什么这些在正常情况下不起作用的答案。我对网络爬行和硒很陌生。请帮助:)

Answer 1

您是否可以看到这些元素并不意味着它们已经在 HTML 中，它们必须显示：隐藏，直到您滚动到它们。

现在我在这里假设，因为您没有提供相关网页的 link，我将尝试用您提供的代码进行解释。

我的建议是 return 来自 table 的所有行 1 乘 1:

i = 0
row_list = []

while True:
    try:
        name = driver.find_element_by_xpath(x_path_to_the_row[i]/div).get_attribute('innerHTML'
        price = driver.find_element_by_xpath(x_path_to_the_row[i]/div[2]).get_attribute('innerHTML')
        row_list.append((name, price))
    except NoSuchElementException:
        break
    i += 1

基本上循环直到table的元素不存在，获取该行的列并构造一个包含两个元素的元组。

注意：除非 HTML 位于 Shadow DOM 组件内，否则应该不会有问题。

Answer 2

我没有达到我的预期。在这种情况下滚动效果不佳 interactable 。我设法通过在 table 中选择单个单元格并发送 'Keys.DOWN' 按钮向下滚动来解决这个问题。

使用 python selenium 读取、写入和控制动态实例化 HTML web table

Reading, writing and control dynamic instantiated HTML web table using python selenium

python

selenium

web-crawler