使用 jsoup 从具有可变页面数据的 'form' class 中提取文本

Use jsoup to extract text from 'form' class with variable page data

首先 post 在这里,所以我会尽力保持这一点。我一直在使用 Jsoup 从大量网页中提取数据以导入实用程序。我遇到了一个页面,该页面根据用户从下拉框中选择的内容动态更新数据。当我检查 Chrome 中的 html 时,我可以看到数据,但我似乎无法提取它。我可以提取它周围的所有文本元素,但是任何动态生成的东西都不会出来。

我正在查看的页面具有以下形式 class,很抱歉包装,我无法摆脱它。

<form class="variations_form cart" method="post" enctype="multipart/form-data" data-product_id="8044" data-product_variations="[{&quot;variation_id&quot;:8047,&quot;variation_is_visible&quot;:true,&quot;variation_is_active&quot;:true,&quot;is_purchasable&quot;:true,&quot;display_price&quot;:19.70,&quot;display_regular_price&quot;:19.70,&quot;attributes&quot;:{&quot;attribute_size&quot;:&quot;500g&quot;},&quot;image_src&quot;:&quot;http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/08\/LABELS_500g-FOOD-Vann-475x652.png&quot;,&quot;image_link&quot;:&quot;http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/08\/LABELS_500g-FOOD-Vann.png&quot;,&quot;image_title&quot;:&quot;LABELS_500g-FOOD Vann&quot;,&quot;image_alt&quot;:&quot;&quot;,&quot;image_srcset&quot;:&quot;http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/08\/LABELS_500g-FOOD-Vann-746x1024.png 746w, http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/08\/LABELS_500g-FOOD-Vann-475x652.png 475w, http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/08\/LABELS_500g-FOOD-Vann.png 1063w&quot;,&quot;image_sizes&quot;:&quot;(max-width: 475px) 100vw, 475px&quot;,&quot;price_html&quot;:&quot;<span class=\&quot;price\&quot;><span class=\&quot;amount\&quot;>.70<\/span><\/span>&quot;,&quot;availability_html&quot;:&quot;&quot;,&quot;sku&quot;:&quot;FOOD-Vanilla-500&quot;,&quot;weight&quot;:&quot;.5 kg&quot;,&quot;dimensions&quot;:&quot;&quot;,&quot;min_qty&quot;:1,&quot;max_qty&quot;:&quot;&quot;,&quot;backorders_allowed&quot;:false,&quot;is_in_stock&quot;:true,&quot;is_downloadable&quot;:false,&quot;is_virtual&quot;:false,&quot;is_sold_individually&quot;:&quot;no&quot;,&quot;variation_description&quot;:&quot;<p>500g<\/p>\n&quot;},{&quot;variation_id&quot;:8045,&quot;variation_is_visible&quot;:true,&quot;variation_is_active&quot;:true,&quot;is_purchasable&quot;:true,&quot;display_price&quot;:13.50,&quot;display_regular_price&quot;:13.50,&quot;attributes&quot;:{&quot;attribute_size&quot;:&quot;1kg&quot;},&quot;image_src&quot;:&quot;http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/09\/LABELS_1kg-FOOD-Van-475x652.png&quot;,&quot;image_link&quot;:&quot;http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/09\/LABELS_1kg-FOOD-Van.png&quot;,&quot;image_title&quot;:&quot;LABELS_1kg-FOOD Van&quot;,&quot;image_alt&quot;:&quot;&quot;,&quot;image_srcset&quot;:&quot;http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/09\/LABELS_1kg-FOOD-Van-746x1024.png 746w, http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/09\/LABELS_1kg-FOOD-Van-475x652.png 475w, http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/09\/LABELS_1kg-FOOD-Van.png 1063w&quot;,&quot;image_sizes&quot;:&quot;(max-width: 475px) 100vw, 475px&quot;,&quot;price_html&quot;:&quot;<span class=\&quot;price\&quot;><span class=\&quot;amount\&quot;>.50<\/span><\/span>&quot;,&quot;availability_html&quot;:&quot;&quot;,&quot;sku&quot;:&quot;FOOD-Vanilla-1kg&quot;,&quot;weight&quot;:&quot;1 kg&quot;,&quot;dimensions&quot;:&quot;&quot;,&quot;min_qty&quot;:1,&quot;max_qty&quot;:&quot;&quot;,&quot;backorders_allowed&quot;:false,&quot;is_in_stock&quot;:true,&quot;is_downloadable&quot;:false,&quot;is_virtual&quot;:false,&quot;is_sold_individually&quot;:&quot;no&quot;,&quot;variation_description&quot;:&quot;<p>1kg<\/p>\n&quot;},{&quot;variation_id&quot;:8046,&quot;variation_is_visible&quot;:true,&quot;variation_is_active&quot;:true,&quot;is_purchasable&quot;:true,&quot;display_price&quot;:199.95,&quot;display_regular_price&quot;:199.95,&quot;attributes&quot;:{&quot;attribute_size&quot;:&quot;3kg&quot;},&quot;image_src&quot;:&quot;http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/09\/LABELS_3kg-FOOD-Van-475x652.png&quot;,&quot;image_link&quot;:&quot;http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/09\/LABELS_3kg-FOOD-Van.png&quot;,&quot;image_title&quot;:&quot;LABELS_3kg-FOOD Van&quot;,&quot;image_alt&quot;:&quot;&quot;,&quot;image_srcset&quot;:&quot;http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/09\/LABELS_3kg-FOOD-Van-746x1024.png 746w, http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/09\/LABELS_3kg-FOOD-Van-475x652.png 475w, http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/09\/LABELS_3kg-FOOD-Van.png 1063w&quot;,&quot;image_sizes&quot;:&quot;(max-width: 475px) 100vw, 475px&quot;,&quot;price_html&quot;:&quot;<span class=\&quot;price\&quot;><span class=\&quot;amount\&quot;>9.95<\/span><\/span>&quot;,&quot;availability_html&quot;:&quot;&quot;,&quot;sku&quot;:&quot;FOOD-Vanilla-3kg&quot;,&quot;weight&quot;:&quot;3 kg&quot;,&quot;dimensions&quot;:&quot;&quot;,&quot;min_qty&quot;:1,&quot;max_qty&quot;:&quot;&quot;,&quot;backorders_allowed&quot;:false,&quot;is_in_stock&quot;:true,&quot;is_downloadable&quot;:false,&quot;is_virtual&quot;:false,&quot;is_sold_individually&quot;:&quot;no&quot;,&quot;variation_description&quot;:&quot;<p>3kg<\/p>\n&quot;}]">

  <table class="variations" cellspacing="0">
    <tbody>
      <tr>
        <td class="label">
          <label for="size">Size</label>
        </td>
        <td class="value">
          <select id="size" class="" name="attribute_size" data-attribute_name="attribute_size">
            <option value="">Choose an option</option>
            <option value="500g">500g</option>
            <option value="1kg" selected="selected">1kg</option>
            <option value="3kg">3kg</option>
          </select><a class="reset_variations" href="#" style="visibility: visible; display: block;">Clear selection</a> 
        </td>
      </tr>
    </tbody>
  </table>

  <div class="angelleye_buton_box_relative" style="position: relative;">

    <div class="single_variation_wrap">
      <div class="woocommerce-variation-description" style="border: 1px solid transparent;">
        <p>1kg</p>
      </div>
      <div class="single_variation"><span class="price"><span class="amount selectorgadget_selected">.50</span></span>
      </div>
      <div class="variations_button">
        <div class="quantity">
          <input type="number" step="1" name="quantity" value="1" title="Qty" class="input-text qty text" size="4" min="1">
        </div>
        <button type="submit" class="single_add_to_cart_button button alt">Add to basket</button>
        <input type="hidden" name="add-to-cart" value="8044">
        <input type="hidden" name="product_id" value="8044">
        <input type="hidden" name="variation_id" class="variation_id" value="8045">
      </div>
    </div>

    <div class="blockUI blockOverlay angelleyeOverlay" style="display:none;z-index: 1000; border: none; margin: 0px; padding: 0px; width: 100%; height: 100%; top: 0px; left: 0px; opacity: 0.6; cursor: default; position: absolute; background: url(http://www.sourcewebsite.com/wp-content/plugins/woocommerce/assets/images/select2-spinner.gif) 50% 50% / 16px 16px no-repeat rgb(255, 255, 255);"></div>
  </div>

</form>

我正在尝试从下方提取价格“13.50”div。

<div class="single_variation"><span class="price"><span class="amount selectorgadget_selected">.50</span></span>
</div>

我的代码如下:

    private class ParseFoodPriceURL extends AsyncTask<String, Void, String> {

    @Override
    protected String doInBackground(String... strings) {
        StringBuffer buffer = new StringBuffer();
        try {
            Document doc = Jsoup.connect(strings[0]).get();
            Elements foodPrice = doc.select("div.single_variation_wrap > div.single_variation");
            String priceTextSelection = foodPrice.text();
            buffer.append("Price: $" + priceTextSelection);

        }
        catch (Throwable t) {
            t.printStackTrace();
        }
        return buffer.toString();
    }

JSoup 不是浏览器,因此不会解释和执行JavaScript。如果网站内容是动态生成的,则不能直接使用 JSoup。我想到了两个选项:

  1. 直接识别AJAX调用,通过这些调用获取信息。通常响应不是 HTML,而是 JSON。所以你可能需要其他的解析库。此选项速度很快,但您需要调查并了解网页的工作原理。

  2. 使用selenium webdriver with a real browser engine (phantomjs for example). This will load the website like a real browser but you can access its contents similar to JSoup. This is relatively easy to program, but slow and uses a lot of resources. If you run within android this may be too much. Anyway for Android the right tool for this seems to be Selenoid.