使用 jsoup 从具有可变页面数据的 'form' class 中提取文本
Use jsoup to extract text from 'form' class with variable page data
首先 post 在这里,所以我会尽力保持这一点。我一直在使用 Jsoup 从大量网页中提取数据以导入实用程序。我遇到了一个页面,该页面根据用户从下拉框中选择的内容动态更新数据。当我检查 Chrome 中的 html 时,我可以看到数据,但我似乎无法提取它。我可以提取它周围的所有文本元素,但是任何动态生成的东西都不会出来。
我正在查看的页面具有以下形式 class,很抱歉包装,我无法摆脱它。
<form class="variations_form cart" method="post" enctype="multipart/form-data" data-product_id="8044" data-product_variations="[{"variation_id":8047,"variation_is_visible":true,"variation_is_active":true,"is_purchasable":true,"display_price":19.70,"display_regular_price":19.70,"attributes":{"attribute_size":"500g"},"image_src":"http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/08\/LABELS_500g-FOOD-Vann-475x652.png","image_link":"http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/08\/LABELS_500g-FOOD-Vann.png","image_title":"LABELS_500g-FOOD Vann","image_alt":"","image_srcset":"http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/08\/LABELS_500g-FOOD-Vann-746x1024.png 746w, http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/08\/LABELS_500g-FOOD-Vann-475x652.png 475w, http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/08\/LABELS_500g-FOOD-Vann.png 1063w","image_sizes":"(max-width: 475px) 100vw, 475px","price_html":"<span class=\"price\"><span class=\"amount\">.70<\/span><\/span>","availability_html":"","sku":"FOOD-Vanilla-500","weight":".5 kg","dimensions":"","min_qty":1,"max_qty":"","backorders_allowed":false,"is_in_stock":true,"is_downloadable":false,"is_virtual":false,"is_sold_individually":"no","variation_description":"<p>500g<\/p>\n"},{"variation_id":8045,"variation_is_visible":true,"variation_is_active":true,"is_purchasable":true,"display_price":13.50,"display_regular_price":13.50,"attributes":{"attribute_size":"1kg"},"image_src":"http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/09\/LABELS_1kg-FOOD-Van-475x652.png","image_link":"http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/09\/LABELS_1kg-FOOD-Van.png","image_title":"LABELS_1kg-FOOD Van","image_alt":"","image_srcset":"http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/09\/LABELS_1kg-FOOD-Van-746x1024.png 746w, http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/09\/LABELS_1kg-FOOD-Van-475x652.png 475w, http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/09\/LABELS_1kg-FOOD-Van.png 1063w","image_sizes":"(max-width: 475px) 100vw, 475px","price_html":"<span class=\"price\"><span class=\"amount\">.50<\/span><\/span>","availability_html":"","sku":"FOOD-Vanilla-1kg","weight":"1 kg","dimensions":"","min_qty":1,"max_qty":"","backorders_allowed":false,"is_in_stock":true,"is_downloadable":false,"is_virtual":false,"is_sold_individually":"no","variation_description":"<p>1kg<\/p>\n"},{"variation_id":8046,"variation_is_visible":true,"variation_is_active":true,"is_purchasable":true,"display_price":199.95,"display_regular_price":199.95,"attributes":{"attribute_size":"3kg"},"image_src":"http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/09\/LABELS_3kg-FOOD-Van-475x652.png","image_link":"http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/09\/LABELS_3kg-FOOD-Van.png","image_title":"LABELS_3kg-FOOD Van","image_alt":"","image_srcset":"http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/09\/LABELS_3kg-FOOD-Van-746x1024.png 746w, http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/09\/LABELS_3kg-FOOD-Van-475x652.png 475w, http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/09\/LABELS_3kg-FOOD-Van.png 1063w","image_sizes":"(max-width: 475px) 100vw, 475px","price_html":"<span class=\"price\"><span class=\"amount\">9.95<\/span><\/span>","availability_html":"","sku":"FOOD-Vanilla-3kg","weight":"3 kg","dimensions":"","min_qty":1,"max_qty":"","backorders_allowed":false,"is_in_stock":true,"is_downloadable":false,"is_virtual":false,"is_sold_individually":"no","variation_description":"<p>3kg<\/p>\n"}]">
<table class="variations" cellspacing="0">
<tbody>
<tr>
<td class="label">
<label for="size">Size</label>
</td>
<td class="value">
<select id="size" class="" name="attribute_size" data-attribute_name="attribute_size">
<option value="">Choose an option</option>
<option value="500g">500g</option>
<option value="1kg" selected="selected">1kg</option>
<option value="3kg">3kg</option>
</select><a class="reset_variations" href="#" style="visibility: visible; display: block;">Clear selection</a>
</td>
</tr>
</tbody>
</table>
<div class="angelleye_buton_box_relative" style="position: relative;">
<div class="single_variation_wrap">
<div class="woocommerce-variation-description" style="border: 1px solid transparent;">
<p>1kg</p>
</div>
<div class="single_variation"><span class="price"><span class="amount selectorgadget_selected">.50</span></span>
</div>
<div class="variations_button">
<div class="quantity">
<input type="number" step="1" name="quantity" value="1" title="Qty" class="input-text qty text" size="4" min="1">
</div>
<button type="submit" class="single_add_to_cart_button button alt">Add to basket</button>
<input type="hidden" name="add-to-cart" value="8044">
<input type="hidden" name="product_id" value="8044">
<input type="hidden" name="variation_id" class="variation_id" value="8045">
</div>
</div>
<div class="blockUI blockOverlay angelleyeOverlay" style="display:none;z-index: 1000; border: none; margin: 0px; padding: 0px; width: 100%; height: 100%; top: 0px; left: 0px; opacity: 0.6; cursor: default; position: absolute; background: url(http://www.sourcewebsite.com/wp-content/plugins/woocommerce/assets/images/select2-spinner.gif) 50% 50% / 16px 16px no-repeat rgb(255, 255, 255);"></div>
</div>
</form>
我正在尝试从下方提取价格“13.50”div。
<div class="single_variation"><span class="price"><span class="amount selectorgadget_selected">.50</span></span>
</div>
我的代码如下:
private class ParseFoodPriceURL extends AsyncTask<String, Void, String> {
@Override
protected String doInBackground(String... strings) {
StringBuffer buffer = new StringBuffer();
try {
Document doc = Jsoup.connect(strings[0]).get();
Elements foodPrice = doc.select("div.single_variation_wrap > div.single_variation");
String priceTextSelection = foodPrice.text();
buffer.append("Price: $" + priceTextSelection);
}
catch (Throwable t) {
t.printStackTrace();
}
return buffer.toString();
}
JSoup 不是浏览器,因此不会解释和执行JavaScript。如果网站内容是动态生成的,则不能直接使用 JSoup。我想到了两个选项:
直接识别AJAX调用,通过这些调用获取信息。通常响应不是 HTML,而是 JSON。所以你可能需要其他的解析库。此选项速度很快,但您需要调查并了解网页的工作原理。
使用selenium webdriver with a real browser engine (phantomjs for example). This will load the website like a real browser but you can access its contents similar to JSoup. This is relatively easy to program, but slow and uses a lot of resources. If you run within android this may be too much. Anyway for Android the right tool for this seems to be Selenoid.
首先 post 在这里,所以我会尽力保持这一点。我一直在使用 Jsoup 从大量网页中提取数据以导入实用程序。我遇到了一个页面,该页面根据用户从下拉框中选择的内容动态更新数据。当我检查 Chrome 中的 html 时,我可以看到数据,但我似乎无法提取它。我可以提取它周围的所有文本元素,但是任何动态生成的东西都不会出来。
我正在查看的页面具有以下形式 class,很抱歉包装,我无法摆脱它。
<form class="variations_form cart" method="post" enctype="multipart/form-data" data-product_id="8044" data-product_variations="[{"variation_id":8047,"variation_is_visible":true,"variation_is_active":true,"is_purchasable":true,"display_price":19.70,"display_regular_price":19.70,"attributes":{"attribute_size":"500g"},"image_src":"http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/08\/LABELS_500g-FOOD-Vann-475x652.png","image_link":"http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/08\/LABELS_500g-FOOD-Vann.png","image_title":"LABELS_500g-FOOD Vann","image_alt":"","image_srcset":"http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/08\/LABELS_500g-FOOD-Vann-746x1024.png 746w, http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/08\/LABELS_500g-FOOD-Vann-475x652.png 475w, http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/08\/LABELS_500g-FOOD-Vann.png 1063w","image_sizes":"(max-width: 475px) 100vw, 475px","price_html":"<span class=\"price\"><span class=\"amount\">.70<\/span><\/span>","availability_html":"","sku":"FOOD-Vanilla-500","weight":".5 kg","dimensions":"","min_qty":1,"max_qty":"","backorders_allowed":false,"is_in_stock":true,"is_downloadable":false,"is_virtual":false,"is_sold_individually":"no","variation_description":"<p>500g<\/p>\n"},{"variation_id":8045,"variation_is_visible":true,"variation_is_active":true,"is_purchasable":true,"display_price":13.50,"display_regular_price":13.50,"attributes":{"attribute_size":"1kg"},"image_src":"http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/09\/LABELS_1kg-FOOD-Van-475x652.png","image_link":"http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/09\/LABELS_1kg-FOOD-Van.png","image_title":"LABELS_1kg-FOOD Van","image_alt":"","image_srcset":"http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/09\/LABELS_1kg-FOOD-Van-746x1024.png 746w, http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/09\/LABELS_1kg-FOOD-Van-475x652.png 475w, http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/09\/LABELS_1kg-FOOD-Van.png 1063w","image_sizes":"(max-width: 475px) 100vw, 475px","price_html":"<span class=\"price\"><span class=\"amount\">.50<\/span><\/span>","availability_html":"","sku":"FOOD-Vanilla-1kg","weight":"1 kg","dimensions":"","min_qty":1,"max_qty":"","backorders_allowed":false,"is_in_stock":true,"is_downloadable":false,"is_virtual":false,"is_sold_individually":"no","variation_description":"<p>1kg<\/p>\n"},{"variation_id":8046,"variation_is_visible":true,"variation_is_active":true,"is_purchasable":true,"display_price":199.95,"display_regular_price":199.95,"attributes":{"attribute_size":"3kg"},"image_src":"http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/09\/LABELS_3kg-FOOD-Van-475x652.png","image_link":"http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/09\/LABELS_3kg-FOOD-Van.png","image_title":"LABELS_3kg-FOOD Van","image_alt":"","image_srcset":"http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/09\/LABELS_3kg-FOOD-Van-746x1024.png 746w, http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/09\/LABELS_3kg-FOOD-Van-475x652.png 475w, http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/09\/LABELS_3kg-FOOD-Van.png 1063w","image_sizes":"(max-width: 475px) 100vw, 475px","price_html":"<span class=\"price\"><span class=\"amount\">9.95<\/span><\/span>","availability_html":"","sku":"FOOD-Vanilla-3kg","weight":"3 kg","dimensions":"","min_qty":1,"max_qty":"","backorders_allowed":false,"is_in_stock":true,"is_downloadable":false,"is_virtual":false,"is_sold_individually":"no","variation_description":"<p>3kg<\/p>\n"}]">
<table class="variations" cellspacing="0">
<tbody>
<tr>
<td class="label">
<label for="size">Size</label>
</td>
<td class="value">
<select id="size" class="" name="attribute_size" data-attribute_name="attribute_size">
<option value="">Choose an option</option>
<option value="500g">500g</option>
<option value="1kg" selected="selected">1kg</option>
<option value="3kg">3kg</option>
</select><a class="reset_variations" href="#" style="visibility: visible; display: block;">Clear selection</a>
</td>
</tr>
</tbody>
</table>
<div class="angelleye_buton_box_relative" style="position: relative;">
<div class="single_variation_wrap">
<div class="woocommerce-variation-description" style="border: 1px solid transparent;">
<p>1kg</p>
</div>
<div class="single_variation"><span class="price"><span class="amount selectorgadget_selected">.50</span></span>
</div>
<div class="variations_button">
<div class="quantity">
<input type="number" step="1" name="quantity" value="1" title="Qty" class="input-text qty text" size="4" min="1">
</div>
<button type="submit" class="single_add_to_cart_button button alt">Add to basket</button>
<input type="hidden" name="add-to-cart" value="8044">
<input type="hidden" name="product_id" value="8044">
<input type="hidden" name="variation_id" class="variation_id" value="8045">
</div>
</div>
<div class="blockUI blockOverlay angelleyeOverlay" style="display:none;z-index: 1000; border: none; margin: 0px; padding: 0px; width: 100%; height: 100%; top: 0px; left: 0px; opacity: 0.6; cursor: default; position: absolute; background: url(http://www.sourcewebsite.com/wp-content/plugins/woocommerce/assets/images/select2-spinner.gif) 50% 50% / 16px 16px no-repeat rgb(255, 255, 255);"></div>
</div>
</form>
我正在尝试从下方提取价格“13.50”div。
<div class="single_variation"><span class="price"><span class="amount selectorgadget_selected">.50</span></span>
</div>
我的代码如下:
private class ParseFoodPriceURL extends AsyncTask<String, Void, String> {
@Override
protected String doInBackground(String... strings) {
StringBuffer buffer = new StringBuffer();
try {
Document doc = Jsoup.connect(strings[0]).get();
Elements foodPrice = doc.select("div.single_variation_wrap > div.single_variation");
String priceTextSelection = foodPrice.text();
buffer.append("Price: $" + priceTextSelection);
}
catch (Throwable t) {
t.printStackTrace();
}
return buffer.toString();
}
JSoup 不是浏览器,因此不会解释和执行JavaScript。如果网站内容是动态生成的,则不能直接使用 JSoup。我想到了两个选项:
直接识别AJAX调用,通过这些调用获取信息。通常响应不是 HTML,而是 JSON。所以你可能需要其他的解析库。此选项速度很快,但您需要调查并了解网页的工作原理。
使用selenium webdriver with a real browser engine (phantomjs for example). This will load the website like a real browser but you can access its contents similar to JSoup. This is relatively easy to program, but slow and uses a lot of resources. If you run within android this may be too much. Anyway for Android the right tool for this seems to be Selenoid.