解码 html 响应
Decode html response
我正在尝试从以下网页抓取一些数据:https://bitcoin.pl/
我收到来自服务器的响应并提取正文。我想从正文中提取链接。但是,不能这样做,因为 body 没有被正确解码并且包含转义字符。
我尝试了以下一些解决方案:
How to unescape HTML character entities in Java?
https://howtodoinjava.com/java/string/unescape-html-to-string/
下面我提供我写的代码:
import com.mashape.unirest.http.HttpResponse;
import com.mashape.unirest.http.Unirest;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class test_scraper {
public static void main(String[] args) throws Exception {
final String USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36";
Unirest.setDefaultHeader("User-Agent",USER_AGENT);
final HttpResponse<String> response = Unirest.post("https://bitcoin.pl/?ajax-request=jnews")
.header("Accept","application/json, text/javascript, */*; q=0.01")
.header("Content-Type","application/x-www-form-urlencoded; charset=UTF-8")
.header("Referer","https://bitcoin.pl/")
.header("user-agent",USER_AGENT)
.header("accept-language", "en-US,en;q=0.9")
.header("X-Requested-With","XMLHttpRequest")
.header("accept-encoding:", "gzip, deflate, br")
.queryString("lang","pl_PL")
.queryString("action","jnews_module_ajax_jnews_block_5")
.queryString("data[current_page]",2)
.queryString("data[attribute][number_post]", 1)
.asString();
//System.out.println(response.getHeaders());
//System.out.println(response.getBody());
final Document html = Jsoup.parseBodyFragment(response.getBody());
System.out.println(Jsoup.parse(response.getBody()));
}
}
我从浏览器(检查器模式 -> 网络 -> XHR -> 响应)收到完全相同的响应,但是我想从已经解码的预览中获得 HTML。
我收到的是(部分回复):
<html>
<head></head>
<body>
{"content":"
<div class="\"jeg_posts" jeg_load_more_flag\">
\n
<article class="\"jeg_post" jeg_pl_lg_2 post-9771 post type-post status-publish format-standard has-post-thumbnail hentry category-kryptowaluty tag-bitcoin tag-chinski-bank-ludowy tag-chinski-banki-centralny tag-chiny tag-cyfrowa-waluta tag-libra tag-token-pboc\">
\n
<div class="\"jeg_thumb\"">
\n \n
<a href="\"https:\/\/bitcoin.pl\/chiny-data-emisji-waluty\/\""></a>
<div class="\"thumbnail-container" animate-lazy size-715 \">
<a href="\"https:\/\/bitcoin.pl\/chiny-data-emisji-waluty\/\""><img width="\"350\"" height="\"250\"" src="\"https:\/\/bitcoin.pl\/wp-content\/themes\/jnews\/assets\/img\/jeg-empty.png\"" class="\"attachment-jnews-350x250" size-jnews-350x250 lazyload wp-post-image\" alt="\"chiny\"" data-src="\"https:\/\/bitcoin.pl\/wp-content\/uploads\/2019\/09\/chiny-350x250.jpg\"" data-sizes="\"auto\"" data-srcset="\"https:\/\/bitcoin.pl\/wp-content\/uploads\/2019\/09\/chiny-350x250.jpg" 350w, https:\ \ bitcoin.pl\ wp-content\ uploads\ 2019\ 09\ chiny-120x86.jpg 120w, chiny-750x536.jpg 750w\" data-expand="\"700\"" data-animate="\"0\""><\/div><\/a>\n
<div class="\"jeg_post_category\"">
\n
<span><a href="\"https:\/\/bitcoin.pl\/category\/kryptowaluty\/\"" class="\"category-kryptowaluty\"">Kryptowaluty<\/a><\/span>\n <\/div>\n <\/div>\n </a>
如何正确解码以上内容以获得正确的HTML?
这项服务returns一个JSON:
{
"content": "<div class=...",
"next": false,
"prev": true
}
Jsoup 不需要,因为此 HTML 已嵌入到 JSON 对象中。改为使用 Jackson:
ObjectMapper mapper = new ObjectMapper();
Map map = mapper.readValue(body, Map.class);
String content = map.get("content").toString();
System.out.println(content);
你会得到正常的 HTML 没有任何转义:
<div class="jeg_posts jeg_load_more_flag">
<article class="jeg_post ...
<div class="jeg_thumb">
...
上面的classObjectMapper
是com.fasterxml.jackson.databind.ObjectMapper
,不要和Unirest的类似class搞混了。
要使用 Jackson,请将以下依赖项添加到您的 Gradle 文件中,在 Maven 中类似:
implementation 'com.fasterxml.jackson.core:jackson-databind:2.10.0.pr3'
我正在尝试从以下网页抓取一些数据:https://bitcoin.pl/ 我收到来自服务器的响应并提取正文。我想从正文中提取链接。但是,不能这样做,因为 body 没有被正确解码并且包含转义字符。
我尝试了以下一些解决方案:
How to unescape HTML character entities in Java?
https://howtodoinjava.com/java/string/unescape-html-to-string/
下面我提供我写的代码:
import com.mashape.unirest.http.HttpResponse;
import com.mashape.unirest.http.Unirest;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class test_scraper {
public static void main(String[] args) throws Exception {
final String USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36";
Unirest.setDefaultHeader("User-Agent",USER_AGENT);
final HttpResponse<String> response = Unirest.post("https://bitcoin.pl/?ajax-request=jnews")
.header("Accept","application/json, text/javascript, */*; q=0.01")
.header("Content-Type","application/x-www-form-urlencoded; charset=UTF-8")
.header("Referer","https://bitcoin.pl/")
.header("user-agent",USER_AGENT)
.header("accept-language", "en-US,en;q=0.9")
.header("X-Requested-With","XMLHttpRequest")
.header("accept-encoding:", "gzip, deflate, br")
.queryString("lang","pl_PL")
.queryString("action","jnews_module_ajax_jnews_block_5")
.queryString("data[current_page]",2)
.queryString("data[attribute][number_post]", 1)
.asString();
//System.out.println(response.getHeaders());
//System.out.println(response.getBody());
final Document html = Jsoup.parseBodyFragment(response.getBody());
System.out.println(Jsoup.parse(response.getBody()));
}
}
我从浏览器(检查器模式 -> 网络 -> XHR -> 响应)收到完全相同的响应,但是我想从已经解码的预览中获得 HTML。
我收到的是(部分回复):
<html>
<head></head>
<body>
{"content":"
<div class="\"jeg_posts" jeg_load_more_flag\">
\n
<article class="\"jeg_post" jeg_pl_lg_2 post-9771 post type-post status-publish format-standard has-post-thumbnail hentry category-kryptowaluty tag-bitcoin tag-chinski-bank-ludowy tag-chinski-banki-centralny tag-chiny tag-cyfrowa-waluta tag-libra tag-token-pboc\">
\n
<div class="\"jeg_thumb\"">
\n \n
<a href="\"https:\/\/bitcoin.pl\/chiny-data-emisji-waluty\/\""></a>
<div class="\"thumbnail-container" animate-lazy size-715 \">
<a href="\"https:\/\/bitcoin.pl\/chiny-data-emisji-waluty\/\""><img width="\"350\"" height="\"250\"" src="\"https:\/\/bitcoin.pl\/wp-content\/themes\/jnews\/assets\/img\/jeg-empty.png\"" class="\"attachment-jnews-350x250" size-jnews-350x250 lazyload wp-post-image\" alt="\"chiny\"" data-src="\"https:\/\/bitcoin.pl\/wp-content\/uploads\/2019\/09\/chiny-350x250.jpg\"" data-sizes="\"auto\"" data-srcset="\"https:\/\/bitcoin.pl\/wp-content\/uploads\/2019\/09\/chiny-350x250.jpg" 350w, https:\ \ bitcoin.pl\ wp-content\ uploads\ 2019\ 09\ chiny-120x86.jpg 120w, chiny-750x536.jpg 750w\" data-expand="\"700\"" data-animate="\"0\""><\/div><\/a>\n
<div class="\"jeg_post_category\"">
\n
<span><a href="\"https:\/\/bitcoin.pl\/category\/kryptowaluty\/\"" class="\"category-kryptowaluty\"">Kryptowaluty<\/a><\/span>\n <\/div>\n <\/div>\n </a>
如何正确解码以上内容以获得正确的HTML?
这项服务returns一个JSON:
{
"content": "<div class=...",
"next": false,
"prev": true
}
Jsoup 不需要,因为此 HTML 已嵌入到 JSON 对象中。改为使用 Jackson:
ObjectMapper mapper = new ObjectMapper();
Map map = mapper.readValue(body, Map.class);
String content = map.get("content").toString();
System.out.println(content);
你会得到正常的 HTML 没有任何转义:
<div class="jeg_posts jeg_load_more_flag">
<article class="jeg_post ...
<div class="jeg_thumb">
...
上面的classObjectMapper
是com.fasterxml.jackson.databind.ObjectMapper
,不要和Unirest的类似class搞混了。
要使用 Jackson,请将以下依赖项添加到您的 Gradle 文件中,在 Maven 中类似:
implementation 'com.fasterxml.jackson.core:jackson-databind:2.10.0.pr3'