HTML 和 PHP cURL 响应 utf-8 编码问题
HTML and PHP cURL response utf-8 encoding problem
正在从两个网站的 cURL 获取 HTML。
SITE 1:
https://xperia.sony.jp/campaign/360RA/?s_tc=somc_co_ext_docomo_360RA_banner
我的 cURL 看起来像:
$ua= "Mozilla/5.0 (X11; Linux i686; rv:36.0) Gecko/20100101 Firefox/36.0 SeaMonkey/2.33.1";
$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_FAILONERROR => true,
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => $ua, // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 10, // timeout on connect
CURLOPT_TIMEOUT => 10, // timeout on response
CURLOPT_MAXREDIRS => 5,
CURLOPT_FORBID_REUSE, true);
$ch = curl_init($url);
curl_setopt_array($ch, $options);
$content = curl_exec($ch);
//Use xPath or str_get_html($content) to parse
第一个 URL 打开完美编码并按预期显示字符
Exaple: $title_string = $html->find("title",0)->plaintext shows the <title> tag text and characters well encoded
SECOND URL 显示 SQUARE BOXES ¤ããªãããi��Ɨ�
。但是,当您执行 utf8_decode( $title_string)
时,此 SECOND URL 将按预期显示编码良好的字符。
问题是,当您使用 utf8_decode( $title_string)
时,FIRST URL 现在显示 SQUARE BOXES。
有没有一种通用的方法来解决这个问题?
我试过了
$charset= mb_detect_encoding($str);
if( $charset=="UTF-8" ) {
return utf8_decode($str);
}
else {
return $str;
}
似乎两个字符串都被 cURL 编码为 UTF-8。一个有效,另一个显示方框。
我也试过了
php curl response encoding
Strange behaviour when encoding cURL response as UTF-8
Replace unicode character
https://www.php.net/manual/en/function.mb-convert-encoding.php
Which charset should i use for multilingual website?
French and Chinese characters are not appearing correctly
还有更多
我花了很多时间来解决这个问题。欢迎任何想法
两个页面都是 UTF-8 编码的,cURL returns 也是如此。问题是以下处理;假设涉及 libxml2,它会尝试从 <meta>
个元素中猜测编码,但如果有 none,它会假定为 ISO-8859-1。如果 UTF-8 BOM ("\xEF\xBB\xBF") 被预先添加到 HTML.
,则可以强制采用 UTF-8
正如@cmb 在上面的回答中提到的,对于那些想要查看我的最终代码的完整细节的人。给你
$url = "https://whosebug.com/
$html = str_get_html($url);
libxml_use_internal_errors(true); // Yeah if you are so worried about using @ with warnings
$doc = new DomDocument();
$doc->loadHTML("\xEF\xBB\xBF$html"); // This is where and how you put the BOM
$xpath = new DOMXPath($doc);
$query = '//*/meta[starts-with(@property, \'og:\')]';
$metas = $xpath->query($query);
$rmetas = array();
foreach ($metas as $meta) {
$property = $meta->getAttribute('property');
$content = $meta->getAttribute('content');
$rmetas[$property] = $content;
}
var_dump($rmetas);
希望对遇到同样危险的人有所帮助。
正在从两个网站的 cURL 获取 HTML。
SITE 1: https://xperia.sony.jp/campaign/360RA/?s_tc=somc_co_ext_docomo_360RA_banner
我的 cURL 看起来像:
$ua= "Mozilla/5.0 (X11; Linux i686; rv:36.0) Gecko/20100101 Firefox/36.0 SeaMonkey/2.33.1";
$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_FAILONERROR => true,
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => $ua, // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 10, // timeout on connect
CURLOPT_TIMEOUT => 10, // timeout on response
CURLOPT_MAXREDIRS => 5,
CURLOPT_FORBID_REUSE, true);
$ch = curl_init($url);
curl_setopt_array($ch, $options);
$content = curl_exec($ch);
//Use xPath or str_get_html($content) to parse
第一个 URL 打开完美编码并按预期显示字符
Exaple: $title_string = $html->find("title",0)->plaintext shows the <title> tag text and characters well encoded
SECOND URL 显示 SQUARE BOXES ¤ããªãããi��Ɨ�
。但是,当您执行 utf8_decode( $title_string)
时,此 SECOND URL 将按预期显示编码良好的字符。
问题是,当您使用 utf8_decode( $title_string)
时,FIRST URL 现在显示 SQUARE BOXES。
有没有一种通用的方法来解决这个问题?
我试过了
$charset= mb_detect_encoding($str);
if( $charset=="UTF-8" ) {
return utf8_decode($str);
}
else {
return $str;
}
似乎两个字符串都被 cURL 编码为 UTF-8。一个有效,另一个显示方框。
我也试过了
php curl response encoding
Strange behaviour when encoding cURL response as UTF-8
Replace unicode character
https://www.php.net/manual/en/function.mb-convert-encoding.php
Which charset should i use for multilingual website?
French and Chinese characters are not appearing correctly
还有更多
我花了很多时间来解决这个问题。欢迎任何想法
两个页面都是 UTF-8 编码的,cURL returns 也是如此。问题是以下处理;假设涉及 libxml2,它会尝试从 <meta>
个元素中猜测编码,但如果有 none,它会假定为 ISO-8859-1。如果 UTF-8 BOM ("\xEF\xBB\xBF") 被预先添加到 HTML.
正如@cmb 在上面的回答中提到的,对于那些想要查看我的最终代码的完整细节的人。给你
$url = "https://whosebug.com/
$html = str_get_html($url);
libxml_use_internal_errors(true); // Yeah if you are so worried about using @ with warnings
$doc = new DomDocument();
$doc->loadHTML("\xEF\xBB\xBF$html"); // This is where and how you put the BOM
$xpath = new DOMXPath($doc);
$query = '//*/meta[starts-with(@property, \'og:\')]';
$metas = $xpath->query($query);
$rmetas = array();
foreach ($metas as $meta) {
$property = $meta->getAttribute('property');
$content = $meta->getAttribute('content');
$rmetas[$property] = $content;
}
var_dump($rmetas);
希望对遇到同样危险的人有所帮助。