C# WebClient 没有 return UTF-8
C# WebClient doesn't return UTF-8
嘿 :) 我非常努力地让 WebClient return 我成为 UTF-8。但是当 sub 应该 return 类似 Ä
的时候,我认为它更像是 E
左右。
尝试了很多解决方法,但都行不通。
private string translate(string input, string languagePair)
{
string url = String.Format("https://translate.google.com/?hl=en&ie=UTF8&text={0}&langpair={1}", input, languagePair);
WebClient wc = new WebClient();
wc.Headers.Add(HttpRequestHeader.AcceptCharset, "UTF-8");
wc.Encoding = Encoding.UTF8;
var data = wc.DownloadData(url);
var result = Encoding.UTF8.GetString(data);
//string result = wc.DownloadString(url);
int start = result.IndexOf("result_box");
string sub = result.Substring(start);
sub = sub.Substring(0, sub.IndexOf("</span>"));
start = sub.LastIndexOf(">");
sub = sub.Substring(start + 1);
return sub;
}
Google 只是忽略 AcceptCharset
header 中发送的编码和 ISO-8859-1
中的 returns 响应,正如您从缩短的响应中看到的那样:
HTTP/1.1 200 OK
Content-Type: text/html; charset=ISO-8859-1
Content-Language: en
Content-Length: 64202
<!DOCTYPE html><html><head><meta content="text/html; charset=ISO-8859-1" http-equiv="content-type">
因此,当您使用 UTF-8 编码解码响应时,您会得到无效字符。如果你只想让它快速工作,我发现当 User-Agent
header 添加到请求时, Google returns 以 UTF-8 响应,你可以休息未修改的代码:
private static string translate(string input, string languagePair)
{
string url = String.Format("https://translate.google.com/?hl=en&ie=UTF8&text={0}&langpair={1}", input, languagePair);
WebClient wc = new WebClient();
wc.Headers.Add(HttpRequestHeader.AcceptCharset, "utf-8");
wc.Headers.Add(HttpRequestHeader.UserAgent, "Mozilla/5.0 (Windows NT 10.0; …) Gecko/20100101 Firefox/55.0");
wc.Encoding = Encoding.UTF8;
string result = wc.DownloadString(url);
int start = result.IndexOf("result_box");
string sub = result.Substring(start);
sub = sub.Substring(0, sub.IndexOf("</span>"));
start = sub.LastIndexOf(">");
sub = sub.Substring(start + 1);
return sub;
}
更好的解决方案是检测响应中使用的编码并将其用于解码。 WebClient
没有此检测 built-in,因此您可以使用 here 中描述的解决方案或使用 HttpClient
,它会自动为您执行此操作:
private static async Task<string> translate(string input, string languagePair)
{
string url = String.Format("https://translate.google.com/?hl=en&ie=UTF8&text={0}&langpair={1}", input, languagePair);
using (var hc = new HttpClient())
{
var result = await hc.GetStringAsync(url).ConfigureAwait(false);
int start = result.IndexOf("result_box");
string sub = result.Substring(start);
sub = sub.Substring(0, sub.IndexOf("</span>"));
start = sub.LastIndexOf(">");
sub = sub.Substring(start + 1);
return sub;
}
}
另请注意 Google 有 Translation API,使用它可能比从 HTML 页面解析翻译更好。
嘿 :) 我非常努力地让 WebClient return 我成为 UTF-8。但是当 sub 应该 return 类似 Ä
的时候,我认为它更像是 E
左右。
尝试了很多解决方法,但都行不通。
private string translate(string input, string languagePair)
{
string url = String.Format("https://translate.google.com/?hl=en&ie=UTF8&text={0}&langpair={1}", input, languagePair);
WebClient wc = new WebClient();
wc.Headers.Add(HttpRequestHeader.AcceptCharset, "UTF-8");
wc.Encoding = Encoding.UTF8;
var data = wc.DownloadData(url);
var result = Encoding.UTF8.GetString(data);
//string result = wc.DownloadString(url);
int start = result.IndexOf("result_box");
string sub = result.Substring(start);
sub = sub.Substring(0, sub.IndexOf("</span>"));
start = sub.LastIndexOf(">");
sub = sub.Substring(start + 1);
return sub;
}
Google 只是忽略 AcceptCharset
header 中发送的编码和 ISO-8859-1
中的 returns 响应,正如您从缩短的响应中看到的那样:
HTTP/1.1 200 OK
Content-Type: text/html; charset=ISO-8859-1
Content-Language: en
Content-Length: 64202
<!DOCTYPE html><html><head><meta content="text/html; charset=ISO-8859-1" http-equiv="content-type">
因此,当您使用 UTF-8 编码解码响应时,您会得到无效字符。如果你只想让它快速工作,我发现当 User-Agent
header 添加到请求时, Google returns 以 UTF-8 响应,你可以休息未修改的代码:
private static string translate(string input, string languagePair)
{
string url = String.Format("https://translate.google.com/?hl=en&ie=UTF8&text={0}&langpair={1}", input, languagePair);
WebClient wc = new WebClient();
wc.Headers.Add(HttpRequestHeader.AcceptCharset, "utf-8");
wc.Headers.Add(HttpRequestHeader.UserAgent, "Mozilla/5.0 (Windows NT 10.0; …) Gecko/20100101 Firefox/55.0");
wc.Encoding = Encoding.UTF8;
string result = wc.DownloadString(url);
int start = result.IndexOf("result_box");
string sub = result.Substring(start);
sub = sub.Substring(0, sub.IndexOf("</span>"));
start = sub.LastIndexOf(">");
sub = sub.Substring(start + 1);
return sub;
}
更好的解决方案是检测响应中使用的编码并将其用于解码。 WebClient
没有此检测 built-in,因此您可以使用 here 中描述的解决方案或使用 HttpClient
,它会自动为您执行此操作:
private static async Task<string> translate(string input, string languagePair)
{
string url = String.Format("https://translate.google.com/?hl=en&ie=UTF8&text={0}&langpair={1}", input, languagePair);
using (var hc = new HttpClient())
{
var result = await hc.GetStringAsync(url).ConfigureAwait(false);
int start = result.IndexOf("result_box");
string sub = result.Substring(start);
sub = sub.Substring(0, sub.IndexOf("</span>"));
start = sub.LastIndexOf(">");
sub = sub.Substring(start + 1);
return sub;
}
}
另请注意 Google 有 Translation API,使用它可能比从 HTML 页面解析翻译更好。