使用 Python 在 Visual Studio 代码中显示日语字符
Display Japanese characters in Visual Studio Code using Python
根据 this older answer,Python 3 个字符串默认符合 UTF-8。但是在我使用 BeautifulSoup 的网络抓取工具中,当我尝试打印或显示 URL 时,日语字符显示为“%E3%81%82”或“%E3%81%B3”实际字符数。
This Japanese website 是我从中收集信息的那个,更具体地说是与可点击字母按钮中的 link 对应的 URL。当您将鼠标悬停在例如あa 上时,您的浏览器将显示您要点击的 link 是 https://kokugo.jitenon.jp/cat/gojuon.php?word=あ
。但是,使用 BeautifulSoup 提取 link 的 ["href"]
属性,我得到 https://kokugo.jitenon.jp/cat/gojuon.php?word=%E3%81%82
.
两个版本link到同一个网页,但为了调试,我想知道是否可以确保显示的字符串包含实际的日文字符。如果不是,我如何转换字符串以适应此目的?
Percent-encoding, also known as URL encoding, is a method to encode arbitrary data in a Uniform Resource Identifier (URI)
using only the limited US-ASCII characters legal within a URI.
应用unquote
method from urllib.parse
module:
urllib.parse.unquote(string, encoding='utf-8', errors='replace')
Replace %xx
escapes by their single-character equivalent. The
optional encoding and errors parameters specify how to decode
percent-encoded sequences into Unicode characters, as accepted by the
bytes.decode()
method.
string
must be a str
. Changed in version 3.9: string
parameter
supports bytes
and str
objects (previously only str
).
encoding
defaults to 'utf-8'
. errors
defaults to 'replace'
,
meaning invalid sequences are replaced by a placeholder character.
例子:
from urllib.parse import unquote
encodedUrl = 'JapaneseChars%E3%81%82or%E3%81%B3'
decodedUrl = unquote( encodedUrl )
print( decodedUrl )
JapaneseCharsあorび
可以将 unquote
方法应用于几乎任何字符串,即使已经解码:
print( unquote(decodedUrl) )
JapaneseCharsあorび
根据 this older answer,Python 3 个字符串默认符合 UTF-8。但是在我使用 BeautifulSoup 的网络抓取工具中,当我尝试打印或显示 URL 时,日语字符显示为“%E3%81%82”或“%E3%81%B3”实际字符数。
This Japanese website 是我从中收集信息的那个,更具体地说是与可点击字母按钮中的 link 对应的 URL。当您将鼠标悬停在例如あa 上时,您的浏览器将显示您要点击的 link 是 https://kokugo.jitenon.jp/cat/gojuon.php?word=あ
。但是,使用 BeautifulSoup 提取 link 的 ["href"]
属性,我得到 https://kokugo.jitenon.jp/cat/gojuon.php?word=%E3%81%82
.
两个版本link到同一个网页,但为了调试,我想知道是否可以确保显示的字符串包含实际的日文字符。如果不是,我如何转换字符串以适应此目的?
Percent-encoding, also known as URL encoding, is a method to encode arbitrary data in a Uniform Resource Identifier (URI) using only the limited US-ASCII characters legal within a URI.
应用unquote
method from urllib.parse
module:
urllib.parse.unquote(string, encoding='utf-8', errors='replace')
Replace
%xx
escapes by their single-character equivalent. The optional encoding and errors parameters specify how to decode percent-encoded sequences into Unicode characters, as accepted by thebytes.decode()
method.
string
must be astr
. Changed in version 3.9:string
parameter supportsbytes
andstr
objects (previously onlystr
).
encoding
defaults to'utf-8'
.errors
defaults to'replace'
, meaning invalid sequences are replaced by a placeholder character.
例子:
from urllib.parse import unquote
encodedUrl = 'JapaneseChars%E3%81%82or%E3%81%B3'
decodedUrl = unquote( encodedUrl )
print( decodedUrl )
JapaneseCharsあorび
可以将 unquote
方法应用于几乎任何字符串,即使已经解码:
print( unquote(decodedUrl) )
JapaneseCharsあorび