如何编码 RDF N-Triples 字符串文字?
How to encode RDF N-Triples string literals?
RDF N-Triples 的规范声明必须对字符串文字进行编码。
https://www.w3.org/TR/n-triples/#grammar-production-STRING_LITERAL_QUOTE
这个 "encoding" 是否有一个我可以查找并在我的编程语言中使用的名称?如果不是,这在实践中意味着什么?
您需要的语法产品就在您链接到的文档中:
[9] STRING_LITERAL_QUOTE ::= '"' ([^#x22#x5C#xA#xD] | ECHAR | UCHAR)* '"'
[141s] BLANK_NODE_LABEL ::= '_:' (PN_CHARS_U | [0-9]) ((PN_CHARS | '.')* PN_CHARS)?
[10] UCHAR ::= '\u' HEX HEX HEX HEX | '\U' HEX HEX HEX HEX HEX HEX HEX HEX
[153s] ECHAR ::= '\' [tbnrf"'\]
这意味着字符串文字以双引号 (") 开头和结尾。在双引号内,您可以有:
- 任何字符,除了:#x22、#x5C、#xA、#xD。副手,我不知道这些是什么,但我假设它们是转义中涵盖的 space 个字符;
- 一个 unicode 字符,用 \u 后跟四个十六进制数字,或 \U 后跟八个十六进制数字表示;或
- 转义字符,即\后接t、b、n、r、f、"、'、\中的任意一个,代表各种字符。
除了 Josh 的回答。将 unicode 数据规范化为 NFC 几乎总是一个好主意,例如在 Java 中,您可以使用以下例程
java.text.Normalizer.normalize("rdf literal", Normalizer.Form.NFKC);
有关详细信息,请参阅:http://www.macchiato.com/unicode/nfc-faq
What is NFC?
For various reasons, Unicode sometimes has multiple representations of the same character. For example, each of the following sequences (the first two being single-character sequences) represent the same character:
U+00C5 ( Å ) LATIN CAPITAL LETTER A WITH RING ABOVE
U+212B ( Å ) ANGSTROM SIGN
U+0041 ( A ) LATIN CAPITAL LETTER A + U+030A ( ̊ ) COMBINING RING ABOVE
These sequences are called canonically equivalent. The first of these forms is called NFC - for Normalization Form C, where the C is for compostion. For more information on these, see the introduction of UAX #15: Unicode Normalization Forms. A function transforming a string S into the NFC form can be abbreviated as toNFC(S), while one that tests whether S is in NFC is abbreviated as isNFC(S).
你可以使用 Literal#n3()
例如
# pip install rdflib
>>> from rdflib import Literal
>>> lit = Literal('This "Literal" needs escaping!')
>>> s = lit.n3()
>>> print(s)
"This \"Literal\" needs escaping!"
RDF N-Triples 的规范声明必须对字符串文字进行编码。
https://www.w3.org/TR/n-triples/#grammar-production-STRING_LITERAL_QUOTE
这个 "encoding" 是否有一个我可以查找并在我的编程语言中使用的名称?如果不是,这在实践中意味着什么?
您需要的语法产品就在您链接到的文档中:
[9] STRING_LITERAL_QUOTE ::= '"' ([^#x22#x5C#xA#xD] | ECHAR | UCHAR)* '"'
[141s] BLANK_NODE_LABEL ::= '_:' (PN_CHARS_U | [0-9]) ((PN_CHARS | '.')* PN_CHARS)?
[10] UCHAR ::= '\u' HEX HEX HEX HEX | '\U' HEX HEX HEX HEX HEX HEX HEX HEX
[153s] ECHAR ::= '\' [tbnrf"'\]
这意味着字符串文字以双引号 (") 开头和结尾。在双引号内,您可以有:
- 任何字符,除了:#x22、#x5C、#xA、#xD。副手,我不知道这些是什么,但我假设它们是转义中涵盖的 space 个字符;
- 一个 unicode 字符,用 \u 后跟四个十六进制数字,或 \U 后跟八个十六进制数字表示;或
- 转义字符,即\后接t、b、n、r、f、"、'、\中的任意一个,代表各种字符。
除了 Josh 的回答。将 unicode 数据规范化为 NFC 几乎总是一个好主意,例如在 Java 中,您可以使用以下例程
java.text.Normalizer.normalize("rdf literal", Normalizer.Form.NFKC);
有关详细信息,请参阅:http://www.macchiato.com/unicode/nfc-faq
What is NFC?
For various reasons, Unicode sometimes has multiple representations of the same character. For example, each of the following sequences (the first two being single-character sequences) represent the same character:
U+00C5 ( Å ) LATIN CAPITAL LETTER A WITH RING ABOVE U+212B ( Å ) ANGSTROM SIGN U+0041 ( A ) LATIN CAPITAL LETTER A + U+030A ( ̊ ) COMBINING RING ABOVE
These sequences are called canonically equivalent. The first of these forms is called NFC - for Normalization Form C, where the C is for compostion. For more information on these, see the introduction of UAX #15: Unicode Normalization Forms. A function transforming a string S into the NFC form can be abbreviated as toNFC(S), while one that tests whether S is in NFC is abbreviated as isNFC(S).
你可以使用 Literal#n3()
例如
# pip install rdflib
>>> from rdflib import Literal
>>> lit = Literal('This "Literal" needs escaping!')
>>> s = lit.n3()
>>> print(s)
"This \"Literal\" needs escaping!"