Swift 字符串索引将“\r\n”合并为一个字符而不是两个

Question

我正在处理包含 \r\n 和 Swift 4.2 的字符串。我运行变成了 Swift 索引的一种 st运行ge 行为，看起来 \r\n 将被 Swift 索引方法视为一个字符而不是两个字符。我写了一段代码来呈现这种行为：

var text = "ABC\r\n\r\nDEF"

func printChar(_ lower: Int, _ upper: Int) {
    let start = text.index(text.startIndex, offsetBy: lower)
    let end = text.index(text.startIndex, offsetBy: upper)
    print("\"" + text[start..<end] + "\"")
}

printChar(0, 1) // "A"
printChar(1, 2) // "B"
printChar(2, 3) // "C"
printChar(3, 4) // new line
printChar(4, 5) // new line (okay, what's going on here?)
printChar(5, 6) // "D"
printChar(6, 7) // "E"
printChar(7, 8) // "F"

打印结果会是

"A"
"B"
"C"
"
"
"
"
"D"
"E"
"F"

知道为什么会这样吗？

Answer 1

TLDR：\r\n 是一个字素簇，在 Swift 中被视为单个 Character 因为 Unicode。

Swift 将 \r\n 视为一个 Character。
Objective-C NSString 将其视为两个字符（根据 length 的结果）。

On the swift-users forum 有人写道：

– "\r\n" is a single Character. Is this the correct behaviour?

– Yes, a Character corresponds to a Unicode grapheme cluster, and "\r\n" is considered a single grapheme cluster.

随后的回复发布了一个 link 到 Unicode 文档，查看 this table 正式声明 CRLF 是一个字素簇。

看看Apple documentation on Characters and Grapheme Clusters。

It's common to think of a string as a sequence of characters, but when working with NSString objects, or with Unicode strings in general, in most cases it is better to deal with substrings rather than with individual characters. The reason for this is that what the user perceives as a character in text may in many cases be represented by multiple characters in the string.

关于 Strings and Characters 的 Swift 文档也值得一读。

这个overview from objc.io也很有趣。

NSString represents UTF-16-encoded text. Length, indices, and ranges are all based on UTF-16 code units.

另一个例子是像这样的表情符号。这个单个字符实际上是 %uD83D%uDC4D%uD83C%uDFFB，四个不同的 unicode 标量。但是如果你在一个只有那个表情符号的字符串上调用 count 你会（正确地）得到 1.

如果您想查看标量，可以按如下方式迭代它们：

for scalar in text.unicodeScalars {
    print("\(scalar.value) ", terminator: "")
}

"\r\n" 会给你 13 10

In the Swift documentation 你会发现为什么 NSString 是不同的：

The count of the characters returned by the count property isn’t always the same as the length property of an NSString that contains the same characters. The length of an NSString is based on the number of 16-bit code units within the string’s UTF-16 representation and not the number of Unicode extended grapheme clusters within the string.

因此，这并不是 Swift 字符串索引的真正“奇怪”行为，而是 Unicode 如何处理这些字符以及 Swift 中的 String 是如何设计的结果。 Swift 字符串索引按 Character 和 \r\n 是单个 Character.

Swift 字符串索引将“\r\n”合并为一个字符而不是两个

Swift string indexing combines "\r\n" as one char instead of two

string

swift

swift4