清理 Swift 中的文本字符串

Question

我想在我的应用程序中使用一些有点乱的文本。我无法控制文本，所以它就是这样。

我正在寻找一种轻量级¹ 方法来清理此处示例中显示的所有内容：

original: <p>Occasionally we&nbsp;deal&nbsp;with this.</p>                       desired: Occasionally we deal with this.
original: <p>Sometimes they \emphasize\ like this, I could live with it</p>      desired: Sometimes they emphasize like this, I could live with it
original: <p>This is junk, but it's what I have<\/p>\r\n                         desired: This is junk, but it's what I have
original: <p>This is test1</p>                                                   desired: This is test1
original: <p>This is u\u00f1icode</p>                                            desired: This is uñicode

所以我们看到特殊字符，比如   unicode，比如 \u00f1，html 段落，比如 <p> 和 </p>，新行的东西，像 \n\r，还有一些奇怪的反斜杠 \。所需的是翻译可翻译的并删除其他垃圾。

虽然我可以直接操作字符串，单独处理这些事情，但我想知道是否有一种简单的¹ 方法来清理这些字符串而不用开销太大¹。

A partial answer 已提供，但在我提供的示例中还有更多问题需要解决。该解决方案翻译 HTML 特殊字符，但没有格式为 \u0000 的 unicode，不删除 HTML 标签等

我试过的其他东西

这不是我一直在寻找的全局解决方案，但它显示了解决问题的方向。

let samples = ["<p>This is test1</p>                                             ":"This is test1",
           "<p>This is u\u00f1icode</p>                                      ":"This is u–icode",
           "<p>This is u&#x00f1;icode</p>                                       ":"This is u–icode",
           "<p>This is junk, but it's what I have<\/p>\r\n                   ":"This is junk, but it's what I have",
           "<p>Sometimes they \emphasize\ like this, I could live with it</p>":"Sometimes they emphasize like this, I could live with it",
           "<p>Occasionally we&nbsp;deal&nbsp;with this.</p>                 ":"Occasionally we deal with this."]

for (key, value) in samples {
    print ("original: \(key)      desired: \(value)" )
}

print("\n\n\n")

for (key, _) in samples {
    var _key = key.trimmingCharacters(in: CharacterSet.whitespaces)
    _key = _key.replacingOccurrences(of: "\/", with: "/")

    if _key.hasSuffix("\r\n") { _key = String(_key.dropLast(4)) }
    if _key.hasPrefix("<p>") { _key = String(_key.dropFirst(3)) }
    if _key.hasSuffix("</p>") { _key = String(_key.dropLast(4)) }

    while let uniRange = _key[_key.startIndex...].range(of: "\u") {
        let charDefRange = uniRange.upperBound..<_key.index(uniRange.upperBound, offsetBy: 4)
        let uniFullRange = uniRange.lowerBound..<charDefRange.upperBound
        let charDef = "&#x" + _key[charDefRange] + ";"

        _key = _key.replacingCharacters(in: uniFullRange, with: charDef)
    }

    let decoded = _key.stringByDecodingHTMLEntities
    print("decoded: \(decoded)")
}

输出

original: <p>Occasionally we&nbsp;deal&nbsp;with this.</p>                       desired: Occasionally we deal with this.
original: <p>Sometimes they \emphasize\ like this, I could live with it</p>      desired: Sometimes they emphasize like this, I could live with it
original: <p>This is u&#x00f1;icode</p>                                          desired: This is uñicode
original: <p>This is junk, but it's what I have<\/p>\r\n                         desired: This is junk, but it's what I have
original: <p>This is test1</p>                                                   desired: This is test1
original: <p>This is u\u00f1icode</p>                                            desired: This is uñicode




decoded: Occasionally we deal with this.
decoded: Sometimes they \emphasize\ like this, I could live with it
decoded: This is uñicode
decoded: This is junk, but it's what I have
decoded: This is test1
decoded: This is uñicode

脚注： 1. _{可能有许多较大的包或库可以将其作为其全部功能的一小部分来执行此操作，而这些在此处不太受关注。}

Answer 1

我无法理解奇怪的反斜杠，但要删除 HTML 标记、HTML 实体和转义符，您可以使用正则表达式进行以下替换：

请注意，您需要一个包含 HTML 个实体的字典，否则这将不起作用。转义的次数少，创建完整的字典不会很复杂。

let strings = [
    "<p>Occasionally we&nbsp;deal&nbsp;with this.</p> ",
    "<p>Sometimes they \emphasize\ like this, I could live with it</p>",
    "<p>This is junk, but it's what I have<\/p>\r\n",
    "<p>This is test1</p>",
    "<p>This is u\u00f1icode</p>",
]

// the pattern needs exactly one capture group
func replaceEntities(in text: String, pattern: String, replace: (String) -> String?) -> String {
    let buffer = (text as NSString).mutableCopy() as! NSMutableString
    let regularExpression = try! NSRegularExpression(pattern: pattern, options: .caseInsensitive)

    let matches = regularExpression.matches(in: text, options: [], range: NSRange(location: 0, length: buffer.length))

    // need to replace from the end or the ranges will break after first replacement
    for match in matches.reversed() {
        let captureGroupRange = match.range(at: 1)
        let matchedEntity = buffer.substring(with: captureGroupRange)
        guard let replacement = replace(matchedEntity) else {
            continue
        }
        buffer.replaceCharacters(in: match.range, with: replacement)
    }

    return buffer as String
}

let htmlEntities = [
    "nbsp": "\u{00A0}"
]

func replaceHtmlEntities(_ text: String) -> String {
    return replaceEntities(in: text, pattern: "&([^;]+);") {
        return htmlEntities[[=10=]]
    }
}

let escapeSequences = [
    "n": "\n",
    "r": "\r"
]

func replaceEscapes(_ text: String) -> String {
    return replaceEntities(in: text, pattern: "\\([a-z])") {
        return escapeSequences[[=10=]]
    }
}

func removeTags(_ text: String) -> String {
    return text
        .replacingOccurrences(of: "<[^>]+>", with: "", options: .regularExpression)
}

func replaceUnicodeSequences(_ text: String) -> String {
    return replaceEntities(in: text, pattern: "\\u([a-z0-9]{4})") {
        let code = Unicode.Scalar(Int([=10=], radix: 16)!)
        return code.map { String([=10=]) }
    }
}

let purifiedStrings = strings
    .map(removeTags)
    .map(replaceHtmlEntities)
    .map(replaceEscapes)
    .map(replaceUnicodeSequences)

print(purifiedStrings.joined(separator: "\n"))

您还可以替换 leading/trailing 字符串并将多个 space 替换为单个 space，但这很简单。

您可以将它与How do I decode HTML entities in swift?

中的解决方案结合起来

清理 Swift 中的文本字符串

Cleaning up text strings in Swift

string

swift

data-cleaning