Apple 的自然语言 API returns 意想不到的结果

Question

我想弄清楚为什么 Apple 的自然语言 API returns 出乎意料的结果。

我做错了什么？是语法问题吗？

我有以下四个字符串，我想提取每个单词的“词干形式”。

    // text 1 has two "accredited" in a different order
    let text1: String = "accredit accredited accrediting accredited accreditation accredits"
    
    // text 2 has three "accredited" in different order
    let text2: String = "accredit accredits accredited accrediting accredited accredited accreditation"
    
    // text 3 has "accreditation"
    let text3: String = "accreditation"
    
    // text 4 has "accredited"
    let text4: String = "accredited"

问题出在单词 accreditation 和 accredited。

单词 accreditation 从未返回词干。 accredited returns 根据单词在字符串中的顺序得到不同的结果，如附图中的文本 1 和文本 2 所示。

我使用了 Apple's documentation

中的代码

这里是 SwiftUI 中的完整代码：

import SwiftUI
import NaturalLanguage

struct ContentView: View {
    
    // text 1 has two "accredited" in a different order
    let text1: String = "accredit accredited accrediting accredited accreditation accredits"
    
    // text 2 has three "accredited" in a different order
    let text2: String = "accredit accredits accredited accrediting accredited accredited accreditation"
    
    // text 3 has "accreditation"
    let text3: String = "accreditation"
    
    // text 4 has "accredited"
    let text4: String = "accredited"
    
    var body: some View {
        ScrollView {
            VStack {
                
                Text("Text 1").bold()
                tagText(text: text1, scheme: .lemma).padding(.bottom)
                
                Text("Text 2").bold()
                tagText(text: text2, scheme: .lemma).padding(.bottom)
                
                Text("Text 3").bold()
                tagText(text: text3, scheme: .lemma).padding(.bottom)
                
                Text("Text 4").bold()
                tagText(text: text4, scheme: .lemma).padding(.bottom)
                
            }
        }
    }
    
    // MARK: - tagText
    func tagText(text: String, scheme: NLTagScheme) -> some View {
        VStack {
            ForEach(partsOfSpeechTagger(for: text, scheme: scheme)) { word in
                Text(word.description)
            }
        }
    }
    
    // MARK: - partsOfSpeechTagger
    func partsOfSpeechTagger(for text: String, scheme: NLTagScheme) -> [NLPTagResult] {
        
        var listOfTaggedWords: [NLPTagResult] = []
        let tagger = NLTagger(tagSchemes: [scheme])
        tagger.string = text
        
        let range = text.startIndex..<text.endIndex
        let options: NLTagger.Options = [.omitPunctuation, .omitWhitespace]
        
        tagger.enumerateTags(in: range, unit: .word, scheme: scheme, options: options) { tag, tokenRange in
            
            if let tag = tag {
                let word: String = String(text[tokenRange])
                let result = NLPTagResult(word: word, tag: tag)
                
                //if !word.localizedCaseInsensitiveContains(tag.rawValue) {
                listOfTaggedWords.append(result)
                //}
            }
            return true
        }
        
        return listOfTaggedWords
    }
    
    // MARK: - NLPTagResult
    struct NLPTagResult: Identifiable, Equatable, Hashable, Comparable {
        var id = UUID()
        var word: String
        var tag: NLTag?
        
        var description: String {
            var newString: String = "\(word)"
            
            if let tag = tag {
                newString += " : \(tag.rawValue)"
            }
            
            return newString
        }
        
        // MARK: - Equatable & Hashable requirements
        static func == (lhs: Self, rhs: Self) -> Bool {
            lhs.id == rhs.id
        }
        
        func hash(into hasher: inout Hasher) {
            hasher.combine(id)
        }
        
        // MARK: - Comparable requirements
        static func <(lhs: NLPTagResult, rhs: NLPTagResult) -> Bool {
            lhs.id.uuidString < rhs.id.uuidString
        }
    }
    
}

// MARK: - Previews
struct ContentView_Previews: PreviewProvider {
    static var previews: some View {
        ContentView()
    }
}

感谢您的帮助！

Answer 1

至于为什么tagger没有从“accreditation”中找到“accredit”，这是因为scheme .lemma找到了单词的lemma，实际上并不是词干。请参阅维基百科上的 difference between stem and lemma。

The stem is the part of the word that never changes even when morphologically inflected; a lemma is the base form of the word. For example, from "produced", the lemma is "produce", but the stem is "produc-". This is because there are words such as production and producing In linguistic analysis, the stem is defined more generally as the analyzed base form from which all inflected forms can be formed.

文档使用“词干”一词，但我确实认为引理是此处的意图，获得“认证”是预期的行为。看到 Usage section of the Wikipedia article for "Word stem" for more info. The lemma is the dictionary form of a word, and "accreditation" 有一个字典条目，而像 "accredited" 这样的东西没有。无论你怎么称呼这些东西，关键是有两个不同的概念，标注器让你成为其中之一，但你期待另一个。

至于为什么单词的顺序很重要，这是因为标注器试图将您的单词作为“自然语言”进行分析，而不是将每个单词单独分析。当然，词序很重要。如果你使用 .lexicalClass，你会看到它认为 text2 中的第三个词是一个形容词，这就解释了为什么它不认为它的字典形式是“accredit”，因为形容词不像那样共轭。请注意 accredited 在字典中是一个形容词。那么“这是一个语法问题吗？”没错。

Apple 的自然语言 API returns 意想不到的结果

Apple's Natural Language API returns unexpected results

nlp

swift

swiftui