Solr 拼写检查查询术语修改

Solr Spellcheck Query term modification

我正在使用 Solr 进行拼写检查。启用 DirectSolrSpellChecker 和 WordBreakSolrSpellChecker。我有以下问题:

一个。当我查询 "worry" 时。 Solr 正在将此术语转换为 "worri" 并返回相同的结果。如果单词以 "y" ["injury"、"worry" 等结尾,则结尾 "y" 将替换为 "i"。

示例查询:

http://localhost:8983/solr/MY_CORE/spell?df=text&spellcheck.q=worry&spellcheck=true&spellcheck.extendedResults=true&spellcheck.onlyMorePopular=true

Solr 结果:

<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">5</int>
</lst>
<result name="response" numFound="0" start="0"/>
<lst name="spellcheck">
<lst name="suggestions">
<lst name="worri">
<int name="numFound">9</int>
<int name="startOffset">0</int>
<int name="endOffset">5</int>
<int name="origFreq">5</int>
<arr name="suggestion">
<lst>
<str name="word">wo r ri</str>
<int name="freq">90</int>
</lst>
<lst>
<str name="word">worst</str>
<int name="freq">12</int>
</lst>
<lst>
<str name="word">wo r r i</str>
<int name="freq">5246</int>
</lst>
<lst>
<str name="word">work</str>
<int name="freq">2920</int>
</lst>
<lst>
<str name="word">w o r ri</str>
<int name="freq">530</int>
</lst>
<lst>
<str name="word">worn</str>
<int name="freq">81</int>
</lst>
<lst>
<str name="word">w o r r i</str>
<int name="freq">5246</int>
</lst>
<lst>
<str name="word">wors</str>
<int name="freq">79</int>
</lst>
<lst>
<str name="word">worm</str>
<int name="freq">10</int>
</lst>
</arr>
</lst>
</lst>
<bool name="correctlySpelled">false</bool>
</lst>
</response>

乙。上面的输出也有像 "w o r r i" 这样的词,我在 solr 字段中找不到任何这些词。我也不知道为什么 solr 返回这样的单词,其中字母由空格分隔。

以下是架构文件:

<field name=MY FIELD type="text_en" multiValued="false" indexed="true" stored="true"/>

配置文件如下:

<!-- a spellchecker built from a field of the main index -->
        <lst name="spellchecker">
            <str name="name">default</str>
            <str name="field"> MY FIELD </str>
            <str name="classname">solr.DirectSolrSpellChecker</str>
            <!-- the spellcheck distance measure used, the default is the internal levenshtein -->
            <str name="distanceMeasure">internal</str>
            <!-- minimum accuracy needed to be considered a valid spellcheck suggestion -->
            <float name="accuracy">0.5</float>
            <!-- the maximum #edits we consider when enumerating terms: can be 1 or 2 -->
            <int name="maxEdits">2</int>
            <!-- the minimum shared prefix when enumerating terms -->
            <int name="minPrefix">1</int>
            <!-- maximum number of inspections per result. -->
            <int name="maxInspections">5</int>
            <!-- minimum length of a query term to be considered for correction -->
            <int name="minQueryLength">4</int>
            <!-- maximum threshold of documents a query term can appear to be considered for correction -->
            <float name="maxQueryFrequency">0.01</float>
            <!-- uncomment this to require suggestions to occur in 1% of the documents
             <float name="thresholdTokenFrequency">.01</float>
             -->
        </lst>

        <!-- a spellchecker that can break or combine words.  See "/spell" handler below for usage -->

         <lst name="spellchecker">
         <str name="name">wordbreak</str>
         <str name="classname">solr.WordBreakSolrSpellChecker</str>
         <str name="field">MY FIELD</str>
         <str name="combineWords">false</str>
         <str name="breakWords">true</str>
         <int name="maxChanges">10</int>
         </lst>

    </searchComponent>


    <requestHandler name="/spell" class="solr.SearchHandler" startup="lazy">
        <lst name="defaults">
            <str name="spellcheck.dictionary">default</str>
            <str name="spellcheck.dictionary">wordbreak</str>
            <str name="spellcheck">on</str>
            <str name="spellcheck.extendedResults">true</str>
            <str name="spellcheck.count">10</str>
            <str name="spellcheck.alternativeTermCount">5</str>
            <str name="spellcheck.maxResultsForSuggest">5</str>
            <str name="spellcheck.collate">false</str>
            <str name="spellcheck.collateExtendedResults">false</str>
            <str name="spellcheck.maxCollationTries">10</str>
            <str name="spellcheck.maxCollations">5</str>
        </lst>
        <arr name="last-components">
            <str>spellcheck</str>
        </arr>
    </requestHandler>

如果有人能帮助我,我将不胜感激。

提前致谢!

您喜欢 "wo r r i" 的 "strange" 个建议。你有它们,因为你正在使用 WordBreakSolrSpellChecker 并且它破坏了试图为你提供一些拼写检查功能的标记,所以如果你要删除你不应该得到这些建议。以下是官方文档中的引述:

WordBreakSolrSpellChecker offers suggestions by combining adjacent query terms and/or breaking terms into multiple words. It is a SpellCheckComponent enhancement, leveraging Lucene's WordBreakSpellChecker. It can detect spelling errors resulting from misplaced whitespace without the use of shingle-based dictionaries and provides collation support for word-break errors, including cases where the user has a mix of single-word spelling errors and word-break errors in the same query. It also provides shard support.

所以,基本上,在您的示例中 - 您从 Solr 索引中获得正常建议,例如:worst, work, worm, worn, wors。所有其他只是 WordBreakSolrSpellChecker 的结果,您永远不会在索引中找到它们。