solr如何对文件进行排序?
How solr ranks documents?
我在 solr 中使用以下配置为我的文档文本编制了索引:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<!-- in this example, we will only use synonyms at query time <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> -->
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
<field name="desc" type="text_general" indexed="true" stored="true" multiValued="false"/>
和一个测试查询
desc:Alabama Crimson Tide Toddler Crimson Team Logo Flannel Pajama Pants
Returns 前 2 个结果如下:
{
"id":"_:node1b897e5ffccc354e5da5128066e2e9e4|https://www.crookscountry.com/product/alabama-greatest-hits",
"name":"Alabama - Greatest Hits",
"source_entity_index":"prod03",
"category":"",
"category_str":"",
"desc":"Alabama ~ Alabama - Greatest Hits",
"host":"www.crookscountry.com",
"url":"https://www.crookscountry.com/product/alabama-greatest-hits",
"_version_":1652845859059007489},
{
"id":"_:noded8c4ca8e98bb12e1132af18c76f277b|https://shop.spreadshirt.com/thatshirtcray/amateur+sketch+shirt-A12174934",
"name":"Amateur Sketch Shirt | Men's T-Shirt",
"source_entity_index":"prod03",
"category":"",
"category_str":"",
"desc":"Leprechaun in Alabama amateur sketch.",
"host":"shop.spreadshirt.com",
"url":"https://shop.spreadshirt.com/thatshirtcray/amateur+sketch+shirt-A12174934",
"_version_":1652846254331265025},
但是我真正想要排名高的文档排在前100之后,例如:
{
"id":"_:nodec65a89504cb5f3af808caf654ac7cb72|http://shop.rolltide.com/Alabama_Crimson_Tide_Sweatshirts_And_Fleece_Sweaters",
"host":"shop.rolltide.com",
"name":"Men's Crimson Alabama Crimson Tide Big Logo Sweater",
"text":"Show off your team spirit with this Alabama Crimson Tide Big Logo sweater.",
"_version_":1646377538225700866},
{
"id":"_:nodeebc0adb5a11937556ebdf77132fab580|http://shop.foxsports.com/FOX_Alabama_Crimson_Tide_Sweaters_And_Dress_Shirts",
"host":"shop.foxsports.com",
"name":"Men's Crimson Alabama Crimson Tide Big Logo Sweater",
"text":"Show off your team spirit with this Alabama Crimson Tide Big Logo sweater.",
"_version_":1646383652576165892},
我不太明白默认的 solr 排名是如何工作的...它似乎更喜欢短文本,即使查询中只有一个重叠的词。无论如何我可以根据我的需要改变这个吗?
非常感谢!
Solr 文档排名依赖于Lucene Similarity。
it seems that it favours short text, even if there is only one overlapping word with the query
此行为是由于字段长度规范化造成的。您可以设置 omitNorms=true
以禁用字段长度规范化(参见 https://lucene.apache.org/solr/guide/6_6/field-type-definitions-and-properties.html#field-default-properties)。
有关更深入的解释,请参阅 this post。
Alternatively/additionally 使用 (e)dismax 解析器,您可以使用 mm
(又名 MinimumShouldMatch)参数来调整 - 不是排名 - 但 Solr 如何匹配文档。
我在 solr 中使用以下配置为我的文档文本编制了索引:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<!-- in this example, we will only use synonyms at query time <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> -->
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
<field name="desc" type="text_general" indexed="true" stored="true" multiValued="false"/>
和一个测试查询
desc:Alabama Crimson Tide Toddler Crimson Team Logo Flannel Pajama Pants
Returns 前 2 个结果如下:
{
"id":"_:node1b897e5ffccc354e5da5128066e2e9e4|https://www.crookscountry.com/product/alabama-greatest-hits",
"name":"Alabama - Greatest Hits",
"source_entity_index":"prod03",
"category":"",
"category_str":"",
"desc":"Alabama ~ Alabama - Greatest Hits",
"host":"www.crookscountry.com",
"url":"https://www.crookscountry.com/product/alabama-greatest-hits",
"_version_":1652845859059007489},
{
"id":"_:noded8c4ca8e98bb12e1132af18c76f277b|https://shop.spreadshirt.com/thatshirtcray/amateur+sketch+shirt-A12174934",
"name":"Amateur Sketch Shirt | Men's T-Shirt",
"source_entity_index":"prod03",
"category":"",
"category_str":"",
"desc":"Leprechaun in Alabama amateur sketch.",
"host":"shop.spreadshirt.com",
"url":"https://shop.spreadshirt.com/thatshirtcray/amateur+sketch+shirt-A12174934",
"_version_":1652846254331265025},
但是我真正想要排名高的文档排在前100之后,例如:
{
"id":"_:nodec65a89504cb5f3af808caf654ac7cb72|http://shop.rolltide.com/Alabama_Crimson_Tide_Sweatshirts_And_Fleece_Sweaters",
"host":"shop.rolltide.com",
"name":"Men's Crimson Alabama Crimson Tide Big Logo Sweater",
"text":"Show off your team spirit with this Alabama Crimson Tide Big Logo sweater.",
"_version_":1646377538225700866},
{
"id":"_:nodeebc0adb5a11937556ebdf77132fab580|http://shop.foxsports.com/FOX_Alabama_Crimson_Tide_Sweaters_And_Dress_Shirts",
"host":"shop.foxsports.com",
"name":"Men's Crimson Alabama Crimson Tide Big Logo Sweater",
"text":"Show off your team spirit with this Alabama Crimson Tide Big Logo sweater.",
"_version_":1646383652576165892},
我不太明白默认的 solr 排名是如何工作的...它似乎更喜欢短文本,即使查询中只有一个重叠的词。无论如何我可以根据我的需要改变这个吗?
非常感谢!
Solr 文档排名依赖于Lucene Similarity。
it seems that it favours short text, even if there is only one overlapping word with the query
此行为是由于字段长度规范化造成的。您可以设置 omitNorms=true
以禁用字段长度规范化(参见 https://lucene.apache.org/solr/guide/6_6/field-type-definitions-and-properties.html#field-default-properties)。
有关更深入的解释,请参阅 this post。
Alternatively/additionally 使用 (e)dismax 解析器,您可以使用 mm
(又名 MinimumShouldMatch)参数来调整 - 不是排名 - 但 Solr 如何匹配文档。