Solr 给出了错误的字段长度
Solr is giving wrong FIeld Length
我的功能列表如下:
[
{
"store": "myfeature_store",
"name" : "titleLength",
"class" : "org.apache.solr.ltr.feature.FieldLengthFeature",
"params" : {
"field":"title"
}
}
]
当我搜索以下查询时:
curl -g 'http://localhost:8983/solr/nutch/select?indent=on&q=python&wt=json&fl=title,score,[features%20efi.query=python%20store=myfeature_store]'
我得到以下结果:
{
"responseHeader":{
"status":0,
"QTime":8,
"params":{
"q":"python",
"indent":"on",
"fl":"title,score,[features efi.query=python store=myfeature_store]",
"wt":"json"}},
"response":{"numFound":793,"start":0,"maxScore":0.33828905,"docs":[
{
"title":"Newest 'python' Questions - Stack Overflow",
"score":0.33828905,
"[features]":"titleLength=1820.4445"},
{
"title":"Newest 'python-3.x' Questions - Stack Overflow",
"score":0.14434122,
"[features]":"titleLength=5349.8774"},
{
"title":"Geographic Information Systems Stack Exchange",
"score":0.08331977,
"[features]":"titleLength=1820.4445"},
{
"title":"Stack Overflow em Português",
"score":0.08331977,
"[features]":"titleLength=1820.4445"},
{
"title":"Stack Overflow en español",
"score":0.07460209,
"[features]":"titleLength=2621.44"},
{
"title":"Hot Questions - Stack Exchange",
"score":0.06534503,
"[features]":"titleLength=655.36"},
{
"title":"Code Review Stack Exchange",
"score":0.05356382,
"[features]":"titleLength=1820.4445"},
{
"title":"Software Recommendations Stack Exchange",
"score":0.05356382,
"[features]":"titleLength=1820.4445"},
{
"title":"Raspberry Pi Stack Exchange",
"score":0.042962566,
"[features]":"titleLength=1820.4445"},
{
"title":"Welcome to The Apache Software Foundation!",
"score":0.042862184,
"[features]":"titleLength=455.1111"}]
}}
正如你所看到的,titleLength
完全错了。例如,对于最后一个结果,标题是 Welcome to The Apache Software Foundation!
,titleLength
应该是 5,但现在是 455.1111。问题可能在哪里?
titleLength
处理程序使用为字段存储的规范 - 这些规范被映射到 a lookup table of floats with 256 possible values. These values are not expected to be exact (since the length of a field can be larger than 256), but to map the whole space of 2^31
integer values 到单个字节中。
这还包括任何索引时间提升,因此如果在索引时某个字段被提升(例如通过 Nutch 插件),这将反映在为该字段存储的规范中。您不能依赖 titleLength
是为该文档的字段存储的确切术语数,但它表示该字段的 "boost"。
我的功能列表如下:
[
{
"store": "myfeature_store",
"name" : "titleLength",
"class" : "org.apache.solr.ltr.feature.FieldLengthFeature",
"params" : {
"field":"title"
}
}
]
当我搜索以下查询时:
curl -g 'http://localhost:8983/solr/nutch/select?indent=on&q=python&wt=json&fl=title,score,[features%20efi.query=python%20store=myfeature_store]'
我得到以下结果:
{
"responseHeader":{
"status":0,
"QTime":8,
"params":{
"q":"python",
"indent":"on",
"fl":"title,score,[features efi.query=python store=myfeature_store]",
"wt":"json"}},
"response":{"numFound":793,"start":0,"maxScore":0.33828905,"docs":[
{
"title":"Newest 'python' Questions - Stack Overflow",
"score":0.33828905,
"[features]":"titleLength=1820.4445"},
{
"title":"Newest 'python-3.x' Questions - Stack Overflow",
"score":0.14434122,
"[features]":"titleLength=5349.8774"},
{
"title":"Geographic Information Systems Stack Exchange",
"score":0.08331977,
"[features]":"titleLength=1820.4445"},
{
"title":"Stack Overflow em Português",
"score":0.08331977,
"[features]":"titleLength=1820.4445"},
{
"title":"Stack Overflow en español",
"score":0.07460209,
"[features]":"titleLength=2621.44"},
{
"title":"Hot Questions - Stack Exchange",
"score":0.06534503,
"[features]":"titleLength=655.36"},
{
"title":"Code Review Stack Exchange",
"score":0.05356382,
"[features]":"titleLength=1820.4445"},
{
"title":"Software Recommendations Stack Exchange",
"score":0.05356382,
"[features]":"titleLength=1820.4445"},
{
"title":"Raspberry Pi Stack Exchange",
"score":0.042962566,
"[features]":"titleLength=1820.4445"},
{
"title":"Welcome to The Apache Software Foundation!",
"score":0.042862184,
"[features]":"titleLength=455.1111"}]
}}
正如你所看到的,titleLength
完全错了。例如,对于最后一个结果,标题是 Welcome to The Apache Software Foundation!
,titleLength
应该是 5,但现在是 455.1111。问题可能在哪里?
titleLength
处理程序使用为字段存储的规范 - 这些规范被映射到 a lookup table of floats with 256 possible values. These values are not expected to be exact (since the length of a field can be larger than 256), but to map the whole space of 2^31
integer values 到单个字节中。
这还包括任何索引时间提升,因此如果在索引时某个字段被提升(例如通过 Nutch 插件),这将反映在为该字段存储的规范中。您不能依赖 titleLength
是为该文档的字段存储的确切术语数,但它表示该字段的 "boost"。