[[relevance-is-broken]] === Relevance Is Broken!
Before we move on to discussing more-complex queries in
<
Every now and again a new user opens an issue claiming that sorting by relevance((("relevance", "differences in IDF producing incorrect results"))) is broken and offering a short reproduction: the user indexes a few documents, runs a simple query, and finds apparently less-relevant results appearing above more-relevant results.
To understand why this happens, let's imagine that we create an index with two
primary shards and we index ten documents, six of which contain the word foo
.
It may happen that shard 1 contains three of the foo
documents and shard
2 contains the other three. In other words, our documents are well distributed.
In <
However, for performance reasons, Elasticsearch doesn't calculate the IDF across all documents in the index.((("shards", "local inverse document frequency (IDF)"))) Instead, each shard calculates a local IDF for the documents contained in that shard.
Because our documents are well distributed, the IDF for both shards will be
the same. Now imagine instead that five of the foo
documents are on shard 1,
and the sixth document is on shard 2. In this scenario, the term foo
is
very common on one shard (and so of little importance), but rare on the other
shard (and so much more important). These differences in IDF can produce
incorrect results.
In practice, this is not a problem. The differences between local and global IDF diminish the more documents that you add to the index. With real-world volumes of data, the local IDFs soon even out. The problem is not that relevance is broken but that there is too little data.
For testing purposes, there are two ways we can work around this issue. The
first is to create an index with one primary shard, as we did in the section
introducing the <
The second workaround is to add ?search_type=dfs_query_then_fetch
to your
search requests. The dfs
stands((("searchtype", "dfs_query_then_fetch")))((("dfs_query_then_fetch search type")))((("DFS (Distributed Frequency Search)"))) for _Distributed Frequency Search, and it
tells Elasticsearch to first retrieve the local IDF from each shard in order
to calculate the global IDF across the whole index.
TIP: Don't use dfs_query_then_fetch
in production. It really isn't
required. Just having enough data will ensure that your term frequencies are
well distributed. There is no reason to add this extra DFS step to every query
that you run.