elasticsearch-definitive-guide-en
Introduction
1. Getting started
2. Distributed Cluster
3. Data In Data Out
4. Distributed CRUD
5. Search
6. Mapping Analysis
7. Query DSL
8. 056_Sorting
9. Distributed Search
10. 070_Index_Mgmt
11. 075_Inside_a_shard
- 11.1. 20_Making_text_searchable
- 11.2. 30_dynamic_indices
- 11.3. 40_Near_real_time
- 11.4. 50_Persistent_changes
- 11.5. 60_Segment_merging
12. 080_Structured_Search
- 12.1. 05_term
- 12.2. 10_compoundfilters
- 12.3. 15_terms
- 12.4. 20_contains
- 12.5. 25_ranges
- 12.6. 30_existsmissing
- 12.7. 40_bitsets
- 12.8. 45_filter_order

elasticsearch-definitive-guide-en

[[analysis-intro]] [role="pagebreak-before"] === Analysis and Analyzers

Analysis is a ((("analysis", "defined")))process that consists of the following:

First, tokenizing a block of text into individual terms suitable for use in an inverted index,
Then normalizing these terms into a standard form to improve their ``searchability,'' or recall

This job is ((("analyzers")))performed by analyzers. An analyzer is really just a wrapper that combines three functions into a((("character filters"))) single package:

Character filters::

First, the string is passed through any _character filters_ in turn. Their
job is to tidy up the string before tokenization. A character filter could
be used to strip out HTML, or to convert `&` characters to the word
`and`.

Tokenizer::

Next, the string is tokenized into individual terms by a tokenizer. A simple tokenizer might split the text into terms whenever it encounters whitespace or punctuation.

Token filters::

Last, each term is passed through any token filters in turn, which can change terms (for example, lowercasing Quick), remove terms (for example, stopwords such as a, and, the) or add terms (for example, synonyms like jump and leap).

Elasticsearch provides many character filters, ((("token filters")))((("tokenizers")))tokenizers, and token filters out of the box. These can be combined to create custom analyzers suitable for different purposes. We discuss these in detail in <>.

==== Built-in Analyzers

However, Elasticsearch also ships with prepackaged analyzers that you can use directly.((("analyzers", "built-in"))) We list the most important ones next and, to demonstrate the difference in behavior, we show what terms each would produce from this string:

"Set the shape to semi-transparent by calling set_trans(5)"

Standard analyzer::

The standard analyzer ((("standard analyzer")))is the default analyzer that Elasticsearch uses. It is the best general choice for analyzing text that may be in any language. It splits the text on word boundaries, as((("word boundaries"))) defined by the http://www.unicode.org/reports/tr29/[Unicode Consortium], and removes most punctuation. Finally, it lowercases all terms. It would produce + set, the, shape, to, semi, transparent, by, calling, set_trans, 5

Simple analyzer::

The simple analyzer splits ((("simple analyzer")))the text on anything that isn't a letter, and lowercases the terms. It would produce + set, the, shape, to, semi, transparent, by, calling, set, trans

Whitespace analyzer::

The whitespace analyzer splits ((("whitespace analyzer")))the text on whitespace. It doesn't lowercase. It would produce + Set, the, shape, to, semi-transparent, by, calling, set_trans(5)

Language analyzers::

Language-specific analyzers ((("language analyzers")))are available for http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html[many languages]. They are able to take the peculiarities of the specified language into account. For instance, the english analyzer comes with a set of English ((("stopwords")))stopwords (common words like and or the that don't have much impact on relevance), which it removes. This analyzer also is able to stem English ((("stemming words")))words because it understands the rules of English grammar. + The english analyzer would produce the following: + set, shape, semi, transpar, call, set_tran, 5 + Note how transparent, calling, and set_trans have been stemmed to their root form.

==== When Analyzers Are Used

When we index a document, its full-text fields are analyzed into terms that are used to create the inverted index.((("indexing", "analyzers, use on full text fields"))) However, when we search on a full-text field, we need to pass the query string through the same analysis process, to ensure that we are searching for terms in the same form as those that exist in the index.

Full-text queries, which we discuss later, understand how each field is defined, and so they can do((("full text", "querying fields representing"))) the right thing:

When you query a full-text field, the query will apply the same analyzer to the query string to produce the correct list of terms to search for.
When you query an exact-value field, the query will not analyze the query string, ((("exact values", "querying fields representing")))but instead search for the exact value that you have specified.

Now you can understand why the queries that we demonstrated at the <> return what they do:

The date field contains an exact value: the single term 2014-09-15.
The _all field is a full-text field, so the analysis process has converted the date into the three terms: 2014, 09, and 15.

When we query the _all field for 2014, it matches all 12 tweets, because all of them contain the term 2014:

[source,sh]

GET /_search?q=2014 # 12 results

// SENSE: 052_Mapping_Analysis/25_Data_type_differences.json

When we query the _all field for 2014-09-15, it first analyzes the query string to produce a query that matches any of the terms 2014, 09, or 15. This also matches all 12 tweets, because all of them contain the term 2014:

[source,sh]

GET /_search?q=2014-09-15 # 12 results !

// SENSE: 052_Mapping_Analysis/25_Data_type_differences.json

When we query the date field for 2014-09-15, it looks for that exact date, and finds one tweet only:

[source,sh]

GET /_search?q=date:2014-09-15 # 1 result

// SENSE: 052_Mapping_Analysis/25_Data_type_differences.json

When we query the date field for 2014, it finds no documents because none contain that exact date:

[source,sh]

GET /_search?q=date:2014 # 0 results !

// SENSE: 052_Mapping_Analysis/25_Data_type_differences.json

[[analyze-api]] ==== Testing Analyzers

Especially when you are new ((("analyzers", "testing")))to Elasticsearch, it is sometimes difficult to understand what is actually being tokenized and stored into your index. To better understand what is going on, you can use the analyze API to see how text is analyzed. Specify which analyzer to use in the query-string parameters, and the text to analyze in the body:

[source,js]

GET /_analyze?analyzer=standard

Text to analyze

// SENSE: 052_Mapping_Analysis/40_Analyze.json

Each element in the result represents a single term:

[source,js]

{ "tokens": [ { "token": "text", "start_offset": 0, "end_offset": 4, "type": "", "position": 1 }, { "token": "to", "start_offset": 5, "end_offset": 7, "type": "", "position": 2 }, { "token": "analyze", "start_offset": 8, "end_offset": 15, "type": "", "position": 3 } ]

}

The token is the actual term that will be stored in the index. The position indicates the order in which the terms appeared in the original text. The start_offset and end_offset indicate the character positions that the original word occupied in the original string.

TIP: The type values like <ALPHANUM> vary ((("types", "type values returned by analyzers")))per analyzer and can be ignored. The only place that they are used in Elasticsearch is in the http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/analysis-intro.html#analyze-api[`keep_types` token filter].

The analyze API is a useful tool for understanding what is happening inside Elasticsearch indices, and we will talk more about it as we progress.

==== Specifying Analyzers

When Elasticsearch detects a new string field((("analyzers", "specifying"))) in your documents, it automatically configures it as a full-text string field and analyzes it with the standard analyzer.((("standard analyzer")))

You don't always want this. Perhaps you want to apply a different analyzer that suits the language your data is in. And sometimes you want a string field to be just a string field--to index the exact value that you pass in, without any analysis, such as a string user ID or an internal status field or tag.

To achieve this, we have to configure these fields manually by specifying the mapping.