[[bulk]] === Cheaper in Bulk
In the same way that mget
allows us to retrieve multiple documents at once,
the bulk
API allows((("bulk API"))) us to make multiple create
, index
, update
, or
delete
requests in a single step. This is particularly useful if you need
to index a data stream such as log events, which can be queued up and indexed
in batches of hundreds or thousands.
The bulk
request body has the following, slightly unusual, format:
{ action: { metadata }}\n { request body }\n { action: { metadata }}\n { request body }\n
This format is like a stream of valid one-line JSON documents joined
together by newline (\n
) characters.((("\n (newline) characters in bulk requests", sortas="n (newline)"))) Two important points to note:
Every line must end with a newline character (\n
), including the last
line. These are used as markers to allow for efficient line separation.
The lines cannot contain unescaped newline characters, as they would interfere with parsing. This means that the JSON must not be pretty-printed.
TIP: In <bulk
API uses this format.
The +action/metadata+ line specifies what action to do to which document.
The +action+ must be((("action, in bulk requests"))) one of the following:
create
::
Create a document only if the document does not already exist. See <
index
::
Create a new document or replace an existing document. See <
update
::
Do a partial update on a document. See <
delete
::
Delete a document. See <
The +metadata+ should ((("metadata, document", "in bulk requests")))specify the _index
, _type
, and _id
of the document
to be indexed, created, updated, or deleted.
For instance, a delete
request could look like this:
The +request body+ line consists of the((("request body line, bulk requests"))) document _source
itself--the fields
and values that the document contains. It is required for index
and
create
operations, which makes sense: you must supply the document to index.
It is also required for update
operations and should consist of the same
request body that you would pass to the update
API: doc
, upsert
,
script
, and so forth. No +request body+ line is required for a delete.
{ "create": { "_index": "website", "_type": "blog", "_id": "123" }}
If no _id
is specified, an ID will be autogenerated:
{ "index": { "_index": "website", "_type": "blog" }}
To put it all together, a complete bulk
request ((("bulk API", "common bulk request, example")))has this form:
POST /_bulk { "delete": { "_index": "website", "_type": "blog", "_id": "123" }} <1> { "create": { "_index": "website", "_type": "blog", "_id": "123" }} { "title": "My first blog post" } { "index": { "_index": "website", "_type": "blog" }} { "title": "My second blog post" } { "update": { "_index": "website", "_type": "blog", "_id": "123", "_retry_on_conflict" : 3} }
// SENSE: 030_Data/55_Bulk.json
<1> Notice how the delete
action does not have a request body; it is
followed immediately by another action.
<2> Remember the final newline character.
The Elasticsearch response contains the items
array,((("items array, listing results of bulk requests")))((("bulk API", "Elasticsearch response"))) which lists the result of
each request, in the same order as we requested them:
{ "took": 4, "errors": false, <1> "items": [ { "delete": { "_index": "website", "_type": "blog", "_id": "123", "_version": 2, "status": 200, "found": true }}, { "create": { "_index": "website", "_type": "blog", "_id": "123", "_version": 3, "status": 201 }}, { "create": { "_index": "website", "_type": "blog", "_id": "EiwfApScQiiy7TIKFxRCTw", "_version": 1, "status": 201 }}, { "update": { "_index": "website", "_type": "blog", "_id": "123", "_version": 4, "status": 200 }} ]
// SENSE: 030_Data/55_Bulk.json
<1> All subrequests completed successfully.
Each subrequest is executed independently, so the failure of one subrequest
won't affect the success of the others. If any of the requests fail, the
top-level error
flag is set to true
and the error details will be
reported under the relevant request:
POST /_bulk { "create": { "_index": "website", "_type": "blog", "_id": "123" }} { "title": "Cannot create - it already exists" } { "index": { "_index": "website", "_type": "blog", "_id": "123" }}
// SENSE: 030_Data/55_Bulk_independent.json
In the response, we can see that it failed to create
document 123
because
it already exists, but the subsequent index
request, also on document 123
,
succeeded:
{ "took": 3, "errors": true, <1> "items": [ { "create": { "_index": "website", "_type": "blog", "_id": "123", "status": 409, <2> "error": "DocumentAlreadyExistsException <3> [[website][4] [blog][123]: document already exists]" }}, { "index": { "_index": "website", "_type": "blog", "_id": "123", "_version": 5, "status": 200 <4> }} ]
// SENSE: 030_Data/55_Bulk_independent.json
<1> One or more requests has failed.
<2> The HTTP status code for this request reports 409 CONFLICT
.
<3> The error message explaining why the request failed.
<4> The second request succeeded with an HTTP status code of 200 OK
.
That also means ((("bulk API", "bulk requests, not transactions")))that bulk
requests are not atomic: they cannot be used to
implement transactions. Each request is processed separately, so the success
or failure of one request will not interfere with the others.
==== Don't Repeat Yourself
Perhaps you are batch-indexing logging data into the same index
, and with the
same type
. Having to ((("metadata, document", "not repeating in bullk requests")))((("bulk API", "default /_index or _index/_type")))specify the same metadata for every document is a waste.
Instead, just as for the mget
API, the bulk
request accepts a default /_index
or
/_index/_type
in the URL:
POST /website/_bulk { "index": { "_type": "log" }}
// SENSE: 030_Data/55_Bulk_defaults.json
You can still override the _index
and _type
in the metadata line, but it
will use the values in the URL as defaults:
POST /website/log/_bulk { "index": {}} { "event": "User logged in" } { "index": { "_type": "blog" }}
// SENSE: 030_Data/55_Bulk_defaults.json
==== How Big Is Too Big?
The entire bulk request needs to be loaded into memory by the node that receives our request, so the bigger the request, the less memory available for other requests.((("bulk API", "optimal size of requests"))) There is an optimal size of bulk request. Above that size, performance no longer improves and may even drop off. The optimal size, however, is not a fixed number. It depends entirely on your hardware, your document size and complexity, and your indexing and search load.
Fortunately, it is easy to find this sweet spot: Try indexing typical documents in batches of increasing size. When performance starts to drop off, your batch size is too big. A good place to start is with batches of 1,000 to 5,000 documents or, if your documents are very large, with even smaller batches.
It is often useful to keep an eye on the physical size of your bulk requests. One thousand 1KB documents is very different from one thousand 1MB documents. A good bulk size to start playing with is around 5-15MB in size.