Skip to content

Double counting of documents processed when both a default/request and a final pipeline are used #92843

@joegallo

Description

@joegallo
PUT _ingest/pipeline/pipeline-3
{
 "processors": [
  {
   "set": {
    "field": "field-3",
    "value": "pipeline-3"
   }
  }
 ]
}

PUT _ingest/pipeline/pipeline-2
{
 "processors": [
  {
   "set": {
    "field": "field-2",
    "value": "pipeline-2"
   }
  }
 ]
}

PUT _ingest/pipeline/pipeline-1
{
 "processors" : [
  {
   "set": {
    "field": "field-1",
    "value": "pipeline-1"
   }
  },
  {
   "pipeline": {
    "name": "pipeline-2"
   }
  }
 ]
}

PUT index-1

PUT index-1/_settings
{
 "index" : {
  "default_pipeline": "pipeline-1",
  "final_pipeline": "pipeline-3"
 }
}

POST _bulk
{ "index" : { "_index" : "index-1" } }
{ "doc_id" : 0 }

POST index-1/_search

GET _nodes/stats?filter_path=nodes.*.ingest

The above creates three pipelines that each record their name into any documents that are processed. Note that pipeline-1 will call pipeline-2, and that pipeline-1 is installed as the default_pipeline while pipeline-3 is installed as the final_pipeline (reminder: both the default_pipeline and the final_pipeline will be executed when a document is indexed).

The _search near the end will give a result like this:

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "index-1",
        "_id": "DFf9oYUBqS6z6DmM3nW4",
        "_score": 1,
        "_source": {
          "field-3": "pipeline-3",
          "field-1": "pipeline-1",
          "field-2": "pipeline-2",
          "doc_id": 0
        }
      }
    ]
  }
}

Note that fields 1-3 all have the expected value, indicating that all three processors executed against this document.

The _nodes/stats call will give a result like this:

{
  "nodes": {
    "e3BFjLN8STSspUcCnDcXkQ": {
      "ingest": {
        "total": {
          "count": 2,
          [...]
        },
        "pipelines": {
          "pipeline-1": {
            "count": 1,
            [...]
          },
          "pipeline-2": {
            "count": 1,
            [...]
          },
          "pipeline-3": {
            "count": 1,
            [...]
          }
        }
      }
    }
  }
}

The "count" for each individual pipeline is correct – each pipeline was executed against a single document. However, we're double counting at the top-level, the "total" "count" is 2 but should have only been 1 (there was only just one document).

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions