Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String fields longer than 32kb cannot be indexed #873

Open
kroepke opened this issue Jan 14, 2015 · 22 comments · Fixed by kube-logging/logging-operator#1803
Open

String fields longer than 32kb cannot be indexed #873

kroepke opened this issue Jan 14, 2015 · 22 comments · Fixed by kube-logging/logging-operator#1803

Comments

@kroepke
Copy link
Member

kroepke commented Jan 14, 2015

Elasticsearch has an upper limit for term length, so trying to index values longer than ~32kb fails with an error.
Find a way to store those values but not trying to analyze them.

@kroepke kroepke added this to the 1.1.0 milestone Jan 14, 2015
@kroepke
Copy link
Member Author

kroepke commented Jan 14, 2015

FWIW here's how "other" people deal with it: http://answers.splunk.com/answers/136664/changing-max-length-of-field.html

@razvanphp
Copy link
Contributor

+1

@kroepke kroepke modified the milestones: 1.2.0, 1.1.0 May 29, 2015
@bernd
Copy link
Member

bernd commented Aug 11, 2015

Removing this from 1.2, not gonna make it. Sorry.

@delfer
Copy link

delfer commented Dec 1, 2015

Found one else 'dummy' solution. Cut tail from long field into another field and then replace this field by any other field.

{
  "extractors": [
    {
      "condition_type": "regex",
      "condition_value": "^.{16383,}$",
      "converters": [],
      "cursor_strategy": "cut",
      "extractor_config": {
        "regex_value": "^.{0,16383}(.*)"
      },
      "extractor_type": "regex",
      "order": 0,
      "source_field": "msg.response",
      "target_field": "responseTail",
      "title": "cut response"
    },
    {
      "condition_type": "none",
      "condition_value": "",
      "converters": [],
      "cursor_strategy": "copy",
      "extractor_config": {},
      "extractor_type": "copy_input",
      "order": 0,
      "source_field": "gl2_remote_ip",
      "target_field": "responseTail",
      "title": "replace responseTail by server IP"
    }
  ],
  "version": "1.2.2 (91c7822)"
}

@ghost
Copy link

ghost commented Jan 28, 2016

+1

Same issue:
2016-01-28T17:03:43.165+01:00 ERROR [Messages] Failed to index [1] messages. Please check the index error log in your web interface for the reason. Error: failure in bulk execution:
[13]: index [graylog2_4], type [message], id [b31f5110-c5d8-11e5-8227-001a4a777b5d], message [IllegalArgumentException[Document contains at least one immense term in field="other" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[49, 50, 51, 52, 53, 54, 55, 56, 57, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 48]...', original message: bytes can be at most 32766 in length; got 186000]; nested: MaxBytesLengthExceededException[bytes can be at most 32766 in length; got 186000]; ]

My Gelf Message:
{ "message": "OK", "other": "more than 32k...." }

@joschi
Copy link
Contributor

joschi commented Jan 28, 2016

@ghost
Copy link

ghost commented Feb 1, 2016

@joschi Thanks for the links.
Is there somewhere a good definition from the GELF? https://www.graylog.org/resources/gelf/
Like which field has which data type and limitations. That would help us a lot.
Now we find out such "limitations" by reverse engineering (Bugs and testing).

@joschi
Copy link
Contributor

joschi commented Feb 1, 2016

@kablz The GELF specification can be found at https://www.graylog.org/resources/gelf/ and describes the names and types of mandatory fields in a GELF message. Additional fields (see specification) naturally don't have a fixed schema unless you enforce that on your GELF producers.

@csquire
Copy link

csquire commented Mar 22, 2016

You could try using an index template which includes a dynamic template that matches all string fields, then use 'ignore_above' to prevent the document from failing to index. Below is a template I use on another Elasticsearch cluster I feed logs to (not Graylog) where I was getting rejections from long fields, such as Java stacktraces. For my purposes, I didn't find it useful to index any field that has over 512 characters, but that value can be tweaked to whatever you like. The other settings could be removed or changed as desired.

Indices Templates
Index Mapping

{
  "logs_template": {
    "template": "logs*",
    "mappings": {
      "_default_": {
        "_all": {
          "enabled": false
        },
        "dynamic_templates": [
          {
            "notanalyzed": {
              "match": "*",
              "match_mapping_type": "string",
              "mapping": {
                "ignore_above": 512,
                "type": "string",
                "index": "not_analyzed",
                "doc_values": true
              }
            }
          }
        ]
      }
    }
  }
}

From the docs:

The analyzer will ignore strings larger than this size. Useful for generic not_analyzed fields that should ignore long text.

This option is also useful for protecting against Lucene’s term byte-length limit of 32766. Note: the value for ignore_above is the character count, but Lucene counts bytes, so if you have UTF-8 text, you may want to set the limit to 32766 / 3 = 10922 since UTF-8 characters may occupy at most 3 bytes.

@meixger
Copy link

meixger commented Apr 14, 2016

@csquire Nice, and this would work on a custom template, but unfortunately i've not found a way to replace the store_generic template in the default graylog-internal template:

{
  "graylog-internal": {
    "order": -2147483648,
    "template": "graylog_*",
    "mappings": {
      "message": {
        ...
        "dynamic_templates": [
          {
            "internal_fields": {
              ...
              "match": "gl2_*"
            }
          },
          {
            "store_generic": {
              "mapping": {
                "index": "not_analyzed",
              },
              "match": "*"
            }
          }
        ],
        "properties": {
          ...
        }
      }
    }
  }
}

What would speak against adding a ignore_above as default?

...
"store_generic": {
  "mapping": {
    "index": "not_analyzed",
    "ignore_above": 32766
  },
  "match": "*"
}
...

@sjoerdmulder
Copy link

sjoerdmulder commented Nov 15, 2016

I also hit this issue on Graylog 2.1.1, ignore_above seems a good option that will fix this.

@mike-daoust
Copy link

this is an issue for me also

@listingmirror
Copy link

listingmirror commented May 9, 2017

I hit this and my entire cluster dies (stops getting new messages). Shouldn't there be some kind of default limit to prevent cluster death? (Maybe this didnt kill cluster, still researching)

@james-gonzalez
Copy link

Same as @listingmirror. Anytime we get a larger than normal java stack trace, it's bringing down Graylog completely. Not indexing documents anymore, no new logs. The only solution I've found so far is to kill -9 and then delete the on-disk journal.

@jebucha
Copy link

jebucha commented May 31, 2017

I believe we are also running in to this issue. I hadn't connected the dots but I'm seeing indexing failures "Document contains at least one immense term in field="full_message"", and the node that threw that error is not currently processing incoming messages, just queuing them up, backed up by 2 million and counting. As with others, my primary resolution has been to restart the service.

@Aenima4six2
Copy link

Aenima4six2 commented Aug 15, 2017

We were getting this issue in pre Graylog 2.3 versions and fixed the issue using @joschi's advice above (using custom mappings). However, with our recent upgrade to Graylog 2.3, the issue is back, even though a custom mapping is present in ES preventing the field from being indexed.

Current Error

{"type":"illegal_argument_exception","reason":"DocValuesField \"requestContent\" is too large, must be <= 32766"}
--

Old (Pre 2.3) Error

{"type":"illegal_argument_exception","reason":"Document contains at least one immense term in field=\"requestContent\" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[45, 45, 45, 32, 82, 101, 113, 117, 101, 115, 116, 32, 72, 101, 97, 100, 101, 114, 115, 32, 45, 45, 45, 13, 10, 67, 111, 110, 110, 101]...', original message: bytes can be at most 32766 in length; got 38345","caused_by":{"type":"max_bytes_length_exceeded_exception","reason":"max_bytes_length_exceeded_exception: bytes can be at most 32766 in length; got 38345"}}

Pushed the following custom mapping to ES to address this, but no luck.

curl -X PUT -d '{ "template": "graylog_*", "mappings" : { "message" : { "properties" : { "requestContent" : { "type" : "string", "index" : "no" } } } } }' http://localhost:9200/_template/graylog-custom-mapping?pretty

curl -X GET 'http://localhost:9200/graylog_deflector/_mapping?pretty' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 12743  100 12743    0     0  2107k      0 --:--:-- --:--:-- --:--:-- 2488k
{
  "graylog_5": {
    "mappings": {
      "message": {
        "dynamic_templates": [
          {
            "internal_fields": {
              "match": "gl2_*",
              "mapping": {
                "type": "keyword"
              }
            }
          },
          {
            "store_generic": {
              "match": "*",
              "mapping": {
                "index": "not_analyzed"
              }
            }
          }
        ],
        "properties": {
          "AccountName": {
            "type": "keyword"
          },
          ...
          "requestContent": {
            "type": "keyword",
            "index": false
          },
         ...
        }
      }
    }
  }
}

UPDATE
Think i found a viable solution.
https://www.elastic.co/guide/en/elasticsearch/reference/current/doc-values.html

curl -X PUT http://localhost:9200/_template/graylog-custom-mapping?pretty -d '
{
  "template": "graylog_*",
  "mappings" : {
    "message" : {
      "properties" : {
        "requestContent" : {
          "type" : "string",
          "index" : "no",
          "doc_values": false ---> turn this off.. ES 5.5 appears to have a 32k size limit.
        }
      }
    }
  }
}'

@avdhoot
Copy link

avdhoot commented Sep 25, 2017

@Aenima4six2 thanks for above solution.
+1

@Ayyappa752
Copy link

Hi @csquire , I ran into the same problem when I'm storing a html template in elastic. I have tried not to index it, and to increase the size by using "ignoreabove":512. But that didn't work. finally i have to use "doc_values": true along with the size. how come this doc_values solved the issue?

@zhangtemplar
Copy link

@Aenima4six2 recent version of ElasticSearch does not allow you to change the type, index and/or doc_values.

However using ignore_above works. Here is the command:

curl -XPUT 'http://localhost:9200/graylog_0/_mapping/message' -d '
{
    "message" : {
      "properties" : {
        "screenShot" : {
          "type" : "keyword",
          "ignore_above": 32000
        }
      }
    }
}
'

@joginsky
Copy link

Hi all,

I have the same problem... ignore_above seems to be working but not for "type": "text". Any idea how to solve this for text type?

@aimhighrana
Copy link

Hello ,

Same issue , max_bytes_length_exceeded_exception for large text data..

Thanks

@bghira
Copy link

bghira commented Aug 28, 2023

any chance some eyes on this one could happen in the next decade? :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.