Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redis brain storage of incident data still isn't working #13

Open
jm-welch opened this issue Jul 19, 2016 · 2 comments
Open

Redis brain storage of incident data still isn't working #13

jm-welch opened this issue Jul 19, 2016 · 2 comments

Comments

@jm-welch
Copy link
Contributor

jm-welch commented Jul 19, 2016

This one's going to be long, but hopefully, I've got a solution. I was digging around in the depths of redis to figure out how best to store control call data, and I discovered that there are a couple of problems with the current handling of the incident list in the redis brain.

When I looked at the redis data block, I saw the following:

{
  "users": {
    "1": {
      "id": "1",
      "name": "Shell",
      "room": "Shell"
    },
    "jmwelch": {
      "id": "jmwelch",
      "name": "jmwelch",
      "roles": [
        "admin",
        "oss"
      ]
    },
    "VictorOps": {
      "id": "VictorOps",
      "name": "VictorOps"
    }
  },
  "_private": {
    "1091": {
      "#ALERT JSON FOR 1091 HERE"
    },
    "1092": {
      "#ALERT JSON FOR 1092 HERE"
    },
    "VO_INCIDENT_KEYS": []
  }
}

(Except there were lots more incidents tracked). So this means that hubot doesn't use redis at the top level, but rather creates a json block stored as the value for the hubot:storage key, and @robot.brain.set is setting values under _private within that JSON block.

Notice the odd thing there? VO_INCIDENT_KEYS is blank. That's not a redaction, it's the observed behavior. And it stayed that way. I ran redis-cli monitor to keep an eye on it, and through every update, that value stayed a blank list. Commenting out the call to cleanupBrain "fixed" that, so I figured I'd do some more digging into that method to see what else might be odd there.

Here's the current implementation (in case it gets lost to the sands of time):

  cleanupBrain: ->
    # get a list of all the victor ops incident keys in the brain
    voIKeys = @robot.brain.get "VO_INCIDENT_KEYS"

    # remove keys from the victor ops incident keys list and from the brain
    # if they are older than 24 hours
    voIKeysFiltered = voIKeys.filter((item) ->
      return (new Date(item.timestamp).getDate() + 1 >= new Date)
    )

    # set the victor ops incident keys list in the the brain to the updated
    # list value
    @robot.brain.set "VO_INCIDENT_KEYS", voIKeysFiltered

The new Date(item.timestamp) creates a date object, which the getDate() method converts to an integer representing the day of the month (1-31) (reference). This means a couple of things.

  1. The new Date on the other side of the comparison is being compared to a raw integer. This isn't going to go well. Because of this, every time cleanupBrain() is called, the entirety of VO_INCIDENT_KEYS gets filtered out, since the integer is always less than the Date object.
  2. Even changing the right side of the comparison to new Date().getDate() doesn't work. The .getDate() on the left side means that the +1 isn't adding a day, it's adding a day-of-month, so any items created on the 31st of a given month will never be removed (there will never be a 32nd of the month), and objects created on the 30th won't qualify for removal until the 31st of the following month.

The fix for this is to use the .getTime() method. This converts the date to an epoch time instead, so instead of adding 1 to stand in for 24 hours, we add 86400000 (milliseconds in a day). This works out to something like:

      new Date(item.timestamp) + 86400000 >= new Date().getTime()

The second issue I found is one that we actually just removed the code for, without actually fixing it. It used to look like this:

  cleanupBrain: ->
    # get a list of all the victor ops incident keys in the brain
    voIKeys = @robot.brain.get "VO_INCIDENT_KEYS"

    # remove keys from the victor ops incident keys list and from the brain
    # if they are older than 24 hours
    voIKeysFiltered = voIKeys.filter((item) ->
      if new Date(item.timestamp).getDate() + 1 < new Date
        #@robot.brain.remove item.name
        return false
      true
    )

    # set the victor ops incident keys list in the the brain to the updated
    # list value
    @robot.brain.set "VO_INCIDENT_KEYS", voIKeysFiltered

Now that I've done some more digging into what's going on in the redis brain, I can see what was going on in the #@robot.brain.remove item.name line. Up in receiveWS, under the TIMELINE_LIST_REPLY_MESSAGE block, the incident is being written two places:

          voCurIName = item.ALERT["INCIDENT_NAME"]
          @robot.brain.set voCurIName, item.ALERT

... and ...

          # get a list of current victor ops incident keys in the brain
          voIKeys = @robot.brain.get "VO_INCIDENT_KEYS"
          # catch null lists and init as blank
          if not voIKeys?
            voIKeys = []

          voIKeys.push
            name: voCurIName
            timestamp: new Date
          @robot.brain.set "VO_INCIDENT_KEYS", voIKeys

The first call, above, adds the full alert JSON directly under _private, while the second call manipulates the contents list stored (at the same level) as VO_INCIDENT_KEYS.

This means that, by dropping the @robot.brain.remove item.name line, we rendered the bot incapable of cleaning up the full JSON of any alert it has received, while the list that was intended to be used to prune the larger body of data is being constantly purged (meaning that there is no way short of manual manipulation to remove those entries). This means that the size of the redis brain has the potential to get quite large over time, as no old entries will ever be removed.

Based on this, it looks like we really do need the @robot.brain.remove line, so I uncommented it, and now that we've got everything else going for us, I think we're in good shape.

  cleanupBrain: ->
    # get a list of all the victor ops incident keys in the brain
    voIKeys = @robot.brain.get "VO_INCIDENT_KEYS"

    # remove keys from the victor ops incident keys list and from the brain
    # if they are older than 24 hours
    voIKeysFiltered = voIKeys.filter((item) ->
      if new Date(item.timestamp).getTime() + 86400000 < new Date().getTime()
        @robot.brain.remove item.name
        return false
      true
    )

    # set the victor ops incident keys list in the the brain to the updated
    # list value
    @robot.brain.set "VO_INCIDENT_KEYS", voIKeysFiltered

I'm working on testing this at the moment to make sure that it really does work now, before I submit a PR. I'll be able to change the 'timeout' value from 1 day down to a few hours tomorrow morning to see if my test alert that I just created gets dropped from the brain.

@jm-welch
Copy link
Contributor Author

jm-welch commented Jul 20, 2016

Just found another issue with the VO_INCIDENT_KEYS list - it's not filtering out duplicates. If an incident receives 3 updates, all three of those will be present in the list. This may not fill a disk, but it is taking up a lot of extra space that doesn't need to be taken. Also, since the older message will be removed from the INCIDENT_KEYS list after 24 hours, this would also remove that incident from the main list at that time, even if it had been updated more recently.

@jm-welch
Copy link
Contributor Author

I think I have a full-on fix for #14 ready to go - doing final testing now. If it works, this issue won't be needed anymore, as the cleanup will be handled differently.

jm-welch added a commit to jm-welch/hubot-victorops that referenced this issue Jul 20, 2016
fixes victorops#14, closes victorops#13
- Added new environment variable (HUBOT_VICTOROPS_KEEP) to control how long incidents are stored in the brain
- Moved incident list to _private.VO_INCIDENTS, instead of as children of _private
- Now using .getTime() to compare timestamps for incident expiry
- Now using timestamps within the JSON data for each alert, rather than maintaining a separate list (VO_INCIDENT_KEYS)
- Bot will log how many days incidents are to be kept on initialization
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant