Splunk – How to deal with delayed events in Alerts

If you’re using Splunk for Alerting, there is a problem that can occur where an event:

  • Is generated before a scheduled search is executed
  • But is only received after the search is done

This will mean that:

  • The scheduled search that should inspect the time-frame where the event was generated will not find the event, because it hasn’t arrived yet
  • But the next search won’t see it too, because it is now outside the inspected time-range!

This results in the event never being “seen” by any of the scheduled searches, and your alert won’t fire!
This is extremely rare if the difference between the event being generated and being received is small, but gets more and more common as the difference (lag/delay) increases.

This happens because splunk does not use the indextime (that is, when an event was received) during searches. Instead, it uses the actual event time as reported by the source system.

Here are two ways to fix this:

1 – Use a Sliding Window (Easy, but Sleazy)

When setting the time range of the event, you can use a sliding window to tolerate delays.

If the maximum delay you’re expecting is of, for example, 1 minute, and your query runs every 15 minutes, then your ranges should be:

  • Start Date: -16m
  • End Date: -1m

This means that, whenever your scheduled search is executed, it will not look at the past (newest) 1 minute, because you’re expecting delays. Instead, this specific minute will be “looked” at during the next execution.

This solution is extremely simple, but it has a downside: if an event occurs right at the end of your window, then you’ll only be alerted in (scheduled frequency) + (maximum delay expected).
In our example, that would be 15+1 = 16 minutes. 🙁

If, for some reason, the delay of an event ends up being bigger than what you expected, the event will also not be seen :(((

2 – Use _indextime (Better!)

_indextime is a special field that tracks when a log was actually received. We can use it to solve our problem

Let’s say you want your alert to run every 5 minutes. Then you should set your time range to something bigger to accomodate for your maximum expected delay. For example, if you’re expecting delays of up to 10 minutes in receiving logs, you should set it to 5+10=15 minutes. However, it’s better to set it even higher in case a log ends up with an even bigger delay.

Of course, if you’re looking at the past 15 minutes every 5 minutes, you will have problems with repeated alerts, as the search time window will see the same events multiple times.

You can, however, use the following code at the start of your query to remove all events that were not received in the last x minutes:

index=whatever
| eval indexTime = _indextime
| eval cutOffDate = relative_time(now(),"-05m")
| where indexTime >= cutOffDate

This means that:

  • The query will “see” the events that were actually received in the last x minutes, not the ones that actually happened in the last x minutes. This removes the delay problem!
  • You can still use the _time field to denote when the event actually happened for whatever purposes
  • You don’t have to worry about the maximum expected delay: just set a big time range and you’ll receive the alert after the event arrives in the very next execution.