Mehmet Ergene
Mastering Log Ingestion Delay in Detection Engineering
Log ingestion delays are an inevitable challenge in detection engineering, especially for large organizations. The topic has already been covered in several blogs. After reading the awesome post by Andrew VanVleet, I decided to write my thoughts on this problem since I have a bit different opinions on handling log ingestion delays.
How Log Ingestion Delay Happens
Log ingestion delay happens when any of the components in the pipeline encounter an issue. These components are simply:
Any of the components may have issues at any time, resulting in ingestion delay. Besides, most of the time, log ingestion delay doesn't occur uniformly. It usually varies depending on system loa, log type, and other factors:
- The device generating the logs
- The agent/device processing and sending the logs
- The receiver that ingests the logs
- Network components(mainly the connection availabilitly and bandwith)
Any of the components may have issues at any time, resulting in ingestion delay. Besides, most of the time, log ingestion delay doesn't occur uniformly. It usually varies depending on system loa, log type, and other factors:
Business Hour Dynamics
During peak business hours, critical systems such as domain controllers, Syslog log collectors that are collecting logs from Firewalls/Web Proxies/etc., and web servers experience heavy utilization. This load results in massive amounts of log generation which can lead to significant delays in log ingestion. However, outside of business hours, these delays often diminish(check your ingestion delay over 24h using 1-2h interval ;)).
The load on certain systems result in varying ingestion delays in the same table. For example, while 95% of the member servers have 2 minutes of delay, the rest, especially the domain controllers, may have 30 minutes or longer delay.
The load on certain systems result in varying ingestion delays in the same table. For example, while 95% of the member servers have 2 minutes of delay, the rest, especially the domain controllers, may have 30 minutes or longer delay.
Table Specific Utilization
Logs are typically segregated into different tables or indexes based on their type (e.g., authentication logs, network logs, application logs). Each table can experience different levels of delay. For example:
- SecurityEvent table might have a delay of only 5 minutes.
- CommonSecurityLog could face delays of up to 4 hours.
Strategies to Mitigate Ingestion Delay
This variability in ingestion delay makes a one-size-fits-all approach to detection queries impractical. To address ingestion delay challenges, different approaches should be employed depending on the detection requirements.
When Event Time Isn’t Part of Detection Logic
When the event timestamp isn’t part of detection logic, leveraging ingest time is the most straightforward solution. It ensures that the detection runs on whatever data is available, regardless of ingestion delays.
When Event Time Is Part of the Detection Logic
If the detection logic relies on event timestamps—for example, in baseline comparisons (e.g., comparing the last 30 minutes of activity to the previous 12 hours)—using ingest time can lead to both false negatives and false positives. Delayed lookback windows are more reliable in these cases, as they allow data ingestion delays to catch up, ensuring the analysis includes all relevant events.
Queries Leveraging Joins within the Same Table
When detection queries involve joining data from different sources within the same table, they often rely on event time either implicitly or explicitly. For example, you may be correlating events from member servers with the events from domain controllers. If these sources experience varying ingestion delays, the join operation may fail to align events properly, resulting in missed detections or incorrect matches when ingest time is used. Using delayed lookback windows accounts for these disparities, reducing the likelihood of false negatives and positives.
Queries Leveraging Joins Across Tables
When detection queries involve joining data from different tables, they usually rely on event time either implicitly or explicitly. If these tables experience varying ingestion delays as explained above, the join operation may fail to align events properly, resulting in missed detections or incorrect matches when ingest time is used. Again, using delayed lookback windows accounts for these disparities, reducing the likelihood of false negatives and positives.
Balancing Time-to-Detect and Accuracy
While delayed lookback windows may increase the time it takes to detect a threat, this tradeoff is often worthwhile. I think, a slightly delayed detection is preferable to completely missing an attack due to ingestion delays.
This approach ensures that critical events are not overlooked, even under challenging logging conditions. By carefully adjusting lookback windows and tailoring queries to the specific requirements of each detection scenario, we can strike a balance between timely response and robust threat detection.
Conclusion
Log ingestion delays are a complex but manageable aspect of detection engineering. By understanding the patterns of these delays and employing strategies tailored to specific detection scenarios, we can build resilient detection systems that maintain accuracy and effectiveness under varying conditions.
Share
Copyright © 2025
Featured Links
Subscribe to our Newsletter!
Thank you!