August 23rd, 2006

Data structures

One of my recent tasks at work has been bringing a raging out-of-control nagios installation back to earth. After changing a bunch of things, the 15 minute load average has dropped from the 8-12 range to the 3-6 range. The biggest CPU hog remaining on the system is nagios itself.

I poked around in the source code and looked at the nagios main loop. Half of nagios is essentially an event scheduler: perform task X, then put it back in the run queue to run in Y seconds. We have around 3200 tasks, most scheduled on 5 minute intervals, but some run every minute and some run every hour and some run somewhere in between.

To do this, nagios keeps a singly-linked list of all tasks and, when inserting an task, walks the list until it finds the right place. D'oh.

(I have to actually profile before I consider this the main cpu-hogging part of nagios, but it doesn't give me hope for the quality of the rest of the system.)