Extremely high load average around 03:30 (AM) each night

elgholm · 10-09-2023, 03:08 AM

Quote:

Originally Posted by wpeckham

Actually those numbers do not reflect CPU load at all as far as I can tell. They appear to be related to queue loading as measured by stack size and/or response time snapshots. Many things can affect those numbers.

I agree that this does not appear any kind of problem and a detailed investigation may not be a good use of your time, but if the machine were one of mine I would be VERY curious!
Checking the I/O, memory, and CPU usage of the processes running during the event window and comparing against results outside the window might prove instructive. IT is certainly something I would try.

Yes, and this is something I've tried to do. But the scripts I've run show me nothing out of the ordinary, and that's why I'm asking the question here if there's some other monitoring-command or stuff I can try - which perhaps will indicate to me what process to check out.

And yes, exactly, the CPU isn't swamped with work during this time - so I also believe it do be a queueing of processes/threads, that stumble upon each other a bit, wait states, which pushes the load average through the roof during a very short amount of time. The problem is, as I've written above, that this makes our monitoring of the server sad - since the high peaks f*cks up our chars. =(

Yes, very curious indeed =)

elgholm · 10-09-2023, 03:12 AM

Quote:

Originally Posted by syg00

Yeah, me too.

But if the attempts the OP has already tried don't show anything, I'd be looking at some kernel traces. Hard to convince me it's more important than my weekend slacking off tho' ....

loadavg is probably the most misunderstood (and pointless) metric in linux IMHO.

Kernel traces sounds a bit too excessive for my taste, and I hope there's easier methods to just monitor and _find_ what process it is that's doing this at around that time. But we'll see. I'm thinking of starting the psacct service during the window now, and also hammering out more textfiles using pidstat. Hopefully that catches something.

loadavg is certainly pointless when people think of it like "CPU usage" - which is it not, really - but it's actually a quite good metric just for seeing stuff that stands out, like this... =(

elgholm · 10-09-2023, 03:33 AM

Quote:

Originally Posted by MadeInGermany

Make a cron job that at 3:30 runs
pidstat >pidstat.out

I'm gonna start this monster at 03:25 now.

pidstat -l 1 400 >/tmp/ts.txt &
pidstat -r -l 1 400 >/tmp/tr.txt &
pidstat -t -l 1 400 >/tmp/tt.txt &

pan64 · 10-09-2023, 03:38 AM

you might want to run it several times, from 3:25 to 3:45 even every second (for example).

elgholm · 10-10-2023, 02:42 AM

Quote:

Originally Posted by pan64

you might want to run it several times, from 3:25 to 3:45 even every second (for example).

I run it for 400 times, one time each second. So, over the period that something is happening. That's what the parameters "1 400" specify.

Also, my previous script runs as a daemon, and wakes up as soon as the load average is above a threshold and then logs a bunch of things for a minute or two. But that script misses the actual burst of jobs happening.

metaed · 10-19-2023, 12:42 PM

Quote:

Originally Posted by elgholm

that script misses the actual burst of jobs happening

The only tool I know of that is 100 % reliable at recording short-lived processes is process accounting. When enabled, process accounting logs every process when it exits, no matter how short-lived.

If there are no short-lived processes logged in the process accounting file around the time in question, then you can safely conclude it is a long-lived process, not a burst of short-lived processes.

elgholm · 10-20-2023, 04:21 AM

I think I've narrowed it down.
It's exim_tidydb that gets started by anacron a couple of minutes before 03:30.

Now I just need to investigate why it wreaks havoc with my load avg.

Thank you all for giving me some pointers! Really helpful, much appreciated, and also nice to not feel so alone out there =)

elgholm · 10-20-2023, 04:40 AM

Quote:

Originally Posted by elgholm

I think I've narrowed it down.
It's exim_tidydb that gets started by anacron a couple of minutes before 03:30.

And just to quote myself... I don't even use this server as a mailserver, so, it's a standard exim install - just to be able to shuffle local mail. So it doesn't have any large databases to "tidy" up.

metaed · 10-20-2023, 12:13 PM

Quote:

Originally Posted by elgholm

It's exim_tidydb that gets started by anacron a couple of minutes before 03:30

How many messages in queue? exim -bp

elgholm · 10-21-2023, 03:08 AM

Quote:

Originally Posted by metaed

How many messages in queue? exim -bp

2.

I don't even use exim on this server. It's just default installed.

What's funny is that when I put the anacron-job back, the spike reappeared, so it's most probably that. But when I run it manually, as root, exactly nothing happens - as is expected, since there's nothing to clean. But somehow it manages to find a bunch of stuff to "do" at 03:20-03:30 each day... Weird.

elgholm · 10-21-2023, 03:54 AM

Update: I've not cleared everything in the /var/spool/exim/db directory, and restarted Exim - which is just a default setup (it doesn't handle any mail for real, just sends them of to another host).

The directory had some files in it, nothing big. I think the biggest one was like 1-200 KB or so.

Now it's just wait and see..

MadeInGermany · 10-21-2023, 06:11 AM

Perhaps other stuff is running at that time that makes the disk slow.
Try to shift it to another time slot.
Low mail usage: then let it run at weekends only.

elgholm · 11-02-2023, 02:38 AM

Quote:

Originally Posted by MadeInGermany

Perhaps other stuff is running at that time that makes the disk slow.
Try to shift it to another time slot.
Low mail usage: then let it run at weekends only.

Yeah, but it wasn't that. Still no idea what is causing this. =(

pan64 · 11-02-2023, 03:49 AM

What is causing what? Do you have any data/log/info/whatever to analyze?

elgholm · 11-03-2023, 02:19 AM

Quote:

Originally Posted by pan64

What is causing what? Do you have any data/log/info/whatever to analyze?

No. That's exactly what I need help to find.
I've checked all the logs I know about, but there's nothing special in them. That's the issue.
I'm hoping for someone to perhaps point me towards something I haven't tried yet.
Attached you can see the load average for tonight... =(