Extremely high load average around 03:30 (AM) each night
Linux - ServerThis forum is for the discussion of Linux Software used in a server related context.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Actually those numbers do not reflect CPU load at all as far as I can tell. They appear to be related to queue loading as measured by stack size and/or response time snapshots. Many things can affect those numbers.
I agree that this does not appear any kind of problem and a detailed investigation may not be a good use of your time, but if the machine were one of mine I would be VERY curious!
Checking the I/O, memory, and CPU usage of the processes running during the event window and comparing against results outside the window might prove instructive. IT is certainly something I would try.
Yes, and this is something I've tried to do. But the scripts I've run show me nothing out of the ordinary, and that's why I'm asking the question here if there's some other monitoring-command or stuff I can try - which perhaps will indicate to me what process to check out.
And yes, exactly, the CPU isn't swamped with work during this time - so I also believe it do be a queueing of processes/threads, that stumble upon each other a bit, wait states, which pushes the load average through the roof during a very short amount of time. The problem is, as I've written above, that this makes our monitoring of the server sad - since the high peaks f*cks up our chars. =(
But if the attempts the OP has already tried don't show anything, I'd be looking at some kernel traces. Hard to convince me it's more important than my weekend slacking off tho' ....
loadavg is probably the most misunderstood (and pointless) metric in linux IMHO.
Kernel traces sounds a bit too excessive for my taste, and I hope there's easier methods to just monitor and _find_ what process it is that's doing this at around that time. But we'll see. I'm thinking of starting the psacct service during the window now, and also hammering out more textfiles using pidstat. Hopefully that catches something.
loadavg is certainly pointless when people think of it like "CPU usage" - which is it not, really - but it's actually a quite good metric just for seeing stuff that stands out, like this... =(
you might want to run it several times, from 3:25 to 3:45 even every second (for example).
I run it for 400 times, one time each second. So, over the period that something is happening. That's what the parameters "1 400" specify.
Also, my previous script runs as a daemon, and wakes up as soon as the load average is above a threshold and then logs a bunch of things for a minute or two. But that script misses the actual burst of jobs happening.
that script misses the actual burst of jobs happening
The only tool I know of that is 100 % reliable at recording short-lived processes is process accounting. When enabled, process accounting logs every process when it exits, no matter how short-lived.
If there are no short-lived processes logged in the process accounting file around the time in question, then you can safely conclude it is a long-lived process, not a burst of short-lived processes.
I think I've narrowed it down.
It's exim_tidydb that gets started by anacron a couple of minutes before 03:30.
And just to quote myself... I don't even use this server as a mailserver, so, it's a standard exim install - just to be able to shuffle local mail. So it doesn't have any large databases to "tidy" up.
I don't even use exim on this server. It's just default installed.
What's funny is that when I put the anacron-job back, the spike reappeared, so it's most probably that. But when I run it manually, as root, exactly nothing happens - as is expected, since there's nothing to clean. But somehow it manages to find a bunch of stuff to "do" at 03:20-03:30 each day... Weird.
Update: I've not cleared everything in the /var/spool/exim/db directory, and restarted Exim - which is just a default setup (it doesn't handle any mail for real, just sends them of to another host).
The directory had some files in it, nothing big. I think the biggest one was like 1-200 KB or so.
Perhaps other stuff is running at that time that makes the disk slow.
Try to shift it to another time slot.
Low mail usage: then let it run at weekends only.
Perhaps other stuff is running at that time that makes the disk slow.
Try to shift it to another time slot.
Low mail usage: then let it run at weekends only.
Yeah, but it wasn't that. Still no idea what is causing this. =(
What is causing what? Do you have any data/log/info/whatever to analyze?
No. That's exactly what I need help to find.
I've checked all the logs I know about, but there's nothing special in them. That's the issue.
I'm hoping for someone to perhaps point me towards something I haven't tried yet.
Attached you can see the load average for tonight... =(
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.