Determine what freezes a server

lenainjaune · 04-16-2024, 05:24 PM

Sorry we missed your answer ...

Quote:

Originally Posted by TB0ne

1 - How, exactly, did you 'swap' things to a new machine, and what kind of machine did you swap TO? And it 100% could be hardware related, since (if you're just moving the hard drive), that IT is the problem.

2 - Errors will obviously show up BEFORE the crash...afterwards, it's showing normal boot/warning messages.

3 - If you have several servers, why is it a bad idea to put in a monitoring solution that can watch whatever you have now and whatever you ADD??

You're omitting a good bit:

4 -
What kind of hardware you're moving this hard drive to
5 -
What services are running on this server
6 -
Has anything changed/been modified/added to this server before this problem started?
7 -
How many users?
8 -
How much storage?
9 -
How much memory? (and have you tested THAT as well??)

There are loads of factors that can cause this, but you've not given us any error messages to work with.

1 - We have 2 machines identical, we just replaced the disk of the good machine by this of the supposed bad one and as we observed the same consequences we have deducted the problem was not hardware.
2 - You say there is some error messages which are missed and not added on logs afterwards ? In any case for the moment we will at least be able to see what is displayed on the screen before forcing the restart
3 - It is a good idea but this is no the priority of things ... but it is planned

; too we prefer to not install Nagios as it is very big for our needs (we tested netdata more lighter).
4 - see 1
5 - A degraded proprietary telephony IPBX server
6 - No, not that we know
7 - About 60 persons
8 - Less 100 GB
9 - 4 GB RAM

What we are looking for is to understand what is wrong. We lack method.

wpeckham · 04-16-2024, 05:32 PM

Instead of moving that drive to a different identical server, try taking a new drive of the same capacity and type, CLONE or image the old one onto the new one, and put THAT in the other machine. If it still freezes the same it is either a software issue or the new drive is faulty. (Which is very unlikely.)

IF it is software, and those logs are not giving you useful data, you might need to turn on better logging. Be warned, additional logging may degrade performance, but is the option most likely to give you useful "root cause" information.

lenainjaune · 04-19-2024, 07:54 AM

Quote:

Originally Posted by wpeckham

Instead of moving that drive to a different identical server, try taking a new drive of the same capacity and type, CLONE or image the old one onto the new one, and put THAT in the other machine. If it still freezes the same it is either a software issue or the new drive is faulty. (Which is very unlikely.)

Oh yes

! We completely missed this point thank you to remind us. We will clone it and test it !

Quote:

Originally Posted by wpeckham

IF it is software, and those logs are not giving you useful data, you might need to turn on better logging. Be warned, additional logging may degrade performance, but is the option most likely to give you useful "root cause" information.

Yes that is exactly what we want but we do not know how to activate it ? Yes we are aware that this will degrade the performances but we will test it temporarily to see what happens.

lenainjaune · 04-19-2024, 11:13 AM

Quote:

Originally Posted by lenainjaune

Oh yes

! We completely missed this point thank you to remind us. We will clone it and test it !

Cloned with clonezilla with success

! The system booted and the telephony works. Now we scan badblocks (smartctl has been yet tested with success while the system was running)

Too, as we must photography the screen when the system will freeze, have you a solution to prevent the monitor to go to sleep ?

We heard about setterm but it seems this does not working (or we do not apply correctly), nor in adding the apm=off to the grub config, nor BIOS setting relative.

wpeckham · 04-19-2024, 12:12 PM

Quote:

Originally Posted by lenainjaune

Cloned with clonezilla with success

! The system booted and the telephony works. Now we scan badblocks (smartctl has been yet tested with success while the system was running)

Too, as we must photography the screen when the system will freeze, have you a solution to prevent the monitor to go to sleep ?

We heard about setterm but it seems this does not working (or we do not apply correctly), nor in adding the apm=off to the grub config, nor BIOS setting relative.

Every time I have had a problem like this, there was a code displayed on the monitor. If the monitor was asleep, waking it up displayed the code.

That said, there were times when that code was NOT the one I needed, and I had to redirect the std or kernel messages elsewhere so I could analyze them later.

wpeckham · 04-19-2024, 12:15 PM

If the machine with the cloned drive freezes, I would look at rebuilding the OS and software.
IF the cloned machine does not freeze, the hardware that DOES freeze is clearly suspect. Perhaps the old hard drive? One might hope, because replacing that old hard drive is the fastest and easiest fix.

lenainjaune · 04-22-2024, 06:58 AM

Quote:

Originally Posted by wpeckham

Every time I have had a problem like this, there was a code displayed on the monitor. If the monitor was asleep, waking it up displayed the code.

In our case the keyboard was ineffective. It did not awake the monitor.

Quote:

Originally Posted by wpeckham

That said, there were times when that code was NOT the one I needed, and I had to redirect the std or kernel messages elsewhere so I could analyze them later.

To output elsewhere than stdout ?

You know how to make this globally to a file ?

lenainjaune · 04-22-2024, 07:05 AM

Quote:

Originally Posted by wpeckham

If the machine with the cloned drive freezes, I would look at rebuilding the OS and software.
IF the cloned machine does not freeze, the hardware that DOES freeze is clearly suspect. Perhaps the old hard drive? One might hope, because replacing that old hard drive is the fastest and easiest fix.

The old hard disk has just been checked (smart + badblocks) ... without error !

So the problem is elsewhere ... or more subtile (we will to observe the server for 2 weeks ; if it does not block once, the problem will still be due to the hard drive)

MadeInGermany · 04-22-2024, 08:08 AM

Perhaps the NIC gets hung when it wakes up from power-safe.

Quote:

adding the apm=off to the grub config

This is not used!
Instead add acpi=off

For the old grub:

Quote:

Append acpi=off to the kernel boot command line in /boot/grub/grub.conf

For the new grub (grub2):

Quote:

Append acpi=off to the kernel boot command line in /boot/grub2/grub.cfg
and run the command
grub2-mkconfig

Ensure you do this for the running/active kernel line, not for alternative kernel lines (like old kernel or failsafe).

lenainjaune · 04-22-2024, 09:14 AM

Quote:

Originally Posted by MadeInGermany

Perhaps the NIC gets hung when it wakes up from power-safe.

This is not used!
Instead add acpi=off

For the old grub:

For the new grub (grub2):

Ensure you do this for the running/active kernel line, not for alternative kernel lines (like old kernel or failsafe).

We configured grub as this :

Code:

root@host:~# /etc/default/grub
...
# https://www.linuxquestions.org/questions/linux-hardware-18/acpi-errors-4175648794/#post5965271
GRUB_CMDLINE_LINUX_DEFAULT="quiet acpi=off apm=off"
...

Even so after update, the monitor had been gone in sleep mode.

We will trying to remove "apm" parameter to let only "acpi" but we doubt of the result ...

MadeInGermany · 04-22-2024, 11:01 AM

An extra boot argument should not harm, in the worst case gives a "unknown, ignored" message.
Ensure that the generated /boot/grub/grub.cfg has the desired option!
Run grub2-mkconfig or grub-mkconfig to update it!

The Xserver (or Wayland) can switch the display to dark without acpi; some monitors go to power-safe soon after being dark.

lenainjaune · 04-22-2024, 11:35 AM

Quote:

Originally Posted by MadeInGermany

An extra boot argument should not harm, in the worst case gives a "unknown, ignored" message.
Ensure that the generated /boot/grub/grub.cfg has the desired option!
Run grub2-mkconfig or grub-mkconfig to update it!

It have them :

Code:

root@host:~# grep acpi /boot/grub/grub.cfg 
	linux	/vmlinuz-4.9.0-19-amd64 root=/dev/mapper/ipbx--vg-root ro quiet acpi=off apm=off
		linux	/vmlinuz-4.9.0-19-amd64 root=/dev/mapper/ipbx--vg-root ro quiet acpi=off apm=off
		linux	/vmlinuz-4.9.0-6-amd64 root=/dev/mapper/ipbx--vg-root ro quiet acpi=off apm=off
root@host:~# uname -r
4.9.0-19-amd64

Quote:

Originally Posted by MadeInGermany

The Xserver (or Wayland) can switch the display to dark without acpi; some monitors go to power-safe soon after being dark.

We have a system without graphical server (no X-server nor wayland). It is a nude server system without desktop environment.

Furthermore we did not find a monitor setting dedicated to power-safe or it is a BIOS parameter.

wpeckham · 04-22-2024, 12:05 PM

Quote:

Originally Posted by lenainjaune

The old hard disk has just been checked (smart + badblocks) ... without error !

So the problem is elsewhere ... or more subtile (we will to observe the server for 2 weeks ; if it does not block once, the problem will still be due to the hard drive)

#1 a drive problem may not be the MEDIA, or detected by SMART, if it is in the ELECTRONICS! If you have ever had one apart, there is a circuit board in there that interfaces the media heads, motor, and interface circuits to the power and interface to the computer. On those electronics is where SMART lives.
I am looking forward to seeing how it does in the next two weeks!

#2 check your TTY options (Ctrl-Alt-F1 through Ctrl-Alt-F6 or higher) and see if one is echoing the kernel journal messages. That would be the one to check...

As much as I hold out hope that the problem is the storage device, it is not easy to imagine what it could do that would lock the machine so bad that it would go totally unresponsive on network, keyboard, and display. That sounds more like power or thermal/CPU issues, or a failed capacitor on the MB.

I am glad for you that the machine is being replaced anyway! IT sounds like it is past due.
I also recommend that after this is all over you revisit your DR plan for this operation. IF this is a critical resource, as it sounds, then you might need a better fail-over/replacement and backup/restore plan for recovery. IF not to modify that plan, at least to verify that it is adequate for your needs.

lenainjaune · 04-25-2024, 09:47 AM

Quote:

Originally Posted by wpeckham

#1 a drive problem may not be the MEDIA, or detected by SMART, if it is in the ELECTRONICS! If you have ever had one apart, there is a circuit board in there that interfaces the media heads, motor, and interface circuits to the power and interface to the computer. On those electronics is where SMART lives.
I am looking forward to seeing how it does in the next two weeks!

There was a freeze on 23/04 but unfortunately someone reboot before a photo was taken. We just warned users to take a photo before restarting.

At least as the problem is always there with the cloned system disk this demonstrates that problem is not relative to the disk.

Quote:

Originally Posted by wpeckham

#2 check your TTY options (Ctrl-Alt-F1 through Ctrl-Alt-F6 or higher) and see if one is echoing the kernel journal messages. That would be the one to check...

As much as I hold out hope that the problem is the storage device, it is not easy to imagine what it could do that would lock the machine so bad that it would go totally unresponsive on network, keyboard, and display. That sounds more like power or thermal/CPU issues, or a failed capacitor on the MB.

I am glad for you that the machine is being replaced anyway! IT sounds like it is past due.

Quote:

Originally Posted by wpeckham

I also recommend that after this is all over you revisit your DR plan for this operation. IF this is a critical resource, as it sounds, then you might need a better fail-over/replacement and backup/restore plan for recovery. IF not to modify that plan, at least to verify that it is adequate for your needs.

Yes we will ! We are considering putting 2 systems in redundancy and when the first will be unresponsive the second will take the relay ... and the more important, we will managing our telephony ourself with our Asterisk system.

---

As we managed to make the screen always on (parameter consoleblank=0 on grub configuration), we noticed that a simulated crash with a kernel panic (we followed this method to achieve it), displays also on the login screen.

Is it sufficient to have information before the freeze ?

We also experimented to make a journalctl command running at boot (so before login) in modifying /etc/rc.local to run this detached command journalctl --follow & and in this case the screen is flooded continuously with no pause (however we discovered that Ctrl + s can stop it and permit access to another tty with Ctrl + Alt + F2 or other). This flood is strange because in logon the command is flood-less. We suppose that is not the right way to do it.

wpeckham · 04-25-2024, 10:22 AM

Information from before the freeze, in particular JUST before the freeze so it is likely to capture the cause, is the ONLY information that might be seriously helpful. AT the freeze logging will stop and you will get no information, and AFTER the freeze is also after the reboot and the cause information may be gone for good.

If I understand correctly:
1. if you move the drive ti a different identical machine that one does freeze.
That would eliminate the original machine hardware EXCEPT the drive.
2. IF cloned to a new drive, it will still freeze. That eliminates the drive itself.

If those are both true, we have eliminated all of the hardware and only a software issue can be left.

What has changed about the software or configuration in the few weeks just before this started?