Linux - ServerThis forum is for the discussion of Linux Software used in a server related context.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
1 - How, exactly, did you 'swap' things to a new machine, and what kind of machine did you swap TO? And it 100% could be hardware related, since (if you're just moving the hard drive), that IT is the problem.
2 - Errors will obviously show up BEFORE the crash...afterwards, it's showing normal boot/warning messages.
3 - If you have several servers, why is it a bad idea to put in a monitoring solution that can watch whatever you have now and whatever you ADD??
You're omitting a good bit:
4 -
What kind of hardware you're moving this hard drive to
5 -
What services are running on this server
6 -
Has anything changed/been modified/added to this server before this problem started?
7 -
How many users?
8 -
How much storage?
9 -
How much memory? (and have you tested THAT as well??)
There are loads of factors that can cause this, but you've not given us any error messages to work with.
1 - We have 2 machines identical, we just replaced the disk of the good machine by this of the supposed bad one and as we observed the same consequences we have deducted the problem was not hardware.
2 - You say there is some error messages which are missed and not added on logs afterwards ? In any case for the moment we will at least be able to see what is displayed on the screen before forcing the restart
3 - It is a good idea but this is no the priority of things ... but it is planned ; too we prefer to not install Nagios as it is very big for our needs (we tested netdata more lighter).
4 - see 1
5 - A degraded proprietary telephony IPBX server
6 - No, not that we know
7 - About 60 persons
8 - Less 100 GB
9 - 4 GB RAM
What we are looking for is to understand what is wrong. We lack method.
Last edited by lenainjaune; 04-16-2024 at 05:50 PM.
Instead of moving that drive to a different identical server, try taking a new drive of the same capacity and type, CLONE or image the old one onto the new one, and put THAT in the other machine. If it still freezes the same it is either a software issue or the new drive is faulty. (Which is very unlikely.)
IF it is software, and those logs are not giving you useful data, you might need to turn on better logging. Be warned, additional logging may degrade performance, but is the option most likely to give you useful "root cause" information.
Instead of moving that drive to a different identical server, try taking a new drive of the same capacity and type, CLONE or image the old one onto the new one, and put THAT in the other machine. If it still freezes the same it is either a software issue or the new drive is faulty. (Which is very unlikely.)
Oh yes ! We completely missed this point thank you to remind us. We will clone it and test it !
Quote:
Originally Posted by wpeckham
IF it is software, and those logs are not giving you useful data, you might need to turn on better logging. Be warned, additional logging may degrade performance, but is the option most likely to give you useful "root cause" information.
Yes that is exactly what we want but we do not know how to activate it ? Yes we are aware that this will degrade the performances but we will test it temporarily to see what happens.
Oh yes ! We completely missed this point thank you to remind us. We will clone it and test it !
Cloned with clonezilla with success ! The system booted and the telephony works. Now we scan badblocks (smartctl has been yet tested with success while the system was running)
Too, as we must photography the screen when the system will freeze, have you a solution to prevent the monitor to go to sleep ?
We heard about setterm but it seems this does not working (or we do not apply correctly), nor in adding the apm=off to the grub config, nor BIOS setting relative.
Last edited by lenainjaune; 04-19-2024 at 11:21 AM.
Cloned with clonezilla with success ! The system booted and the telephony works. Now we scan badblocks (smartctl has been yet tested with success while the system was running)
Too, as we must photography the screen when the system will freeze, have you a solution to prevent the monitor to go to sleep ?
We heard about setterm but it seems this does not working (or we do not apply correctly), nor in adding the apm=off to the grub config, nor BIOS setting relative.
Every time I have had a problem like this, there was a code displayed on the monitor. If the monitor was asleep, waking it up displayed the code.
That said, there were times when that code was NOT the one I needed, and I had to redirect the std or kernel messages elsewhere so I could analyze them later.
If the machine with the cloned drive freezes, I would look at rebuilding the OS and software.
IF the cloned machine does not freeze, the hardware that DOES freeze is clearly suspect. Perhaps the old hard drive? One might hope, because replacing that old hard drive is the fastest and easiest fix.
Every time I have had a problem like this, there was a code displayed on the monitor. If the monitor was asleep, waking it up displayed the code.
In our case the keyboard was ineffective. It did not awake the monitor.
Quote:
Originally Posted by wpeckham
That said, there were times when that code was NOT the one I needed, and I had to redirect the std or kernel messages elsewhere so I could analyze them later.
If the machine with the cloned drive freezes, I would look at rebuilding the OS and software.
IF the cloned machine does not freeze, the hardware that DOES freeze is clearly suspect. Perhaps the old hard drive? One might hope, because replacing that old hard drive is the fastest and easiest fix.
The old hard disk has just been checked (smart + badblocks) ... without error !
So the problem is elsewhere ... or more subtile (we will to observe the server for 2 weeks ; if it does not block once, the problem will still be due to the hard drive)
An extra boot argument should not harm, in the worst case gives a "unknown, ignored" message.
Ensure that the generated /boot/grub/grub.cfg has the desired option!
Run grub2-mkconfig or grub-mkconfig to update it!
The Xserver (or Wayland) can switch the display to dark without acpi; some monitors go to power-safe soon after being dark.
An extra boot argument should not harm, in the worst case gives a "unknown, ignored" message.
Ensure that the generated /boot/grub/grub.cfg has the desired option!
Run grub2-mkconfig or grub-mkconfig to update it!
It have them :
Code:
root@host:~# grep acpi /boot/grub/grub.cfg
linux /vmlinuz-4.9.0-19-amd64 root=/dev/mapper/ipbx--vg-root ro quiet acpi=off apm=off
linux /vmlinuz-4.9.0-19-amd64 root=/dev/mapper/ipbx--vg-root ro quiet acpi=off apm=off
linux /vmlinuz-4.9.0-6-amd64 root=/dev/mapper/ipbx--vg-root ro quiet acpi=off apm=off
root@host:~# uname -r
4.9.0-19-amd64
Quote:
Originally Posted by MadeInGermany
The Xserver (or Wayland) can switch the display to dark without acpi; some monitors go to power-safe soon after being dark.
We have a system without graphical server (no X-server nor wayland). It is a nude server system without desktop environment.
Furthermore we did not find a monitor setting dedicated to power-safe or it is a BIOS parameter.
Last edited by lenainjaune; 04-22-2024 at 11:40 AM.
The old hard disk has just been checked (smart + badblocks) ... without error !
So the problem is elsewhere ... or more subtile (we will to observe the server for 2 weeks ; if it does not block once, the problem will still be due to the hard drive)
#1 a drive problem may not be the MEDIA, or detected by SMART, if it is in the ELECTRONICS! If you have ever had one apart, there is a circuit board in there that interfaces the media heads, motor, and interface circuits to the power and interface to the computer. On those electronics is where SMART lives.
I am looking forward to seeing how it does in the next two weeks!
#2 check your TTY options (Ctrl-Alt-F1 through Ctrl-Alt-F6 or higher) and see if one is echoing the kernel journal messages. That would be the one to check...
As much as I hold out hope that the problem is the storage device, it is not easy to imagine what it could do that would lock the machine so bad that it would go totally unresponsive on network, keyboard, and display. That sounds more like power or thermal/CPU issues, or a failed capacitor on the MB.
I am glad for you that the machine is being replaced anyway! IT sounds like it is past due.
I also recommend that after this is all over you revisit your DR plan for this operation. IF this is a critical resource, as it sounds, then you might need a better fail-over/replacement and backup/restore plan for recovery. IF not to modify that plan, at least to verify that it is adequate for your needs.
#1 a drive problem may not be the MEDIA, or detected by SMART, if it is in the ELECTRONICS! If you have ever had one apart, there is a circuit board in there that interfaces the media heads, motor, and interface circuits to the power and interface to the computer. On those electronics is where SMART lives.
I am looking forward to seeing how it does in the next two weeks!
There was a freeze on 23/04 but unfortunately someone reboot before a photo was taken. We just warned users to take a photo before restarting.
At least as the problem is always there with the cloned system disk this demonstrates that problem is not relative to the disk.
Quote:
Originally Posted by wpeckham
#2 check your TTY options (Ctrl-Alt-F1 through Ctrl-Alt-F6 or higher) and see if one is echoing the kernel journal messages. That would be the one to check...
As much as I hold out hope that the problem is the storage device, it is not easy to imagine what it could do that would lock the machine so bad that it would go totally unresponsive on network, keyboard, and display. That sounds more like power or thermal/CPU issues, or a failed capacitor on the MB.
I am glad for you that the machine is being replaced anyway! IT sounds like it is past due.
Quote:
Originally Posted by wpeckham
I also recommend that after this is all over you revisit your DR plan for this operation. IF this is a critical resource, as it sounds, then you might need a better fail-over/replacement and backup/restore plan for recovery. IF not to modify that plan, at least to verify that it is adequate for your needs.
Yes we will ! We are considering putting 2 systems in redundancy and when the first will be unresponsive the second will take the relay ... and the more important, we will managing our telephony ourself with our Asterisk system.
---
As we managed to make the screen always on (parameter consoleblank=0 on grub configuration), we noticed that a simulated crash with a kernel panic (we followed this method to achieve it), displays also on the login screen.
Is it sufficient to have information before the freeze ?
We also experimented to make a journalctl command running at boot (so before login) in modifying /etc/rc.local to run this detached command journalctl --follow & and in this case the screen is flooded continuously with no pause (however we discovered that Ctrl + s can stop it and permit access to another tty with Ctrl + Alt + F2 or other). This flood is strange because in logon the command is flood-less. We suppose that is not the right way to do it.
Information from before the freeze, in particular JUST before the freeze so it is likely to capture the cause, is the ONLY information that might be seriously helpful. AT the freeze logging will stop and you will get no information, and AFTER the freeze is also after the reboot and the cause information may be gone for good.
If I understand correctly:
1. if you move the drive ti a different identical machine that one does freeze.
That would eliminate the original machine hardware EXCEPT the drive.
2. IF cloned to a new drive, it will still freeze. That eliminates the drive itself.
If those are both true, we have eliminated all of the hardware and only a software issue can be left.
What has changed about the software or configuration in the few weeks just before this started?
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.