Linux - KernelThis forum is for all discussion relating to the Linux kernel.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Got a bit of a brain-buster here and i'm not sure if anyone can help me, but i'll post anyway just on the off chance someone has any ideas of what i might try next.
First, i've been maintaining my own linux "distro" for the past 20 years, everything is compiled from source and i've implemented some custom package management via encap (similar to gnu stow, basically symlinking /usr from a directory of packages). I also treat the kernel as a package of sorts.
My home network is running off a central server and essentially workstations are just a deployment of the system software with /home (and a few other things) mounted over nfs from the server. Deploying a workstation is pretty easy, i boot up the machine with a system-rescue.org usb stick, partition the drives, copy the system software over from the server, set up an EFI partition using grub and i'm good to go.
I have a ryzen machine i bought in 2020 that i provisioned in just this way and has been my primary workstation since then, with everything working really nicely.
A few days ago my server crashed and i couldn't find a usb stick to save my life so i pulled the SSD software RAID drives out of it and put them in the rzyen machine to test. I've since found my usb stick and have managed to get the server back up and running - in the end it looks like the EFI bios essentially "forgot" about my grub install and it was defaulting to trying to boot using the old MBR method. I just had to re-run grub-install to fix it.
However, since i took the server drives out of it my ryzen workstation won't boot. Grub loads off its NVME drive EFI partition, but as soon as it tries to boot the kernel, i just get a blank screen and no activity. No messages at all to indicate what's going on, just a blank screen.
I've gone in with the systemrescue stick and tried just about everything i can think of, including a complete rebuild like i did when i first set it up in 2020. I also tried a rebuild onto a spare SATA SSD instead of the NVME drive. I've tried the last three kernels that are booting all my other machines just fine. No matter what i try, just a blank screen when grub tries to load the kernel.
This is trying to boot my custom kernel off the hard drive. The systemrescue usb stick boots just fine. I can even get it to use its "findroot" mode to find the nvme partition - i had to remove the /sbin/init symlink and replace it with the binary, but then it worked, though the sysresc kernel doesn't support a lot of stuff i need for a proper boot. i've also tried booting with a gentoo install stick, which also worked fine. Finally i tried an arch install disk, and actually installed arch onto the SSD, and that was able to boot fine.
My first assumption was that the kernel was somehow corrupted, or the partition/filesystem itself was (even though grub could see the files on it fine). But after several re-installs where i re-partitioned everything and copied all the known-good files from the server, that can't be it.
I figured maybe it was some BIOS setting but other x86_64 kernels work just fine.
I've recovered from a lot of odd situations over the past 20 years but i'm just not having any luck here. Any idea how i could get *some* kind of error message or anything to see more about what's maybe going on?
have you tried putting a video=1024x768 and/or somethihg similar on the linux line of the grub menuentry?
I hadn't, but good thought. I did try nomodeset and some different connection options, though - i've normally used a displayport cable but the video card also has HDMI out, which i tried to no avail. Just tried the suggested video=1024x768 as well as the native resolution of the monitor, also no difference. I'll investigate the various video parameters, but something tells me this isn't the issue...
Also note i can tell there's video output, as the monitor remains on (doesn't go into sleep mode), just blank.
Also i've tried using the system rescue stick version of GRUB to load the kernel off the NVME disk, and that had the same effect i believe. Which really makes it seem like it's the kernel itself? Though it's the same one that's booting all my other machines and worked on this machine too before all this started...
Stabbing in the dark ... I'd do BIOS reset to clear CMOS.
I'm willing to try any sort of voodoo at this point...
I saw there's a recent BIOS update for my board (MSI MAG Tomahawk X570) so i installed that, reset a few things to my liking (numlock and fullscreen logo and such), booted into the rescue stick and re-installed grub.
I take you did not reset your BIOS?
BIOS upgrade won't reset CMOS in full. In case your CMOS is corrupt only hard reset [with jumper] clears it.
You are right of course. Long time since i've had to mess with jumpers! I figured out the method for my board, did it, verified on boot that CMOS was indeed cleared... and, still the same result. Was worth a try!
OK, has this box ever booted from NVMe? It dawns me some motherboards are choosy, mine for instance has two NVMe slots, but only one is bootable.
Yep, before i opened it up and tested the server SSDs it was booting just fine from NVMe. It just... stopped. But as noted i did also try moving everything over to a SATA SSD and had the same issue.
It really does look like it's something to do with the kernel itself, given that other kernels boot on the machine, and i can get one of those kernels to mount the root partition and start up the system with limited functionality. I've even installed one of those kernels (arch) on the SSD and booted it from there, but mine won't boot. I haven't tried another kernel on NVMe yet but that sure feels like a red herring at this point...
It just doesn't make sense as like i said, it's the same kernel that was booting this machine before, it's the same kernel that boots all my other machines (no other ryzens, but a couple of althon FXs and an intel NUC), and i've re-copied it several times from the server. So it can't be that the kernel is corrupt, or the file system it's on is corrupt...
I'm stymied. I'll keep poking at it and if there are any other suggestions i'll try them... i'm sure something will work eventually as the machine is obviously okay, but man, this has me scratching my head.
I'm wondering if there's some way to run a kernel other than getting grub to start it. I mean, it's also the same kernel i've got installed on a libvirt/qemu virtual machine and works fine there too...
No, no initramfs required, though i was using one before all this happened, just for early loading of a non-critical SSD RAID0 array, which has since been scrapped anyway so i could try booting from one of the SSDs. I am still using the initramfs but i've tried both with and without.
Besides, wouldn't there be at least a little kernel output before it failed, if that was it? What's making this so hard to diagnose is the total lack of output.
I feel like if i could get the kernel to say *anything* at all, i'd have something to work with.
This is what kexec is made for. Boot some kernel normally, then try loading the problem kernel via kexec.
Interesting! I'd vaguely heard of kexec but never really looked into it. Thanks, will do so tomorrow and see if it gives me any more clues! (or just another blank screen...)
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.