How to find the cause of an issue that is dificult to solve?

The best way is by isolating the problem

So let's say for example you have an issue with your computer that randomly freezes, this is commonly caused by a driver, maybe firmware, a broken ram, or a broken hardware, sometimes due to the GL acceleration of your graphic card, but let's assume that we don't have idea about the cause so we need to proceed by isolating the cause:

Let's assume it only occurs when you run a specific game, we have a starting point here, so maybe the cause is the game, or maybe due to the GL that uses, but let's make it harder, let's say we don't have a starting point, and we think:

What can cause it? We don't have any idea, so let's look what is behind it:

  • the computer itself: maybe it is broken?
  • the operating system: something is wrong that makes it freeze?

So we have a starting point now, we need to discard these options to know if the cause is hardware or software, we need to find a fast way to do that, so we can try these options:

  • hardware: put the hard disk into another machine and see if freezes (note that this doesn't discard the hard disk itself as a hardware cause of the issue)
  • hardware: let's run the elive hardware checks
  • software: let's change the operating system: run Elive directly from a Live system and see if using the same apps it becomes frozen
  • software: same, but run into another kernel
  • software: if you still have the issue, run directly the Live of another Linux OS

By trying multiple options you can discard now if the issue is hardware or software, which allows you to go to the next isolating issue step:

  • hardware: if we know that the issue is hardware, which part of the hardware is the broken one? isolate more
  • software: let's go deeper on that...

Let's say that the issue is software, like said before, we need to try the "behind causes" to isolate the issue, it is the distro itself? it is a specific kernel? a specific application? etc...

There's many things you can try to isolate the cause:

  • GL: let's say that your issue is GL, you need to check it: boot the computer without graphical mode (or at least without desktop, desktops uses sometimes GL), it freezes on the pure console mode? it freezes more running glxgears? try more things like chrome with 3d acceleration enabled, a 3d game, or "expedite -e=opengl_x11 -a" to see if freezes more likely using GL..., disable GL acceleration on desktop, etc...
  • RAM: do a test to check your ram
  • Hard Disk: open the gnome disks tool ("sudo gnome-disks") and verify in the S.M.A.R.T. data if reports failures, especially the reallocated sectors, more information here, try also running Elive from a USB which avoids the hard disk used, try to not use strange filesystems, like the ones of microsoft-windows (NTFS), or even reiserfs reported to freeze your kernel in some new kernels (it is reported now a deprecated filesystem, sadly)
  • A kernel freeze normally includes an (almost-unreadable) message, read in the message if you can find the name of a kernel module (driver) that is causing the freeze, or at least you will know which part of your hardware is related to your problems
  • Check your logs: this is a must to-do thing, for example run this command "while true ; do dmesg | tail -n 40 ; LC_ALL=C sleep 0.3 ; done" and leave this terminal visible all the time, if the computer freezes check if you can see any important message here (if you still can read it, if not, you will need to inspect the /var/log/ files in your next reboot and find what happened before the new-booting lines)

If you reach this point and still don't know the cause, think that must be somewhere and you still don't have find it yet, checking the logs as previously said can be your best friend because it will tell you exactly what is happening

More tips

To be faster in your tests, try radically-different options, like entirely-different builds of Elive, for example @triantares has an issue with the audio in one of his computers, where the older versions of Elive worked (the ones based on Buster instead of Bullseye), he has a start point here, for example using the same kernel version as these builds, or the same firmware files

Personal experience:

For example long time ago my computer did shutdown itself without reason, the cause? the CPU became too hot (reached 100C) and the kernel decided to shutdown the computer for its safety, how I did find out? thanks to reading the later kernel messages in /var/log/

Bonus:

Thanks to my previous experience, Elive implemented a feature that in the boot, will inform you about it (in case that the hot-temperature was the cause), this is one of the many amazing Elive features that is included in this amazing distro :slight_smile:

How Elive featured this? simple, by just checking for specific keywords in the /var/log/ files at desktop start. More features like that can be added into Elive if you report it to us the specific messages lines and on which file

2 Likes