Evelyn forwarded this blog post on the First 5 Minutes Troubleshooting An Unknown Server. It's pretty good, as a general guide for an unknown server.
Our focus is a little different: We try not to have unknown servers, which is why we try to get our customers to commit to an ongoing relationship rather than just getting called when there's an emergency. We have had success at preventing problems rather than reacting to them. The diagnosis process is quite different when you are more familiar with a box and you know that munin and nagios monitoring are set up, etc…
In general, I will say that when there is a problem on a customer system, especially if it is performance-related, I will usually look at “uptime” first. This tells you if the system was recently rebooted, and the system load. If the load is higher than 0.5 or so, I will usually run “vmstat 1” and watch that output, which will tell me if the system is spending a lot of “cpu wa” time (usually I/O related), has no “cpu id” (idle CPU resources), if there are many “b” (blocked processes), and if it's swapping.
However, as the beginning of the above post says, knowing what you are troubleshooting is best. I always prefer to know how to reproduce the problem that I am troubleshooting, and if any changes have been made since it was last working. Of course, some of this information you have to take with a grain of salt. If I had a nickel for every time I've heard “No, nothing has changed” when something had changed… :-)comments powered by Disqus