Endogenous information
If you're working on a system, you want most of your information to come from within it and as you're working on it you should be able to make that happen. If something goes wrong for a user, it's better that the information comes from the system you control rather than externally.Information from most systems consists of logs, error messages and the like which should be configurable to up the reporting level. You need to know what logs, log levels and error messages are available and how you act on each of them.
The very best error messages provide the user with instruction:
Error 2468: User cannot access this asset. Please check the user #987654321's permissions against asset #1357.Most logging is configurable - you can change from a relatively quiet mode to very verbose debug mode, but often this isn't done on the fly in production and the debugging information you'd like for the only time someone logged in without a password simply isn't available. For this reason, your production error/incident logs need to be:
- Minimal. Don't include noise otherwise it'll be ignored.
- Actionable. If something is wrong, point to all the known places the problem can be fixed and made sure you write this while you're writing the rest of the app. You will have forgotten details later on.
- Read. Don't assume people read error logs, make sure they do.
There are typically two more types of endogenous information: the code and the data. The code tells you what the system really did, but you will have to read through it and probably simulate what happened. This is important because too often people spend hours or days pointlessly postulating based on their incomplete knowledge of how the system is built.
For goodness sake, read the code and find out how the system will behave.
The data is the last type of endogenous information but can often be irrelevant. The problem with data is that often it is already broken - the user's account is corrupt or the images are missing. There are two ways around this: use backups and simulate what happened or keep an audit of every action. The former is leg work when something goes wrong; the latter creates a lot of potential noise.
Exogenous information
Any information from outside of the system is exogenous and has a different characteristic.Firstly, it's more likely to tell you how the system looks from to the outside world which means it's biased towards telling you your system is wrong. If the system is down, your system is at fault. If the financial numbers add up wrongly, it's because your system gave the wrong numbers.
We can call this: dumb user syndrome. The system looks wrong and the user always blames the system rather than any other possible cause.
It's good to have dumb users because they will be right a lot of the time and most users are, in the short term dumb. They click something and it doesn't work so they go to the next result on Google. The dumb user syndrome is an excellent benchmark and is exactly why non-technical people throw all their toys out of the pram when something apparently minor breaks.
Secondly, it is unaffected by the running of the system. If you're using an external measure of uptime this will be more accurate than any measure you have running on the system itself. If an external system is checking your data it is more likely to be objective simply because it is unaffected by your internal code and data changes.
In short, exogenous information can be skewed but always reflects what the rest of the world really sees.
Which type is best?
You need both. Endogenous information is the best for finding the cause and you must maximise this to reduce debugging time, but you can also use this to proactively kill bugs before the rest of the world notices.Exogenous should be your measure of stability because it closely reflects how the world (i.e. your paying users) see.