There's a similar phenomenon that would suggest two experiments were working under different conditions, even if the data got all mixed together before you looked at it. Say you were gathering data to find out the average Intelligence in Cheliax, which you expect to be the sum of lots of tiny factors and hence distributed along a central distribution.
Actually, however, your data was gathered for you by a professional data-gathering firm, though, uh, you might possibly have not done a lot of due diligence before hiring them. They were cheap, though!
This data-gathering firm immediately subcontracted out your job to two even cheaper subcontractors.
What these data-gatherers should have found - at least if the data told to Keltham himself was correct, and a couple of other facts seem to have borne it out - was that Golarion has a mean Intelligence of 10, with a square-root-of-average-squared-deviation-from-the-mean of 2. (Baseline: 'Deviation' of 2.)
One subcontractor, however, didn't spell-check their survey, and the spelling errors turned off the smarter and more perfectionist people reading it, so their biased sample of respondents had average Intelligence 8 and deviation 1.
Another subcontractor went where it was very convenient for them to find survey respondents, which was, it turned out, people standing in line to apply to a wizard academy. That subgroup had average Intelligence 12 and deviation 1.
If both datasets are completely mixed together before you get them, when you compute the average, you'll find it's around Intelligence 10, and the deviation... will not be exactly 2, but it will be around 2.
But the hypothesis "This is a central distribution with average 10 and deviation 2" would predict that 10 is the most likely Intelligence score you can find. Intelligence-10s will actually be relatively rare if your distribution is the sum of two subdistributions with deviation 1 and averages 8 and 12. 6% or so of subjects will have Intelligence 10, instead of 38% as the hypothesis predicted. You don't need to notice that particular deficiency by looking at Intelligence-10 subjects specifically. It'll show up in the combined likelihood of all the data being much lower than expected, even if the whole thing is calculated by a 'computer' that wasn't 'programmed' to detect that exact kind of anomaly.
You can calculate what kind of score you'd expect to get, if any of your hypotheses were actually true, and if all the hypotheses score much lower than they expect, they're all - in Baseline colloquialism - 'stupid with respect to the data'. This doesn't always happen when different experimenters are working under secretly different conditions and measuring actually different phenomena, it is not always obvious just from the likelihoods especially if you mix all the data together before checking it, but it is an example of a pattern suggesting that the true hypothesis wasn't anything you were considering.
One should always keep in mind, though, that the 'fair coin' hypothesis never looks stupid no matter how much pattern it's missing out on. If you spin a coin 1000 times, and it comes up Queen 1000 times, the fair-coin hypothesis expected to lose 1000 twos and that's exactly what it loses. In a case like that, you have to think of the specific better hypothesis - 'this coin has a 100% propensity to Queen' - or perform some more general test that implicitly takes into account the possibility of lower-'entropy' hypotheses like that - before you can see the problem.
If it's just never occurred to you that coins might be biased, if you haven't invented any tests to detect biases, then contemplating the fair-coin hypothesis alone is not going to show you that hypothesis doing any more poorly than it promised you it would do.