Willa's Story
Someone reading Willa's thoughts might be surprised to find out that she isn't afraid, she's excited. Willa's good at tests, the tricky 2-4-6 thing from earlier this week notwithstanding.
And this one isn't like that, it's pretty well defined. If anything's going to get in her way of being special, it won't be a test like this. She keeps her boosts back for now, she can at least see how to start already...
If the first data set has N Lefts and M Rights, then that would inform a set of relative Posteriors:
p^N*(1-p)^M
For some probability p that each ball goes Left.
Then you would want to normalize those to sum to 1 if possible, you could use the result Keltham already got for that because it's a pretty hard problem on its own; she looks in her notes and finds that it's M!*N!/(M+N+1)!. So for the first data set the Hypotheses of left probability have their own probabilities:
P1(p) = [p^N*(1-p)^M]/[M!*N!/(M+N+1)!]
Similarly, if we say the second data set has L lefts and R rights, we can deduce the same probability function there:
P2(p) = [p^L*(1-p)^R]/[R!*L!/(R+L+1)!]
Willa feels sure somehow that the right solution involves using these two functions together somehow. She could instead just use the first one and then feed the other data into it, but why should one data set be treated differently than the other? The situation is symmetric, so the Law should treat them symmetrically.
So what's the important thing here? It's tempting to say that the functions want to be the same, but that's wrong. Both functions could have no data at all and both would be the same flat line, and she'd know nothing at all about if they were the same.
To KNOW FOR SURE the ps are the same, or different, you would have to know p exactly. The function would have to be a lone spike of probability somewhere in each case. Like if P1 was a spike at p=0.5, and P2 was a spike at p=0.6, then you have a 0% chance they're the same. Similarly, if they're both spikes at p=0.5, there's a 100% chance (as long as the model is right in the first place...)
But how do you actually process the P functions to get those 100% or 0% or anything in between chances? Well, what do you always do with probabilties? You multiply them. So then... you'd have to multiply these functions together. With an integral?
Willa's been feverishly learning calculus since she saw it used to such powerful effect, she thinks you'd integrate the two of them multiplied together, and it would be a definite integral. You'd be integrating over the little probability, the p, so dp. The bounds would have to be from p=0 to p=1, the set of possible outcomes.
Would you have to normalize? Scary, she isn't sure. She'll think more about that part later.
INT([p^N*(1-p)^M]/[M!*N!/(M+N+1)!]*[p^L*(1-p)^R]/[R!*L!/(R+L+1)!],p,0,1)
What a mess. But a lot of this doesn't even have p in it, it can seamlessly escape the integral. Goodbye denominators! To Abbadon with you!
INT([p^N*(1-p)^M]*[p^L*(1-p)^R],p,0,1)/[M!*N!*R!*L!/(M+N+1)!(R+L+1)!]
Rearrange a little bit, combine like terms...
INT(p^(N+L)*(1-p)^(M+R),p,0,1)/[M!*N!*R!*L!/(M+N+1)!(R+L+1)!]
And now it's the same form Keltham had again! She can use exactly what she used to normalize them earlier! She doesn't even have to do any work! She feels like cackling. She doesn't of course, but she'll remember this later and cackle.
[(N+L)!(M+R)!/(N+L+M+R+1)!] / [M!*N!*R!*L!/(M+N+1)!(R+L+1)!]
Clean this up...
[(N+L)!(M+R)!(M+N+1)!(R+L+1)!] / [(N+L+M+R+1)!*M!*N!*R!*L!]
So this could be an answer. But calm down. Don't get overexcited.
OK, that's impossible.
But she still had to be a little careful here. First, was she supposed to normalize? She thinks how the flat probability distributions would've looked. P(p) = 1 from 0 to 1 would be the flat one, that normalizes properly, she knows. If she integrated that times itself, obviously she'd get 1. Concerning. So she has some normalizing work to do still then.
What if it was P(p) = 2 from 0 to 0.5, for each? Then she'd get 2, from 2^2=4, then 4*0.5=2. Makes sense, they're twice as much like each other. So the idea is at least relatively correct, good, good. Think of her answer as a Rating for now, rather than true probability.
Rating = [(N+L)!(M+R)!(M+N+1)!(R+L+1)!] / [(N+L+M+R+1)!*M!*N!*R!*L!]
It occurs to her now that if you assume each true probability can be anything between 0 and 1, the chances they line up exactly should be 0. In a way, it's nonsense to say they can be "the same", at least when working in this framework.
But they can still be nearer together or farther apart. Maybe what she's looking for is the expected difference in probability, or something like that. The half-full ones were twice as good. And clearly, the half full ones are twice as near together. So distance apart is inversely dependent on rating, almost surely.
What's the average distance apart for rating 1 then? That's the key to all this, she can work from that to get everything. But there's something tricky here, she feels a tinge of suspicion.
Owl's Wisdom.
And she realizes she's at least a little wrong. The 0.5 and 0.6 spikes she thought about before would have rating 0, and they're very definitely 0.1 apart. Darn. Is this the end of the road for the two-function method? But this sort of thing would have to happen no matter how she does it, wouldn't it? If she's working from an initial prior that the true p1 can be anything between 0 and 1, and the true p2 can be as well, then the odds of them ever being the same must always be zero.
She's so tempted to ask Keltham what the heck this problem is even supposed to be then, but she forces herself not to. This must be part of the test.
So let's think about this a little more, in a new and strange direction. Imagine she had a prior not about p1 or p2 individually, but about them being the same. If her prior was 1/2 say, that's sorta like saying p1 or p2 can each be one of two values with equal chance, and they might line up or might not. And 1/3 chance of being the same would be three different values, and so on.
So maybe the chance p1=p2 is something you have to get both from the prior probability "Q" that p1=p2 and from the data itself. Maybe her rating is useful after all? Let's think of some cases.
0.5 spike and 0.6 spike. Rating zero. Chance they're the same, always zero. Full-flat and Full-flat. Rating 1. Chance they're the same? Well, the full-flats are like having no data at all, which means the chance must remain Q, the prior you started with. Half-flat and Half-flat. Rating 2. Well, you definitely don't multiply, since Q might be more than 0.5, and 2Q would then be more than 1. Bad. But the chance is definitely more than Q. 0.5 spike and 0.5 spike. Rating... rating infinity. Has to be, infinitely squished together so infinitely big rating. Probability has to be 1.
So what sorta function looks like that? Multiplying is dumb, what about an exponent? But Q is less than 1 and big Rating is good, so maybe...
Updated Probability of Same = Q^(1/Rating) ???
It's a wild guess, but it gets points for being a simple guess, at least. So this would mean half-flats with rating 2 would give you SQRT(Q). 0.5 upgrading to 0.707. It seems plausible? It's at least something to use as a backup plan, it's not a terrible try.
Can she work it out from first principles now that she has a better idea what she's doing? Or maybe find that it's wrong and see something better? Owl's Wisdom runs out, and she decides to cool off for a little, do some sanity checks on her Rating to make sure it even makes sense. It gets better and better as both cluster to the same side, and worse and worse as they cluster to opposite sides. OK, good.
She's pretty sure now that the problem needs a Q. The way it's framed doesn't make any sense without it. If there were buckets, you could make guesses about bucket priors and it might be doable without a Q, but there are no buckets, and buckets are mean and nasty anyway. They went over that. And if you take a totally random 0 to 1 as the prior for both, then the answer to the question is just zero, and it's boring. You need a Q.
But how do you go from Q to anything useful? As necessary as it is, it's kind of an obnoxious object. She thinks about it hard, doesn't get anywhere, and then decides it's time for Fox's Cunning.
Let's go back to those half-flat functions: 2 from p=0 to p=0.5, 0 elsewhere. Imagine I'm given Q=0.5, and that function for each data set. What do I conclude?
It's difficult because the probability weight of the functions together is sort of fundamentally a line shaped-thing, and the functions apart is an area shaped-thing. But she knows it isn't an infinite update in favor of them being not the same, that'd be silly. The two full-flats make for no update at all. Maybe she can use that? With the full flats, the Rating is 1, and Q isn't updated at all. That means Same and Not-Same had the same likelihood there. For the half flats, the Rating is 2, so in a sense the likelihood of Same has doubled. The likelihood of Not-Same... maybe that can't really change? The total probability area can't really be effected by the little probability line.
So imagine a Rating of 2 is a 2:1 update. That feels right, in a comforting way. Her ratings are just likelihoods, basically. So...
Updated Probability = 0.5*2/(0.5*2+0.5) = 2/3
Great, that makes sense. Or generally...
Updated Probability of Same = Q*Rating/(Q*Rating+(1-Q)*1)
Yes, yes. She's going to register a second instance of cackling to be saved for later now. So in summary...
Prior Probability of Same = Q
Set 1: L Lefts, R Rights
Set 2: N Lefts, M Rights
(N+L)!(M+R)!(M+N+1)!(R+L+1)!
Rating = -------------------------------------------- = Likelihood Ratio of 'Same' vs 'Not Same'
(N+L+M+R+1)!M!N!R!L!
Updated Probability of Same = Q*Rating/(Q*Rating+1-Q)
In Civilization, they wouldn't even give a prior Q as a guess in most cases, just the "Rating", aka the Likelihood ratio. But she figures it's good to be very clear about how one should handle their Q, if they had one and wanted to do something with it.
Do her old sanity checks still work? Rating of zero gives update to zero. Rating of infinity gives update to one. Rating of one still gives "update" to Q. And this is more like the real language of probability than her first guess, it's actually justified, at least sort of. She thinks that for now this is as good as she can get with the main part of the problem.
But this is so so so important. She gets her second Fox's Cunning from staff, and then spends the first minute of it looking over everything she's done again, just in case. Nothing else catches her eye, so she starts thinking about the rest in truth now. So she starts writing paragraphs after all the equations interspersed with notes:
"Now imagine that the p you're looking for really is the same, and you have some medicore Prior Q of that to start with. Let's say p=0.5. But one experiment is leaning right a little, and one left a little, so their true probabilities are p1=0.45 and p2=0.55."
"With small amounts of data in both experiments, the probability functions won't be very sharp. They'll just be soft hills, taller in the middle, and the Rating you get by combining them will still be higher than 1, and update Q in the correct direction, towards 'Same'. But as you took more and more data, the functions get sharper and sharper, like her imaginary spikes. Eventually you'll have spike-like functions at 0.45 and 0.55, and they'll barely even intersect! The rating would be terrible, much less than 1, and update Q in the wrong direction, towards 'Not Same.'"
"So at first, if the experiments just had a little data, comparing them would suggest the right result. But as they got more and more, the problem would get magnified, and it would eventually start suggesting the wrong result instead. It almost seems like a paradox: more data should be making things better, but it's making things worse."
"It's because even if the data is real data, the probability functions from it are sort of lies, if you think they're referencing the true behavior and not just the data from the experiment. p1 and p2 aren't really quite the same thing, they're both just nearby shadows of the same p. So really, the Law isn't lying to you. It's saying p1 and p2 are probably different, and they are! But that's just about the experiments, not about p itself."
"So when you look at an experiment, you should maybe limit how spiky you let your probability function get for the true p. You need to force it to be at least as wide as the inherent errors in the experiment are big. Exactly how wide you force it to be, and how to do that with Law instead of with handwaving, is probably the subject of another problem."
It seems like a big difficult problem, and she's fresh out of Foxes and Owls. So that's that with that for now.
But as lunch approaches and she looks her paper over, she feels confident. Maybe some of this is wrong, it's possible, but she doesn't think it could be wrong enough to sink her.