Continuing from: Day 24 (20) / Morning Lecture
Dath ilan has many different ways of looking at ideas like "the integers" or “the points between 0 and 1”, and teaches its children to switch between them as appropriate, especially when they are doing rapidly-prototyped mathematics.
The problem before them is to sum up:
(0/INF)^N (1/INF)^N (2/INF)^N (INF/INF)^N
--------- + --------- + --------- + ... + -----------
INF INF INF INF
...and get the correct answer of 1/(N+1). This obviously reduces to summing:
(0/INF)^N + (1/INF)^N + (2/INF)^N + ... + (INF/INF)^N
...and getting INF/(N+1).
(Retroactive parenthetical class chatter)
"So..." A hand goes up. It is the hand of a student who does not want to admit weakness, but will do so anyway, under the eyes of Security. "It isn't obvious to me how that reduces?"
Keltham starts to answer -
Oh wait never mind it is obvious.
It's too late to say that now, though, so the student will just listen to Keltham speak the awful words that he's just multiplying both the original sum and the original result 1/(N+1) by infinity.
Factoring out (1/INF)^N from all the terms (n/INF)^N reduces the problem to one of summing:
0^N + 1^N + 2^N + 3^N + ... + INF^N
...and getting INF^(N+1) / (N+1).
(Retroactive parenthetical class chatter)
A hand goes up, in obedience to stated Orders. "Why are we factoring this out, sir? And why does it require summing this?"
...they're taking (0/INF)^N + ... and factoring out (1/INF)^N from all terms (n/INF)^N to yield n^N * (1/INF)^N, for each term, and then dividing both sides to yield the sum over n^N equalling [ INF/(N+1) / (1/INF)^N ] = INF^(N+1) / (N+1).
Then, once they get the solution for an infinite sum of cubes n^N, they'll multiply that whole thing through by (1/INF)^N and by (1/INF), and get back out...(0/INF)^N (1/INF)^N (2/INF)^N (INF/INF)^N...which represents an infinity of possible hypotheses, for all possible propensities between 0 and 1, for how often a ball goes LEFT, each hypothesis with 1/infinity prior probability.
--------- + --------- + --------- + ... + -----------
INF INF INF INF
Look, it's just the same as if they'd used four buckets each with 1/4 prior probability, and asked for the chance of seeing 9 balls going LEFT, that would be:(0/3)^9 (1/3)^9 (2/3)^9 (3/3)^9
------- + ------- + ------- + -------
4 4 4 4
...Somebody then asks whether it ought to be INF+1 in the denominator of the original expression.
Keltham stares at the wall for a moment, then sagely announces that, with a sum this large, it won't make any relative difference if they drop any finite number of terms from it, since those contribute 0% of anything. He's just going to arbitrarily declare that one of the ones in the middle is missing.
Now in an intuitive sense, INF^(N+1) / (N+1) is obviously what the answer has to be, of course.
If you stack up a bunch of one-dimensional lines of length 1, 2, 3, ... 10, you end up with roughly half of a two-dimensional square of side 10.
If you stack up a bunch of two-dimensional squares each with sides 1, 2, 3, ..., 10, you end up with roughly a third of a three-dimensional cube of side 10.
So naturally if you stack up a bunch of N-dimensional hypercubes with sides 1, 2, 3, ... infinity, you'll end up with exactly 1/(N+1) of an (N+1)-dimensional hypercube with side infinity.
(Retroactive parenthetical class chatter)
(This nerdsnipes multiple researchers hard enough that they have trouble following the next few sentences, which thankfully aren't vital ones.)
Actually, they might possibly want to check that on the numbers up to, like, 5, or something.
1 + 2 + 3 + 4 + 5 = 15 which is like not that far off from half of 25.
1 + 4 + 9 + 16 + 25 = 55 which is not that far off from 1/3 of 125.
1 + 8 + 27 + 64 + 125 = 225, not far from 1/4 of 625.
It's a bit more each time, since there are extra terms in the sum beyond the one for SIDE^(N+1) / (N+1). But as you sum a larger number of terms, the extra terms grow less quickly, and they can be ignored when the sums are infinite.
Like 1 + 2 + 3 + ... + 100 = 5050 and that's pretty close to half of 10,000.
Someone is increasingly confused about how this INF business works. Is INF - 1 a different number from INF? Is INF-(INF-2)+3 equal to 5?
Yes to both, obviously.
The usual guideline for answering questions like that is to pretend that INF is a very large finite number? The whole idea is that a bunch of things that are approximately true about large numbers become exactly true about INF.
2 divided by a very large number is approximately 0.
So 2/INF is exactly 0, so long as it isn't being multiplied by infinity somewhere outside that.
Similarly, (INF-1)/(INF+1) = INF/INF = 1.
That sounds like a question SLOW people would ask! Boring question! Let's move on!
"Keltham, has it ever occurred to you that, as an eight-year-old boy, you were taught mathematics in a style that someone thought would appeal to eight-year-old boys?"
She doesn't actually say it out loud. Maybe when they're alone.
Now, the nice thing about working with infinities is that you don't have to be very precise at all, and you'll still get exactly the right answer at the end!
Instead of taking x^N, let's use the 'rising powers', okay that didn't translate so probably is not a thing in Golarion, or at least the Share Language donor didn't know it.
Instead of taking x * x * x * ... * x, with N copies of x, let's consider the expression x * (x + 1) * (x + 2) ... until you've got N of those.
Let's denote that x`N`.
Obviously if x is any finite fraction of the way towards infinity, x`N` = x^N exactly. Like, when you're working with trillions, a trillion squared is nearly the same number as a trillion times a trillion plus one, proportionally speaking; that's approximately true, so with infinities it's exactly true.
Example with x=10, N=3:
10`3`
= 10*11*12
= (10)(10+1)...(10+(3-1)) (3 terms total)
= (x)(x+1)...(x+(N-1)) (x`N` with x=10, N=3)
Now observe that:
11`3` - 10`3`
= 11*12*13 - 10*11*12
= (13 - 10) * (11 * 12)
= 3 * 11`2`
By pointwise generalization from 10 to x and 3 to N:
(x + 1)`N` - (x)`N`
= (x + 1)(x + 2)...(x + N) - (x)(x+1)...(x + N - 1)
= ((x + N) - x) * (x+1)...(x + N - 1)
= N * (x+1)`N - 1`
Rearrange:
(x + 1)`N` - (x)`N` = N * (x+1)`N - 1`
==>
x`N` + N*(x+1)`N-1` = (x+1)`N`
We now have an N-1 term in our rule, which would be bad if N were 0. Eliminate all subtraction whenever possible! In case something ends up 0 or negative that didn't start out that way!
In this case we'll just redescribe the number N as being some other number N plus one, so N-1 goes to "N" and N goes to "N+1". We coulda done that originally, but it woulda been needlessly more symbols.
x`N` + N*(x+1)`N-1` = (x+1)`N`
==>
x`N+1` + (N+1)*(x+1)`N` = (x+1)`N+1`
Sanity check with x=10, N=2:
x`N+1` + (N+1)*(x+1)`N` = (x+1)`N+1`
=> 10`2+1` + (2+1)*(10+1)`2` = (10+1)`2+1`
=> 10*11*12 + 3*(11*12) = 11*12*13
PASSED
Then, by induction on f(x) = x`N+1`:
0`N+1` = 0 (base case)
(x+1)`N+1` = x`N+1` + (N+1)*(x+1)`N`
==>
x`N + 1` = (N+1) * 1`N` + (N+1) * 2`N` + ... (N+1) * x`N`
(can optionally add a starting term of (N+1)*0`N+1' since this always equals zero)
Sanity check with x=3, N = 2:
(3)(4)(5) = 3(0)(1) + 3(1)(2) + 3(2)(3) + 3(3)(4)
60 = 0 + 6 + 18 + 36
PASSED
Divide both sides by N+1:
x`N+1` / (N + 1) = 1`N` + 2`N` + 3`N` + ... + x`N`
Approximations are exactly correct at infinity:
(INF)`N` = (INF)^N
...and this is also exactly true once you get any substantial fraction of the way towards infinity. Like, even (1,000,000)`3` starts to look pretty close to (1,000,000)^3.
As for all the finite numbers at the start of the sequence, where the equality isn't exact, they collectively comprise 0% of anything and can be ignored.
Conclusion:
0^N + 1^N + 2^N + 3^N + ... + INF^N = INF^(N+1) / (N+1).
So if you stack up an infinite number of squares with sides 0, 1, 2, 3... you will get exactly a third of an infinitely large cube, and more generally, if you stack up an infinite number of N-hypercubes with sides 0, 1, 2, 3... you'll get exactly 1 / (N+1) of an infinite (N+1)-hypercube.
Which, in case anybody's forgotten by now, was what they needed to prove that:
1/INF * [ (0/INF)^N + (1/INF)^N + (2/INF)^N + ... + (INF/INF)^N ] = 1/(N+1)
...which in turn is how you:
- Represent an infinite number of possible hypotheses, about how far from 0% LEFT to 100% LEFT the first Ball 0 could have landed;
- Each hypothesis with equal prior probability 1/infinity;
- And then update on seeing N balls go LEFT;
- As would have a likelihood of p^N for each hypothesis a fraction p of the way between the left side and right side.
Also in case anybody's forgotten, if you imagine N balls plus the 0 ball getting randomly ordered, the chance of the 0 ball coming rightmost in the ordering - hence, all other balls being LEFT of it - is 1/(N+1). Which is how they originally knew that 1/(N+1) was the desired end result, for the likelihood of seeing all N balls land on the LEFT.
Also also in case anybody's forgotten, the reason they needed this sum was to renormalize the relative probabilities into absolute probabilities. That's another step along the direct path to deriving a prediction about the next ball observed.
After seeing N balls go LEFT, the posterior distribution ends up proportional to a distribution where each hypothesis p has weight p^N.
But if you forget to normalize that prior-times-likelihood distribution, and ask what happens if every part of the distribution then contributes p likelihood to the next ball going LEFT, you'd first add one extra factor of p everywhere, and then calculate:
1/INF * [ (0/INF)^(N+1) + (1/INF)^(N+1) + ... + (INF/INF)^(N+1) ] = 1/(N+2)
After seeing 100 balls go LEFT, you shouldn't calculate a 1/102 chance of the next ball going LEFT - you should think it's more likely than not for the next ball to go LEFT - so something has gone wrong here; and what's wrong is that we didn't normalize the prior-times-likelihood distribution.
The pre-normalized distribution sums to 1/(N+1); the normalized distribution should sum to 1. So we divide everything through by 1/(N+1), meaning we multiply by (N+1), to make it sum to 1. For this example, that means multiplying everything uniformly by 101 to normalize it. Which yields a 101/102 chance of the next ball going LEFT, which is what it should be.
Though, Keltham quickly remarks, the principle of making the posterior sum to 1, is something that only applies within that whole metahypothesis. If you see balls going LEFT RIGHT LEFT RIGHT repeating, you might form, after a few steps, the hypothesis that the ball goes LEFT RIGHT repeating, which can't be expressed within that whole metahypothesis. All of this reasoning is taking place inside the metahypothesis about the balls having some stable but otherwise not-further-scrutable fractional propensity to go LEFT or RIGHT, with all such propensities between 0 and 1 being equally likely on priors. This whole metahypothesis then may do better or worse compared to some other hypotheses.
He's not saying that the metahypothesis absolutely has to be true, by saying that the sum of all the hypotheses inside it should be 1. If you started believing that, you'd never in principle be able to notice any other pattern like LEFT RIGHT repeating.
Rather, this whole metahypothesis is itself being entertained as one of many in-principle possible metahypotheses about how things could be, and the probabilities of all the hypotheses inside it summing to 1, are to be understood as being what we get when we entertain the possibility, condition on, reason inside, the metahypothesis being true.
After seeing N balls go left, the whole metahypothesis has racked up a joint probability of 1/(N+1) and scored log(1/(N+1)) in terms of how well it's doing. If you previously started out with any prior credence on an incompatible metahypothesis, like 'Actually maybe the ball just always goes LEFT or RIGHT, both equally credible on priors,' that other metahypothesis may be doing better. In fact, the always-left-or-always-right metahypothesis has now racked up probability 1/2 and a score of log(1/2) = -1, which is probably quite a bit better than 1/(N+1) if you've seen a lot of balls go LEFT. You'd probably be believing in that metahypothesis a bit more, now, even if it started out less meta-credible than the other.
"Why do we need that - other hypothesis, metahypothesis - about there being hypotheses where the ball always goes LEFT or RIGHT? Aren't those hypotheses already inside - the metahypothesis about all the fractions of times a ball could go LEFT or RIGHT?"
"Like, 0/INF is that it never goes left, and INF/INF is that it always goes left... am I missing something?"
"Both those hypotheses have prior probabilities of 1/INF, remember? That's equal to zero. So the metahypothesis thinks it's infinitely unlikely that the fraction of times the ball goes left is exactly equal to 1 or 0. It never notices that's true, no matter how many examples it sees."
"If it sees 100 balls go left, it'll get a joint prediction of 1/101. If it sees 10,000 balls go left, it'll get a joint prediction of 1/10,001. In the limit, the probability that it assigns to a sequence of left-going balls, continuing forever, is 0. Before you've rolled a single ball, the metahypothesis is already absolutely certain that it won't just go left every time."
"The alternative metahypothesis that the ball has a 50% chance of always going left, and 50% chance of always going right, loses all of its probability and scores negative infinity as soon as it sees any sequence with a mix of left and right balls. But if you present that metahypothesis with a sequence of balls going left forever, no matter how long that continues, it predicted that with 50% probability from the start."
Keltham will now zoom in closer on what happens to the 'any fraction between 0 and 1 is equally likely' metahypothesis when it updates.
Let's say that we see 10 balls go LEFT.
After seeing 10 balls go LEFT, the metahypothesis as a whole has a cumulative score of 1/(N+1) = 1/11, or equivalently, if you asked this metahypothesis at the dawn of time what probability it assigned to the first 10 balls going left, it would answer "1/11". Equivalently, isomorphically, these are stating the same thing, if you multiply out the prior distribution (1/INF everywhere) and the likelihood distribution (p^10 everywhere) you'd get a new distribution p^10/INF which, if you summed over the infinitely many terms inside it, would sum to 1/11.
Continuing to reason within this metahypothesis, one of the internal subhypotheses must by assumption be true, and so within the metahypothesis the posterior probabilities of the (infinite collection of) subhypotheses must sum to 1.
Which you'd obviously do by dividing all the terms through by 1/11, that is, multiplying them by 11.
The distribution...
(0/INF)^10 (1/INF)^10 (INF/INF)^10
---------- + ---------- + ... + ------------
INF * 1/11 INF * 1/11 INF * 1/11
...will, indeed, sum to 1.
To be clear, by having the distribution sum to 1, we are assuming-for-the-sake-of-extrapolation that some hypothesis in the collection must be true. But no matter how far the posterior updates on finite data, every individual hypothesis in the collection will still have posterior 0.
After rolling 1000 balls and seeing 985 go left, there's finite probability between 0.98 and 0.99, but 0 probability on the fraction being exactly 0.985.
That's just how this metahypothesis rolls. If you wanted a metahypothesis that could put little finite dots of probability mass on all the big round numbers like 0.985, 1/2pi, and so on, that would be a much more complicated ask.
What we ask about, having updated, is not so much the probability of individual hypotheses - which is always 0 - but the new shape of the curve.
If you see 10 balls go LEFT, the individual hypothesis 0.9 has now scored around 1/e, 37% or so, compared to 1 to the individual hypothesis 1.0. If you see 100 balls go LEFT, raise all that to a power of 10; now 0.9 will have less than 0.01% the cumulative likelihood predicted as 1.0. On the scale of the curve, this means that over 99.99% of the probability is now concentrated into the interval between 90% and 100%. Though, of course, still 0% is concentrated at exactly 100%.
Keltham will Prestidigitate-sketch a few rough images of what curves like these should maybe probably look like, after updating on various numbers of LEFT and RIGHT balls observed. Keeping in mind that Keltham's not a 'computer', and can't draw the curves perfectly.
If you're willing to assume your audience can already freely wield the machinery of derivatives and antiderivatives and understand on an intuitive level everything that they mean, then sure, you can solve the problem more quickly using overpowered tools like that.
Integrating x^N gives you x^(N+1)/(N+1), which evaluated from 0 to 1 will give you 1/(N+1).
Given a mix of N LEFTs and M RIGHTs, however, and the need to renormalize the posterior over that, they might find themselves in a bit more trouble when it comes to deriving the integral...
Integral of p from 0 to 1: [ (1 - p)^M p^N dp ] = M! * N! / (M + N + 1)!
Keltham doesn't actually remember the exact proof, it's been a while, but he's pretty sure the introduction he used was the one his teachers used on him. So it's possibly the right introduction for getting the more complicated proof in the least painful way, once he starts trying to remember that?
And in any case, Keltham hasn't yet been told he's allowed to assume that everyone in this class has a full intuition for derivatives and integrals yet. Keltham hasn't taught them a calculus class himself, and who knows what weird malfunctions exist in the Golarion way of teaching it.
Why, it wouldn't shock him, at this point, if people are just being told to memorize formulas and not really taught to intuit how anything works!
So he's just talking directly about the infinity of possible hypotheses between 0 and 1, and how to sum up priors-times-likelihoods and posterior-predictions from those, rather than using calculus to abstract over that. Abstracting over that only works to teach mathematical intuition if people already know what's being abstracted away.