thellim vs. p-values
Next Post »
« Previous Post
+ Show First Post
Total: 60
Posts Per Page:
Permalink

Okay, but... how much money would he need to be paid to report likelihood functions anyways, despite their being associated with priors, which are 'subjective'?  Is it possibly just a hundred bucks, if he focuses hard on the feeling, and considers how, even though 'subjectivity' seems like a very bad thing, there isn't any actual concrete terrible bad thing that happens to you?  Especially if you just report likelihood functions, and don't talk about any priors or posteriors on the real-world facts (as opposed to internal model variables being integrated out).  As is standard practice anyways, because other people might know something you don't, and because your report isn't a realtime-updated prediction market -

Permalink

Well, no, the thing he'd worry about is that if you report on the data having whatever-it-was likelihood, given the coin's propensity to heads of 20%, somebody will decide that, even though this likelihood was pretty low, the coin still has a trillion-to-one prior probability of having a 20% propensity to heads!

Wouldn't this make the scientific process open to obvious abuses?  What then?

Permalink

Ideally?  In a saner world?  Whoever says that can lose a ton of money on prediction markets about what the next experiment will show on the same coin!

Permalink

Aren't prediction markets about that sort of thing illegal because of gambling laws?

Permalink

Thellim is aware.  Sometimes she likes to pretend she's somewhere else where that isn't true.

Permalink

To be clear, he's not especially a supporter of those laws himself -

Permalink

Yes.  Thank you for the sentiment.

As it happens, a sane scientific process says that putting a prior probability like that, on any particular effect-size, or reporting a posterior after seeing the data, isn't supposed to be the job of experimentalists in the first place. So maybe this is a place where that whole horror of the 'subjective' could work in their favor.

And has he noticed that there's... kind of a lot of end-user-manipulable free variables inside the 'frequentist' procedures that have been constructed in incredibly elaborate ways so as to avoid all mention of likelihoods?

Permalink

 

...is she talking about p-hacking?

Permalink

She's talking about how, if you flip a coin and get the sequence HHHHHT, then this is 'statistically significant' if you got the sequence by following the rule 'Flip until you get the first tail and then stop', in which case the sequence is only six coinflips or longer on 1/32 of occasions, p < 0.05.

And 'fails' to achieve statistical significance if you got exactly the same exact observations by following the rule 'Flip the coin six times and count the tails', in which case there's 7/64 ways to get one or fewer tails, p=11%, n.s.

Permalink

Well, no reputable journal would accept the first analysis - it's not the way somebody would normally compute the p-values after collecting the data on six coinflips.  Which isn't really very much in the way of coinflips anyways -

Permalink

Or take 95% confidence intervals.  One way to get a 95% confidence interval is... whatever the traditional method is, Thellim tried to read it several times but her eyes started to bleed and black tentacles started squirming out from under nearby doors, but it involved taking the average of all the measurements so far, and picking a data-dependent constant c to add and subtract to the mean, to get an interval (mean - c, mean + c) such that following this procedure would, 95% of the time, in the long run, give you a confidence interval that contained the true population average value of the parameter, if you were sampling randomly from that population.

There's also, obviously, going to be a way to pick a constant d where you report the interval (mean - d, mean + 2d) via some method for picking d such that this method will 95% of the time, in the long run, produce an interval containing the true population mean.

Yet another way to get a 95% confidence interval is to, like, construct a 99% confidence interval the traditional way, and then with 4/99 probability, report the interval (purple, blagoobah).  It doesn't matter that blagoobah isn't a word.  The interval will still contain the true average 95% of the time, in the long run, if the method is separately repeated infinitely many times.

Permalink

He agrees that the second and third methods are worse ways to get 95% confidence intervals than the first method; they'll produce wider intervals with no gain in correctness.

Permalink

What she's trying to get at here is that all the 'p-values' and 'confidence intervals' are much more manipulable than likelihood functions as a summary of the data!

P-values contain free parameters for which class of other possible results you decide to lump in with the exact data that you saw.  If you say that the result HHHHHT is part of a class of results that includes {THHHHH, HTHHHH, HHTHHHH, HHHTHH, HHHHTH, HHHHHT, HHHHHH}, then there's 7 lumped-in results like that with total probability 7/64, so you say 'not significant'.  If instead you say that HHHHHT is part of a class of results that includes {HHHHHT, HHHHHHT, HHHHHHHT...} then there's infinity lumped-in results like that with total probability 1/32, so you say 'significant'.

Similarly with 'confidence intervals', and all the different ways you could fiddle interval construction, and still have it be a valid deduction that the method would with 95% probability - pardon her, would with 95% frequency in the long run - produce an interval containing the true average measurement.

Likelihood functions don't have free parameters like that!  You don't take the actual results you saw and lump them in with other results you didn't see and calculate their summed probability!  You don't apply 'methods' to things to draw weird intervals around them!  You just use the actual data that you saw!  For any sufficiently characterized way the world can be, any hypothesis of a sort that experimenters ought to summarize statistics about, there's just some valid deduction about how likely that world was to produce the exact data observed.

Permalink

You can't fiddle with the numbers if you're honest, is the idea in science.  Science does presume that scientists were honest in reporting which rules they followed to do their experiments; it does assume that somebody who flips a coin six times, gets HHHHHT, and stops, will honestly say that 'they decided to flip the coin six times and then stop' and not lie and say 'I decided in advance that I'd flip the coin until I got tails and then stop'.

So long as nobody lies, though, the free parameters can't be used to cheat - you can't have some clever way of picking the method that lets you get p < 0.05, with probability greater than 5%, when the null hypothesis is actually true.

Science does presume honesty; but absent that presumption, people could just lie about the data anyways, no matter what stats you used.

Permalink

When you flip a coin and get HHHHHT, the meaning of that result should not depend on the experimenter's state of mind unless the experimenter's state of mind is able to affect the coin.  What tells you about reality, is what is entangled with reality; the coinflips entangled with the coin; if the experimenter is not telekinetic then who cares what they were thinking.

Once you have characterized the world, the likelihood of the data, given that world, doesn't change with what the experimenter is thinking, unless the model says that the experimenter's thoughts are able to affect reality.

And let's be frank here, if Earthlings are not entirely immune to temptation in the face of terrible incentives and no prediction markets and not very much replication, people might be quantitatively more tempted to fudge their unobservable private intentions while conducting their experiment than to fudge the hard objective facts of what was observed that somebody would be more likely to catch them lying about.

(Thellim would ask how an epistemology with experimenter's-unobservable-private-intentions-dependent interpretation of the evidence is not 'subjective', but she has worked out by this point that 'subjective' actually means 'the moon makes me feel like I hate this' and doesn't really relate to the ordinary English meaning of the word.)

Permalink

Well, that's why there's very standard methods for computing the p-values, which, in practice, gets rid of a lot of the downsides of the 'free parameters' Thellim is complaining might exist in principle.  Nobody's actually going to believe you if you flip a coin six times, get HHHHHT, and claim you decided to flip until you got a tails and then stop.

Permalink

With very very simple statistics, that might, possibly, be true with respect to that exact particular class of disaster that results from 'p-values' lumping different hypotheses and outcomes into weird buckets.  There's other disaster classes she'll get to.

But Earth science has been known to ever involve complicated procedures involving dozens of parameters entered into 'stats' computer programs.

Past that point, if the reported-on strength of your evidence isn't independent of the experimenter's private state of mind - and there's no prediction markets, and there's no preregistration of studies including the analysis methods they'll use, and journals accept papers after the evidence gets gathered and analyzed instead of accepting the preregistered paper beforehand, and the journals openly take into account the results when deciding whether to publish the study, meaning experimenters are being openly plied with incentives that are not accuracy incentives - you're kind of screwed.

Permalink

Well, he's definitely not arguing against the point that there ought to be more preregistration of studies, though the idea that journals should accept papers based on their preregistration is one that he hasn't heard before.  He likes it in principle, but it's the sort of noble idea that basically never happens in real life.  His field is still trying to replace Elsevier journals with open-access ones.

Permalink

Ah, yes.  Elsevier.

Earth 'science' is a giant obvious flaming disaster, and you'd think people would be more open to the suggestion that maybe the incredibly complicated private-state-of-mind-dependent statistical analyses would perhaps also be broken and contributing to the giant flaming disaster.

Permalink

He really doesn't think that bad stats are the primary problem there -

Permalink

Every single aspect of Earth's science is broken SIMULTANEOUSLY and the stats are PART OF WHAT'S BROKEN.

The 'null hypothesis significance testing' paradigm means that a 10% apparent effect size on a small group and a 3% apparent effect size on a larger group can be treated as a replication having 'reproduced' the original's 'rejection of the null hypothesis' instead of as a failure to reproduce the apparent 10% effect size.

The 'null hypothesis significance testing' paradigm means that journals treat some findings of an experiment as 'failures' to find a 'statistically significant effect' rather than as valid evidence ruling out some possible effect sizes.  That journals then don't publish that evidence against larger effect sizes, because they didn't accept on the basis of preregistration, is an enormous blatant filter on the presented evidence which no sane society would tolerate for thirty seconds, and also, a giant blatant incentive that is not an accuracy incentive.  If you think in likelihood functions there are no failures or successes, there is no 'significant' or 'insignificant' evidence, there is just the data and the summaries of how likely that data is given different states of the world.

If you've correctly disentangled your hypotheses, likelihood functions from different experiments just stack.  You literally just fucking multiply them together to get the combined update.  It becomes enormously easier to accumulate evidence across multiple experiments -

- although, yes, anybody who tries this on Earth will no doubt find that their likelihood function ends up zero-everywhere, because different experiments were done under different conditions, as is itself a vastly important fact that needs to be explicitly accounted-for, and which the "rejection of the null hypothesis" and "meta-analysis" paradigms are overwhelmingly failing in practice to turn up before it's too late.

'Null hypothesis significance testing' rejects the notion of a way reality can be that produces your data.  It just says reality isn't like the null hypothesis, now you win, you can publish a paper.  That doesn't exactly prompt people to notice if reality is being unlike the null hypothesis in incompatible ways on different occasions.

The 'p-value' paradigm's dependence on the experimenter's private state of mind means that people can't just go out and gather more data, when it turns out they didn't get enough data, because the fact that the experimenter has privately internally chosen how much data to gather breaks the p-value paradigm -

Permalink

Wait, is she saying that people should just be allowed to gather more data any time they feel like it?

Permalink

YES!  It's just DATA!  It's not going to HURT YOU if you don't MISTREAT it like Earth scientists do!

If the experimentalist's intentions are not telekinetically affecting the data, then your statistical method shouldn't care why they decided to gather the data, the data is telling you about the world and not about the experimenter!  People can gather data for any reasons they feel like!  Or no reasons at all!  Stop telling experimenters how to feel about their data!

Permalink

Okay, but then, what stops somebody from just continuing to flip a fair coin until there's a bunch more heads than tails in the results, and then stopping and reporting that the coin is biased towards heads?

Permalink

He is welcome to set up any computer program he likes, in which some coins are biased, some coins are fair, as selected at random according to a known distribution; and an analyst, who knows this prior distribution, is updating her credences correctly using likelihoods; and an experimenter, who does not know the true state of the coin but can observe it, can decide as he pleases when to stop gathering data.

The experimenter will not be able to make the analyst arrive at an ill-calibrated probability distribution.

Now, maybe if the analyst had bad priors, they'd come to bad conclusions.  Which is why experimental reports shouldn't report on posteriors like that.  But the likelihood functions don't introduce any problems for the analyst, even if the experimenter is deciding how much data to gather based on previous observations; because the experimenter's private decision process does not then affect the likelihood of the additionally gathered data given the true state of the world.

Total: 60
Posts Per Page: