In this note we’ll capture reflections about Jun Otsuka’s *Thinking about statistics*

## Statistical Models and the Problem of Induction

The book begins by framing the different epistemological projects of both #Bayesian and #Frequentist patterns of inference as approaches to solving the problem of induction expressed by David Hume. #book #philosophy #statistics

This is a nice lens on the development of statistics and the applied work of statistical modelling. The two probabilistic frameworks are initially contrasted or compared to each other. The distinction is drawn between the epistemological process involved in both approaches. Firstly the notion of Bayesian conditionalisation which incorporates new data to derive new beliefs coherent with the observed facts is spelled out. Then we see how the frequentist approach can be considered as a species of reliablist epistemology, where the focus is on the error control of well defined processes. In both cases the problem of induction is located as one of inference i.e. if we have an appropriate set of i.i.d sample data we can warrant the inference that the future will look like the past

He draws out the ontological commitments to probabilistic kinds that appear required to underwrite statistical inference in both paradigms. He supplies a justification for these commitments as being instances of **real patterns** in the sense of Daniel Dennett’s phrasing. This is a kind of indispensability argument for the deployment of probabilistic kinds in our best science.

## The Uniformity of Nature

If probabilistic kinds inhabit the world they can exhibit different characteristics - varying fauna and flora of the natural world. One process might be well described by a Gaussian distribution, another by a Laplace distribution… this diversity is all well and good, but to go beyond descriptive statistics we need to posit more. We need the assumption that there is some stability to the processes we seek to characterise. A uniformity of nature that underwrites statistical inference and probabilistic prediction models.

Otsuka suggests that this commitment is cashed out in contemporary statistics with the famous i.i.d assumption. This posit argues that for sound inference, we must assume that any sample data is drawn from a probability distribution that each draw is *independent* and identically distributed.

“The IID condition is a mathematical specification of what Hume called the uniformity of nature. To say that nature is uniform means that whatever circumstances holds for the observed, the same circumstances will continue to hold for the unobserved. This is what Hume required for the possibility of inductive reasoning …” pg 25/26

This obviously is constraint on sound inference, but also implicitly an ontological assertion regarding distributional drift. This commitment is deeper than when we argue for a particular distributional characterisation of our process of interest. Our target process could be articulated as a mixture distribution or some more complicated beast, but in each case we require that (a) it is well described by **some** probability model and (b) the world is set up in such a way that more observations enable us to learn which particular statistical model (parametric or non-parametric) fits the data.

## Approaches to Learning

With this background Otsuka goes on to describe the manner in which the Bayesian and frequentist schools approach the task of learning from data.

### Bayesian Machinery

The focus in the Bayesian setting presents conditionalisation as a logic of inductive reasoning, which expands on logical inference. These are frameworks for organising and interrogating our system of beliefs and their relationship to the achievement of knowledge. Otsuka argues that Bayesian inference plays a crucial role in justification. Moving from prior to posterior distribution is seen as a change in the weighting of our beliefs. An internalist species of the justification-relation between beliefs in light of data.

But founding a story about epistemological justification on bayesian inference involves a defense of priors and likelihood specifications. Priors are defended using the usual moves: (1) appeal to wash-out theorems of iteratively updating on sufficiently large data. Convergence theorems assuring Bayesian updated belief is truth conducive in the limit regardless of apparently subjective prior specification. (2) Non-informative priors and (3) Objective priors or empirical Bayes.

We won’t spend much time on (2) because it’s just kind of silly to justify beliefs and their manipulation by constant appeals to ignorance. On (3) there is a more interesting discussion regarding the relationship between degrees of belief and chance. We want our priors to reflect our background knowledge and update our beliefs in a way that it tracks the actual occurence of the events in question. David Lewis enshrined this requirement as the *Principal Principle*. But while this is an agreeable sounding tenet it cannot serve as a foundational justification for our priors within an **internalist** picture of justification without risk of infinite regress. This is problematic for the philosopher that seeks to establish bayesian inference as the sole source of belief generation, but seems less serious if you can tolerate primitive or foundational epistemological commitments outside those justified with inductive inference in the Bayesian loop. Justification must end somewhere (the spade eventually turns), and in-practice arguments and evidential exchanges rarely get anywhere close to an infinite series.

Additionally the Bayesian needs to defend the incorporation of different likelihood choices. Their shape and implications. Fortunately this can be more pragmatic in so far likelihood specifications are in effect testable hypotheses about the data generating process. They can be justified by the success of the modelling endeavour and our ability to recover data akin to our observations. The data is a fixed quantity, we use it to update our probabilstic beliefs and commitments. This is to the good because it can be shown (via Dutch book style theorems) that in strategic decision making where your beliefs adhere to the probability calculus they will strictly dominate other strategies.

The Bayesian machinery is a set of tools for arranging coherence amongst our beliefs, commitments - tracing out the implications. We move dynamically between prior and posterior by means of the likelihood term. This process cannot serve as as an ultimate court of appeal for basic beliefs that kick off the learning process itself. It is an abstract, highly flexible set of tools applicable to a wide range of questions. It provides a very general model of learning where the concern is justification of our beliefs in the context of what we know.

### Frequentist Consolidation

So far so uncontroversial. The Bayesian perspective is a natural fit for a species of internalist epistemology. The framework is abstract and characterised concisely, so relatively straightforward to incorporate in a general philosophical picture. Otsuka’s synthesis of the “classical” inferential picture is in this way more impressive.

The classical frequentist picture has a history of poor pedagogy and can seem disparate and ad-hoc. Jaynes is famously dismissive of the absurdities engendered by the pick-and-mix approach to statistical inference adhered to in the “classical” approach.

The frequentist view ties statements of probability to measures of relative frequency within a collection of observations. This makes it impossible to articulate probability statements for specific hypotheses. The epistemological perspective is quite distinct from the Bayesian view of updating individual hypotheses. Instead the focus must lie of testing statements about stochastic processes - processes which are inherently repeatable.

As such these processes can be described by probabilistic kinds. The question then becomes - how can the properties of these observable processes feed into knowledge gathering routines. How can we go from a statement about an observed frequency to claims of knowledge or belief regarding the data generating process?

The route is to go via the framework of statistical testing which has some relationship to Karl Popper’s *falsification*. This involves positing a statistical hypothesis with direct implications. These implications can be parsed as an explicit prediction that can be compared to future observations. In this way, the hypothesis is tested against the data. This gives us a means of arguing *reductio ad unlikely* against the initial hypothesis and turns statistical inference into a “process of systematically plowing out falsehood”. Through iteratively testing and refining more targeted hypotheses. We define and reject the null hypotheses as we go.

The constraints on the test design are built to ensure reliability over the course of repeated testing under a known null hypothesis. Sample size considerations, alpha-spending and statistical power are properties of the test. These properties need to be chosen in such a way to ensure reliability of a particular test, but there are also constraints for running repeated tests of multiple hypotheses. The entire testing enterprise is set up to minimise and control errors. The epistemological picture is one of reliablism. Whether a belief is justified is determined by the nature and **defensiblity** of the process that generated the claim. This approach allows us to reject the Gettier style counter examples to justified true belief analyses of knowledge. If the procedures of knowledge acquisition are not themselves well founded than the coincidence between claim and fact in the Gettier cases cannot be counted as justified. The procedures of knowledge acquisition are fixed, uncertainty stems from how we learn the shape of the data probabilistically.

For this method to work the reliabilism the methodology seeks should be clarified. Otsuka suggests that the reliability implicit in statistical testing needs to underwrite truth-tracking counterfactuals. In particular the claims:

- if P were not true, S would not believe P
- If P were true then S would believe P.

Which is not to say that any particular statistical test will lead to the endorsement/rejection of the hypothesis in question. Statistical tests cannot directly accept the null - just fail to reject. But cumulatively over many tests a reliabilist epistemology should ultimately ascertain the falsity of the null when the null is false. This points to a tension in the focus on individual tests and specific p-values. Statistical testing as an epistemological enterprise is a wholesale endeavour and the overall success or failure of conditions which underwrite the reliability of the process are not easily discerned by mere success in our world. The procedures must be valid and truth tracking in “nearby” counterfactual worlds where the null hypothesis is not how it is in our world. The conditions under which a process is ultimately truth-tracking may depend on contextual factors which cannot be easily turned to an algorithm. This perspective nicely unites all the ad-hoc approaches to defining tests with asymptotic error control properties. The concern is inherently procedural, and procedures can (and perhaps should) vary in the context of the learning. But their adoption is founded on the commitment that they would be reliably truth-tracking in all worlds similar enough to our own.

## When Philosophies Conflict

Otsuka’s survey of the two approaches is detailed and comprehensive. It sketches the motivations of each position well, and you might hope to reconcile the two. Apply each in their own domain where appropriate…, but unfortunately the motivating instincts can clash irreconcilably.

The Bayesian perspective necessitates the adoption of the **Likelihood Principle** which can be violated by the classical procedure of sequential testing. Recall how for the Bayesian the data is a fixed quantity, fed forward into the likelihood term to update our beliefs. This works the same whether we update our beliefs at time t1, t2 or tN. All the information is in the likelihood irrespective of how much data has accrued over time.

While under the frequentist model if the data is analysed under different experimental designs (e.g. one with a stopping rule and one without) the results of test for a particular null hypothesis can differ I.e. with the same data and the same null hypothesis, one experiment can reject the null and the other will fail to reject it. This is because the result incorporates information about the design of the experiment that cannot be captured in the simple likelihood. Otsuka makes the point that this is a broad problem for all **reliabilist** philosophies where the fit between process/test and reality is contestable. There is no perfectly general procedure that returns true results in all circumstances. As such the frequentist methodologist must argue anew in each circumstances for the appropriateness of their methods.

This is a keen source of divisiveness between the two schools. Scientists historically cannot be trusted to review their methods with respect to their goals. Instead frequentist statistical methodology has been abused in practice. Adherents use the routine nature of procedure as a rubber stamp without due consideration for the reliability of the methods in the context of the question at hand. This pattern of abuse is rightly seen as a key contributor to the replication crisis in science. Ironically it is this aspect of frequentist methodology that introduces obscure subjective bias into the experimental exercise. The Bayesian (often accused of criminal subjectivity) wears their priors clearly and defends their appropriateness for each analysis.

Neither school of thought is a self-contained epistemology. Neither method is self-sustaining. Both priors and procedures must be justified with an appeal to their apt fit for the problem at hand. Statistical methodology slots into a broader epistemological endeavour and there is no substitute for the careful and knowing application of inferential machinery if we hope to justify the achievements of science.

## Bias and Regularisation

In later chapters we move towards the more pragmatist position of model selection based on predictive power. This mirrors a move away from understanding of uncertainty due to natural variation towards an model-uncertainty in the analysis.

We may prefer a model which does not capture the true data generating process just so long as it performs better in prediction tasks. This effectively invokes the more typical approaches to machine learning model evaluation as driving us toward pragmatic biases that work well to optimise for some measure of out-of-sample prediction. Your inferential framework can be independent of your model selection criteria, but model ranking is still conducted under uncertainty. Each measure of performance is still an estimate.

Otsuka contrasts the ranking of models based on information criteria with the out of sample predictive performance of deep learning systems. The AIC style methodologies penalises models with too many parameters to optimise for predictive performance. Deep learning methods have explicit methods to induce regularisation like effects with: drop-out, modified loss-functions. What is striking is that while both methods are checks against overfitting of models to the data, both are pragmatic compromises that dispense with the idea of epistemological truth-tracking. They concede that the it is often better to overcome the problem of model uncertainty with suitably practical abstraction. Don’t worry about the world-model fit so long as the outcomes of the model are workable.

But this comes with a burden of explainability often demanded of opaque models. Predictive success in one domain does not always translate to success in another. Regulators and stakeholders need to understand when and why the predictive reliability of deep learning systems can be expected to transfer well across tasks. Appeals to the epistemological surety of beliefs derived from predictions in these black-box systems stem from a loosely held belief in the virtues of the deep learning mechanisms. These virtues are less well established than error-control rates of a statistical test and they are somewhat transient across task.

## Causal Inference and Statistics

Finally Otsuka pauses to consider the role of causal inference in contemporary statistics and how here optimising for predictive power will often fail when the task requires subtlety or insight into the data generating process. The manner in which models and measurements can be confounded by aspects of the data generating process cannot be detected automatically without knowledge of the relationships in the system. The focus here is on how tools for **identification** are required when we want to be sure that the causal quantities of interest are soundly derivable from the data we have to hand. Whether we use the potential outcomes framework, the do-calculus or structural estimation. The work of statistical inference using any and all of the above frameworks can only get off the ground when we have enough of a world-picture - a set of commitments or assumptions that allow our inferences to have warrant. Our priors our informed, our inferential procedures suitably well defined to sustain truth-tracking counterfactuals.

## Conclusion

The broad picture painted in Otsuka’s survey of statistics is the central role of inferential procedures in our broader epistemological landscape. The diversity of roles it can play and different standards of rigour at play in different contexts. The links between the foundational questions and puzzles of epistemology have illuminated the structure of the debates in statistics.