consistency – Spencer Greenberg

11 Types of Thinkers and Intellectuals (a little framework)

admin — Sun, 23 Aug 2020 11:39:00 +0000

1. Ideators: generate novel ideas

Ex: Einstein

Strengths: creativity, insight

2: Investigators: vigorously investigate a topic in order to understand it

Ex: Curie

Strengths: truth-seeking, curiosity, systematicness, persistence

3. Provers: demonstrate that the ideas of others are sound, explore their limits, strengthen or work out the implications of existing theories

Ex: Singer

Strengths: consistency, logic, rigor, bullet-biting

4. Appliers: explore new, useful applications for existing ideas, or combine ideas to make something valuable

Ex: Ford

Strengths: pragmatism, knowledge, goal orientation

5. Doers: do things in the world, reflect on what worked and what didn’t, draw insights from and generalize this experience so others can learn from it

Ex: Graham

Strengths: experience, reflection, generalization, synthesizing

6. Critics: dissect ideas to find the flaws in them

Ex: Wollstonecraft

Strengths: questioning, challenging, dissection, disagreeing

7. Enhancers: refine, hone or clarify existing ideas to make them better

Ex: Bostrom

Strengths: clarifying, honing, making rigorous

8. Popularizers: figure out how to explain important, complex ideas in simple ways; spread them to the public

Ex: Sagan

Strengths: simplification, explanation, inspiration, metaphor

9. Activators: use ideas to change norms or improve society

Ex: King

Strengths: inspiration, eloquence, courage, altruism

10. Storytellers: tell compelling stories to convey information and ideas and to capture the narrative behind ideas

Ex: Gladwell

Strengths: narrative, journalism

11. Cataloguers: collect, categorize and organize information

Ex: Dewey

Strengths: comprehensiveness, organization, categorization

This piece was first written on August 23, 2020, and first appeared on this site on July 8, 2022.

Do The Findings Of A Study Conducted In One Place Generalize To Other Places?

Spencer — Thu, 19 Jul 2018 23:42:00 +0000

Do the results of studies generalize to new situations?

For instance, suppose a study is conducted on a technique or intervention (e.g., providing health education to parents) and the study finds it to be effective for a particular outcome (e.g., improving the health of children). When the next study is conducted on (what appears to be) the same intervention and outcome, should we expect that study to ALSO find the intervention to be effective?

There are a lot of reasons why it may NOT:

(1) sampling error – studies are almost always done on only a subset of the population of interest, which creates random variation from study to study. For instance, it could have just been a fluke that this particular group of children happened to have their health improve more than the control group of children. So the two results might differ purely by chance. Fortunately, statistical methods allow us to analyze how likely this is as an explanation, and the larger the population is that we test the technique on, the lower the chance is of this happening.

(2) questionable research practices – the first study’s apparent positive result may have been a false positive produced largely by bad research practices, such as analyzing many different outcomes and reporting only the one that seemed to work. If the study and its analysis plan are publicly pre-registered before the study is conducted, then this problem can be mostly avoided. In that case, researchers won’t be able to fool themselves about their original intention, and others can also hold them accountable to their original plan. On the other hand, it can be very useful to explore questions about data that you had not thought to explore before you looked at the data, or to engage in open-ended analyses where you aren’t certain what your hypothesis should be. But doing so comes with a greater risk of false positives.

(3) quality – the quality of the intervention and competence of the implementation may vary. For instance, if the first educational intervention studied was well designed, but the second one taught the information poorly or taught information that wasn’t actually useful, then it would hardly be surprising if the first succeeded and the second failed. More generally, a sufficiently bad implementation of any intervention will always fail! So an intervention of a certain type failing is NOT good evidence for the type of intervention not working more generally, unless a low-quality implementation is unlikely (or future implementors are unlikely to be able to create a more competent implementation). If the intervention went through multiple rounds of iterative improvement and user feedback before the study was conducted, this will help rule out (but not eliminate) low quality as the explanation for a failure.

(4) technique – the two interventions could have applied different techniques in order to produce change, despite very similar-sounding names and equal levels of quality. For instance, the first intervention might have focused more on the health benefits of purifying water, while the second might have focused more on the value of hand washing. These are both “water-related health education interventions for parents”, which sound extremely similar, but one may be much more effective than the other.

(5) dosage – the two interventions could have been administered in different amounts. For instance, the first might have been a 1-week course with 6 hours of education per day, and the second a 3-day course with 2 hours of education per day. Since the former in that case has a much higher dose, it would be little surprise if it had the potential to be much more effective (though it also might be substantially more costly to implement).

(6) format – the interventions could have used the same technique but delivered it using a different mechanism. For instance, the interventions might have been teaching the same information for the same duration, but the first might have been an in-person training in groups, whereas the second could have been an online training done individually.

(7) follow-up – the two studies could differ in how long they waited to collect the outcomes. For instance, the first might have looked at health outcomes 6 months later, whereas the second might have looked at health outcomes 3 weeks later. It’s possible that 3 weeks is too soon to find effects, or that the effects only last for a few months, so 6 months is too late to find them!

(8) measurement – different methods of measuring the outcomes could have been used. For instance, the first study might have measured health by looking at children’s medical usage records at nearby clinics, whereas the second might have asked parents to self-report the health of their children. These measurements may produce results that contradict each other even if every other detail of the studies is identical, simply because they are measuring different (though related) things.

(9) pre-existing attributes – the two populations studied might have a different distribution of relevant characteristics. For instance, in the first study, the children might have had a higher level of pre-existing illness than in the second study, making it easier to improve health outcomes. If the children are already really healthy, it’s going to be tough to produce a large effect on health. But on the flip side, if the children are already extremely ill, in some cases, it may be hard to help them recover compared to helping less sick people.

(10) culture – the two populations might differ culturally or in their personalities, language, attitudes, or beliefs. For instance, in the first study, the local culture may have made people more interested in the learning material, whereas in the second study, the local culture may have made people less trusting of the information because it contradicts traditionally held beliefs.

(11) environment – the surrounding environment that the populations exist in, or the resources already available, might have differed in the two studies. For instance, the first study group might live in an area where health resources are scarce, whereas the second group might live in one where alternative health resources are more abundant, making it harder to add value for the latter population. Or, for instance, in one area, certain easily treated childhood diseases might be common, whereas in the other region, the common childhood diseases might be harder to treat.

(12) control group – oftentimes, studies use control groups to help determine whether an intervention caused a change beyond what would have otherwise happened, but the choice of control can differ across studies. For instance, the first study might have used a wait-list control where people get nothing for a while, whereas the second group might have received an educational intervention as a control that is not expected to produce health improvements (i.e., an “active control”). Or it could be that the first study used statistical control after the fact (e.g., using linear regression), whereas the second study randomized some of the study participants to a control group. The former method is typically weaker in that it may not fully account for confounding variables, but in some instances, the latter method may be weaker if the size of the control group is very small.

— Can we standardize effect sizes to make results more compatible across studies? —

How do we fix the problems above?

Well, one approach that is intended to help is to try to make interventions and outcomes that are more comparable using statistical tricks for “standardizing” outcomes. For instance, by converting the size of the effects to “Cohen’s d” scores.

Example: calculate how much people getting the intervention changed in the outcome of interest, subtract away the change in the control group, and then divide by the pooled standard deviation (from both groups). Use this as the measure of how well each intervention worked.

So rather than having one intervention’s effect measured in one unit (like fraction of people cured) and another measured in another unit (like reduction in symptoms on a 1-7 symptom scale), making these outcomes hard to compare, all the effect sizes now are measured as “number of standard deviations that the intervention group changed relative to the change in the control group.”

This certainly improves the situation of comparing two interventions, but only *somewhat*. It doesn’t address most of the issues mentioned in the list above, and those it does help are only partially addressed by such standardization.

Two Cohen’s d values that are the same can mean quite different things with respect to how “effective” an intervention really is. For instance, suppose that the outcome of one intervention was recorded as 0 for each participant not completely cured, and 1 for each participant completely cured, whereas the outcome for the second intervention was recorded as 0 when there was no improvement, and 1 when there was any improvement. Even if the interventions and populations are identical and the Cohen’s d for both is 0.50, this does not imply that the interventions are anywhere close to as useful as each other!

That being said, standardization is useful in that it at least somewhat reduces the problem of incomparability of outcomes.

— effect size doesn’t exist —

A potentially deeper and more disturbing point (one seemingly often overlooked) is that the “effect size” (i.e., strength of effect) of an intervention doesn’t really exist.

What does exist is the effect size of an intervention that’s been implemented in a particular way for a particular duration of implementation, and that’s been calculated using a particular choice of effect size measure, when applied to a particular population that’s being sampled using a particular sampling procedure.

Horrible to state, but its consequences are more horrible still.

That means: the “effect size of parent health education on child health” isn’t really a thing.

But: the “Cohen’s d of this particular 3-month health education intervention applied to randomly sampled families in a particular city using the outcome of medical clinic usage” does exist!

If we talk as though the “effect size of parent health education on child health” exists, we are either being sloppy, or optimistically hoping that all of those other details don’t matter much in terms of the effect size, or implicitly assuming some specific set of study attributes and procedures that we simply aren’t bothering to make explicit.

Now insofar as the “effect size of parent health education on child health” doesn’t depend on the messy details of things like which health education intervention we’re talking about, and which population of children we’re talking about, and what duration of intervention we’re talking about, etc. etc. it can be sensible to treat things as though there really is an “effect size” of that intervention. But really, how often do we expect that to be true?

— the average over what? —

This issue of the effect size not existing is actually an example of a more general mathematical problem, which is that if you say “this is the average of variable X”, a mathematician’s natural response might be, “The average of variable X taken over what probability distribution?” By that, they will mean that the average depends on the sampling procedure used. In other words, an average does not exist unless a sampling procedure is stated (or implicitly assumed).

Now, if someone says “the average age of people in this city”, the implied sampling procedure is probably choosing people uniformly at random from this city, getting the age of each such person, then taking an average of those, but if the way we attempted to collect ages was to stand on a particular street and ask people their age as the people passed, this would not be a uniformly random sampling procedure.

Presumably, there is some correlation between someone’s age and the chance that they bother to talk to you on that particular street about their age. This “average age” you calculate would be “the average age of people in this city who will stop to talk to me when I’m standing in this spot over this period” rather than “the average age of people in this city.

Now, it might be the case that the sampling procedure used turns out to be a pretty good proxy for the average age overall (if one were to sample uniformly at random), but then again, it might not be (e.g., you might dramatically under-sample newborns who rarely approach you on the street).

— So…can we not generalize the results of studies? —

All of the above may sound very pessimistic. It may sound like doing a study on one intervention tells us almost nothing about future similar studies on similar interventions.

But what I’ll call the “five C’s of generalizability” should make us somewhat more optimistic:

(1) Constants – many psychological and biological features of humans transcend time and place (because they are hard-wired into us via genetics, or because the intervention is only planned to be used on a particular group of people that share a lot of common characteristics). Insofar as we’re working with these more basic mechanisms (or across homogeneous groups), we can expect a higher degree of generalization.

That being said, when we’re used to being embedded in one culture, it can sometimes be surprising what varies across cultures, so we have to be cautious about positing human universals.

(2) Causality – the more deeply we understand the causality underlying why an intervention works, the more we can predict when it will or won’t generalize. For instance, if we know it will only work where assumptions A, B, and C are true, then we can ask about the extent to which those assumptions hold in a particular time and place. So if we are in a situation where our model of the underlying causal factors predicts the intervention will work, we can be more confident that it really will.

That being said, causality can be very tricky to uncover.

(3) Consistency – if studies find consistent effects for an intervention across varied contexts, then we may not know quite how well it will work in a new context, but we at least have evidence that the effect is fairly robust, and may generalize across new contexts as well

That being said, this only helps if we have found a fairly consistent pattern, and we still might get unlucky and find a new context where the intervention doesn’t actually work.

(4) Capability – if an intervention is found to be highly effective, it is more likely to generalize to other contexts. Even if variations in contexts reduce the effectiveness, hopefully, some of that very large effect will still remain.

That being said, it is certainly possible for something to work extremely well in one situation and fail totally in another.

(5) Circumstances – If the circumstance that an intervention was found to work in previously is very similar to the one that it is being applied to now (including the sort of people it’s being applied to, the precise intervention being applied, the way outcomes are measured, the surrounding culture, etc.), we can be more confident that it will work about as well as the prior one. For instance, if you take an intervention that worked in one school district and apply that exact same intervention in the same way in a similar school district nearby, it is more likely to work as it did previously than if you apply a modified version of the intervention in a different country.

That being said, it is often too limiting to only apply interventions in similar contexts to those in which they were previously tested.

— empirical data —

So how well, empirically speaking, does an intervention study done in one context tend to generalize to another?

Eva Vivalt has done some really interesting work on this topic, with an emphasis on international and economic development (e.g., see this paper: https://bit.ly/2yWQpRG). Also see this chart from the paper (https://i.imgur.com/6SVoVnr.png), which shows the standardized effect sizes for many different intervention/outcome pairs, such as bed nets for malaria and conditional cash transfers for attendance rate. The chart can help you gain an intuitive feeling for the extent to which generalizability does or does not occur for studies that are ostensibly on the same intervention and outcome, and the paper gives a lot more detail.

— What do we do about inconsistent effects? —

As you can see from Eva’s chart, it is pretty frequently the case that one study of an intervention will find a positive effect on an outcome, while another study finds approximately no effect or even a negative effect.

To understand why this happened, we can return again to our list of reasons that studies don’t generalize. We can ask ourselves, was this due to…

(1) sampling error?

(2) questionable research practices?

Or perhaps differences in…

(3) quality of the implementation?

(4) the technique used?

(5) dosage?

(6) format?

(7) follow up?

(8) measurement?

(9) pre-existing attributes?

(10) culture?

(11) environment?

(12) control group?

We can’t always tell what the causes of different results are, but at least we can TRY to figure it out by considering the different possibilities and attempt to hone in on likely candidates.

What’s more, we can apply the “five C’s of generalizability”.

The more that our intervention relies on human *constants*, the more our model of the underlying *causality* predicts we’ll get generalizability in this case, the more we have *consistency* of past effects across varied situations, the more that past studies produced high *capability* of achieving large effects, and the more similar this situation is to the *context* where the effect worked in the past, the less of a problem we’ll tend to find with generalization.

— one of my favorite solutions for generalizability —

One of my favorite solutions to this problem of poor generalizability is different than any of those I’ve discussed above.

When feasible and not prohibitively expensive or difficult, I think it is great for intervention developers to study their exact intervention using an outcome they actually care about on a population that is the same as (or much like) the real one that the intervention will be applied to.

In other words, merging the study with the real-world deployment of the intervention.

This neatly side steps generalizability to a large degree, not by ensuring generalizability to new groups, but by testing your results on the group you most care about!

Then, as you “generalize” your intervention to new contexts or groups, if you keep collecting data of the right form as you deploy it more broadly, you may be able to detect whether its effectiveness is successfully generalizing as well, and even better, try to discover what is or isn’t causing it to work.

This piece was first written on July 19, 2018, and first appeared on my website on December 11, 2025.

Is Math True?

Spencer — Mon, 19 Jan 2009 00:51:00 +0000

Mathematics is often thought to be universally and unassailably true. Some people even argue that even an omnipotent God couldn’t make math false.

But can mathematicians actually prove that math is true? If they can’t, does the fact that math is so useful in solving real-world problems provide evidence of its truth? And, if mathematics is not true, then does that imply that conclusions drawn from it are faulty or suspect? Let’s explore those questions.

The first attempt we might take to prove that math is true is to consider real-world situations where equations seem to appear. Some examples are:

If I have three red balls in a bag and add two more, the bag will then contain five red balls (3 balls in a bag with 2 balls added to the bag gives 5 balls).

If I am on a train traveling at three miles per hour and throw a ball at two miles per hour (measured with respect to the train), then the ball will be traveling at five miles per hour with respect to the ground (3 mph sped up by 2 mph gives 5 mph).

If I had three dollars worth of goods yesterday and borrowed two dollars worth of goods from you today, then I have five dollars worth of goods in my possession (3 dollars of goods with an additional 2 dollars of goods borrowed yields 5 dollars of goods).

Each of these three situations seems to imply the equation 3+2=5. But do they actually PROVE that the equation 3+2=5 is true?

One problem with drawing conclusions about mathematics from these examples is that the number ‘3’ is not the same as ‘3 balls’ or ‘3 hours’ or ‘3 dollars’, and the operator ‘+’ is not the same as grouping balls or combining velocities or aggregating wealth.

While 3+2=5 is typically an excellent model for each of these situations, the equation is not precisely equivalent to these situations. Why not? Well, it’s true that when we group balls (by, in this case, placing them in a bag), the procedure generally behaves as though we are performing addition. But now suppose that the objects we are grouping together are made of packed sand. In this case, when we add new objects to our bag, they will sometimes fracture and split into multiple objects. Or if the balls are made of wet clay, they may fuse into a single object in the bag. The addition operator ‘+’ no longer models this situation well because when we place two new objects in the bag, it does not always increase the number of objects contained in the bag by two. So addition is not a perfect model for grouping physical objects in a confined space.

What about the other examples? Einstein’s theory of relativity tells us (in contradiction to the more intuitive but less accurate equations of Newtonian mechanics) that when a person on a train (which is moving three miles per hour with respect to the ground) throws a ball at two miles per hour (with respect to the train), then the speed of the ball with respect to the ground is actually very slightly less than 5 miles per hour, not equal to 5 miles per hour. So while addition is very accurate for modeling that situation, it’s known that it does not, in fact, give the exact correct answer.

What about the last example? If I had three dollars worth of goods yesterday and then borrowed two dollars worth of goods from you today, the total number of dollars worth of goods that I have possession of will not necessarily be five dollars if the value of my original goods changed between yesterday and today (as can happen in real economic markets).

What these examples show us is that the only reason to say that grouping balls or combining velocities or aggregating wealth encapsulates the idea of mathematical addition is that most of the time, the addition operator ‘+’ provides a good MODEL for these scenarios. We can no more conclude that 3+2=5 is a true statement simply because putting two balls into a bag that already has three balls usually produces a bag with five balls, then we can conclude that 3+2=5 is false, simply because if you’re using balls of packed sand, sometimes the balls will fracture into more balls when you place them in the bag. In other words, while real-world situations can motivate the equations of mathematics and provide justifications for applying them, they cannot prove that those equations are actually true.

We have stared at equations like 3+2=5 so many times in our lives that it can be difficult to consider them with fresh eyes in order to ask ourselves what it really is that they are saying. Clearly, ‘3’, ‘+’, ‘2’, ‘=’, and ‘5’ are not objects in the physical universe. You can go to the zoo and see three bears, or see the numeral ‘3’ printed on a sign, or perform arithmetic on paper using the symbol ‘3’, but nowhere in the universe can you find the actual (metaphysical) number ‘3’. This is hardly surprising, since ‘3’ is a concept or idea, not a physical thing. But this line of thought implies that 3+2=5 is a statement about the relationship among the concepts ‘3’, ‘2’, and ‘5’, and not a statement about physical entities that actually exist. The only time that 3+2=5 is a statement about physical things that actually exist is when we use it as a model for real-world properties that are sufficiently similar to the concepts for the model to be useful.

But how do we define the word “true” when it comes to relations among abstract concepts? One possible approach is to say that statements about abstract concepts are true if they follow as a logical consequence of the definitions of the concepts themselves.

This leads us to ask whether 3+2=5 and all other mathematical statements are simply true by definition as a consequence of our chosen definitions for ‘3’, ‘+’, ‘2’, ‘=’, ‘5’, and the other mathematical objects.

Unfortunately, this question cannot be answered without further qualification. How do we choose to define concepts such as ‘3’? Various authors have attempted to define mathematics by developing lists of axioms (which are simply assumed to be true) and then proving that the basic mathematical objects (e.g., integers) and theorems (e.g., a+b = b+a) follow from these axioms. There are a variety of different ways that math can be axiomatized (i.e., built up from basic axioms). Some approaches use sets as the most basic objects (as is done in what is probably the most popular axiomatization, Zermelo-Fraenkel set theory).

In contrast, others use Category Theory to provide the basic building blocks. Still, other theories attempt to axiomatize only small portions of math, such as Euclid’s Axioms of planar geometry, Hilbert’s axiomatization of Euclidean Geometry, and the Peano axioms for arithmetic.

What is even trickier (when it comes to deciding what is true) than having so many conflicting viewpoints for constructing math is that the axioms of these viewpoints are themselves not provably true. If you are willing to assume the axioms of math are “true”, then all of the resulting theorems that can be derived from those axioms are also true, but the axioms themselves must simply be accepted without proof in order for this process to work. If we could prove that the axioms were true, then they would be called “theorems” and not “axioms”!

Even those mathematicians who agree to rely on a single basic axiomatization (such as Zermelo-Fraenkel set theory) sometimes cannot agree on whether certain extra axioms (such as the continuum hypothesis, which concerns itself with the existence of sets of certain infinite sizes, or the axiom of choice which pertains to being able to select one element from each element of a set of sets) should be added or left out. And to top that off, mathematics (as defined by whichever axiomatization you like) has not even been proven to be consistent, meaning that no one has been able to mathematically demonstrate that the axioms of any single axiomatization do not contradict each other. In fact, Gödel’s 2nd incompleteness theorem shows that if mathematics is in fact consistent, then it will not be possible to use math to prove that no inconsistencies exist!

In conclusion, numbers and other mathematical objects are simply concepts, and not things that are actually observable in the universe, so we cannot say that statements like 3+2=5 are true in the same way that we can say that the statement “massive objects exert forces on other massive objects” is true. We might like to think that mathematical statements are true by definition. Still, this idea is complicated by the fact that there is more than one way to axiomatize mathematics, and therefore more than one definition that we might choose in order to define numbers, operators, and other mathematical objects. But even if there were truly only one way to axiomatize math, the axioms themselves would still not be provably true (they would only be assumed to be true), and hence it would hardly seem fair to then conclude that mathematical theorems are “true” in some objective and universal sense.

In the end, while it hardly seems fair to say that math is “false”, it also does not seem fair to conclude that math is “true” in the usual sense of the word. It’s true, conditional on the axiom that we choose to accept, and only insofar as it is talking about the concepts that it defines (rather than the physical world).

Of course, math undeniably provides extraordinarily useful models for making predictions about what will happen in our physical universe. This will perhaps seem less surprising if we remember that mathematics was not originally developed from the ground up using axioms, but rather piece by piece in order to find solutions to problems that appear in the real world (like those related to calculating the size of plots of land, counting money, measuring roads, tracking the movements of the stars, understanding heat flow in cannons, etc.). Humans chose mathematical definitions to model physical reality so that we could make useful predictions, not to encapsulate metaphysical truth, so should we have expected math to be “true” rather than merely (very) useful?

This piece was first written on January 18, 2009, and first appeared on my website on March 3, 2026.