evidence – Spencer Greenberg

Is it a problem if students cheat using AI?

Spencer — Fri, 23 May 2025 22:26:52 +0000

A really bad take I’m hearing: “It’s fine if students use AI to cheat at writing, they’ll have AI in real life.” It’s bad because:

1) Learning to WRITE well is a primary way people learn to THINK well. There are other ways to learn to think well (e.g., a strong culture of oral debate and rigorous discussion), but that’s largely not how things are set up, so without writing, there’s a vacuum. Until schools change, students are sacrificing learning to think.

2) Normalizing cheating in one domain normalizes it in other domains too.

There are lots of ways to use AI to improve your thinking (e.g., ask an AI to critique a belief you hold or to help you explore points on all sides of a debated issue). But when a teacher says, “Write this without AI,” and you have an AI write the essay, it’s preventing you from engaging in significant thinking.

Thinking well involves a number of components, such as:

– gathering evidence
– considering arguments
– formulating a viewpoint
– honing your viewpoint
– presenting your viewpoint clearly

Replacing thinking with AI is not analogous to replacing doing multiplication with a calculator. That’s a memorized algorithm. Thinking well, on the other hand, is core to understanding the world, figuring out what goals to set, not being duped by others, and many other essential aspects of life.

This piece was first written on May 23, 2025, and first appeared on my website on June 5, 2025.

For Health And Longevity, Be Wary Of Mechanisms

Spencer — Fri, 09 May 2025 00:18:40 +0000

Often in health and longevity discussions, you’ll hear arguments about mechanisms. For instance:

Antioxidants -> reduced free radicals -> less DNA damage -> less cancer

Unfortunately, these biologically plausible-sounding claims usually don’t work when rigorously tested.

Are mechanistic arguments useless?

No. They are a great source of *hypotheses*. While most of these hypotheses fail, some eventually lead to important new treatments.

Unfortunately, health gurus, podcasters, and even sometimes (though they should know better) doctors and scientists use mechanistic arguments to convince the public about treatments for which we have little evidence.

Mechanistic arguments in health sound scientific and impressive. They make the speaker seem authoritative and knowledgeable. And they *seem* very hard to argue with. However, there is one general argument that works for most of them: “That sounds nice, but let’s look at randomized experiments in humans to check if it works.”

Why is it so common that health-related mechanistic claims don’t work when rigorously tested in randomized trials?

What goes wrong with X->Y biological thinking?

1) Causality: The first issue is that an X->Y claim may be true in terms of associations, without the links being *causal*. It’s typically a lot easier to establish that when people have higher X, they also have higher Y than to show that increasing X causes higher Y.

Alzheimer’s research seems to be experiencing this problem in a major way. The hypothesis:

Amyloid plaques -> Alzheimer’s

Seems to be oversimplified or perhaps mostly associational (rather than causal), as drugs that reduce brain plaques have had disappointing results.

2) Multiple mechanisms: even if it’s true that X is in the causal chain for Y, it may also be true that Y is also highly influenced by other mechanisms, and so changing X may not change Y that much, even if you control X completely.

3) Other effects: even if the mechanism is completely correct, there may be alternative effects of the treatment. These could undermine the original benefit through other pathways, or cause other forms of harm that mean the benefit is not worth it.

4) Equilibrium: even if mechanistically X->Y, the body may work hard to maintain a balance of Y (much as it does to keep your core body temperature roughly constant regardless of whether you’re drinking a hot beverage or standing outside in the cold). Hence, the effects of intervening on X may not create lasting impacts on Y because your body works to restore a homeostasis.

5) Evaluability: unlike arguments based on empirical evidence (we gave patients this treatment in a study and here’s how their outcomes differed from the control group) and logical arguments, which a reasonably knowledgable non-expert can understand and assess to at least some degree, biological mechanism based arguments can’t be evaluated at all by non-experts. Take this claim, for example. Is this sound? See if you can tell:

“Subcutaneous WPP9 injections activate orexinergic neurons via Gq-coupled receptor agonism in the lateral hypothalamus, which increases daytime cortisol rhythm, leading to increased alertness.”

So, is this a valid mechanistic argument about human biology? Well, no, but I only know that because I had an LLM AI make this argument up by prompting it to generate a biologically plausible sounding but made-up argument. An expert on the topic may immediately identify it as implausible, but anyone else is going to have no realistic way of evaluating its soundness without help.

So, what’s the takeaway here? Well, when a podcaster or health guru tells you that we know a treatment works because [insert biological mechanistic argument here], remember that it isn’t strong evidence, no matter how impressive it sounds. We need careful randomized experiments (or other high-quality evidence) to be confident it’s true. Mechanistic arguments are for generating hypotheses; they give us a reason to collect more data and run studies to see if an idea pans out – they don’t themselves serve as strong evidence for what’s true.

Of course, we don’t always need strong evidence to try a treatment if it is worth it. If a treatment isn’t expensive and is low risk, we would be able to tell if it is working, and we don’t have more evidence-backed alternatives, then experimenting with the treatment (even if it only has weak evidence) can still be a good idea. But we shouldn’t mistake “worth experimenting with” for “having strong evidence for”.

This piece was first written on May 8, 2025, and first appeared on my website on May 15, 2025.

When is it worth it to argue over definitions?

Spencer — Thu, 10 Apr 2025 21:42:24 +0000

It’s almost always a waste of time debating definitions with people (“semantic debates”).

Just stop for a moment to define terms or switch to using the other person’s definition so you don’t talk past each other. Definitions can be whatever we want them to be, and most of the time the important thing is just that our definitions match closely enough so that we can communicate effectively. Attempts to argue about definitions usually are a fool’s errand.

And yet… there are some situations where disagreeing about definitions or trying to convince the other person to adopt a different definition may be wise:

Ambiguity. When someone attempts to use an ambiguous definition, that can cause reasoning about the topic to become sloppy. You can suggest an alternative, more precise definition.
Nonstandardness. When someone uses a word in a way that is out of sync with how most people use that word, it can create a lot of confusion. You can suggest switching to a standard definition or using a different word/phrase for what they are referring to.
Emotionality. Sometimes, people sneak judgments, offensiveness, or slants into arguments with an emotion-laden word. For instance, “slut” will sound negative to many, even if the speaker insists on giving it a neutral definition. You can suggest switching to a neutral word. Sometimes this can also go in the reverse direction, where someone tries to make something awful sound okay by giving it a very benign phrase.
Centrality. Sometimes, a definition is too broad or does not capture the core of what’s under debate. For instance, defining “criminal” as anyone who’s broken ANY law may make it hard to discuss “criminal justice reform.” You can suggest a new definition that’s better focused.
Circularity. Sometimes, people will try to win an argument by using an unusual definition that makes them right by definition. For instance, in a debate about the cost-effectiveness of medical care, if someone defines “routine medical care” in such a way as to exclude all non-cost-effective medical care, then, by definition, routine medical care will be cost-effective. In such cases, you can suggest using a widely accepted definition that doesn’t make the other person automatically right (by definition).
Benefits. Sometimes, using one definition is more useful or more beneficial to the world than using another definition. In such cases, it might be valuable for you to argue that the other person should switch to using a different definition just for these pragmatic benefits.
Shifting. Sometimes people make an unreasonable or false claim using one definition but then, when they’re challenged, they’ll switch to using another definition that makes their claim much more easily defensible (a “motte-and-bailey” fallacy). In such cases, you can argue against their usage of the fallback definition so as to pin down their claims.
Objective. There are some special situations (though they’re rare) in which there really is only one good way to define something. For instance, this sometimes happens in physics and mathematics, where any other definition (that’s not equivalent) fails to have the properties we want. I would argue that “evidence” is like this too – I believe there is only one definition that has all the properties we’d want “evidence” to have.

So, most of the time, when disagreements over definitions come up, you shouldn’t debate definitions. It’s simply a waste of time. These conversations usually are unresolvable because there are no agreed upon criteria for deciding which definition is better, and the conversation amounts to pointlessly trading intuitions. Fundamentally, definitions are things we make up, so it’s usually best just to agree on definitions upfront or to adopt the other person’s definition so effective communication can happen.

But, as we‘ve seen here, there are a handful of interesting cases where it’s actually helpful to propose a potentially “better” definition and to try to get the other person to agree to it before proceeding with the discussion!

This piece was first written on March 16, 2025, and first appeared on my website on April 10, 2025.

Has every made-up anecdote already happened?

Admin — Sat, 14 Sep 2024 03:37:00 +0000

A weird thing about anecdotes: there are so many humans, and each human has so many things happen to them, that for a great many simple stories, you might make up (as long as it is within the bounds of physics/current technology/human capacity, and isn’t too specific), something similar has happened to somebody.

For instance, I just made up these stories that I’ve never heard of ever happening:

• a young child stealing their mother’s car

• a dog discovering buried treasure

And indeed, with a quick search I can confirm that these things seem to really have happened!

Though, of course, this won’t always be the case since the number of human events still pales in comparison to the number of concepts that can be mixed – for instance, I couldn’t find even one documented case of “a clown being killed by bees” (though I’m confident that at some point in history, someone was dressed in a clown suit when a bee stung them).

In any event, the preponderance of events on our planet means that something happening one single time tells us almost nothing. Having happened once is a very low bar.

And yet, to make a point in a way that people find compelling, it’s sometimes mandatory (or close to it) to give real-world examples that demonstrate the point.

This creates an awkward tension where a single real-world example often has almost no evidentiary value but has substantial persuasive power.

There are some special cases where an anecdote can provide meaningful evidence. For instance, when the anecdote is so well documented or reliable that you know it happened AND the outcome couldn’t reasonably have been caused by anything other than through the explanation the anecdote provides – such as a case study in a hospital where some experimental new treatment saves a patient with a previously incurable disease. Or when you yourself have tried something once (e.g., a self-help technique), and it seemed to work well, and that is sufficient justification for trying it again.

But in most cases, despite their usefulness in making a compelling point, anecdotes should be thought of as a way to imagine something more vividly and see more clearly specific ways it can manifest, not as evidence for something being true. They are important when explaining a concept, but usually not because they provide evidence of its validity.

This piece was first written on September 13, 2024, and first appeared on my website on October 11, 2024.

Conducting Instantaneous Experiments

Admin — Sat, 24 Aug 2024 11:54:00 +0000

Have a hypothesis about the world, society, human nature, physics, or anything else that nobody has directly tested before? It might seem like conducting a costly experiment would be required to find out whether it’s true. But a lot of the time, you can check your hypothesis easily using what I call an “Instantaneous Experiment.”

How to do an Instantaneous Experiment:

Step 1: Think of anything at all about the world that’s checkable that is likely to be true if your hypothesis is true, but that is likely to be false if your hypothesis is false.

Important: this checkable thing should be something that you have never investigated before – in other words, you don’t actually know if it’s true, and the only real reason you think it’s true is just because your hypothesis implies it would be. This is critical to help prevent bias from occurring during the process (for instance, this procedure doesn’t work if the fact you are checking is one that influenced your development of the hypothesis).

Step 2: Go check whether the checkable thing is true or not by trying to look the answer up (e.g., in an article or paper)!

The amount of evidence that the answer provides in favor (or against) your hypothesis precisely depends on how many times more likely you are to see that result if your hypothesis is true compared to if it’s not true. The bigger that number is, the greater the evidence!

Instantaneous Experiments work because, to get evidence for a theory or hypothesis, it is not necessary to directly check whether that thing is true. All you have to do is check something that is implied by that theory (that would be unlikely to be true otherwise).

Here’s an example:

Suppose you believe that “greater intelligence causes people to worry a lot more”

That’s very hard to test. But you can do an Instantaneous Experiment:

Step 1: if intelligence causes worry, then you might expect higher IQ people to agree more often with a statement like “I worry too much,” whereas if the theory is not true, you wouldn’t expect a positive correlation between IQ and agreement with that statement.

Step 2: We go check this, and we find a paper that measures both IQ and the level of agreement on the statement “I worry too much.” The correlation between them is essentially 0.

Result: We haven’t completely disproven the theory, but we should now reduce our confidence in it compared to what we thought before.

How much we reduce our confidence depends on how many times less likely we’d be to find no correlation between self-reported worry and IQ if our hypothesis “greater intelligence causes people to worry a lot more” is true, compared to if it’s false.

This piece was first written on August 24, 2024, and first appeared on my website on October 11, 2024.

How can big problems get solved?

admin — Sun, 05 May 2024 15:12:00 +0000

I think that big problems in the world (like chronic homelessness, loneliness, depression, poverty, underrepresentation of groups, risks from A.I., global warming, etc.) are ridiculously complex – way more complex than the narratives about them suggest.

The only approach I know of that I think has a meaningful shot to help solve such huge problems, which you might call “Scientific Entrepreneurship,” combines two methods into one:

(1) Rigorous science to deeply understand the causal structures of the problem and how strong an effect each cause has (which often will begin with a qualitative approach to understand the outlines of the problem and then move to analysis of carefully conducted measurements).

(2) A “lean startup” approach, where you try things (guided by your current understanding of the causal relationships), see what happens, and then rapidly course-correct based on the results (and sometimes take large pivots) to iterate towards better and better approaches.

When dealing with these highly complex problems, I think that a “lean startup” approach without rigorous science ends up leading to lots of random attempts that have almost no chance of ultimate success (and sometimes even converge to self-propagating but useless approaches, such as charities that perpetually absorb money without genuinely helping the cause).

On the other hand, a purely scientific approach without the lean startup iterative mindset often ends up missing critical contextual details that are actually essential for the project to work. You’ll inevitably encounter many specific barriers that the scientific theory doesn’t address. In summary, I think our best bet for solving highly complex world problems is Scientific Entrepreneurship: developing a deep understanding of the high-level causal structures through scientific rigor and then combining that with an entrepreneurial form of rapid iteration.

This piece was first written on May 5, 2024, and first appeared on my website on June 5, 2024.

How to spot real expertise

admin — Tue, 23 Apr 2024 13:33:35 +0000

Thanks go to Travis (from the Clearer Thinking team) for coauthoring this with me. This is a cross-post from Clearer Thinking.

How can you tell who is a valid expert, and who is full of B.S.?

On almost any topic of importance you can find a mix of valid experts (who are giving you reliable information) and false but confident-seeming “experts” (who are giving you misinformation). To make matters even more confusing, sometimes the fake experts even have very impressive credentials, and every once in a while, the real, genuine experts are entirely self-taught.

Here are 12 signs we look for in an expert to help us determine whether they are trustworthy.

1. They have deep factual knowledge

Let’s start with the obvious: for most topics, a lot of factual knowledge is required before you can have genuine expertise. This means that a genuine expert will have an impressive command of the relevant (non-debated) facts on the topic of their expertise. Thankfully, it’s a lot easier to tell if an expert has a strong command of the non-debated facts than whether they are correct about more controversial claims.

2. They communicate their confidence levels

Not all knowledge is equally well-established. Even theories that are widely accepted enjoy different levels of support from the relevant evidence. When an expert regularly pretends that all their claims are equally well-established, they demonstrate they are willing to make you believe something is certain when it isn’t.

It’s a good sign that someone treats their subject with the nuance expected from genuine expertise, when they indicate how confident they are (e.g., “It’s been shown in many high-quality studies that…”, or “My best guess is…”), and they explain limitations in the evidence they are using (e.g., “this is unfortunately based on just one study, but that is all that currently exists”)

3. They admit not knowing

Genuine experts also sometimes say that they don’t know the answer to a question, or that the answer is generally not known by anyone. This is important because every topic will have some unknowns, and no expert can know everything about a topic. Telling you when they don’t know is a sign that, when they say they do know, they actually do know.

4. They tell you to look at sources other than themselves

This might happen when an expert doesn’t know the answer to a question, or when they want to help you go beyond the answer they can give you. Genuine experts don’t seek to be seen as a sole arbiter of knowledge or authority on a topic (which can be an indication that ego, rather than truth-seeking, is a primary motivation for them), but instead encourage you to look at resources other than the ones they have produced.

5. They use logic and evidence

Anyone can use rhetorical devices like emotional appeals, no matter how wrong they are, but a well-reasoned argument that uses valid logic and strong evidence will tend to point toward truth. Or, put another way, using strong logic and strong evidence is easier to do when you’re right, whereas emotional appeals are no easier when you’re right than when you’re wrong.

6. They cite high-quality evidence

Some evidence is much more reliable than other evidence, and those who rely on the less reliable kinds when the more reliable kinds exist probably aren’t doing the best job they can at figuring out the truth. For this reason, genuine experts cite high-quality evidence when it exists (e.g., looking at multiple randomized controlled trials for causal claims) rather than low-quality evidence (e.g., just talking about personal anecdotes), and when high-quality evidence doesn’t exist, they cite the highest quality evidence that does exist.

7. They acknowledge the consensus

Consensus views among experts are more often correct than the idiosyncratic views of just one or two experts. The consensus will not always be right, of course, but often it will be the best understanding we have available. That’s why reliable experts are transparent about the degree to which their opinion differs from the majority of experts, provide reasoned explanations for any deviations, and they are cautious not to present fringe theories as mainstream. This shows a deep engagement with the topic of their expertise and also an adherence to ethical standards of honesty and accuracy in communication.

8. They change their mind

Genuine experts will change their minds about topics within their expertise in response to evidence and arguments. It’s hard to become an expert in something without having been wrong from time-to-time.

That means that anyone claiming to be an expert who has never changed their mind probably has not found and corrected their mistakes. Relatedly, changing one’s mind in response to evidence is also a sign of the epistemic humility associated with genuine expertise.

Of course, if someone has a long history of being wrong, that is evidence against them being a genuine expert, not in favor of it. But, since everyone makes some mistakes, if they make mistakes from time to time and then note they were wrong and improve their beliefs, that is a sign that they are following the evidence where it leads rather than continuing to believe what they do regardless of the evidence.

9. They Steelman

When you ‘straw man’ an argument, you misrepresent or oversimplify someone else’s position to make it easier to attack or refute. Instead of dealing with the actual argument, you replace it with a weaker version that distorts the original point, which you then argue against. The opposite of this is called ‘steelmanning’, and it involves presenting the strongest possible version of an argument you’re objecting to, even if it’s more robust than the one originally presented. This approach aims to strengthen the opposing case in order to facilitate a more genuine and constructive debate.

The most reliable experts will accurately present the strongest arguments made by those that disagree with them while pointing out flaws in those arguments, rather than focusing on just weak arguments from the other side or just mocking the other side (including ad hominem attacks rather than focusing on the substance of the claims of the other side). This is important because knocking down a weak argument from the other side of a debate does little to show the other side is wrong; you have to refute the strongest claims of the other side to actually show they are wrong. Additionally, demonstrating a knowledge of the strongest arguments against your own position shows a deeper level of expertise than only understanding the opposing point of view at a superficial level.

10. They clearly explain their reasons for believing

The philosopher Daniel Dennett has said: “if I can’t explain something I’m doing to a group of bright undergraduates, I don’t really understand it myself.” This sentiment is echoed by philosopher John Searle, who said “In general, I feel if you can’t say it clearly you don’t understand it yourself.”

When communicating with non-experts, genuine experts are often able to give clear, easy-to-follow (and, ideally, checkable) explanations for why they believe what they believe – without dumbing down the points. They avoid unnecessary jargon and technical language (which sounds smart but makes their arguments very difficult for their audience to follow). Not every genuine expert is able to do this, but the ability to do this well is a sign of genuine expertise. This is important because an expert who cannot explain their ideas clearly will end up requiring you to believe them based on their authority rather than engaging with the arguments themselves. And sometimes, people claiming to be experts will hide behind technical expertise and jargon so that you won’t notice that their arguments are actually weak.

11. They have a track record

Sometimes genuine experts will have track records of predictions or successes that you can check, and this provides direct evidence of their knowledge or skill. Unfortunately, this only applies to some fields, like chess masters, martial experts who fight in tournaments, experts who make public predictions about the economy or politics, etc.

12. They use multiple lenses

The world is complex and multi-faceted, and any one simple theory is going to fail to explain a lot of what’s really going on. For this reason, genuine experts tend to look at problems from multiple frames and perspectives; they don’t act as though one way of looking at things solves all problems, or that one solution works for all problems, or that one simple theory explains everything.

So the next time you hear claims from an alleged expert on a topic that is important to you, you may want to consider: how many of these signs of expertise do they exhibit? You can use this checklist, considering if they:

have deep factual knowledge
communicate their confidence levels
admit not knowing
tell you to look at sources other than themselves
use logic and evidence
cite high-quality evidence
acknowledge the consensus
change their mind
steelman
clearly explain their reasons for believing
have a track record
use multiple lenses

And if you’re seeking to be an expert in something yourself, you may want to ask yourself: “to what extent do I exhibit these traits?”Being able to discern genuine expertise from B.S. requires good judgment. If you’d like to improve your skills at making accurate judgments, why not try our Calibrate Your Judgment tool, created in partnership with Open Philanthropy.

This piece first appeared on Clearer Thinking.org on April 16, 2024, and first appeared on my website on April 22, 2024.

Weird but potentially valuable new roles we could have in our society

admin — Thu, 19 Nov 2020 01:47:00 +0000

There are certain roles in society that come with special training, powers, and responsibilities. For instance: doctors (can prescribe medicine), lawyers (client-attorney privilege), judges (can bindingly interpret law), etc.

Here’s my list of some weird but potentially really valuable roles in society that don’t exist but maybe should:

Role 1: Truth Teller

They wear a special, very noticeable hat. When wearing it, they are not permitted to say anything they know to be untrue (they are punished severely and may be suspended or lose their license if they do, plus the incident becomes public). They can also get punished for clear lies of omission or for making misleading statements. At all times when being worn, their hat records time-stamped, watermarked 360-degree video whenever it is worn. Anyone who is caught on camera can request the segment of the video (and accompanying audio) of the portion they are a part of.

Training: practicing telling people very difficult truths (e.g., breaking the news to parents of military vets that their child isn’t coming back), answering difficult personal questions fully truthfully, speaking very carefully about what they know and how they know it, etc.

Some uses:

• When you need an opinion, you can count on them being TOTALLY honest

• As an eye witness to prove what occurred (e.g., at protests or high-stakes negotiations)

• Observing voting recounts and lottery drawings

• When needing an eye witness to later prove to others very credibly (e.g., in court) that something did or didn’t happen

Role 2: Evidence Evaluator

They provide an impartial, apolitical, thoroughly researched, unbiased, and fallacy-aware perspective on any topic. When they use their official title in writing or speech (e.g., “Signed, Evidence Evaluator Jane Doe”), they can be suspended for falling into even minor fallacies or biases, and they can lose their license for significant ones.

Training: extensive practice with argument and evidence evaluation, avoiding rhetorical fallacies & cognitive biases, and calibration training for making predictions; extensive learning about Bayes’ rule, probabilistic and nuanced thinking, research best practices, statistics, summarizing evidence, scientific thinking, etc.

Some uses:

• When you want to know what’s known on a thorny topic, you can hire them to interview experts on all sides of the issue, or read papers on all sides, giving an impartial account of the evidence (e.g., what is known about how much human behavior is increasing global temperatures, and how certain this information is)

• When it’s helpful to find weaknesses or flaws in any perspective

Role 3: Unconditional Aide

They can be hired by the hour, and during that time, they are required to look out SOLELY for the interests of the person that hired them (as long as the health, property, or safety of anyone else is not at risk). In other words, they are fully and completely your supporter and your team member for the time you are paying them and will help you with ANYTHING you choose. They do, however, have the right to maintain a public list of activities they are not willing to do, to refuse clients who they would prefer not to work with, and to quit at any moment (by notifying you that they are quitting – in which case you would still owe them payments for any hours logged thus far). They also may have a price list (i.e., their hourly rate can fluctuate based on what you are asking for their help with). Credible reports that they are not acting on behalf of the client’s interests can lead to suspension or complete removal of their title.

Training: practicing active listening, practicing eliciting a person’s underlying goals, and real-world training where they have to help many different people with many different kinds of requests and goals (and then get assessed by the people they helped with qualitative and quantitative feedback).

Some uses:

• You are going into an emotionally difficult situation and would like a supporter there with you (but don’t want to ask friends/loved ones)

• You are trying to carry out a difficult activity and need someone’s help with it

• You are in a serious pickle and need another person’s help (e.g., your child had to suddenly go to the hospital, and you need someone to show a potential buyer around your house, then walk your dog, then bring something from your home to the hospital)

The established roles we have in society (doctors, judges, etc.) are very useful. Perhaps we could do with a few more of them.

This piece was first written on November 18, 2020, and first appeared on this site on October 21, 2022.

50 “Laws” of Everything

Spencer — Mon, 06 Jul 2020 23:06:00 +0000

Parkinson’s Law: Work expands so as to fill the time available for its completion.
Hofstadter’s Law: It always takes longer than you expect, even when you take into account Hofstadter’s Law.
Gates’ Law: Most people overestimate what they can do in one year and underestimate what they can do in ten years.
Goodhart’s Law: When a measure becomes a target, it ceases to be a good measure.
Hanlon’s Razor: Never attribute to malice that which is adequately explained by stupidity (or, don’t invoke conspiracy when ignorance and incompetence will suffice, as conspiracy implies intelligence).
Acton’s Dictum: Power tends to corrupt, and absolute power corrupts absolutely.
Amara’s Law: We tend to overestimate the effect of a technology in the short run and underestimate the effect in the long run.
Benford’s Law: In a diverse collection of unrelated statistics, a given statistic has roughly a 30% chance of starting with the digit 1.
Betteridge’s Law: Any headline which ends in a question mark can be answered by the word ‘no’.
Brooks’ Law: Adding manpower to a late software project makes it later.
Chesterton’s Fence: Reforms should not be made until the reasoning behind the existing state of affairs is understood.
Claasen’s Law: Usefulness = log(technology).
Clarke’s First Law: When a distinguished elderly scientist states that something is possible, they are almost certainly right, but when they state something is impossible, they are probably wrong.
Cromwell’s Rule: Nothing but logical impossibilities have a prior probability of 0 or 1.
Cunningham’s Law: The best way to get the right answer on the Internet is not to ask a question, it’s to post the wrong answer.
Doctorow’s Law: When someone puts a lock on a thing you own, against your wishes, and doesn’t give you the key, they’re not doing it for your benefit.
Dunbar’s Number: Most people can’t maintain stable social relationships with more than 150 people.
Eroom’s Law: Drug discovery is becoming slower and more expensive over time, despite improvements in technology.
Gell-Mann Amnesia Effect: You’ll believe articles outside your area of expertise, even after acknowledging that neighboring articles in your area of expertise are completely wrong.
Gibson’s Law (or the Expert Witness Law): For each PhD (to use as an expert witness for one side) there’s an equal and opposite PhD.
Godwin’s Law: As an online discussion grows longer, the probability of a comparison involving Nazis or Hitler approaches one.
Morley-Souter’s Law (Rule 34): There is porn of it (no exceptions).
Greenspun’s Tenth Rule: Any sufficiently complicated C program contains an ad hoc, informally specified, bug-ridden, slow implementation of half of Common Lisp.
Hebb’s Law: Neurons that fire together wire together.
Hubble’s Law: Galaxies recede from an observer at a rate proportional to their distance to that observer.
Hume’s Guillotine (Is-Ought Problem): Normative statements (about what’s moral/immoral/right/wrong) cannot be deduced exclusively from descriptive statements.
Humphrey’s Law: Conscious attention to a task normally performed automatically can impair its performance.
Kranzberg’s Law: Technology is neither good nor bad; nor is it neutral.
Lamarck’s Principle (or “Use it or Lose it”): Use it or lose it (evolutionarily speaking, but also in the brain).
Lewis’s Law: The comments you’ll inevitably find on any article about feminism justify feminism.
Littlewood’s Law: Individuals can expect miracles to happen to them, at the rate of about one per month.
Maes–Garreau Law: Favorable predictions about future technology will fall at the latest possible date they can come true and still remain in the lifetime of the predictor.
Metcalfe’s Law: The value of a system grows as approximately the square of the number of users of the system.
Miller’s Law: To understand what another person is saying, you must assume that it is true and try to imagine what it could be true of.
Moore’s Law: Computation per dollar grows exponentially (or: number of transistors per circuit doubles roughly every 24 months).
Murphy’s Law: Anything that can go wrong will go wrong.
Alder’s Law: What cannot be settled by experiment is not worth debating.
O’Sullivan’s Law: All organizations that are not actually right-wing will over time become left-wing.
Pareto’s Principle (80/20 Rule): For many phenomena 80% of consequences stem from 20% of the causes.
Peter’s Principle: In a hierarchy, every employee tends to rise to his level of incompetence.
Poisson’s Law (or Law of Large Numbers): For independent random variables with a common distribution, the average tends to the mean as sample size increases.
Pournelle’s Iron Law of Bureaucracy: In bureaucracy, those devoted to the bureaucracy get control, those devoted to what it’s supposed to achieve lose influence.
Putt’s Law: Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand.
Rosenthal Effect (Pygmalion Effect): High expectations lead to an increase in performance, low expectations to a decrease in performance.
Schneier’s Law: Any person can invent a security system so clever that she or he can’t think of how to break it.
Shermer’s Law: Any sufficiently advanced extraterrestrial intelligence is indistinguishable from God.
Zipf’s Law: The frequency of use of the nth-most-frequently-used word in any natural language is approximately inversely proportional to n (few words are used often, most are used rarely).
Wirth’s Law: Software gets slower more quickly than hardware gets faster.
Sturgeon’s Law: Ninety percent of everything is crud.
Stigler’s Law: No discovery is named after its original discoverer, including this one.

This piece was first written on July 6, 2020, and first appeared on my website on May 30, 2026.

Do The Findings Of A Study Conducted In One Place Generalize To Other Places?

Spencer — Thu, 19 Jul 2018 23:42:00 +0000

Do the results of studies generalize to new situations?

For instance, suppose a study is conducted on a technique or intervention (e.g., providing health education to parents) and the study finds it to be effective for a particular outcome (e.g., improving the health of children). When the next study is conducted on (what appears to be) the same intervention and outcome, should we expect that study to ALSO find the intervention to be effective?

There are a lot of reasons why it may NOT:

(1) sampling error – studies are almost always done on only a subset of the population of interest, which creates random variation from study to study. For instance, it could have just been a fluke that this particular group of children happened to have their health improve more than the control group of children. So the two results might differ purely by chance. Fortunately, statistical methods allow us to analyze how likely this is as an explanation, and the larger the population is that we test the technique on, the lower the chance is of this happening.

(2) questionable research practices – the first study’s apparent positive result may have been a false positive produced largely by bad research practices, such as analyzing many different outcomes and reporting only the one that seemed to work. If the study and its analysis plan are publicly pre-registered before the study is conducted, then this problem can be mostly avoided. In that case, researchers won’t be able to fool themselves about their original intention, and others can also hold them accountable to their original plan. On the other hand, it can be very useful to explore questions about data that you had not thought to explore before you looked at the data, or to engage in open-ended analyses where you aren’t certain what your hypothesis should be. But doing so comes with a greater risk of false positives.

(3) quality – the quality of the intervention and competence of the implementation may vary. For instance, if the first educational intervention studied was well designed, but the second one taught the information poorly or taught information that wasn’t actually useful, then it would hardly be surprising if the first succeeded and the second failed. More generally, a sufficiently bad implementation of any intervention will always fail! So an intervention of a certain type failing is NOT good evidence for the type of intervention not working more generally, unless a low-quality implementation is unlikely (or future implementors are unlikely to be able to create a more competent implementation). If the intervention went through multiple rounds of iterative improvement and user feedback before the study was conducted, this will help rule out (but not eliminate) low quality as the explanation for a failure.

(4) technique – the two interventions could have applied different techniques in order to produce change, despite very similar-sounding names and equal levels of quality. For instance, the first intervention might have focused more on the health benefits of purifying water, while the second might have focused more on the value of hand washing. These are both “water-related health education interventions for parents”, which sound extremely similar, but one may be much more effective than the other.

(5) dosage – the two interventions could have been administered in different amounts. For instance, the first might have been a 1-week course with 6 hours of education per day, and the second a 3-day course with 2 hours of education per day. Since the former in that case has a much higher dose, it would be little surprise if it had the potential to be much more effective (though it also might be substantially more costly to implement).

(6) format – the interventions could have used the same technique but delivered it using a different mechanism. For instance, the interventions might have been teaching the same information for the same duration, but the first might have been an in-person training in groups, whereas the second could have been an online training done individually.

(7) follow-up – the two studies could differ in how long they waited to collect the outcomes. For instance, the first might have looked at health outcomes 6 months later, whereas the second might have looked at health outcomes 3 weeks later. It’s possible that 3 weeks is too soon to find effects, or that the effects only last for a few months, so 6 months is too late to find them!

(8) measurement – different methods of measuring the outcomes could have been used. For instance, the first study might have measured health by looking at children’s medical usage records at nearby clinics, whereas the second might have asked parents to self-report the health of their children. These measurements may produce results that contradict each other even if every other detail of the studies is identical, simply because they are measuring different (though related) things.

(9) pre-existing attributes – the two populations studied might have a different distribution of relevant characteristics. For instance, in the first study, the children might have had a higher level of pre-existing illness than in the second study, making it easier to improve health outcomes. If the children are already really healthy, it’s going to be tough to produce a large effect on health. But on the flip side, if the children are already extremely ill, in some cases, it may be hard to help them recover compared to helping less sick people.

(10) culture – the two populations might differ culturally or in their personalities, language, attitudes, or beliefs. For instance, in the first study, the local culture may have made people more interested in the learning material, whereas in the second study, the local culture may have made people less trusting of the information because it contradicts traditionally held beliefs.

(11) environment – the surrounding environment that the populations exist in, or the resources already available, might have differed in the two studies. For instance, the first study group might live in an area where health resources are scarce, whereas the second group might live in one where alternative health resources are more abundant, making it harder to add value for the latter population. Or, for instance, in one area, certain easily treated childhood diseases might be common, whereas in the other region, the common childhood diseases might be harder to treat.

(12) control group – oftentimes, studies use control groups to help determine whether an intervention caused a change beyond what would have otherwise happened, but the choice of control can differ across studies. For instance, the first study might have used a wait-list control where people get nothing for a while, whereas the second group might have received an educational intervention as a control that is not expected to produce health improvements (i.e., an “active control”). Or it could be that the first study used statistical control after the fact (e.g., using linear regression), whereas the second study randomized some of the study participants to a control group. The former method is typically weaker in that it may not fully account for confounding variables, but in some instances, the latter method may be weaker if the size of the control group is very small.

— Can we standardize effect sizes to make results more compatible across studies? —

How do we fix the problems above?

Well, one approach that is intended to help is to try to make interventions and outcomes that are more comparable using statistical tricks for “standardizing” outcomes. For instance, by converting the size of the effects to “Cohen’s d” scores.

Example: calculate how much people getting the intervention changed in the outcome of interest, subtract away the change in the control group, and then divide by the pooled standard deviation (from both groups). Use this as the measure of how well each intervention worked.

So rather than having one intervention’s effect measured in one unit (like fraction of people cured) and another measured in another unit (like reduction in symptoms on a 1-7 symptom scale), making these outcomes hard to compare, all the effect sizes now are measured as “number of standard deviations that the intervention group changed relative to the change in the control group.”

This certainly improves the situation of comparing two interventions, but only *somewhat*. It doesn’t address most of the issues mentioned in the list above, and those it does help are only partially addressed by such standardization.

Two Cohen’s d values that are the same can mean quite different things with respect to how “effective” an intervention really is. For instance, suppose that the outcome of one intervention was recorded as 0 for each participant not completely cured, and 1 for each participant completely cured, whereas the outcome for the second intervention was recorded as 0 when there was no improvement, and 1 when there was any improvement. Even if the interventions and populations are identical and the Cohen’s d for both is 0.50, this does not imply that the interventions are anywhere close to as useful as each other!

That being said, standardization is useful in that it at least somewhat reduces the problem of incomparability of outcomes.

— effect size doesn’t exist —

A potentially deeper and more disturbing point (one seemingly often overlooked) is that the “effect size” (i.e., strength of effect) of an intervention doesn’t really exist.

What does exist is the effect size of an intervention that’s been implemented in a particular way for a particular duration of implementation, and that’s been calculated using a particular choice of effect size measure, when applied to a particular population that’s being sampled using a particular sampling procedure.

Horrible to state, but its consequences are more horrible still.

That means: the “effect size of parent health education on child health” isn’t really a thing.

But: the “Cohen’s d of this particular 3-month health education intervention applied to randomly sampled families in a particular city using the outcome of medical clinic usage” does exist!

If we talk as though the “effect size of parent health education on child health” exists, we are either being sloppy, or optimistically hoping that all of those other details don’t matter much in terms of the effect size, or implicitly assuming some specific set of study attributes and procedures that we simply aren’t bothering to make explicit.

Now insofar as the “effect size of parent health education on child health” doesn’t depend on the messy details of things like which health education intervention we’re talking about, and which population of children we’re talking about, and what duration of intervention we’re talking about, etc. etc. it can be sensible to treat things as though there really is an “effect size” of that intervention. But really, how often do we expect that to be true?

— the average over what? —

This issue of the effect size not existing is actually an example of a more general mathematical problem, which is that if you say “this is the average of variable X”, a mathematician’s natural response might be, “The average of variable X taken over what probability distribution?” By that, they will mean that the average depends on the sampling procedure used. In other words, an average does not exist unless a sampling procedure is stated (or implicitly assumed).

Now, if someone says “the average age of people in this city”, the implied sampling procedure is probably choosing people uniformly at random from this city, getting the age of each such person, then taking an average of those, but if the way we attempted to collect ages was to stand on a particular street and ask people their age as the people passed, this would not be a uniformly random sampling procedure.

Presumably, there is some correlation between someone’s age and the chance that they bother to talk to you on that particular street about their age. This “average age” you calculate would be “the average age of people in this city who will stop to talk to me when I’m standing in this spot over this period” rather than “the average age of people in this city.

Now, it might be the case that the sampling procedure used turns out to be a pretty good proxy for the average age overall (if one were to sample uniformly at random), but then again, it might not be (e.g., you might dramatically under-sample newborns who rarely approach you on the street).

— So…can we not generalize the results of studies? —

All of the above may sound very pessimistic. It may sound like doing a study on one intervention tells us almost nothing about future similar studies on similar interventions.

But what I’ll call the “five C’s of generalizability” should make us somewhat more optimistic:

(1) Constants – many psychological and biological features of humans transcend time and place (because they are hard-wired into us via genetics, or because the intervention is only planned to be used on a particular group of people that share a lot of common characteristics). Insofar as we’re working with these more basic mechanisms (or across homogeneous groups), we can expect a higher degree of generalization.

That being said, when we’re used to being embedded in one culture, it can sometimes be surprising what varies across cultures, so we have to be cautious about positing human universals.

(2) Causality – the more deeply we understand the causality underlying why an intervention works, the more we can predict when it will or won’t generalize. For instance, if we know it will only work where assumptions A, B, and C are true, then we can ask about the extent to which those assumptions hold in a particular time and place. So if we are in a situation where our model of the underlying causal factors predicts the intervention will work, we can be more confident that it really will.

That being said, causality can be very tricky to uncover.

(3) Consistency – if studies find consistent effects for an intervention across varied contexts, then we may not know quite how well it will work in a new context, but we at least have evidence that the effect is fairly robust, and may generalize across new contexts as well

That being said, this only helps if we have found a fairly consistent pattern, and we still might get unlucky and find a new context where the intervention doesn’t actually work.

(4) Capability – if an intervention is found to be highly effective, it is more likely to generalize to other contexts. Even if variations in contexts reduce the effectiveness, hopefully, some of that very large effect will still remain.

That being said, it is certainly possible for something to work extremely well in one situation and fail totally in another.

(5) Circumstances – If the circumstance that an intervention was found to work in previously is very similar to the one that it is being applied to now (including the sort of people it’s being applied to, the precise intervention being applied, the way outcomes are measured, the surrounding culture, etc.), we can be more confident that it will work about as well as the prior one. For instance, if you take an intervention that worked in one school district and apply that exact same intervention in the same way in a similar school district nearby, it is more likely to work as it did previously than if you apply a modified version of the intervention in a different country.

That being said, it is often too limiting to only apply interventions in similar contexts to those in which they were previously tested.

— empirical data —

So how well, empirically speaking, does an intervention study done in one context tend to generalize to another?

Eva Vivalt has done some really interesting work on this topic, with an emphasis on international and economic development (e.g., see this paper: https://bit.ly/2yWQpRG). Also see this chart from the paper (https://i.imgur.com/6SVoVnr.png), which shows the standardized effect sizes for many different intervention/outcome pairs, such as bed nets for malaria and conditional cash transfers for attendance rate. The chart can help you gain an intuitive feeling for the extent to which generalizability does or does not occur for studies that are ostensibly on the same intervention and outcome, and the paper gives a lot more detail.

— What do we do about inconsistent effects? —

As you can see from Eva’s chart, it is pretty frequently the case that one study of an intervention will find a positive effect on an outcome, while another study finds approximately no effect or even a negative effect.

To understand why this happened, we can return again to our list of reasons that studies don’t generalize. We can ask ourselves, was this due to…

(1) sampling error?

(2) questionable research practices?

Or perhaps differences in…

(3) quality of the implementation?

(4) the technique used?

(5) dosage?

(6) format?

(7) follow up?

(8) measurement?

(9) pre-existing attributes?

(10) culture?

(11) environment?

(12) control group?

We can’t always tell what the causes of different results are, but at least we can TRY to figure it out by considering the different possibilities and attempt to hone in on likely candidates.

What’s more, we can apply the “five C’s of generalizability”.

The more that our intervention relies on human *constants*, the more our model of the underlying *causality* predicts we’ll get generalizability in this case, the more we have *consistency* of past effects across varied situations, the more that past studies produced high *capability* of achieving large effects, and the more similar this situation is to the *context* where the effect worked in the past, the less of a problem we’ll tend to find with generalization.

— one of my favorite solutions for generalizability —

One of my favorite solutions to this problem of poor generalizability is different than any of those I’ve discussed above.

When feasible and not prohibitively expensive or difficult, I think it is great for intervention developers to study their exact intervention using an outcome they actually care about on a population that is the same as (or much like) the real one that the intervention will be applied to.

In other words, merging the study with the real-world deployment of the intervention.

This neatly side steps generalizability to a large degree, not by ensuring generalizability to new groups, but by testing your results on the group you most care about!

Then, as you “generalize” your intervention to new contexts or groups, if you keep collecting data of the right form as you deploy it more broadly, you may be able to detect whether its effectiveness is successfully generalizing as well, and even better, try to discover what is or isn’t causing it to work.

This piece was first written on July 19, 2018, and first appeared on my website on December 11, 2025.