research findings – Spencer Greenberg

Victims and Perpetrators Of Sexual Harassment and Sexual Assault: Gender Differences and Rates Of Victimization and Perpetration

Spencer — Wed, 05 May 2021 20:01:00 +0000

This piece was first written in 2017, and first appeared on my website on April 29, 2026.

This essay is one in a series examining sexual harassment and assault. Sexual harassment and assault occur appallingly often. We have an absurdly long way to go as a society to correct this problem.

To make progress on persistent problems, as this one clearly is, it’s often helpful to try to deeply understand the forces that drive and maintain the problem. As a small step in that direction, I ran a study surveying victims and perpetrators of sexual harassment and sexual assault, collecting quantitative and qualitative data. I’m summarizing the findings in a series of posts.

This data has helped me understand this problem a bit better, and I hope you find it does the same for you. Note that this study only scratches the surface of this very complex topic.

Warning: this post contains extensive discussion about and numerous accounts of sexual harassment and sexual assault.

I surveyed 574 people in the United States online, 52% of whom are female, using Positly.com, our participant recruitment platform. They answered multiple-choice questions about their experience (or lack thereof) as victims and/or perpetrators of sexual harassment and sexual assault, as well as a number of questions about their other characteristics. Additionally, 46% were also asked to give free-form written responses explaining some of their multiple-choice answers. The participants for this study leaned somewhat more liberal and younger than the broader United States population.

Since this is a self-report survey, people may be answering in a way that is socially desirable or otherwise obscuring the truth. That being said, the survey was 100% anonymous, and the survey takers were aware that it was anonymous, so they had nothing to lose from answering honestly. But still, it seems likely that the rates of being the victim and/or perpetrator of sexual harassment and sexual assault are underreported here.

PREVELANCE

Here are the high-level results regarding how frequently participants reported being victors and/or perpetrators:

Cat-calling (by/to a stranger): 32% of males and 81% of females reported being victims, 18% of males and 9% of females reported being perpetrators
Verbal sexual harassment: 25% of males and 62% of females reported being victims, 9% of males and 5% of females reported being perpetrators
Physical sexual harassment: 18% of males and 45% of females reported being victims, 4% of males and 4% of females reported being perpetrators
Unwanted persistent sexual advances: 33% of males and 59% of females reported being victims, 9% of males and 4% of females reported being perpetrators
Unwanted requests for sexual favors: 25% of males and 50% of females reported being victims, 6% of males and 4% of females reported being perpetrators
Unwanted sexual activity (which was continued after the perpetrator had reason to believe it was unwanted): 18% of males and 37% of females reported being victims, 8% of males and 4% of females reported being perpetrators

GENDER DIFFERENCES

Unsurprisingly, males are much more likely to be perpetrators than females, and females are much more likely to be victims than males.

We assigned each participant a score from 0 to 6 based on the number of types of sexual harassment or sexual assault they had experienced (“victim scores”), and another score from 0 to 6 based on the number of types of sexual harassment or sexual assault they had perpetrated (“perpetrator scores”).

Victim scores: females averaged 3.35 yes answers versus 1.52 for males, which means that females had about 2.2 times the level of agreement to having been the victim of different types of sexual harassment or sexual assault than males, a statistically significant result (p<0.01).

Perpetrator scores: males averaged 0.54 yes answers versus 0.30 for females, which means males had about 1.8 times more agreement to having conducted sexual harassing or sexual assaulting behaviors, a statistically significant result (p<0.01).

51% of women reported having experienced 4 or more of the 6 items in the list above as victims, vs. 19% for males.

RELATIVE FREQUENCIES OF REPORTS OF BEING A VICTIM AND PERPETRATOR

Far more people report being victims than report being perpetrators.

If we look at all participants in the study, they reported having experienced 2.4 of these types of events on average as victims, and only 0.44 of these types of events as perpetrators, meaning that people reported 5.5 times more of these events occurring from the victim perspective than the perpetrator perspective.

Another way to look at this is that while 70% of participants in the study reported having experienced 1 or more of these events as a victim, only 25% admit to having been the perpetrator in one or more of these types of events. Similarly, while 55% say they have experienced 2 or more of these types of events as victims, only 10% say they have perpetrated 2 or more. The gender breakdown is that 20% of females say they have perpetrated one or more, vs. 30% of males, and 6% of females say they have perpetrated 2 or more, vs. 14% of males.

What to make of the number of reports of being victims of these events being so dramatically smaller than the number of reports of being perpetrators of these events?

One possible interpretation is that victims are being much more honest than perpetrators, and many of the actual perpetrators are simply lying about things they’ve done. Another possibility is that perpetrators are not purposely lying, but classifying their actions differently than the victims (i.e., the victim views what happened as sexual harassment, but the perpetrator doesn’t view it as sexual harassment). It would make a self-interested kind of sense that perpetrators wouldn’t want to admit even to themselves that their past actions were bad. A third possible interpretation is that a much wider range of people are victims than are perpetrators (e.g., the repeat offenders produce many victims, raising the number of victims far above the number of perpetrators). This is consistent with other data that suggests that offending once is a strong risk factor for offending again. A fourth possibility is that victims are much more likely to readily remember their experiences of being victims than their perpetrators are to remember their experiences of being perpetrators. The reality may be a mix of these.

CONCLUSION

So, what do we learn here? It seems that it’s common to be a victim and also common (though less so) to be a perpetrator of sexual harassment and/or assault. This is also an inherently gendered issue, with very different rates of victimization and perpetration when dividing by gender.

— APPENDIX —

DEMOGRAPHICS

The demographics of the sample leaned younger than the broader U.S. population (mean age 37, median 35, with 50% of people within the range 29 to 44). The population was also more liberal than the broader U.S. (especially more socially liberal, but also more economically liberal). The median and also the most common education level of participants was a bachelor’s degree. The mean household income was $62,000, with a median of $50,000 (a little less than the U.S. household median of about $60,000). Please keep the above demographics in mind when considering the results, as they may have been different for a different population.

—

If you’re curious exactly how the questions were asked about whether someone was a victim or perpetrator for each of the 6 categories mentioned above, here are the wordings. Each of these was asked as a “yes/no” question, and the number of yes’s on the victim questions gave each person their 0-6 victimization score, whereas the number of yes’s on the perpetrator questions gave each person their 0-6 perpetration score.

Cat-calling

“Have you ever been cat-called by a person you didn’t know (i.e., had a stranger make a sexual whistle, shout, or comment of a sexual nature towards you)?”
“Have you ever cat-called a person that you don’t know (by making a sexual whistle, shout, or comment of a sexual nature)?”

Verbal sexual harassment

“Have you ever had a person verbally harass you in a manner that was sexual in nature?”
“Have you ever verbally harassed another person in a manner that was sexual in nature?”

Physical sexual harassment

“Have you ever had a person physically harass you in a manner that was sexual in nature?”
“Have you ever physically harassed another person in a manner that was sexual in nature?”

Unwanted persistent sexual advances

“Have you ever had a person make unwanted sexual advances towards you that they continued to make even though they had reason to believe their sexual advances were unwanted?”
“Have you ever made unwanted sexual advances towards a person that you continued to make even though you believed the sexual advances were unwanted?”

Unwanted requests for sexual favors

“Have you ever had a person make unwanted requests for sexual favors from you that they continued to make even though they had reason to believe their requests were unwanted?”
“Have you ever made unwanted requests for sexual favors that you continued to make even though you believed the requests were unwanted?”

Unwanted sexual activity

“Have you ever had a person engage in sexual activity with you when they had reason to doubt whether you actually wanted to engage in that sexual activity, but they continued anyway?”
“Have you ever had sexual activity with a person when you had reason to doubt whether the person actually wanted to engage in that sexual activity, but you continued anyway?”

Do The Findings Of A Study Conducted In One Place Generalize To Other Places?

Spencer — Thu, 19 Jul 2018 23:42:00 +0000

Do the results of studies generalize to new situations?

For instance, suppose a study is conducted on a technique or intervention (e.g., providing health education to parents) and the study finds it to be effective for a particular outcome (e.g., improving the health of children). When the next study is conducted on (what appears to be) the same intervention and outcome, should we expect that study to ALSO find the intervention to be effective?

There are a lot of reasons why it may NOT:

(1) sampling error – studies are almost always done on only a subset of the population of interest, which creates random variation from study to study. For instance, it could have just been a fluke that this particular group of children happened to have their health improve more than the control group of children. So the two results might differ purely by chance. Fortunately, statistical methods allow us to analyze how likely this is as an explanation, and the larger the population is that we test the technique on, the lower the chance is of this happening.

(2) questionable research practices – the first study’s apparent positive result may have been a false positive produced largely by bad research practices, such as analyzing many different outcomes and reporting only the one that seemed to work. If the study and its analysis plan are publicly pre-registered before the study is conducted, then this problem can be mostly avoided. In that case, researchers won’t be able to fool themselves about their original intention, and others can also hold them accountable to their original plan. On the other hand, it can be very useful to explore questions about data that you had not thought to explore before you looked at the data, or to engage in open-ended analyses where you aren’t certain what your hypothesis should be. But doing so comes with a greater risk of false positives.

(3) quality – the quality of the intervention and competence of the implementation may vary. For instance, if the first educational intervention studied was well designed, but the second one taught the information poorly or taught information that wasn’t actually useful, then it would hardly be surprising if the first succeeded and the second failed. More generally, a sufficiently bad implementation of any intervention will always fail! So an intervention of a certain type failing is NOT good evidence for the type of intervention not working more generally, unless a low-quality implementation is unlikely (or future implementors are unlikely to be able to create a more competent implementation). If the intervention went through multiple rounds of iterative improvement and user feedback before the study was conducted, this will help rule out (but not eliminate) low quality as the explanation for a failure.

(4) technique – the two interventions could have applied different techniques in order to produce change, despite very similar-sounding names and equal levels of quality. For instance, the first intervention might have focused more on the health benefits of purifying water, while the second might have focused more on the value of hand washing. These are both “water-related health education interventions for parents”, which sound extremely similar, but one may be much more effective than the other.

(5) dosage – the two interventions could have been administered in different amounts. For instance, the first might have been a 1-week course with 6 hours of education per day, and the second a 3-day course with 2 hours of education per day. Since the former in that case has a much higher dose, it would be little surprise if it had the potential to be much more effective (though it also might be substantially more costly to implement).

(6) format – the interventions could have used the same technique but delivered it using a different mechanism. For instance, the interventions might have been teaching the same information for the same duration, but the first might have been an in-person training in groups, whereas the second could have been an online training done individually.

(7) follow-up – the two studies could differ in how long they waited to collect the outcomes. For instance, the first might have looked at health outcomes 6 months later, whereas the second might have looked at health outcomes 3 weeks later. It’s possible that 3 weeks is too soon to find effects, or that the effects only last for a few months, so 6 months is too late to find them!

(8) measurement – different methods of measuring the outcomes could have been used. For instance, the first study might have measured health by looking at children’s medical usage records at nearby clinics, whereas the second might have asked parents to self-report the health of their children. These measurements may produce results that contradict each other even if every other detail of the studies is identical, simply because they are measuring different (though related) things.

(9) pre-existing attributes – the two populations studied might have a different distribution of relevant characteristics. For instance, in the first study, the children might have had a higher level of pre-existing illness than in the second study, making it easier to improve health outcomes. If the children are already really healthy, it’s going to be tough to produce a large effect on health. But on the flip side, if the children are already extremely ill, in some cases, it may be hard to help them recover compared to helping less sick people.

(10) culture – the two populations might differ culturally or in their personalities, language, attitudes, or beliefs. For instance, in the first study, the local culture may have made people more interested in the learning material, whereas in the second study, the local culture may have made people less trusting of the information because it contradicts traditionally held beliefs.

(11) environment – the surrounding environment that the populations exist in, or the resources already available, might have differed in the two studies. For instance, the first study group might live in an area where health resources are scarce, whereas the second group might live in one where alternative health resources are more abundant, making it harder to add value for the latter population. Or, for instance, in one area, certain easily treated childhood diseases might be common, whereas in the other region, the common childhood diseases might be harder to treat.

(12) control group – oftentimes, studies use control groups to help determine whether an intervention caused a change beyond what would have otherwise happened, but the choice of control can differ across studies. For instance, the first study might have used a wait-list control where people get nothing for a while, whereas the second group might have received an educational intervention as a control that is not expected to produce health improvements (i.e., an “active control”). Or it could be that the first study used statistical control after the fact (e.g., using linear regression), whereas the second study randomized some of the study participants to a control group. The former method is typically weaker in that it may not fully account for confounding variables, but in some instances, the latter method may be weaker if the size of the control group is very small.

— Can we standardize effect sizes to make results more compatible across studies? —

How do we fix the problems above?

Well, one approach that is intended to help is to try to make interventions and outcomes that are more comparable using statistical tricks for “standardizing” outcomes. For instance, by converting the size of the effects to “Cohen’s d” scores.

Example: calculate how much people getting the intervention changed in the outcome of interest, subtract away the change in the control group, and then divide by the pooled standard deviation (from both groups). Use this as the measure of how well each intervention worked.

So rather than having one intervention’s effect measured in one unit (like fraction of people cured) and another measured in another unit (like reduction in symptoms on a 1-7 symptom scale), making these outcomes hard to compare, all the effect sizes now are measured as “number of standard deviations that the intervention group changed relative to the change in the control group.”

This certainly improves the situation of comparing two interventions, but only *somewhat*. It doesn’t address most of the issues mentioned in the list above, and those it does help are only partially addressed by such standardization.

Two Cohen’s d values that are the same can mean quite different things with respect to how “effective” an intervention really is. For instance, suppose that the outcome of one intervention was recorded as 0 for each participant not completely cured, and 1 for each participant completely cured, whereas the outcome for the second intervention was recorded as 0 when there was no improvement, and 1 when there was any improvement. Even if the interventions and populations are identical and the Cohen’s d for both is 0.50, this does not imply that the interventions are anywhere close to as useful as each other!

That being said, standardization is useful in that it at least somewhat reduces the problem of incomparability of outcomes.

— effect size doesn’t exist —

A potentially deeper and more disturbing point (one seemingly often overlooked) is that the “effect size” (i.e., strength of effect) of an intervention doesn’t really exist.

What does exist is the effect size of an intervention that’s been implemented in a particular way for a particular duration of implementation, and that’s been calculated using a particular choice of effect size measure, when applied to a particular population that’s being sampled using a particular sampling procedure.

Horrible to state, but its consequences are more horrible still.

That means: the “effect size of parent health education on child health” isn’t really a thing.

But: the “Cohen’s d of this particular 3-month health education intervention applied to randomly sampled families in a particular city using the outcome of medical clinic usage” does exist!

If we talk as though the “effect size of parent health education on child health” exists, we are either being sloppy, or optimistically hoping that all of those other details don’t matter much in terms of the effect size, or implicitly assuming some specific set of study attributes and procedures that we simply aren’t bothering to make explicit.

Now insofar as the “effect size of parent health education on child health” doesn’t depend on the messy details of things like which health education intervention we’re talking about, and which population of children we’re talking about, and what duration of intervention we’re talking about, etc. etc. it can be sensible to treat things as though there really is an “effect size” of that intervention. But really, how often do we expect that to be true?

— the average over what? —

This issue of the effect size not existing is actually an example of a more general mathematical problem, which is that if you say “this is the average of variable X”, a mathematician’s natural response might be, “The average of variable X taken over what probability distribution?” By that, they will mean that the average depends on the sampling procedure used. In other words, an average does not exist unless a sampling procedure is stated (or implicitly assumed).

Now, if someone says “the average age of people in this city”, the implied sampling procedure is probably choosing people uniformly at random from this city, getting the age of each such person, then taking an average of those, but if the way we attempted to collect ages was to stand on a particular street and ask people their age as the people passed, this would not be a uniformly random sampling procedure.

Presumably, there is some correlation between someone’s age and the chance that they bother to talk to you on that particular street about their age. This “average age” you calculate would be “the average age of people in this city who will stop to talk to me when I’m standing in this spot over this period” rather than “the average age of people in this city.

Now, it might be the case that the sampling procedure used turns out to be a pretty good proxy for the average age overall (if one were to sample uniformly at random), but then again, it might not be (e.g., you might dramatically under-sample newborns who rarely approach you on the street).

— So…can we not generalize the results of studies? —

All of the above may sound very pessimistic. It may sound like doing a study on one intervention tells us almost nothing about future similar studies on similar interventions.

But what I’ll call the “five C’s of generalizability” should make us somewhat more optimistic:

(1) Constants – many psychological and biological features of humans transcend time and place (because they are hard-wired into us via genetics, or because the intervention is only planned to be used on a particular group of people that share a lot of common characteristics). Insofar as we’re working with these more basic mechanisms (or across homogeneous groups), we can expect a higher degree of generalization.

That being said, when we’re used to being embedded in one culture, it can sometimes be surprising what varies across cultures, so we have to be cautious about positing human universals.

(2) Causality – the more deeply we understand the causality underlying why an intervention works, the more we can predict when it will or won’t generalize. For instance, if we know it will only work where assumptions A, B, and C are true, then we can ask about the extent to which those assumptions hold in a particular time and place. So if we are in a situation where our model of the underlying causal factors predicts the intervention will work, we can be more confident that it really will.

That being said, causality can be very tricky to uncover.

(3) Consistency – if studies find consistent effects for an intervention across varied contexts, then we may not know quite how well it will work in a new context, but we at least have evidence that the effect is fairly robust, and may generalize across new contexts as well

That being said, this only helps if we have found a fairly consistent pattern, and we still might get unlucky and find a new context where the intervention doesn’t actually work.

(4) Capability – if an intervention is found to be highly effective, it is more likely to generalize to other contexts. Even if variations in contexts reduce the effectiveness, hopefully, some of that very large effect will still remain.

That being said, it is certainly possible for something to work extremely well in one situation and fail totally in another.

(5) Circumstances – If the circumstance that an intervention was found to work in previously is very similar to the one that it is being applied to now (including the sort of people it’s being applied to, the precise intervention being applied, the way outcomes are measured, the surrounding culture, etc.), we can be more confident that it will work about as well as the prior one. For instance, if you take an intervention that worked in one school district and apply that exact same intervention in the same way in a similar school district nearby, it is more likely to work as it did previously than if you apply a modified version of the intervention in a different country.

That being said, it is often too limiting to only apply interventions in similar contexts to those in which they were previously tested.

— empirical data —

So how well, empirically speaking, does an intervention study done in one context tend to generalize to another?

Eva Vivalt has done some really interesting work on this topic, with an emphasis on international and economic development (e.g., see this paper: https://bit.ly/2yWQpRG). Also see this chart from the paper (https://i.imgur.com/6SVoVnr.png), which shows the standardized effect sizes for many different intervention/outcome pairs, such as bed nets for malaria and conditional cash transfers for attendance rate. The chart can help you gain an intuitive feeling for the extent to which generalizability does or does not occur for studies that are ostensibly on the same intervention and outcome, and the paper gives a lot more detail.

— What do we do about inconsistent effects? —

As you can see from Eva’s chart, it is pretty frequently the case that one study of an intervention will find a positive effect on an outcome, while another study finds approximately no effect or even a negative effect.

To understand why this happened, we can return again to our list of reasons that studies don’t generalize. We can ask ourselves, was this due to…

(1) sampling error?

(2) questionable research practices?

Or perhaps differences in…

(3) quality of the implementation?

(4) the technique used?

(5) dosage?

(6) format?

(7) follow up?

(8) measurement?

(9) pre-existing attributes?

(10) culture?

(11) environment?

(12) control group?

We can’t always tell what the causes of different results are, but at least we can TRY to figure it out by considering the different possibilities and attempt to hone in on likely candidates.

What’s more, we can apply the “five C’s of generalizability”.

The more that our intervention relies on human *constants*, the more our model of the underlying *causality* predicts we’ll get generalizability in this case, the more we have *consistency* of past effects across varied situations, the more that past studies produced high *capability* of achieving large effects, and the more similar this situation is to the *context* where the effect worked in the past, the less of a problem we’ll tend to find with generalization.

— one of my favorite solutions for generalizability —

One of my favorite solutions to this problem of poor generalizability is different than any of those I’ve discussed above.

When feasible and not prohibitively expensive or difficult, I think it is great for intervention developers to study their exact intervention using an outcome they actually care about on a population that is the same as (or much like) the real one that the intervention will be applied to.

In other words, merging the study with the real-world deployment of the intervention.

This neatly side steps generalizability to a large degree, not by ensuring generalizability to new groups, but by testing your results on the group you most care about!

Then, as you “generalize” your intervention to new contexts or groups, if you keep collecting data of the right form as you deploy it more broadly, you may be able to detect whether its effectiveness is successfully generalizing as well, and even better, try to discover what is or isn’t causing it to work.

This piece was first written on July 19, 2018, and first appeared on my website on December 11, 2025.