Statistical-Inference/07-anova.Rmd at master · WdeNooy/Statistical-Inference · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Moderation with Analysis of Variance (ANOVA) {#anova}

> Key concepts: eta-squared, between-groups variance, within-groups variance, *F* test on analysis of variance model, pairwise comparisons, post-hoc tests, one-way analysis of variance, two-way analysis of variance, balanced design, main effects, moderation, interaction effect.

Watch this micro lecture on moderation with analysis of variance for an overview of the chapter.

```{r, echo=FALSE, out.width="640px", fig.pos='H', fig.align='center', dev="png", screenshot.opts = list(delay = 5)}
knitr::include_url("https://www.youtube.com/embed/klV2FFgH9OU", height = "360px")
```

### Summary {.unnumbered}

```{block2, type='rmdimportant'}
How do we test mean differences for three or more groups and what if group effects are not the same for all participants?
```

Imagine an experiment in which participants watch a video promoting a charity. They see George Clooney, Angelina Jolie, or no celebrity endorse the charity's fund-raiser. Afterwards, their willingness to donate to the charity is measured. Which campaign works best, that is, produces highest average willingness to donate? Or does one campaign work better for females, another for males?

In this example, we want to compare the outcome scores (average willingness to donate) across more than two groups (participants who saw Clooney, Jolie, or no celebrity). To this end, we use analysis of variance. The null hypothesis tested in analysis of variance states that all groups have the same average outcome score in the population.

This null hypothesis is similar to the one we test in an independent-samples *t* test for two groups. With three or more groups, we must use the variance of the group means (between-groups variance) to test the null hypothesis. If the between-groups variance is zero, all group means are equal.

In addition to between-groups variance, we have to take into account the variance of outcome scores within groups (within-groups variance). Within-groups variance is related to the fact that we may obtain different group means even if we draw random samples from populations with the same means. The ratio of between-groups variance over within-groups variance gives us the *F* test statistic, which has an *F* distribution.

Differences in average outcome scores for groups on one independent variable (usually called *factor* in analysis of variance) are called a main effect. A main effect represents an overall or average effect of a factor. If we have only one factor in our model, for instance, the endorser of the fund-raiser, we apply a one-way analysis of variance. With two factors, we have a two-way analysis of variance, and so on.

With two or more factors, we can have interaction effects in addition to main effects. An interaction effect is the joint effect of two or more factors on the dependent variable. An interaction effect is best understood as different effects of one factor across different groups on another factor. For example, Clooney may increase willingness to donate among females but Jolie works best for males.

The phenomenon that a variable can have different effects for different groups on another variable is called moderation. We usually think of one factor as the predictor (or independent variable) and the other factor as the moderator. The moderator (e.g., sex) changes the effect of the predictor (e.g., celebrity endorser) on the dependent variable (e.g., willingness to donate).

#### Essential Analytics {.unnumbered}

In SPSS, we use the *One-Way ANOVA* option in the *Compare Means* submenu for one-way analysis of variance and the *Univariate* option in the *General Linear Model* submenu for two-way analysis of variance.

```{r ANOVAtable, echo=FALSE, out.width="60%", fig.pos='H', fig.align='center', fig.cap="SPSS table of main and interaction effects in a two-way analysis of variance."}
knitr::include_graphics("figures/S7_AE1.png")
```

The significance tests on the main effects and interaction effect are reported in the **Tests of Between-Subjects Effects** table. Figure \@ref(fig:ANOVAtable) offers an example. The tests on the main effects are in the red box and the green box contains the test on the interaction effect. The APA-style summary of the main effect of endorser is: *F* (2, 137) = 8.43, *p* \< .001, eta^2^ = .10. Note the two degrees of freedom in between the brackets, which are marked by a blue ellipse in the figure. You get the effect size eta^2^ by dividing the sum of squares of an effect by the corrected total sum of squares (in purple ellipses in the figure): 37.456 / 389.566 = 0.10.

Interpret the effects by comparing mean scores on the dependent variable among groups:

1. If there are two groups on a factor, for example, females and males, compare the two group means: Which group scores higher? For example, females score on average 5.05 on willingness to donate whereas the average willingness is only 4.19 for males. The _F_ test shows whether or not the difference between the two groups is statistically significant.

2. If a factor has more than two groups, for example, Jolie, Clooney, and no celebrity endorser, use post-hoc comparisons with Bonferroni correction. The results tell you which group scores on average higher than another group and whether the difference is statistically significant if we correct for capitalization on chance.

3. If you want to interpret an interaction effect, create means plots such as Figure \@ref(fig:ANOVAplots). Compare the differences between means across groups. In the left panel, for example, we see that the effect of sex on willingness to donate (the difference between the mean score of females and the mean score of males) is larger for Clooney (pink box in the middle) than for no celebrity endorser (pink box on the left), and it is smallest for Angelina Jolie (pink box on the right). Similarly, we see that the effect of seeing Clooney instead of no celebrity endorser is larger for females (right-hand panel, pink box on the right) than for males (right-hand panel, pink box on the left).

```{r ANOVAplots, echo=FALSE, out.width="100%", fig.pos='H', fig.align='center', fig.cap="SPSS means plots of the interaction effect of sex and endorser on willingness to donate. Note that the pink boxes have been added manually to aid the interpretation."}
knitr::include_graphics("figures/S7_AE2.png")
```

## Different Means for Three or More Groups

Celebrity endorsement theory states that celebrities who publicly state that they favour a product, candidate, or cause, help to persuade consumers to adopt or support the product, candidate, or cause [for a review, see @RefWorks:3940; for an alternative approach, see @RefWorks:3941].

Imagine that we want to test if the celebrity who endorses a fund raiser in a fund-raising campaign makes a difference to people's willingness to donate. We will be using the celebrities George Clooney and Angelina Jolie, and we will compare campaigns with one of them to a campaign without celebrity endorsement.

```{r clooneyjolie, echo=FALSE, out.width="50%", fig.pos='H', fig.align='center', fig.cap="George Clooney and Angelina Jolie. Photo Clooney by Angela George [CC BY-SA 3.0](https://upload.wikimedia.org/wikipedia/commons/3/32/GeorgeClooneyHWoFJan12.jpg). Photo Jolie by Foreign and Commonwealth Office [CC BY 2.0](https://upload.wikimedia.org/wikipedia/commons/a/ad/Angelina_Jolie_2_June_2014_%28cropped%29.jpg), via Wikimedia Commons."}
# Include portraits: Clooney, Jolie.
knitr::include_graphics("figures/ClooneyJolie.png")
```

Let us design an experiment to investigate the effects of celebrity endorsement. We sample a number of people (participants), whom we assign randomly to one of three groups. We show a campaign video with George Clooney to one group, a video with Angelina Jolie to another group, and the third group---the control group---sees a campaign video without celebrity endorsement. So we have three experimental conditions (Clooney, Jolie, no endorser) as our independent variable.

Our dependent variable is a numeric scale assessing the participant's willingness to donate to the fund raiser on a scale from 1 ("absolutely certain that I will not donate") to 10 ("absolutely certain that I will donate"). We will compare the average outcome scores among groups. If groups with Clooney or Jolie as endorser have systematically higher average willingness to donate than the group without celebrity endorsement, we conclude that celebrity endorsement has a positive effect.

In statistical terminology, we have a categorical independent (or predictor) variable and a numerical dependent variable. In experiments, we usually have a very limited set of treatment levels, so our independent variable is categorical. For nuanced results, we usually want to have a numeric dependent variable. Analysis of variance was developed for this kind of data [@RefWorks:3955], so it is widely used in the context of experiments.

### Mean differences as effects {#anova-meandiffs}

Figure \@ref(fig:anova-means) shows the willingness to donate scores for twelve participants in our experiment. Four participants saw Clooney, four saw Jolie, and four did not see a celebrity endorser in the video that they watched.

```{r anova-means, fig.pos='H', fig.align='center', fig.cap="How do group means relate to effect size?", echo=FALSE, screenshot.opts = list(delay = 5), dev="png", out.width="540px"}
# Goal: Illustrate that differences between group means represent effects (effect size given by eta^2).
# Generate 4 randomobservations with mean 6.4 (Clooney), 4 observations with mean 6.8 (Jolie), and 4 observations with mean 3.3 (no endorser). Use colour for the treatment factor (3 levels). Represent observations in a dotplot, each with a separate value on the x axis, clustered by factor level (experimental condition). Display group means as horizontal line segments (coloured by factor level). Add vertical double-sided arrows between each pair of group means to illustrate group differences. Display eta^2 for the data. Allow user to change the group means and update the plot, mean (difference) lines, and eta^2.
knitr::include_app("http://82.196.4.233:3838/apps/anova-means/", height="490px")
```

<A name="question7.1.1"></A>

```{block2, type='rmdquestion'}
1. In the sample of (12) participants displayed as dots in Figure \@ref(fig:anova-means), what do the double-sided vertical arrows represent? [<img src="icons/2answer.png" width=115px align="right">](#answer7.1.1)
```

<A name="question7.1.2"></A>

```{block2, type='rmdquestion'}
2. According to this figure, does a celebrity endorser matter to the willingness to donate? Explain your answer. [<img src="icons/2answer.png" width=115px align="right">](#answer7.1.2)
```

<A name="question7.1.3"></A>

```{block2, type='rmdquestion'}
3. How do the double-sided vertical arrows relate to effect size (eta^2^)? Change the group means (and update the graph) and explain what you see. [<img src="icons/2answer.png" width=115px align="right">](#answer7.1.3)
```

A group's average score on the dependent variable represents the group's score level. The group averages in Figure \@ref(fig:anova-means) tell us for which celebrity the average willingness to donate is higher and for which situation it is lower.

Random assignment of test participants to experimental groups (e.g., which video is shown) creates groups that are in principle equal on all imaginable characteristics except the experimental treatment(s) administered by the researcher. Participants who see Clooney should have more or less the same average age, knowledge, and so on as participants who see Jolie or no celebrity. After all, each experimental group is just a random sample of participants.

If random assignment was done successfully, differences between group means can only be caused by the experimental treatment (we will discuss this in more detail in Chapter \@ref(confounder)). Mean differences are said to represent the *effect* of experimental treatment in analysis of variance.

Analysis of variance was developed for the analysis of randomized experiments, where effects can be interpreted as causal effects. Note, however, that analysis of variance can also be applied to non-experimental data. Although mean differences are still called effects in the latter type of analysis, these do not have to be causal effects.

In analysis of variance, then, we are simply interested in differences between group means. The conclusion for a sample is easy: Which groups have higher average score on the dependent variable and for which are they lower? A means plot, such as Figure \@ref(fig:anova-meansplot), aids interpretation and helps communicating results to the reader. On average, participants who saw Clooney or Jolie have higher willingness to donate than participants who did not see a celebrity endorser.

```{r anova-meansplot, echo=FALSE, fig.pos='H', fig.align='center', fig.cap="A means plot showing that average willingness to donate is higher with a celebrity endorser than without a celebrity endorser. As a reading instruction, effects of endorsers are represented by arrows.", fig.asp=0.6, out.width="70%"}
# Insert means plot for celebrity endorsement example.
d <- data.frame(endorser = factor(c("Nobody","Clooney","Jolie"), levels = c("Nobody","Clooney","Jolie")), willingness_av = c(3, 6, 7), const = 1)
library(ggplot2)
ggplot(d, aes(endorser, willingness_av)) +
  geom_point(size = 3, color=brewercolors["Blue"]) +
  geom_line(aes(group = const), size = 1, color=brewercolors["Blue"]) +
  geom_segment(aes(x = 1, xend = 3, y = d[1,2], yend = d[1,2]),
               linetype = "dashed", color = "black") +
  geom_segment(aes(x = 2, xend = 2, y = d[1,2], yend = (d[2,2] - 0.1)),
               color = "darkgrey",
               arrow = arrow(length = unit(2,"mm"),
                                 # ends = "both",
                                 type = "closed")) +
  geom_text(aes(x = 2.02, y = (d[1,2] + d[2,2])/2,
            label = "Clooney effect",
            hjust = 0), color = "darkgrey"
            ) +
  geom_segment(aes(x = 3, xend = 3, y = d[1,2], yend = (d[3,2] - 0.1)),
               color = "darkgrey",
               arrow = arrow(length = unit(2,"mm"),
                                 # ends = "both",
                                 type = "closed")) +
  geom_text(aes(x = 2.98, y = (d[1,2] + d[3,2])/2,
            label = "Jolie effect",
            hjust = 1), color = "darkgrey"
            ) +
  theme_general() +
  scale_y_continuous(limits = c(1, 10), breaks = c(1, 5, 10)) + labs(x = "Endorser", y = "Average willingness to donate")
rm(d)
```

Effect size in an analysis of variance refers to the overall differences between group means. We use eta^2^ as effect size, which gives the proportion of variance in the dependent variable (willingness to donate) explained or predicted by the group variable (experimental condition).

This proportion is informative and precise. If you want to classify the effect size in more general terms, you should take the square root of eta^2^ to obtain *eta*. As a measure of association, eta can be interpreted with the following rules of thumb:

-   0.1 = small or weak effect,

-   0.3 = medium-sized or moderate effect,

-   0.5 = large or strong effect.

### Between-groups variance and within-groups variance {#between-variance}

For a better understanding of eta^2^ and the statistical test of an analysis of variance model, we have to compare the individual scores to the group averages and to the overall average. Figure \@ref(fig:anova-between) adds overall average willingness to donate to the plot (horizontal black line) with participants' scores and average experimental group scores (coloured horizontal lines).

```{r anova-between, fig.pos='H', fig.align='center', fig.cap="Which part of score differences tells us about the differences between groups?", echo=FALSE, screenshot.opts = list(delay = 5), dev="png", out.width="540px"}
# Goal: Illustrate that between-groups variance represents differences between group means and the grand mean. And that it is a (smaller or larger) proportion of total variance.
# App anova-means: Generate 4 random observations from a normally distributed population with mean 6.4, sd = 1 (Clooney), 4 observations from a population N(m = 6.8, sd = 1) (Jolie), and 4 observations from N(m = 3.3, sd = 1) (no endorser). Use colour for the treatment factor (3 levels). Represent observations in a dotplot, each with a separate value on the x axis, clustered by factor level (experimental condition). Display group means as horizontal line segments (coloured by factor level). Display eta^2 for the data. Allow user to change the group means and update the plot, mean (difference) lines, and eta^2.
# Extension/replacement: Add horizontal line for grand mean, vertical red solid double-sided arrows between each observation and the grand mean (total variance), vertical black solid double-sided arrows for each observation between its group mean and the grand mean (between variance), and vertical black dotted double-sided arrows for each observation between the dot and its group mean (within variance).
knitr::include_app("http://82.196.4.233:3838/apps/anova-between/", height="490px")
```

<A name="question7.1.4"></A>

```{block2, type='rmdquestion'}
4. In Figure \@ref(fig:anova-between), what do the solid red arrows represent? [<img src="icons/2answer.png" width=115px align="right">](#answer7.1.4)
```

<A name="question7.1.5"></A>

```{block2, type='rmdquestion'}
5. What do the solid black arrows represent? [<img src="icons/2answer.png" width=115px align="right">](#answer7.1.5)
```

<A name="question7.1.6"></A>

```{block2, type='rmdquestion'}
6. What do the dotted black arrows in Figure \@ref(fig:anova-between) represent? [<img src="icons/2answer.png" width=115px align="right">](#answer7.1.6)
```

<A name="question7.1.7"></A>

```{block2, type='rmdquestion'}
7. Which arrows relate to effect size eta^2^? Change group means (and press the _Update graph_ button) and describe what happens. [<img src="icons/2answer.png" width=115px align="right">](#answer7.1.7)
```

Let us assume that we have measured willingness to donate for a sample of 12 participants in our study as depicted in Figure \@ref(fig:anova-between). Once we have our data, we first have a look at the percentage of variance that is explained, eta^2^. What does it mean if we say that a percentage of the variance is explained when we interpret eta^2^?

The variance that we want to explain consists of the differences between the scores of the participants on the dependent variable and the overall or grand mean of all outcome scores. Remember that a variance measures deviations from the mean. The dotted black arrows in Figure \@ref(fig:anova-between) express the distances between outcome scores and the grand average. Squaring, summing, and averaging these distances over all observations gives us the total variance in outcome scores.

The goal of our experiment is to explain why some of our participants have a willingness to donate that is far above the grand mean (horizontal black line in Figure \@ref(fig:anova-between)) while others score a lot lower. We hypothesized that participants are influenced by the endorser they have seen. If an endorser has a positive effect, the average willingness should be higher for participants confronted with this endorser.

If we know the group to which a participant belongs---which celebrity she saw endorsing the fundraising campaign---we can use the average outcome score for the group as the predicted outcome for each group member---her willingness to donate due to the endorser she saw. The predicted group scores are represented by the coloured horizontal lines for group means in Figure \@ref(fig:anova-between).

Now what part of the variance in outcome scores (dotted black arrows in Figure \@ref(fig:anova-between)) is explained by the experimental treatment? If we use the experimental treatment as predictor of willingness to donate, we predict that a participant's willingness equals her group average (horizontal coloured line) instead of the overall average (horizontal black line), which we use if we do not take into account the participant's experimental treatment.

So the difference between the overall average and the group average is what we predict and explain by the experimental treatment. This difference is represented by the solid black arrows in Figure \@ref(fig:anova-between). The variance of the predicted scores is obtained if we average the squared sizes of the solid black arrows for all participants. This variance is called the *between-groups variance*.

Playing with the group means in Figure \@ref(fig:anova-between), you may have noticed that eta^2^ is high if there are large differences between group means. In this situation we have high between-groups variance---large black arrows---so we can predict a lot of the variation in outcome scores between participants.

In contrast, small differences between group averages allow us to predict only a small part of the variation in outcome scores. If all group means are equal, we can predict none of the variation in outcome scores because the between-groups variance is zero. As we will see in Section \@ref(anova-model), zero between-groups variance is central to the null hypothesis in analysis of variance.

```{r skipped1, eval = FALSE, echo = FALSE}
# Skipped to simplify the explanation of ANOVA.

### Within-groups variance {#within-variance}

# {r anova-within, fig.pos='H', fig.align='center', fig.cap="How does within-groups variance relate to between-groups variance?"}
# Goal: Sensitize students to the fact that larger population variance creates larger random variance of sample means.
# 3 populations (of willingness to donate scores; arranged vertically, so means can easily be compared) with equal means and variances (N(5.2, 2)?) ; add button to draw a random sample from all three populations (N = 10 per sample) and display as (3 vertically arranged) dotplots ; display the mean (vertical line) and the variance as a number within each plot ; calculate and display the between-groups variance of the three sample means ; allow user to change the population variance (range: [0, 8], initially 2) to see how it relates to random between-groups variance


1. In Figure \@ref(fig:anova-within), samples are drawn from three populations that have the same means and variances. Do the samples have the same means?

2. How does between-groups variance in the samples relate to the variance in the populations?

3. What happens if you set population variance to zero in Figure \@ref(fig:anova-within)?

4. What, do you expect, is within-groups variance in Figure \@ref(fig:anova-within)?

If we draw samples from the same population or from populations with the same means, the sample means can still be different because we draw samples at random. These sample mean differences are due to chance, they do not reflect true differences between the populations.

Random samples from the same population can only have different sample means if there is variation in the population scores. After all, if all people exposed to George Clooney as endorser would have exactly the same willingness to donate, every random sample drawn from these people would contain people with exactly the same willingness to donate. Average willingness can only be exactly the same for all samples.

The variation in scores within a population, for example, all people who would be exposed to George Clooney as endorser, is called _within-groups variance_. within-groups variance gives rise to chance differences between means of sample drawn from the same or identical population.

The amount of variation in population scores is important. Chance differences between sample means are more likely to be larger if within-groups variance is larger. You are more likely to draw some observations far away from the population mean if score variation is larger in the population. Observations far from the mean influence the sample mean strongly, so the means of samples drawn from this population fluctuate more.
```

The experimental treatment predicts that a participant's willingness equals the average willingness of the participant's group. It cannot predict or explain that a participant's willingness score is slightly different from her group mean (the red double-sided arrows in Figure \@ref(fig:anova-between)). *Within-groups variance* in outcome scores is what we cannot predict with our experimental treatment; it is prediction error. In some SPSS output, it is therefore labeled as "Error".

### *F* test on the model {#anova-model}

Average group scores tell us whether the experimental treatment has effects within the sample (Section \@ref(anova-meandiffs)). If the group who saw Angelina Jolie as endorser has higher average willingness to donate than the group who did not see an endorser, we conclude that Angelina Jolie makes a difference in the sample. But how about the population?

```{r skipped2, eval=FALSE, echo=FALSE}
# Skipped to simplify the explanation of ANOVA.

#### Test statistic

{r anova-F}
# Goal: Illustrate that between-groups variance is zero for equal group means and that it increases with larger (population) differences between group means.
# Generate a sample (N = 12) and plot it as in the app anova-between. Include lines (line segments) for grand mean and group means but do not include double-sided arrows. Display between-groups variance instead of eta^2 as a value. Allow users to change the three group means. Adjust current plots and betweengroups variance but don't draw a new sample.

1. In which situation is between-groups variance zero? Adjust the group means in Figure \@ref(fig:anova-F) to check your answer.
```

If we want to test whether the difference that we find in the sample also applies to the population, we use the null hypothesis that all average outcome scores are equal in the population from which the samples were drawn. In our example, the null hypothesis states that people in the population who would see George Clooney as endorser are on average just as willing to donate as people who would see Angelina Jolie or who would not see a celebrity endorser at all.

```{r skipped3, eval=FALSE, echo=FALSE}
# Skipped to simplify the explanation of ANOVA.

A statistical test requires a single number that expresses how close the sample result is to the hypothesis. How can we express the equality of three or more population means in one number?

In an independent-samples _t_ test, it is easy to express the equality of the two group means as one number: Just take the difference of the two means. Subtraction, however, does not work for three or more groups.

It is easy to see that subtraction of three means may yield zero even if the means are not the same. Assume that the Clooney group scores 6 as average willingness to donate, the Jolie group scores 4, and the group without celebrity endorser score 2. If we subtract in this order, the result is 6 - 4 - 2 = 0. But the differences between means is not zero!

Instead of subtraction,
```

We use the variance in group means as the number that expresses the differences between group means. If all groups have the same average outcome score, the between-groups variance is zero. The larger the differences, the larger the between-groups variance (see Section \@ref(between-variance)).

```{r skipped4, eval=FALSE, echo=FALSE}
# Skipped to simplify the explanation of ANOVA.

#### Chance differences in samples

#{r anova-Fratio}
# Goal: Sensitize student to importance of ratio between groups over within-groups variance for rejecting the null hypothesis of equal population means.
# Generate a sample (N = 12) and plot it as in the app anova-F. Display between-groups variance and within-groups variances a pie chart. Display the F and p values. Allow users to change both variances and adjust the dotplot and pie chart accordingly.

1. How should you adjust between-groups variance and within-groups variance to get more convincing differences between the average group scores in Figure \@ref(fig:anova-Fratio)?

2. What happens to the F test statistic and its p value if differences are more convincing?
```

We cannot just use the between-groups variance as the test statistic because we have to take into account chance differences between sample means. Even if we draw different samples from the same population, the sample means will be different because we draw samples at random. These sample mean differences are due to chance, they do not reflect true differences between groups in the population.

We have to correct for chance differences and this is done by taking the ratio of between-groups variance over within-groups variance. This ratio gives us the relative size of observed differences between group means over group mean differences that we expect by chance.

Our test statistic, then, is the ratio of two variances: between-groups variance and within-groups variance. The *F* distribution approximates the sampling distribution of the ratio of two variances, so we can use this probability distribution to test the significance of the group mean differences we observe in our sample.

Long story short: We test the null hypothesis that all groups have the same population means in an analysis of variance. But behind the scenes, we actually test between-groups variance against within-groups variance. That is why it is called analysis of variance.

### Assumptions for the *F* test in analysis of variance {#anova-assumpt}

There are two important assumptions that we must make if we use the *F* distribution in analysis of variance: (1) independent samples and (2) homogeneous population variances.

#### Independent samples

The first assumption is that the groups can be regarded as independent samples. As in an independent-samples *t* test, it must be possible *in principle* to draw a separate sample for each group in the analysis. Because this is a matter of principle instead of how we actually draw the sample, we have to argue that the assumption is reasonable. We cannot check the assumption against the data.

Here is an example of an argument that we can make. In an experiment, we usually draw one sample of participants and, as a next step, we assign participants randomly to one of the experimental conditions. We could have easily drawn a separate sample for each experimental group. For example, we first draw a participant for the first condition: seeing George Clooney endorsing the fundraising campaign. Next, we draw a participant for the second condition, e.g., Angelina Jolie. The two draws are independent: whomever we have drawn for the Clooney condition is irrelevant to whom we draw for the Jolie condition. Therefore, draws are independent and the samples can be regarded as independent.

Situations where samples cannot be regarded as independent are the same as in the case of dependent/paired-samples *t* tests (see Section \@ref(dependentsamples)). For example, samples of first and second observations in a repeated measurement design should not be regarded as independent samples. Some analysis of variance models can handle repeated measurements but we do not discuss them here.

#### Homogeneous population variances

The *F* test on the null hypothesis of no effect (the nil) in analysis of variance assumes that the groups are drawn from the same population. This implies that they have the same average score on the dependent variable in the population as well as the same variance of outcome scores. The null hypothesis tests the equality of population means but we must assume that the groups have equal dependent variable variances in the population.

We can use a statistical test to decide whether or not the population variances are equal (homogeneous). This is Levene's *F* test, which is also used in combination with independent samples *t* tests. The test's null hypothesis is that the population variances of the groups are equal. If we do *not* reject the null hypothesis, we decide that the assumption of equal population variances is plausible.

The assumption of equal population variances is less important if group samples are more or less of equal size (a balanced design, see Section \@ref(balanced)). We use a rule of thumb that groups are of equal size if the size of the largest group is less than 10% (of the largest group) larger than the size of the smallest group. If this is the case, we do not care about the assumption of homogeneous population variances.

### Which groups have different average scores?

Analysis of variance tests the null hypothesis of equal population means but it does not yield confidence intervals for group means. It does not always tell us which groups score significantly higher or lower.

```{r anova-posthoc, fig.pos='H', fig.align='center', fig.cap="Which groups have different average outcome scores in the population? The _p_ values belong to independent-samples _t_ tests on the means of two groups.", echo=FALSE, screenshot.opts = list(delay = 5), dev="png", out.width="525px"}
# Goal: Sensitize students to the need for and problems with post-hoc tests.
# Generate and display a sample as in the app anova-means with option to change
# group means.
# Display the model F test and the results of three t-tests (without Bonferroni
# correction) for pairwise comparisons (if feasible: as labels to the vertical
# arrows between group means in the plot).
knitr::include_app("http://82.196.4.233:3838/apps/anova-posthoc/", height="485px")
```

<A name="question7.1.8"></A>

```{block2, type='rmdquestion'}
8. Does the _F_ test in analysis of variance tell us which groups have significantly different average population outcome scores? Can we have the same _F_ test result with different sets of group means? Adjust group means in Figure \@ref(fig:anova-posthoc) to demonstrate your answer. [<img src="icons/2answer.png" width=115px align="right">](#answer7.1.8)
```

<A name="question7.1.9"></A>

```{block2, type='rmdquestion'}
9. Is it possible that the _F_ test is statistically significant but none of the _t_ tests that compare groups one by one? Can you obtain this situation in Figure \@ref(fig:anova-posthoc)? [<img src="icons/2answer.png" width=115px align="right">](#answer7.1.9)
```

<A name="question7.1.10"></A>

```{block2, type='rmdquestion'}
10. Is it okay that we apply both an _F_ test and several _t_ tests to the same group differences? [<img src="icons/2answer.png" width=115px align="right">](#answer7.1.10)
```

If the *F* test is statistically significant, we reject the null hypothesis that all groups have the same population mean on the dependent variable. In our current example, we reject the null hypothesis that average willingness to donate is equal for people who saw George Clooney, Angelina Jolie, or no endorser for the fund raiser. In other words, we *reject* the null hypothesis that the endorser does *not* matter to willingness to donate.

#### Pairwise comparisons as post-hoc tests

With a statistically significant *F* test for the analysis of variance model, several questions remain to be answered. Does an endorser increase or decrease the willingness to donate? Are both endorsers equally effective? The *F* test does not provide answers to these questions. We have to compare groups one by one to see which condition (endorser) is associated with a higher level of willingness to donate.

In a pairwise comparison, we have two groups, for instance, participants confronted with George Clooney and participants who did not see a celebrity endorse the fund raiser. We want to compare the two groups on a numeric dependent variable, namely their willingness to donate. An independent-samples *t* test is appropriate here.

With three groups, we can make three pairs: Clooney versus Jolie, Clooney versus nobody, and Jolie versus nobody. We have to execute three *t* tests on the same data. We already know that there are most likely differences in average scores, so the *t* tests are executed after the fact, in Latin *post hoc*. Hence the name *post-hoc tests*.

Applying more than one test to the same data increases the probability of finding at least one statistically significant difference even if there are no differences at all in the population. Section \@ref(cap-chance) discussed this phenomenon as capitalization on chance and it offered a way to correct for this problem, namely Bonferroni correction. We ought to apply this correction to the independent-samples *t* tests that we execute if the analysis of variance *F* test is statistically significant.

The Bonferroni correction divides the significance level by the number of tests that we do. In our example, we do three *t* tests on pairs of groups, so we divide the significance level of five per cent by three. The resulting significance level for each *t* test is .0167. If a *t* test's *p* value is below .0167, we reject the null hypothesis, but we do not reject it otherwise.

#### Two steps in analysis of variance

Analysis of variance, then, consists of two steps. In the first step, we test the general null hypothesis that all groups have equal average scores on the dependent variable in the population. If we cannot reject this null hypothesis, we have too little evidence to conclude that there are differences between the groups. Our analysis of variance stops here, although it is recommended to report the confidence intervals of the group means to inform the reader. Perhaps our sample was just too small to reject the null hypothesis.

If the *F* test is statistically significant, we proceed to the second step. Here, we apply independent-samples *t* tests with Bonferroni correction to each pair of groups to see which groups have significantly different means. In our example, we would compare the Clooney and Jolie groups to the group without celebrity endorser to see if celebrity endorsement increases willingness to donate to the fund raiser, and, if so, how much. In addition, we would compare the Clooney and Jolie groups to see if one celebrity is more effective than the other.

#### Contradictory results

It may happen that the *F* test on the model is statistically significant but none of the post-hoc tests is statistically significant. This mainly happens when the *p* value of the *F* test is near .05. Perhaps the correction for capitalization on chance is too strong; this is known to be the case with the Bonferroni correction. Alternatively, the sample can be too small for the post-hoc test. Note that we have fewer observations in a post-hoc test than in the *F* test because we only look at two of the groups.

This situation illustrates the limitations of null hypothesis significance tests (Chapter \@ref(crit-discus)). Remember that the 5 per cent significance level remains an arbitrary boundary and statistical significance depends a lot on sample size. So do not panic if the *F* and *t* tests have contradictory results.

A statistically significant *F* test tells us that we may be quite confident that at least two group means are different in the population. If none of the post-hoc *t* tests is statistically significant, we should note that it is difficult to pinpoint the differences. Nevertheless, we should report the sample means of the groups (and their standard deviations) as well as the confidence intervals of their differences as reported in the post-hoc test. The two groups that have most different sample means are most likely to have different population means.

### Answers {.unnumbered}

<A name="answer7.1.1"></A>

```{block2, type='rmdanswer'}
Answer to Question 1.

* The double-sided vertical arrows represent the differences in average
willingness to donate across the three experimental conditions: exposure to
Clooney as endorser, Jolie as endorser, or no celebrity endorser. [<img src="icons/2question.png" width=161px align="right">](#question7.1.1)
```

<A name="answer7.1.2"></A>

```{block2, type='rmdanswer'}
Answer to Question 2.

* Yes, a celebrity endorser seems to matter to participants’ willingness to donate.
* Overall, participants who saw celebrities Jolie or Clooney have higher willingness to donate (the blue and orange dots) than those who did not see a celebrity endorser (the green dots). Average willingness to donate is clearly higher for participants who saw celebrities Jolie (blue line) or Clooney (orange line) than for the control group (green line), who did not see a celebrity endorser. [<img src="icons/2question.png" width=161px align="right">](#question7.1.2)
```

<A name="answer7.1.3"></A>

```{block2, type='rmdanswer'}
Answer to Question 3.

* The more different the group means, the larger the red arrows, the larger the
between-groups variance, the larger eta^2^.
* For those of you who love the details: Between-groups variance squares the
distances (actually the distances between group means and the grand mean but
that is not relevant here) before taking the average. As a consequence, a long
red arrow contributes more to between-groups variance than two short arrows
that are just as long as the long arrow if they are summed. This is the reason
that eta^2^ increases if the group in the middle moves closer to the top
group (or, from some point, closer to the bottom group). [<img src="icons/2question.png" width=161px align="right">](#question7.1.3)
```

<A name="answer7.1.4"></A>

```{block2, type='rmdanswer'}
Answer to Question 4.

![](figures/S7_1Q4.png)

* The solid red arrows represent the difference between an individual score
and the average score of the group to which the individual belongs. For
example, the left-most orange dot represents the willingness score of a
participant who was exposed to Clooney as endorser. The orange line represents
the average willingness score of the participants who were exposed to Clooney.
The red arrow is the difference between the individual's willingness score and
the mean score of its group. Squaring, summing, and averaging the solid red
arrows (within-groups differences) yields the within-groups variance. [<img src="icons/2question.png" width=161px align="right">](#question7.1.4)
```

<A name="answer7.1.5"></A>

```{block2, type='rmdanswer'}
Answer to Question 5.

* The solid black arrows represent the difference between an individual's group
score, for instance, the average willingness score of all participants who were
exposed to Clooney, and the average score of all participants (the grand or
overall mean).
* If we square, sum, and average these differences, we get the variance of
group means, which is called the between-groups variance. [<img src="icons/2question.png" width=161px align="right">](#question7.1.5)
```

<A name="answer7.1.6"></A>

```{block2, type='rmdanswer'}
Answer to Question 6.

* The dotted black arrows represent the difference between individual
willingness scores and the average willingness scores of all participants.
Square, sum, and average them to get the overall variance. Take the square
root of the variance to obtain the overall standard deviation. [<img src="icons/2question.png" width=161px align="right">](#question7.1.6)
```

<A name="answer7.1.7"></A>

```{block2, type='rmdanswer'}
Answer to Question 7.

* The solid black arrows relate to eta^2^. Eta^2^ is larger if the
differences between the group means are larger. The solid black arrows express these differences.

![](figures/S7_1Q7.png)

* If you decrease the differences between the group means, eta^2^
decreases and the solid black arrows become smaller. The red arrows (differences
between scores and their group means) remain the same.
* The dotted black arrows change but some become longer and others become
shorter if you decrease the differences between the group means, so the
dotted black arrows are not clearly related to eta^2^. [<img src="icons/2question.png" width=161px align="right">](#question7.1.7)
```

<A name="answer7.1.8"></A>

```{block2, type='rmdanswer'}
Answer to Question 8.

* If the _F_ test in analysis of variance is statistically significant, we
reject the null hypothesis that all groups have the same mean score in the
population. If we have more than two groups, as in the example, we do not know
which groups score higher and which groups score lower in the population. We
need post-hoc tests to find that out.
* Because the _F_ test does not tell us which group has a higher or lower score,
we can have the same _F_ value for different situations. For example, exchange
the means of two groups. You will get exactly the same _F_ value (in this
balanced design). [<img src="icons/2question.png" width=161px align="right">](#question7.1.8)
```

<A name="answer7.1.9"></A>

```{block2, type='rmdanswer'}
Answer to Question 9.

* Yes, this is possible. For example, set the mean score of the No Endorser
group at 4.5, while you leave the Clooney and Jolie averages at 6.4 and 6.8.
* A _t_ test uses only two of the three groups, so the total number of
observations is lower and, therefore, test power is lower and the null
hypothesis is more difficult to reject.
* Note that this situation mainly occurs if the _F_ test is just significant
(slightly below .05). This illustrates that the .05 threshold is an artificial
boundary. [<img src="icons/2question.png" width=161px align="right">](#question7.1.9)
```

<A name="answer7.1.10"></A>

```{block2, type='rmdanswer'}
Answer to Question 10.

* This is okay provided that we correct for capitalization on chance (see Section \@ref(cap-chance)). [<img src="icons/2question.png" width=161px align="right">](#question7.1.10)
```

## One-Way Analysis of Variance in SPSS {#onewaySPSS}

### Instructions

```{r SPSS1way, echo=FALSE, out.width="640px", fig.pos='H', fig.align='center', fig.cap="(ref:1waySPSS)", dev="png", screenshot.opts = list(delay = 5)}
knitr::include_url("https://www.youtube.com/embed/_2MmzWRcm2k", height = "360px")
# Execute one-way analysis of variance in SPSS with Analyze > Compare Means > One-Way ANOVA {or use general Linear Model also for one-way? (MCRS probably uses Compare Means because it only disucsses one-way ANOVA) - between & within variance not labeled as such!).
# Goal: association as level differences between three or more groups: Does the endorser matter to the level of willingness to donate to a fund raiser?
# Example: donors.sav, outcome is willingness to donate (post), predictor (grouping variable) is endorser (0, 1).
# Technique: one-way ANOVA.
# SPSS menu: Compare Means ; post hoc: Bonferroni ; options: descriptives, homogeneity of variance test, means plot.
# Paste & Run.
# Check assumptions: F test homogeneous population variances or groups of equal size ; post-hoc t tests: each group more than 30 observations or normally distributed.
# Interpret output: F test on the null hypothesis that all groups have equal population means - point out between groups sum of squares? ; post-hoc t test for pairwise comparison of (population) means ; test result and significance, confidence interval.
```

### Exercises

<A name="question7.2.1"></A>

```{block2, type='rmdquestion'}
1. How does celebrity endorsement affect the willingness to donate and is one celebrity more effective than the other? Use the data in [donors.sav](http://82.196.4.233:3838/data/donors.sav). [<img src="icons/2answer.png" width=115px align="right">](#answer7.2.1)
```

<A name="question7.2.2"></A>

```{block2, type='rmdquestion'}
2. The data set [smokers.sav](http://82.196.4.233:3838/data/smokers.sav) contains information on smoking behaviour and attitude towards smoking for a random sample of adults. Does the attitude towards smoking differ among smokers, former smokers, and non-smokers (variable: *status3*)? [<img src="icons/2answer.png" width=115px align="right">](#answer7.2.2)
```

### Answers {.unnumbered}

<A name="answer7.2.1"></A>

```{block2, type='rmdanswer'}
Answer to Exercise 1.

SPSS syntax:

\* Check data.
FREQUENCIES VARIABLES=willing_post endorser
  /ORDER=ANALYSIS.
\* One-way analysis of variance.
ONEWAY willing_post BY endorser
  /ES=OVERALL
  /STATISTICS DESCRIPTIVES HOMOGENEITY
  /PLOT MEANS
  /MISSING ANALYSIS
  /POSTHOC=BONFERRONI ALPHA(0.05).
\* Note: ES=OVERALL gives eta^2^ (SPSS version 27 and higher)

Check data:

There are no impossible values on the variables.

Check assumptions:

The three groups are more or less of equal size: The largest difference is 4
participants, which is less than ten per cent of the largest group (*N* = 49).
Anyway, we may assume equal population variances, Levene *F* (2, 140) = .02, *p* =
.978. Note that the results of this test can be slightly different in
different versions of SPSS.

Interpret the results:

* Willingness to donate depends on the endorsing celebrity. There is a statistically significant difference between average willingness to donate for the three endorsers, *F* (2, 140) = 7.44, *p* = .001, eta^2^ = 0.10, 95% CI [0.02, 0.19].
* People are more willing to donate if they have seen Clooney (*M* = 4.99, *SD* = 1.64) or Jolie (*M* = 4.95, *SD* = 1.63) endorse the fund raiser than people who do not see a celebrity endorser (*M* = 3.87, *SD* = 1.47). The differences between, on the one hand, no celebrity endorser and, on the other hand, Clooney (*p* = .002, 95% CI [-1.91, -0.33]) or Jolie (*p* = .004, 95% CI [-1.88, -0.29]) are statistically significant and range between circa 0.3 and 1.9 with 95% confidence in the population.
* However, there is not a substantial or statistically significant difference between Clooney and Jolie with respect to their effect on willingness to donate (mean difference = 0.04, *p* = 1.000, 95% CI [-0.74, 0.81]); participants seeing Clooney may just as well have higher as lower willingness to donate than participants seeing Jolie.

Instead of reporting the _F_ test result in the text, the ANOVA table can be included.

Note that eta^2^ must be calculated by hand in SPSS version 26 or older: Divide the sum of squares of the main effect (between groups) by the total sum of squares. The confidence interval for eta^2^ cannot be calculated by hand. [<img src="icons/2question.png" width=161px align="right">](#question7.2.1)
```

<A name="answer7.2.2"></A>

```{block2, type='rmdanswer'}
Answer to Exercise 2.

SPSS syntax:

\* Check data.
FREQUENCIES VARIABLES=attitude status3
  /ORDER=ANALYSIS.
\* One-way analysis of variance.
ONEWAY attitude BY status3
  /ES=OVERALL
  /STATISTICS DESCRIPTIVES HOMOGENEITY
  /PLOT MEANS
  /MISSING ANALYSIS
  /POSTHOC=BONFERRONI ALPHA(0.05).
\* Note: ES=OVERALL gives eta^2^ (SPSS version 27 and higher)

Check data:

There are no impossible scores on the two variables.

Check assumptions:

The Levene test is not statistically significant, *F* (2, 82) = 2.82, *p* = .066,
so we assume that smoking attitude for the three groups have equal population
variances. Note that different versions of SPSS may give slightly different
results, but the test is never statistically significant.

Interpret the results:

In the sample, former smokers have a much more negative attitude towards smoking (*M* = -1.69, *SD* = 1.71) than non-smokers (*M* = 0.64, *SD* = 1.17) and smokers (*M* = 0.80, *SD* = 1.67).

With 95% confidence, the attitude towards smoking for former smokers is on average at least 1.35 points more negative on a scale from -5 to +5 than for non-smokers and the difference can be as large as 3.3 points in the population (mean difference = -2.33, *p* < .001, 95% CI [-3.31, -1.35]). Similarly, former smokers have an attitude that is on average 1.3 to 3.7 points more negative than smokers (mean difference = -2.49, *p* < .001, 95% CI [-3.66, -1.32]). The difference in attitude towards smoking between smokers and non-smokers in the population is not clear. Smokers can on average be up to 0.78 points more negative about smoking than non-smokers, but they can also be up to 1.09 points more positive (mean difference = 0.16, *p* = 1.000, 95% CI [-0.78, 1.09]). We need a larger sample to be more specific about the difference between smokers and non-smokers.

Due to the differences between on the one hand former smokers and on the other hand smokers and non-smokers, average attitude towards smoking scores are significantly different across smoking status, *F* (2, 82) = 18.85, *p* < .001, eta^2^ = 0.31 (SPSS yields 0.31498, which is rounded to 0.315). We have sufficient evidence to conclude that former smokers have a more negative attitude towards smoking than the other groups in the population of adults. The differences are substantial (a strong effect, _eta_ = square root of 0.31 = 0.56).

Note that eta^2^ must be calculated by hand in SPSS version 26 or older: Divide the sum of squares of the main effect (between groups) by the total sum of squares. The confidence interval for eta^2^ cannot be calculated by hand. [<img src="icons/2question.png" width=161px align="right">](#question7.2.2)
```

## Different Means for Two Factors

The participants in the experiment do not only differ because they see different endorsers in the charity video. In addition, there are differences according to sex: female versus male participants. Does participant's sex matter to the effect of the endorser on willingness to donate?

```{r anova-twoway, fig.pos='H', fig.align='center', fig.cap="How do group means inform us about (main) effects in analysis of variance?", echo=FALSE, screenshot.opts = list(delay = 5), dev="png", out.width="775px"}
# Goal: Illustrate that different main effects merely use means of different
# groupings.
# Similar to app anova-between but with a double classification of cases
# (according to endorser and sex) and te option to display means of different
# groupings.
# Generate 6 sets of 2 (random) observations (from a normally distributed
# population with mean runif(3, 7) and sd = 1). Assign the groups to the
# experimental treatment factor (endorser, 3 levels) and sex factor (2 levels).
# Represent observations in a dotplot with treatment as dot colour and sex as
# dot shape, each observation with a separate value on the x axis, ordered by
# factor levels.
# Display the grand mean as a horizontal line.
# Allow the user to select (display) group means on one of the two factors or
# both. On selection, add group means as horizontal lines (coloured for the 3
# levels factor and different line styles for 2 levels factor) with between
# group variation indicated by double-sided arrows between group mean and grand
# mean for each dot (between). Add subgroup (endorser by sex) means if both
# factors are selected.
knitr::include_app("http://82.196.4.233:3838/apps/anova-twoway/", height="408px")
```

<A name="question7.3.1"></A>

```{block2, type='rmdquestion'}
1. How does an analysis of variance test the effect of endorser on willingness to donate with the data displayed in Figure \@ref(fig:anova-twoway)? Select the endorser factor to check your answer. [<img src="icons/2answer.png" width=115px align="right">](#answer7.3.1)
```

<A name="question7.3.2"></A>

```{block2, type='rmdquestion'}
2. Compare a plot with the endorser factor selected to a plot with the sex factor selected. Which effect on willingness to donate is probably stronger: the effect of endorser or of sex? Motivate your answer, for example, using the grey arrows in the plots. [<img src="icons/2answer.png" width=115px align="right">](#answer7.3.2)
```

<A name="question7.3.3"></A>

```{block2, type='rmdquestion'}
3. Where do you expect the group means to show up in the graph if you select both the Endorser and Sex check boxes? [<img src="icons/2answer.png" width=115px align="right">](#answer7.3.3)
```

In the preceding section, we have looked at the effect of a single factor on willingness to donate, namely, the endorser to whom participants are exposed. Thus, we take into account two variables: one independent variable and one dependent variable. This is an example of *bivariate analysis*.

Usually, however, we expect an outcome to depend on more than one variable. Willingness to donate does not depend only on the celebrity endorsing a fundraising campaign. It is easy to think of more factors, such as a person's available budget, her personal level of altruism, and so on.

It is straightforward to include more factors in an analysis of variance. These can be additional experimental treatments in the context of an experiment as well as participant characteristics that are not manipulated by the researcher. For example, we may hypothesize that females are generally more charitable than males.

### Two-way analysis of variance {#anova2way}

If we use one factor, the analysis is called one-way analysis of variance. With two factors, it is called two-way analysis of variance, and with three factors... well, you probably already guessed that name.

A two-way analysis of variance using a factor with three levels, for instance, exposure to three different endorsers, and a second factor with two levels, for example, female versus male, is called a 3x2 (say: three by two) factorial design.

### Balanced design {#balanced}

In analysis of variance with two or more factors, it is quite nice if the factors are statistically independent from one another. In other words, it is nice if the scores on one factor are not associated with scores on another factor. This is called a *balanced design*.

In an experiment, we can ensure that factors are independent if we have the same number of participants in each combination of levels on all factors. In other words, a factorial design is balanced if we have the same number of observations in each subgroup. A subgroup contains the participants that have the same level on both factors just like a cell in a contingency table.

```{r anova-balanced, echo=FALSE}
# Table for a balanced 3x2 factorial design.
# Create data.
df <- data.frame(Female = rep(2, 3), Male = rep(2, 3))
row.names(df) <- c("Clooney", "Jolie", "No endorser")
# Display table.
knitr::kable(df, booktabs = TRUE, caption = "Number of observations per subgroup in a balanced 3x2 factorial design.") %>%
  kable_styling(font_size = 12, full_width = F, position = "float_right",
                latex_options = c("HOLD_position"))
# Cleanup.
rm(df)
```

Table \@ref(tab:anova-balanced) shows an example of a balanced 3x2 factorial design. Each subgroup (cell) contains two participants (cases). Equal distributions of frequencies across columns or across rows indicate statistical independence. In the example, the distributions are the same across columns (and rows), so the factors are statistically independent.

In practice, it may not always be possible to have exactly the same number of observations for each subgroup. A participant may drop out from the experiment, a measurement may go wrong, and so on. If the numbers of observations are more or less the same for all subgroups, the factors are nearly independent, which is okay. We can use the same rule of thumb for a balanced design as for the conditions of an *F* test in analysis of variance: If the size of the smallest subgroup is less than ten per cent smaller than the size of the largest group, we call a factorial design balanced.

A balanced design is nice but not necessary. Unbalanced designs can be analyzed but estimation is more complicated (a problem for the computer, not for us) and the assumption of equal population variances for all groups (Levene's *F* test) is more important (a problem for us, not for the computer) because we do not have equal group sizes. Note that the requirement of equal group sizes applies to the *subgroups* in a two-way analysis of variance. With a balanced design, we ensure that we have the same number of observations in all subgroups, so we are on the safe side.

### Main effects in two-way analysis of variance {#maineffects}

A two-way analysis of variance tests the effects of both factors on the dependent variable in one go. It tests the null hypothesis that participants exposed to Clooney have the same average willingness to donate in the population as participants exposed to Jolie or those who are not exposed to an endorser. At the same time, it tests the null hypothesis that females and males have the same average willingness to donate in the population.

```{r anova-meansplot2, echo=FALSE, fig.pos='H', fig.align='center', fig.cap="Means plots for the main effects of endorser and sex on willingness to donate. As a reading instruction, effects of endorsers and of being female are represented by arrows.", out.width="50%", fig.asp=0.8, fig.show='hold'}
# Insert means plot for celebrity endorsement example.
d <- data.frame(
  endorser = factor(c("Nobody","Nobody","Clooney","Clooney","Jolie","Jolie"),
                    levels = c("Nobody","Clooney","Jolie")),
  sex = factor(rep(c("Women", "Men"), 3)),
  willingness_av = c(4.5, 3, 6.5, 5, 8.5, 7), const = 1)
library(ggplot2)
d2 <- d %>% group_by(endorser) %>%
  summarise(willing = mean(willingness_av),
            const = max(const))
# Plot for main effect endorser.
ggplot(d2, aes(endorser, willing)) +
    geom_point(size = 3, color=brewercolors["Blue"]) +
    geom_line(aes(group = const), size = 1, color=brewercolors["Blue"]) +
    geom_segment(aes(x = 1, xend = 3, y = willing[[1]], yend = willing[[1]]),
               linetype = "dashed", color = "black") +
    geom_segment(aes(x = 2, xend = 2, y = willing[[1]], yend = (willing[[2]] - 0.1)),
               color = "darkgrey",
               arrow = arrow(length = unit(2,"mm"),
                                 # ends = "both",
                                 type = "closed")) +
    geom_text(aes(x = 2.02, y = (willing[[1]] + willing[[2]])/2,
            label = "Clooney effect",
            hjust = 0), color = "darkgrey", size =5
            ) +
    geom_segment(aes(x = 3, xend = 3, y = willing[[1]], yend = (willing[[3]] - 0.1)),
               color = "darkgrey",
               arrow = arrow(length = unit(2,"mm"),
                                 # ends = "both",
                                 type = "closed")) +
    geom_text(aes(x = 2.98, y = (willing[[1]] + willing[[3]])/2,
            label = "Jolie effect",
            hjust = 1), color = "darkgrey", size =5
            ) +
    theme_general() +
    theme(text = element_text(size = 18)) +
    scale_y_continuous(limits = c(1, 10), breaks = c(1, 5, 10)) +
    labs(x = "Endorser", y = "Average willingness to donate")
# Plot for main effect sex.
d %>% group_by(sex) %>%
  summarise(willing = mean(willingness_av),
            const = max(const)) %>%
  ggplot(aes(sex, willing)) +
    geom_point(size = 3, color=brewercolors["Blue"]) +
    geom_line(aes(group = const), size = 1, color=brewercolors["Blue"]) +
    geom_segment(aes(x = 1, xend = 2, y = willing[[1]], yend = willing[[1]]),
               linetype = "dashed", color = "black") +
    geom_segment(aes(x = 2, xend = 2, y = willing[[1]], yend = (willing[[2]] - 0.1)),
               color = "darkgrey",
               arrow = arrow(length = unit(2,"mm"),
                                 # ends = "both",
                                 type = "closed")) +
    geom_text(aes(x = 1.98, y = (willing[[1]] + willing[[2]])/2,
            label = "Effect of being female",
            hjust = 1), color = "darkgrey", size =5
            ) +
    theme_general() +
    theme(text = element_text(size = 18)) +
    scale_y_continuous(limits = c(1, 10), breaks = c(1, 5, 10)) +
    labs(x = "Sex", y = "Average willingness to donate")
rm(d)
```

The tested effects are main effects because they represent the effect of one factor. They express an overall or average difference between the mean scores of the groups on the dependent variable. The main effect of the endorser factor shows the mean differences for endorser groups if we do not distinguish between females and males. Likewise, the main effect for sex shows the average difference in willingness to donate between females and males without taking into account the endorser to whom they were exposed.

We could have used two separate one-way analyses of variance to test the same effects. Moreover, we could have tested the difference between females and males with an independent-samples *t* test. The results would have been the same (if the design is balanced.) But there is an important advantage to using a two-way analysis of variance, to which we turn in the next section.

### Answers {.unnumbered}

<A name="answer7.3.1"></A>

```{block2, type='rmdanswer'}
Answer to Question 1.

* Analysis of variance tests the variance (variation, spread) in the mean
scores on the dependent variable among groups. If the group means are more widely
apart, the variance of group means is larger, so the _F_ test value is larger and
further away from what we expect if there are no differences among group means
in the population. The variation among group means is larger if they are more
distant from the overall (or grand) mean.
* To test the endorser effect, analysis of variance looks at the mean scores of
groups defined by the celebrity endorser they have seen. It will check the
distance between the three celebrity group means and the grand mean. [<img src="icons/2question.png" width=161px align="right">](#question7.3.1)
```

<A name="answer7.3.2"></A>

```{block2, type='rmdanswer'}
Answer to Question 2.

* The effect of endorser is probably stronger than the effect of sex because
the observations within an endorser group are less equally distributed around
the grand mean (they are more clustered above or below the grand mean) than
observations within a sex group. Both females and males are found nearly
equally above and below the grand mean, so their group means are close to the
grand mean, and there is little variation in sex group means.

![](figures/S7_3Q2.png)

* The grey arrows represent the distance between group mean and grand mean for each observation. They are much shorter for the difference between females and males than for differences between endorser groups. [<img src="icons/2question.png" width=161px align="right">](#question7.3.2)
```

<A name="answer7.3.3"></A>

```{block2, type='rmdanswer'}
Answer to Question 3.

* If both boxes are ticked, the scores are grouped by the combination of
endorser and sex. The graph will show the means of subgroups: females exposed to
Clooney, males exposed to Clooney, males exposed to Jolie, and so on. [<img src="icons/2question.png" width=161px align="right">](#question7.3.3)
```

## Moderation: Group-Level Differences that Depend on Context {#moderationanova}

```{r echo=FALSE}
# TBD: the concept of moderation ; picture (sketch as video) as icon
# for moderation
```

In the preceding section, we have analyzed the effects both of endorser and sex on willingness to donate to a fund raiser. The two main effects isolate the influence of endorser on willingness from the effect of sex and the other way around. This assumes that endorser and sex have an effect on their own, a general effect.

We should, however, wonder whether endorser always has the same effect. Even if there is a general effect of endorser on willingness to donate, is this effect the same for females and males? Note that one endorser is a male celebrity who is reputed to be quite attractive to women. The other endorser is a female celebrity with a similar reputation among men. In this situation, shouldn't we expect that one endorser is more effective among female participants and the other among male participants?

If the effect of a factor is different for different groups on another factor, the first factor's effect is *moderated* by the second factor. The phenomenon that effects are moderated is called *moderation*. Both factors are independent variables. To distinguish between them, we will henceforth refer to them as the predictor and the moderator.

With moderation, factors have a combined effect. The context (group score on one factor) affects the effect of the other factor on the dependent variable. The conceptual diagram for moderation expresses the effect of the moderator on the effect of the predictor as an arrow pointing at another arrow. Figure \@ref(fig:anova-diagram) shows the conceptual diagram for participant's sex moderating the effect of endorsing celebrity on willingness to donate.

```{r anova-diagram, echo=FALSE, fig.pos='H', fig.align='center', fig.cap="Conceptual diagram of moderation.", fig.asp=0.3}
library(ggplot2)
# Create coordinates for the variable names.
variables <- data.frame(x = c(0.3, 0.5, 0.7),
                        y = c(.1, .3, .1),
                        label = c("Endorser", "Sex", "Willingness"))
ggplot(variables, aes(x, y)) +
  geom_segment(aes(x = x[1], y = y[1], xend = x[3] - 0.05, yend = y[1]), arrow = arrow(length = unit(0.04, "npc"), type = "closed")) +
  geom_segment(aes(x = x[2], y = y[2], xend = x[2], yend = y[1]), arrow = arrow(length = unit(0.04, "npc"), type = "closed")) +
  geom_label(aes(label=label)) +
  coord_cartesian(xlim = c(0.2, 0.8), ylim = c(0, 0.4)) +
  theme_void()
#Cleanup.
rm(variables)
```

### Types of moderation

Moderation as different effects for different groups is best interpreted using a cross-tabulation of group means, which is visualized as a means plot. In a group means table, the **Totals** row and column contain the means for each factor separately, for example the means for males and females (factor sex) or the means for the endorsers (factor endorser). These means represent the main effects. In contrast, the means in the cells of the table are the means of the subgroups, which represent moderation. Draw them in a means plot for easy interpretation.

In a means plot, we use the groups of the predictor on the horizontal axis, for example, the three endorsers. The average score on the dependent variable is used as the vertical axis. Finally, we plot the average scores for every predictor-moderator group, for instance, an endorser-sex combination, and we link the means that belong to the same moderator group, for example, the means for females and the means for males (Figure \@ref(fig:anova-moderation)).

```{r anova-moderation, fig.pos='H', fig.align='center', fig.cap="How can we recognize main effects and moderation in a means plot?", echo = FALSE, screenshot.opts = list(delay = 5), dev="png", out.width="775px"}
# Goals: Learn to recognize different types of moderation, visually distinguishing between main effects (differences) and moderation (different differences.
# Create a means plot with 3x2 means, willingness on Y axis, endorser (nobody, Clooney, Jolie) on X axis, and different colours for sex. Initial means have main effects but no interaction effect.Connect the means per sex by line segments. Link the female & male mean for the same group by a vertical double-sided arrow. Allow user to change all six means (if possible, by dragging them vertically?). Display marginal (total) means for each sex and endorser.
# Initial means.
# d <- data.frame(endorser = factor(c("Nobody","Clooney","Jolie","Nobody","Clooney","Jolie"), levels = c("Nobody","Clooney","Jolie")), sex = as.factor(c(rep("male", 3), rep("female", 3))), willingness_av = c(3, 5, 7, 4.5, 6.5, 8.5))
knitr::include_app("http://82.196.4.233:3838/apps/anova-moderation/", height="305px")
```

<A name="question7.4.1"></A>

```{block2, type='rmdquestion'}
1. Does the plot in Figure \@ref(fig:anova-moderation) display a main effect of the factor sex? Motivate your answer with numbers from the table and/or the lines in the plot. If sex has a main effect in this sample, describe the effect: What is the difference between women and men here? [<img src="icons/2answer.png" width=115px align="right">](#answer7.4.1)
```

<A name="question7.4.2"></A>

```{block2, type='rmdquestion'}
2. Is there a main effect of endorser? Again, motivate your answer and describe the effect if there is one. [<img src="icons/2answer.png" width=115px align="right">](#answer7.4.2)
```

<A name="question7.4.3"></A>

```{block2, type='rmdquestion'}
3. Is the effect of participant's sex the same for all three types of endorser in Figure \@ref(fig:anova-moderation)? How can you tell? [<img src="icons/2answer.png" width=115px align="right">](#answer7.4.3)
```

<A name="question7.4.4"></A>

```{block2, type='rmdquestion'}
4. Adjust the means in such a way that the effect of sex is different for different endorsers. In other words, adjust the means such that sex moderates the effect of endorser on willingness to donate. [<img src="icons/2answer.png" width=115px align="right">](#answer7.4.4)
```

Moderation happens a lot in communication science for the simple reason that the effects of messages are stronger for people who are more susceptible to the message. If you know more people who have adopted a new product or a healthy/risky lifestyle, you are more likely to be persuaded by media campaigns to also adopt that product or lifestyle. If you are more impressionable in general, media messages are more effective.

#### Effect strength moderation

Moderation refers to contexts that strengthen or diminish the effect of, for instance, a media campaign. Let us refer to this type of moderation as *effect strength moderation*. In our current example, we would hypothesize that the effect of George Clooney as an endorser is stronger for female participants than for male participants.

In analysis of variance, effects are differences between average outcome scores. The effect of Clooney on willingness to donate, for instance, is the difference between the average willingness score of participants exposed to Clooney and the average score of participants who were not exposed to a celebrity endorser.

Different "Clooney effects" for female and male participants imply different differences! The difference in average willingness scores between females exposed to Clooney and females who are not exposed to an endorser is different from the difference in average scores for males. We have four subgroups with average willingness scores that we have to compare. We have six subgroups if we also include endorsement by Angelina Jolie.

```{r anova-effectstrengthmod, echo=FALSE, out.width="70%", fig.asp=0.6, fig.pos='H', fig.align='center', fig.cap="Moderation as a stronger effect within a particular context."}
# Add means plot with Clooney, Jolie, and Nobody on the x axis, willingness to donate on the y axis, and separate (coloured) lines linking the mean scores for females and males. In the plot, the average for females*Clooney is much higher than males*Clooney but the reverse does not apply to Jolie.
d <- data.frame(endorser = factor(c("Nobody","Clooney","Jolie","Nobody","Clooney","Jolie"), levels = c("Clooney","Nobody","Jolie")), sex = as.factor(c(rep("male", 3), rep("female", 3))), willingness_av = c(3, 5, 5, 4.5, 9, 6.5))
library(ggplot2)
ggplot(d, aes(endorser, willingness_av, colour = sex)) +
  geom_point(size = 3) +
  geom_line(aes(group = sex), size = 1) +
  geom_segment(aes(x = 0.8, xend = 2.2, y = d[5,3], yend = d[5,3]),
               linetype = "dashed", color = "black") +
  geom_segment(aes(x = 2, xend = 2, y = d[4,3], yend = d[5,3]),
               color = "darkgrey",
               arrow = arrow(length = unit(2,"mm"),
                                 # ends = "both",
                                 type = "closed")) +
  geom_text(aes(x = 2.02, y = (d[4,3] + d[5,3])/2,
            label = "Clooney effect\nfor females",
            hjust = 0), color = "darkgrey"
            ) +
  geom_segment(aes(x = 0.8, xend = 2.2, y = d[1,3], yend = d[1,3]),
               linetype = "dashed", color = "black") +
  geom_segment(aes(x = 1, xend = 1, y = d[1,3], yend = (d[2,3] - 0.1)),
               color = "darkgrey",
               arrow = arrow(length = unit(2,"mm"),
                                 # ends = "both",
                                 type = "closed")) +
  geom_text(aes(x = 0.98, y = (d[1,3] + d[2,3])/2,
            label = "Clooney effect\nfor males",
            hjust = 1), color = "darkgrey"
            ) +
  theme_general() +
  scale_y_continuous(limits = c(1, 10), breaks = c(1, 5, 10)) +
  labs(x = "Endorser", y = "Average willingness to donate") +
  scale_color_manual(values=c(brewercolors[[1]], brewercolors[[5]]))
rm(d)
```

A means plot is a very convenient tool to interpret different differences. Connect the means of the subgroups by lines that belong to the same group on the factor you use as moderator. Each line in the plot represents the effect differences within one moderator group. If a line goes up or down, predictor groups have different means, so the predictor has an effect within that moderator group. A flat (horizontal) line tells us that there is no effect at all within that moderator group

The distances between the lines show the difference of the differences. If the lines for females and males are parallel, the difference between endorsers is the same for females and males. Then, the effects are the same and there is *no* moderation. In contrast, if the lines are not parallel but diverge or converge, the differences are different for females and males and there is moderation.

A special case of effect strength moderation is the situation in which the effect is absent (zero) in one context and present in another context. A trivial example would be the effect of an anti-smoking campaign on smoking frequency. For smokers (one context), smoking frequency may go down with campaign exposure and the campaign may have an effect. For non-smokers (another context), smoking frequency cannot go down and the campaign cannot have this effect.

Except for trivial cases such as the effect of anti-smoking campaigns on non-smokers, it does not make much sense to distinguish sharply between moderation in which the effect is strengthened and moderation in which the effect is present versus absent. In non-trivial cases, it is very rare that an effect is precisely zero. (See Holbert and Park [-@HolbertConceptualizingOrganizingPositing2019] for a different view on this matter.)

#### Effect direction moderation

In the other type of moderation, the effect in one group is the opposite of the effect in another group. In figure \@ref(fig:anova-effectdirmod1), for example, Clooney increases the average willingness to donate among females in comparison to the group who did not see a celebrity endorser. In contrast, average willingness for male Clooney viewers is lower than the average for males without endorser. Let us call this *effect direction moderation*. Males reverse the Clooney effect that we find for females.

```{r anova-effectdirmod1, echo=FALSE, out.width="70%", fig.asp=0.6, fig.pos='H', fig.align='center', fig.cap="Moderation as a positive  effect in one context and a negative effect in another context."}
# Add means plot with Clooney, Jolie, and Nobody on the x axis, willingness to donate on the y axis, and separate (coloured) lines linking the mean scores for females and males. In the plot, the average for females*Clooney is much higher than males*Clooney but the reverse does not apply to Jolie.
d <- data.frame(endorser = factor(c("Nobody","Clooney","Jolie","Nobody","Clooney","Jolie"), levels = c("Clooney","Nobody","Jolie")), sex = as.factor(c(rep("male", 3), rep("female", 3))), willingness_av = c(5, 3.5, 7, 6, 7.5, 9))
library(ggplot2)
ggplot(d, aes(endorser, willingness_av, colour = sex)) +
  geom_point(size = 3) +
  geom_line(aes(group = sex), size = 1) +
  geom_segment(aes(x = 0.8, xend = 2.2, y = d[5,3], yend = d[5,3]),
               linetype = "dashed", color = "black") +
  geom_segment(aes(x = 2, xend = 2, y = d[4,3], yend = d[5,3]),
               color = "darkgrey",
               arrow = arrow(length = unit(2,"mm"),
                                 # ends = "both",
                                 type = "closed")) +
  geom_text(aes(x = 1.98, y = (d[4,3] + d[5,3])/2,
            label = "Clooney effect\nfor females",
            hjust = 1), color = "darkgrey"
            ) +
  geom_segment(aes(x = 0.8, xend = 2.2, y = d[1,3], yend = d[1,3]),
               linetype = "dashed", color = "black") +
  geom_segment(aes(x = 1, xend = 1, y = d[1,3], yend = (d[2,3] + 0.1)),
               color = "darkgrey",
               arrow = arrow(length = unit(2,"mm"),
                                 # ends = "both",
                                 type = "closed")) +
  geom_text(aes(x = 0.98, y = (d[1,3] + d[2,3])/2,
            label = "Clooney effect\nfor males",
            hjust = 1), color = "darkgrey"
            ) +
  theme_general() +
  scale_y_continuous(limits = c(1, 10), breaks = c(1, 5, 10)) +
  labs(x = "Endorser", y = "Average willingness to donate") +
  scale_color_manual(values=c(brewercolors[[1]], brewercolors[[5]]))
rm(d)
```

In an extreme situation, the effect in one group can compensate for the effect in another group if it is about as strong but of the opposite direction (Figure \@ref(fig:anova-effectdirmod2)). Imagine that George Clooney convinces females to donate but discourages males to donate because his charms backfires on men (pure jealousy, perhaps.) Similarly, Angelina Jolie may have opposite effects on females and males.

```{r anova-effectdirmod2, echo=FALSE, out.width="70%", fig.asp=0.6, fig.pos='H', fig.align='center', fig.cap="Moderation as opposite effects in different contexts."}
# Add means plot with Clooney, Jolie, and Nobody on the x axis, willingness to donate on the y axis, and separate (coloured) lines linking the mean scores for females and males. In the plot, the average for females*Clooney is much higher than males*Clooney but the reverse does not apply to Jolie.
d <- data.frame(endorser = factor(c("Nobody","Clooney","Jolie","Nobody","Clooney","Jolie"), levels = c("Clooney","Nobody","Jolie")), sex = as.factor(c(rep("male", 3), rep("female", 3))), willingness_av = c(6, 3, 9, 6, 9, 3))
library(ggplot2)
ggplot(d, aes(endorser, willingness_av, colour = sex)) +
  geom_point(size = 3) +
  geom_point(aes(x = endorser, y = 6), size = 3, color = "darkgrey") +
  geom_line(aes(group = sex), size = 1) +
  geom_segment(aes(x = 0.8, xend = 3.2, y = d[1,3], yend = d[1,3]),
               linetype = "dashed", color = "black") +
  geom_segment(aes(x = 1, xend = 1, y = d[4,3], yend = (d[5,3] - 0.1)),
               color = "darkgrey",
               arrow = arrow(length = unit(2,"mm"),
                                 # ends = "both",
                                 type = "closed")) +
  geom_text(aes(x = 0.98, y = (d[4,3] + d[5,3])/2,
            label = "Clooney effect\nfor females",
            hjust = 1), color = "darkgrey") +
  geom_segment(aes(x = 3, xend = 3, y = d[4,3], yend = d[6,3]),
               color = "darkgrey",
               arrow = arrow(length = unit(2,"mm"),
                                 # ends = "both",
                                 type = "closed")) +
  geom_text(aes(x = 3.02, y = (d[4,3] + d[6,3])/2,