core-microbiome/core-microbiome.Rmd at master · microDM/core-microbiome · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
---
title: "Core Microbiome"
output: html_document
---

Identifying a core microbiome is useful for simplifying microbiome, for example to facilitate description and analyses, or as a quality control measure rejecting rarer sequences. There is not a single established concept or method which defines the core microbiome (we do not suggest that there should be), but [Hu et al (2013)](http://dx.doi.org/10.1371/journal.pone.0056343) recently outlined some of the commonly used methods. These are most often based on OTU _detection_ in all samples or OTU detection in a certain percentage of samples. Detection based criteria are not very well suited to microbiome analyses where sequencing depth, and thus detection potential, varies between samples and studies.

## Abundance threshold
An alternative method is to use an abundance threshold rather than presence or absence. This approach mostly avoids the problem of detection being affected by sequencing depth, because the detection threshold can be set much higher than 1 sequence per sample for typical thresholds, which are likely to be in the range of 0.1 - 5 % (estimate - need to check real examples in literature). When applying an abundance threshold to identify the 'core microbiome', it is important to consider whether the threshold is applied to the whole sequencing library, to specific experimental factors, or individually to every sample. This choice will make a big difference to the identified core OTUs:

### Experiment-wide abundance threshold
Identifies OTUs that are common overall, but may not pick up OTUs that are common only in a few samples

### Sample-wise abundance threshold
An OTU that is numerically dominant in one or few samples can be picked up, even if it's absence in most samples brings it's experiment-wide abundance below the threshold. This would seem a more sensible approach if the objective is to compare samples. A sample-wise approach has a strong chance of picking up less relevant OTUs which are abundant in single erroneous or contaminated samples.

### Factor-wise abundance threshold
Similar to the sample-wise approach, using a grouping factor requires that an OTU be reliably abundant in replicates, reducing the chance of unusual samples contributing OTUs that are abundant as a result of sample related problems.

### Examples
...

## Inclusion of higher taxa
Most core microbiome selections and quality control measures in the literature are performed at the OTU level; i.e. selecting and de-selecting the clustered de-replicated sequences based upon their abundance or presence. A 97 % similarity is commonly used in microbiome studies and is widely regarded to approximate species-level similarities, however this choice is not universal, and from a biological point of view it is arbitrary. Furthermore, depending upon the taxonomic group in question, the similarity level required to differentiate species varies considerably, therefore applying selection criteria only at the OTU level is not easy to justify. The great genetic diversity of microbes is reflected in a variety of life strategies. Selecting OTUs of interest only at a single similarity level fails to recognise this diversity, favouring representation of higher taxonomic groups which are represented by fewer OTUs.

In [Elliott et al (2014)](http://dx.doi.org/10.1007/s10531-014-0684-8) we characterised the bacteria of Kalahari Sand, in which both _Bacteroidetes_ and _Cyanobacteria_ phyla were found to be abundant. Whilst _Cyanobacteria_ were represented predominantly by a single OTU, over 50 relatively rare OTUs contributed to the _Bacteroidetes_ phylum abundance. Depending on specific thresholds used, in cases like this common higher taxonomic ranks can be excluded from OTU selections intended to facilitate microbiome analyses. For this reason we implemented a multi-rank approach to selecting core OTUs, using the [phylosq](http://joey711.github.io/phyloseq/) package for R. We suggest that this is a useful strategy to achieve better representation of numerically abundant taxa in microbiome analyses.

## Relevant references not cited as yet
http://dx.doi.org/10.1016/j.ecolind.2011.10.008

## Core microbiome function
```{r functions}

# core microbiome function
# finds the core microbiome with various options
# physeq: the phyloseq object to query
# threshold: the abundance below which OTUs are not regarded as being "core taxa"
# method: "max" or "mean" - define whether to calculate the threshold for the max or mean
# ranks: vector of taxonomic ranks at which to search for the core taxa
# group: factor on which to merge samples before applying threshold criteria.

# Note.
# max and mean are calculated for the whole OTU table per row
# therefore
# mean gives the overall abundant OTUs
# max will include any OTU which is abundant any 1 sample, so it is really like a sample-wise search

# returns a phyloseq object containing only the OTUs matching the criteria set
core_microbiome <- function(physeq,threshold=2,method="mean",ranks=c("Species"),group="",verbose=0){
  core_otus_glom <- c()
  physeq.original <- physeq
  if(group!="") {
    # note that grouping will sum the counts
    # but for the threshold to be applied correctly we need to
    # use the mean for the group.
    # the mean is worked out and applied below.
    # I'm sure this can be done in a better way
    # HAS NOT BEEN TESTED ON OBJECTS WHERE THE GROUP FREQUENCY VARIES
    sample_data(physeq)$group_frequency <- NA
    for(n in 1:nrow(sample_data(physeq))) {
      sample_data(physeq)$group_frequency <- sum(get_variable(physeq, group)==get_variable(physeq, group)[n])
    }
    otu_table(physeq) <- otu_table(physeq)/sample_data(physeq)$group_frequency
    physeq <- merge_samples(x=physeq,group=group)
    # I don't know why the above transposes the OTU table. Turn it back:
    otu_table(physeq) <- t(otu_table(physeq))

    # transform_sample_counts(physeq, function(x) x/sample_data(physeq)$group_frequency)
  }
  for (rank in ranks) {
    # agglomerate to the taxonomic rank being searched
    # but only do this if we are not at the bottom rank already
    if(rank!=rank_names(physeq)[length(rank_names(physeq))]) {
      physeq.glom <- tax_glom(physeq,rank)
    } else {
      physeq.glom <- physeq
    }
    # get the OTU table into a dataframe
    glom_otus <- as.data.frame(otu_table(physeq.glom))
    # calculate the test statistic for each OTU
    glom_otus$stat <- apply(glom_otus,1,method)
    # make a list of the taxa meeting the abundance criteria
    glom_rank_candidates <- row.names(glom_otus[glom_otus$stat >= threshold,])
    # identify the taxonomy of the candidates
    if(length(glom_rank_candidates)>0) {
      glom_rank_taxonomy <- tax_table(physeq.glom)[glom_rank_candidates,rank]

      # identify the most abundant OTU within the identified taxonomic group, and add it to the list
      # tax_glom does not always pick the most abundant OTU in the classification
      # Therefore need to specifically check for the most abundant OTU
      # should only do this if we are not working at the lowest rank in the phyloseq object
      if(rank!=rank_names(physeq)[length(rank_names(physeq))]) {
        for (candidate in glom_rank_taxonomy) {
          OTUs_in_taxonomic_group <- tax_table(physeq)[,rank] == as.character(candidate)
          most_abundant_OTU <- names(which.max(taxa_sums(physeq)[as.vector(OTUs_in_taxonomic_group)]))
          core_otus_glom <- c(core_otus_glom,most_abundant_OTU)
        }
      }
      else {
        # at OTU level we don't need to search for the most abundant one - we want them all
        core_otus_glom <- c(core_otus_glom,glom_rank_candidates)
      }
    }
    if(verbose>1)  print(paste(rank,"level search found",length(glom_rank_candidates),"taxa above",method,"abundance of",threshold))
  }
  if(verbose>0) print(paste("Search found", length(core_otus_glom),"unique OTUs"))
  core_otus <- unique(core_otus_glom)
  return(prune_taxa(core_otus,physeq.original) )
}

```

```{r other_functions, echo=FALSE}

library("phyloseq"); library("grid");

tidy_phyloseq <- function(my_phyloseq){
  # set ranks.
  colnames(tax_table(my_phyloseq)) = c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species")
  # fix taxa names
  tax_table(my_phyloseq)[,colnames(tax_table(my_phyloseq))] <- gsub(tax_table(my_phyloseq)[,colnames(tax_table(my_phyloseq))],pattern="[a-z]__",replacement="")
  # If NA in phylum position of taxonomy table then change it to "Unidentified"
  tax_table(my_phyloseq)[tax_table(my_phyloseq)[,"Phylum"]==NA,"Phylum"] <- "Unidentified"
  return(my_phyloseq)
}


# core compare function
# Input 2 phyloseq objects which have been pruned to contain a 'core microbiome'
# returns a list containing:
# extra OTUs in the first object, not present in the second
# extra taxonomic groups represented by those OTUs, which are not present in the second object

core_compare <- function(physeq1,physeq2) {
  out <- list()
  # Find out which OTUs were added by the multi-rank approach
  extras <- tax_table(physeq1)[!rownames(otu_table(physeq1))%in%rownames(otu_table(physeq2))]
  out[["extra_otus"]] <- rownames(extras)
  # Find out which taxa are additionally represented by those OTUs
  for(rank in rank_names(physeq1)) {
    index <- !extras[,rank]%in%unique(tax_table(physeq2)[,rank])
    if(sum(index)) {
      out[[rank]] <- unique(as.vector(extras[index,rank]))
    }
    else {
      out[rank] <- "none"
    }
  }
  return(out)
}

```


```{r load_date,echo=FALSE}

# Kalahari bacteria
biom <- import_biom("data/kb/otu_table.biom",taxaPrefix="X")
map <- import_qiime_sample_data(mapfilename="data/kb/map.txt")
kb <- merge_phyloseq(biom, map)
kb <- tidy_phyloseq(kb)

### Holme Moss bacteria
biom <- import_biom("data/hb/otu_table.biom",taxaPrefix="X")
map <- import_qiime_sample_data(mapfilename="data/hb/map.txt")
tax_assignments <- read.delim("data/hb/tax.txt", header=F)
row.names(tax_assignments) <- tax_assignments$V1
taxonomy <- as.matrix(tax_assignments[2:8])
tax = tax_table(taxonomy)
hb <- merge_phyloseq(biom, map, tax)
hb <- tidy_phyloseq(hb)

### Holme Moss fungi
biom <- import_biom("data/hf/otu_table.biom",taxaPrefix="X")
map <- import_qiime_sample_data(mapfilename="data/hf/map.txt")
tax <- read.delim("data/hf/tax.txt")

tax_assignments <- read.delim("data/hf/tax.txt", header=F)
row.names(tax_assignments) <- tax_assignments$V1
taxonomy <- as.matrix(tax_assignments[2:8])
tax = tax_table(taxonomy)
hf <- merge_phyloseq(biom, map, tax)
hf <- tidy_phyloseq(hf)

# convert OTU observations into relative abundance per sample
kb.rel <- transform_sample_counts(kb, function(x) 100*x/(sum(x)))
hb.rel <- transform_sample_counts(hb, function(x) 100*x/(sum(x)))
hf.rel <- transform_sample_counts(hf, function(x) 100*x/(sum(x)))

```


```{r core_OTU, fig.width=18, fig.height=15, results='asis'}

expts <- c(kb.rel , hb.rel , hf.rel)
studies <- c("dryland bacteria" , "peatland bacteria" , "peatland fungi")

for ( i in 1:length(expts) ) {
  expt <- expts[[i]]
  study <- studies[i]
  otus <- as.data.frame(otu_table(expt))

  # Identify core OTUs using different options
  # then make a pruned phyloseq object for each list
  expt.species_mean <- core_microbiome(expt,threshold=2,method="mean",ranks=c("Species"))
  expt.all_ranks_mean <- core_microbiome(expt,threshold=2,method="mean",ranks=rank_names(expt))
  expt.species_max <- core_microbiome(expt,threshold=2,method="max",ranks=c("Species"))
  expt.all_ranks_max <- core_microbiome(expt,threshold=2,method="max",ranks=rank_names(expt))

  # Examine the differently defined core microbiomes
  expt_list <- c(expt.species_mean,expt.all_ranks_mean,expt.species_max,expt.all_ranks_max)
  expt_names <- c("species mean", "all ranks mean", "species max", "all ranks max")
  n <- 1
  r <- c(1,1,2,2)
  c <- c(1,2,1,2)
  grid.newpage()
  pushViewport(viewport(layout = grid.layout(2,2)))

  for(this_expt in  expt_list) {
      plot_title <- paste(study,"-",expt_names[n],"-",ntaxa(this_expt),"otus")
      print(plot_bar(this_expt, fill="Phylum",title=plot_title),vp=viewport(layout.pos.row = r[n], layout.pos.col = c[n]))
      expt_names[1]
      n <- n+1
  }

  # Identify differences in OTU and taxa represented with each method
  cat("<b>",study,"- mean method - extra OTUs and taxa represented:</b><br>")
  print(core_compare(expt.all_ranks_mean,expt.species_mean))

  cat("<b>",study,"- max method - extra OTUs and taxa represented:</b><br>")
  print(core_compare(expt.all_ranks_max,expt.species_max))
}

```