## Friday, March 26, 2021

### Calculating beta diversity

Diversity? Comme au courant! Well, you know I like to cover all the bases.

There are (or were when I was in college) three types of ecological diversity: alpha, beta, and gamma. Let's say we're talking about a territory in which there are a number of separate forests, each of which contains a number of trees which may be classified into discrete species.

Gamma diversity is the total diversity of trees within the territory. If we select two of the territory's trees at random, what is the probability that they will be of different species? That number (a "diversity index") would be a quantification of the territory's gamma diversity. You can think of the gamma as standing for global; we're looking at the diversity of individuals (trees) in the entire territory, without considering any of the smaller subgroups (forests) among which those individuals are distributed.

Alpha diversity is the diversity of trees within each forest. If we randomly select two trees from the same forest, what is the probability that they will be of different species? This is an index of alpha diversity. Think of alpha as representing the article a -- the internal diversity within a single forest. We can calculate a diversity index for each of the forests in the territory, and the mean of these numbers will be the alpha diversity of the territory as a whole.

For maximum simplicity, let's just look at territories that have only two forests (Forest 1 and Forest 2), which each have the same number of trees, and only two tree species (redwoods and bluewoods).

Calculating diversity indices is quite straightforward. You take the percentages for each species in the population (for example, the trees of Charliestan are 75% redwood and 25% bluewood). For each species, the probability of randomly selecting two members of that species is its percentage squared -- so the sum of the squares of all the percentages is the probability of selecting two trees of the same species. For Charliestan, that probability is .75² + .25² = .625; the diversity index (the probability of selecting two trees of different species) is 1 minus that number, or 37.5%.

Note that when there are only two species, the highest possible diversity index is 50%. Note also that gamma diversity sets a cap for alpha diversity. The two measures can be equal (as in Bakerstan and Charliestan), or gamma can be higher (as in Ablestan and Dogstan), but alpha can never be higher than gamma.

When there is a difference between gamma (the diversity of trees in the territory) and alpha (the diversity of trees within the forests of the territory), that difference must be accounted for by beta diversity: diversity between forests. (Think of beta as standing for between -- though of course you really ought to say among if there are more than two.) The remainder of this post will discuss the relative merits of various ways of calculating beta diversity.

Approach 1: Forests as units

Can we calculate beta diversity as a diversity index of the sort we have used for alpha and gamma? Well, we could, but that would mean treating entire forests the way we have been treating trees -- as unanalyzable units to be classified into a finite number of discrete "species." For example, if a territory had 10 spruce-fir forests, 5 oak-hickory forests, and 5 maple-beech-birch forests, we could calculate its beta diversity as 1 - (.5² + .25² + .25²) = .625.

The obvious problem with this is that forests just aren't unanalyzable units, and classifying them qualitatively seems the wrong way to go about things. Forests can be more or less similar in their species profile; it's not a binary same/different question. Imagine a spruce-fir forest that is pretty much just spruce and fir, and a maple-beech-birch forest that is also pretty much just what it says on the tin. Now imagine a different country where the spruce-fir forest also has significant numbers of maple and birch trees, and where both the spruce-fir and the maple-beech-birch forests have plenty of hemlocks. This latter country obviously has less beta diversity -- that is, its forests are more similar to one another -- but this approach can't see that.

(Incidentally, this objection also applies to some extent to alpha and gamma diversity. Doesn't a white-red-jack forest, where all the major species are species of pine, have less diversity than an oak-gum-cypress forest? Isn't a neighborhood that's half black and half white more diverse than one that's half German and half Austrian?)

Approach 2: All the gamma that's not alpha

The logic is simple: gamma diversity is total diversity; some of it is accounted for by alpha diversity; all the rest must be beta diversity.

Robert Whittaker's original equation for beta diversity was β = γ/α, which is obviously suboptimal. It would make 1 the minimum figure for beta diversity, when it is the hypothetical maximum for alpha and gamma, making it incommensurable with the other two types of diversity. It is also unable to deal countries like Ablestan, which have 0 alpha diversity and thus cause a divide-by-zero error.

Later ecologists (perhaps for the reasons I mention) decided to subtract rather than divide, making the new formula β = γ - α. Let's look at our three territories again (reproduced here so you don't have to scroll up).

Using the subtractive formula, we get 0 beta diversity for Bakerstan and Charliestan -- which is correct, since in each of those countries the two forests are identical in terms of species profile -- 12.5% for Dogstan, and 50% for Ablestan. But, wait, isn't that a little strange? The two forests of Ablestan are 100% different -- not a single tree in Forest 1 is the same species as any tree in Forest 2 -- so shouldn't the beta diversity be 100%?

Compare Ablestan to Easystan -- which, unlike the territories we have looked at so far, has yellowwoods.

Both alpha and gamma are higher for Easystan, which makes sense. It has greater global (gamma) diversity, and Forest 2 has greater internal (alpha) diversity. But shouldn't its beta diversity -- the difference between the two forests -- be exactly the same as Ablestan's? In both territories, the trees in Forest 1 are 100% different from those in Forest 2. But the subtractive formula gives us a beta of only 37.5% for Easystan, lower than Ablestan's 50%. Obviously this formula is not capturing the intuitive meaning of beta diversity.

Or consider Foxstan, which differs from Ablestan only in that its forests are not the same size; 75% of its trees are in Forest 1.

Both Ablestan and Foxstan have an alpha of 0, which is correct because there is no internal diversity within their forests at all. Ablestan has a higher gamma because it is half redwoods and half bluewoods -- the maximum diversity possible when there are only two species. In Foxstan, redwoods are a solid majority, making it less diverse.

What about beta? In each territory, how different is one forest from the other? Well, it seems obvious that both Ablestan and Foxstan have equal, because maximal, beta diversity. In both countries, the trees in Forest 1 are 100% different from the trees in Forest 2. If anything, we might even say that the two forests differ more in Foxstan than in Ablestan, because they differ in size as well as in species profile. But if we use the formula β = γ - α, and alpha is 0, each territory's beta is equal to its gamma, which means Foxstan has less beta diversity than Ablestan. This seems clearly wrong.

Approach 3: An outgroup diversity index

Both gamma and alpha are calculated by means of a diversity index -- the probability that two randomly selected trees will be of different species. For gamma, the figure is for any two trees in the territory; for alpha, it is for any two trees that are in the same forest. So can't we get beta by calculating a diversity index for any two trees that are not in the same forest?

No, this doesn't work, either. Consider the case of Bakerstan and Charliestan.

Both of these territories should have a beta of 0, because each has two identical forests. But -- precisely because the two forests are identical -- comparing two random trees from different forests is the same as comparing two from the same forest, or from the territory as a whole, so β = α = γ. This method gives Bakerstan a beta of 50%, when it ought to be 0. That's a pretty serious error!

So maybe we should say beta diversity is outgroup diversity (call it xi) minus ingroup diversity (alpha): β = ξ - α. That would give us the desired 0 beta value for Bakerstan and Charliestan. Does it work more generally? No. It fails the Easystan test.

In Ablestan, xi is 1 and alpha is 0, so beta is also 1. This is correct, since the two forests are maximally different from each other.

In Easystan, the two forests are also maximally different from one another -- not a single tree in Forest 1 is the same species as any tree in Forest 2 -- so its xi is 1, and its beta ought to be 1 as well. But because it has an alpha of 25%, its beta is only 75%.

Approach 4: Slice-matching

And now we come to my final answer! I assume I'm not the first to have thought of it, but I'm much too lazy and unprofessional to play the "literature review" game when it's so much more fun to just reinvent the wheel. I do hope I'm not making an original contribution to diversitology here because, you know, that would just be sad. (Alas, my experience with astronomy does not fill me with optimism.)

This method yields the correct values of 1 for Ablestan, Easystan, and Foxstan; and 0 for Bakerstan and Charliestan. Of the territories we have looked at so far, only Dogstan has a non-trivial beta value, so we will look at it first to demonstrate how the slice-matching method works.

You take the two forests' pie charts and remove all matching slices. That is, you can cut a slice out of a pie and remove it if and only if you can remove a slice of the same size and color from the other pie chart. You keep doing this until you can't do it anymore, and the percentage of the pies remaining is your beta diversity. (When I talk about the "size" of a slice, I mean its relative size as a percentage of its pie; beta diversity is not affected by differences in absolute size among forests.)

For Dogstan, we can remove a 25%-sized slice of blue from each pie, the a 25%-sized slice of red, and then we're done. We still have 50% of each pie left, so Dogstan's beta diversity is 50%.

What if there are more than two territories in the forest? Do we remove only those slice that can be removed from every forest? No, that clearly won't work. Imagine a territory with 4 all-redwood forests and 1 all-bluewood forest; no slices could be removed, and thus the beta would be 1 -- maximal beta diversity, despite the fact that three of the four forests are identical. No, slice-matching can only be done between a pair of forests, and the beta diversity of the whole territory is calculated by taking the mean beta of all possible pairs of forests. In our example, there are 5 forests and thus 10 possible pairs of forests. Of these, 6 are red-red pairs with beta of 0, and 4 are red-blue pairs with beta of 1. The mean beta diversity for the whole territory would thus be 40%.

As a further illustration of how this works, let's take a look at Georgestan and Howstan -- territories which each have four different forests and four different tree species.

The bottom row of pie charts shows the species distribution for each of the forests. I have so designed these distributions as to give the two territories identical gamma diversity, but Georgestan's diversity is more of the alpha variety (each forest is internally diverse), while Howstan's is more beta (each forest is different from the other forests). I've limited myself to pie slices that are multiples of 12.5%, so as not to overtax my MSPaint skillz.

The pyramid of pie charts above each bottom row shows the "slice-matching" results for each pair of forests. Go diagonally down to the left and to the right to see which two forests each chart is comparing. For example, the pie at the apex of the Georgestan pyramid is comparing the forests F1 and F4, which are highly similar. Slice-matching rules allow us to remove quarter slices of red, yellow, and blue, and an eighth slice of green, from each forest. What is left -- the slices that cannot be matched -- is shaded black and represents the beta diversity between those two forests, which in this case is 12.5%. Looking at the corresponding pie at the apex of the Howstan pyramid, we can see that its F1 and F2 are very different, with 50% beta diversity. Beta diversity for a whole territory is simply the mean beta diversity of all possible forest pairs.

The diversity figures for the two territories, then, are as follows:
• Georgestan
• gamma = 75%
• alpha = 72.7%
• beta = 18.8%
• Howstan
• gamma = 75%
• alpha = 57.8%
• beta = 52.1%
We can compare these to the extreme cases of Itemstan (all four forests look like Georgestan's F1) and Jigstan (the forests are all red, all yellow, all green, and all blue, respectively).
• Itemstan
• gamma = 75%
• alpha = 75%
• beta = 0%
• Jigstan
• gamma = 75%
• alpha = 0%
• beta = 100%

Is there a formula?

Robert Whittaker had a simple formula -- β = γ/α. -- which we have found inadequate. Can the slice-matching approach to beta diversity also be reduced to a formula? This much seems intuitively obvious:

If gamma is held constant, increasing alpha causes beta to decrease and vice versa. This seems to imply that we should be able to derive alpha if we know beta and gamma, or derive beta if we know alpha a gamma. (Seems. I haven't fully thought this through yet.)

We clearly cannot derive gamma if we know alpha and beta. Jigstan and Ablestan both have an alpha of 0 and a beta of 1, but their gamma is different. This is only possible because Ablestan has two forests but Jigstan has four, so perhaps a fourth variable -- the number of forests -- has to be included in the formula. My hunch (just a hunch) is that any one of those variables should be derivable from the other three, hopefully in a tolerably elegant manner.

Perhaps some of my more mathematically gifted readers (you know who you are!) would like to give it a shot.

This is a great post. Your method of comparisons is a creative idea, it somewhat reminds me of the Condorcet method of pairwise comparisons for voting. I don't have a formula for the three, but here are some preliminary thoughts:

Alpha and beta do not depend on the size of the forests because we are only concerned with the proportion. But gamma does depend on the size of the forests because we are adding up all the trees together. So that is another variable we need to consider. If all the forests have the same size, it's not a conern, but in the Foxstan example, it would be.

Another method for beta diversity would be the probability of picking two different trees given that we pick one tree from forest one and one from forest two.

So for Dogstan, we have four possibilities:
RR, RB, BR, BB. The probabilities are:

RR: 0.75*0.25 = 0.1875
RB: 0.75*0.75 = 0.5625
BR: 0.25*0.25 = 0.0625
BB: 0.25*0.75 = 0.1875

So, then the two that involve picking different trees are RB and BR, so beta = 0.5625+0.0625 = 0.625

Doing the same method with Georgestan is more work, but you get six values, one for each of the pairs and then average these so the beta is 0.7890625.

John Goes said...

Some of the details of this are a little unclear to me.

The definition of alpha is clear when we are talking about one forest, because it's the same as gamma diversity applied to that forest. But how is alpha typically defined when you have a territory with, say, two forests of different size? My understanding from your post is that alpha diversity is defined as the average gamma diversity of the two forests, separately.

But then I am a bit confused about the statement that alpha diversity is necessarily lower than gamma diversity, because if you have one forest with a million red trees and another with 1 green tree and 1 red tree, the gamma diversity is close to zero, but the alpha diversity is some average of the gamma diversity of the first forest (zero) and the second (50%). For your claim to be true, the average would have to be weighted by the number of trees in the forest, right?

I'm interested in the problem you posed at the end, but the basic problem is still not well formed in my mind.

Wm Jas Tychonievich said...

Good catch, John. Yes, alpha can exceed gamma when the forests are unequal in size.

Wm Jas Tychonievich said...

@NLR

"Another method for beta diversity would be the probability of picking two different trees given that we pick one tree from forest one and one from forest two."

Isn't that my Approach 3? If there's a difference, I'm missing it.

Wm Jas Tychonievich said...

A comment from my much more mathematical brother Luther:

I'm assuming you have, or can easily gain, some familiarity with P-norms (https://en.wikipedia.org/wiki/Norm_(mathematics)#p-norm and https://en.wikipedia.org/wiki/Lp_space#The_p-norm_in_finite_dimensions).

Your task in beta is to measure the distance between population vectors; that is vectors whose sum is one. Ergo, the question is what is the best measurement of distance for such vectors?

Your final beta for two populations is (a normalizing ½ times) the L-1 distance between the population vectors. That is, given x = (x1, x2, ... xn) where sum(xi) = 1 and y = (y1, y2, ... yn) where sum(yi) = 1 you report ½ sum(|xi-yi|). Given the base-line assumption that tree species are all equally far apart, L-1 is a reasonable choice, though not the only option. Do you want the difference between (1, 0, 0) and (0, 1, 0) to be equal to the difference between (1, 0, 0) and (0, 0.5, 0.5)? If so, use L-1. If you want the second to be more different, use a fractional norm instead (like L-⅔).

If you wanted to handle the "all pines are similar" you mentioned in the introduction, you'd instead want to have a weighted feature vector (I'd probably implement the weights as a matrix to make the whole a well-defined inner-product space). Once you add weights, it is likely that L-2 will be what you want rather than L-1, as L-2 is better at measuring comparable distances and outliers are not a problem when you have just two vectors.

For 3+ populations, you are computing all-pairs L-1 distances and averaging them. I suppose that's a fine approach, though expensive to compute if there are many forests. It's also a bit tricky to weight if you want big forests to matter more than little forests. I'd probably have found the weighted average population vector, then taken the weighted average difference from that average population to each other population instead of simply averaging all pairwise differences; but to know if that approach is actually better or not would require knowing the intended application of the measurement. 