Tuesday, February 19, 2008

playing a mean harmonica

(yes -- the punny blog titles will just keep getting worse until someone starts posting comments!!)

as promised: part II of my engrossing saga of the two lesser-known cousins of the arithmetic mean. previously, we considered the geometric mean, and a fairly obscure use it might be put to in (slightly) more accurately summarizing population change over time. i'd be interested to hear (comments, anyone?) about any other uses of a biological nature to which it can be put.

so, on to the harmonic mean: S&R show you how to calculate it (p. 44) and if you google "harmonic mean + use" the interwebs will tell you that it might be useful for figuring out how fast you went on average under certain very unnatural driving conditions. evidently it also has some uses in calculating electrical resistance and maybe in petroleum geology as well. but -- we're all biologists... why should we care?

as it turns out, this is a fairly important measure in conservation biology as well, used in calculating effective population size over time. a number of papers and books (including Gotelli and Ellison, 2004, referenced in my previous blog post) outline or advocate for its use in 'averaging' population sizes over time.

as a hypothetical example (modified from Gotelli and Ellison, 2004): over a decade, a population has the following sizes: 986, 1067, 95, 221, 489, 821, 961, 1017, 1039, 1126. obviously something pretty bad happened there in year #3, from which it took several years to recover.

the arithmetic mean population size for the decade is still a pretty high 782.2.


> x = c(986, 1067, 95, 221, 489, 821, 961, 1017, 1039, 1126)
> mean(x)
[1] 782.2


however, the scenario i've laid out above, most of you will immediately recognize is a "bottleneck" of the type you learned about in reference to genetic drift. in terms of genetic diversity, the presence of such an event has a pronounced negative effect. the harmonic mean, not coincidentally, emphasizes the smaller values in a series, and gives them greater weight:


> 1/mean(1/x)
[1] 414.2493


most of the references that i consulted don't actually provide a citation to the original use of the harmonic mean for this purpose, however, using my amazing sleuthing skills, i managed to trace it back to at least the 1930s (Wright, 1938). i'd be curious if there were any references that pre-date this.

Reference

Wright, S. 1938. Size of population and breeding structure in relation to evolution. Science 87:430-431.

Tuesday, February 12, 2008

don't be mean :)

as promised in class today, a brief discursion into the realm of the 'alternate' means: geometric and harmonic. S&R do a fine job of explaining how to calculate these values, but as to why one might want to -- eh, not so good (imho). also of note is that -- at least according to the index -- S&R only mention geometric means once more (and even then just in passing) and don't seem to bring up harmonic means again at all. i'm not sure why (other than historical inertia) these statistics are almost always introduced, other than maybe to keep students on their toes. perhaps noteworthy is that collectively the arithmetic, geometric, and harmonic means are known as the 'Pythagorean' means.

anyway, there are a couple of very particular circumstances in which you might use one of these in a biological context, and it would be arguably superior to the arithmetic mean. a book which i think does a pretty good job of laying these out this is A Primer of Ecological Statistics by N.J. Gotelli and A.M. Ellison (2004), pp. 61-63. (if you're really into this stuff, you're welcome to borrow my copy and read it for yourself).

so here i expand somewhat on one of their hypothetical examples to illustrate the use of geometric mean in summarizing population growth rates: assume an initial population of 1000 individuals and, for simplicity's sake, a growth rate of 10% the first year, increasing by 1% per year up to 20% in the eleventh year. so, in the second year, the population is (1000 * 1.10) = 1100. likewise, in the third year, the population grows by 11% to (1100 * 1.11) = 1221. in the eleventh year, the population reaches 4633.07 (you'll have to forgive the biologically unrealistic fractional individuals).

now, if you wanted to summarize the growth over these eleven years, you'd be tempted to just 'average' them -- that is, take the arithmetic mean of 10%, 11%, 12% ... up to 20%, which -- as you can probably do in your head -- is exactly 15%. in other words, on average, you'd say, there was 15% growth per year for those eleven years. it makes sense, but, as it turns out, it's not quite exactly precisely right: 1000 * 1.15 = 1150 (1st year); 1150 * 1.15 = 1322.5 (2nd year) ... ending with 4652.39, which is almost 20 greater than it should be (4633.07; from previous paragraph).

so, the arithmetic mean overestimates the average growth. as it turns out, you get the right answer if you instead use the geometric mean of the eleven values (1.10, 1.11, 1.12 ... 1.20), which is only a little bit smaller: 1.149565... (as opposed to 1.15).

as a side-note: "R" doesn't have a built-in function for calcualting geometric means, but it's nevertheless fairly easy to do:

> Y = c(1.10, 1.11, 1.12, 1.13, 1.14, 1.15, 1.16 1.17, 1.18, 1.19, 1.20)
> mean(Y) ## regular old arithmetic mean
[1] 1.15
> exp(mean(log(Y))) ## geometric mean using base "e"
[1] 1.149565
> 10^(mean(log(Y, base=10))) ## same answer in base 10
[1] 1.149565


this blog entry has already turned out much longer than i anticipated, so i'll leave it as an exercise to the reader (if there are any of you left by now) to work through the calculations. given that you haven't yet been introduced to R's 'looping' functions, it would probably make more sense to do the calculations using a spreadsheet. (i know, i know; i warned you away from them for statistical work, but they nevertheless have their uses for quick-and-dirty calculations).

let me know if you're interested in doing this, and i'm happy to help you get started.

i'll pick up with harmonic means in my next entry! (i know you can hardly wait!)

Monday, January 28, 2008

let's blogroll

over on the right side of this page now resides a list ("blogroll") of those students' blogs who've sent me their addresses so far; this list will increase as more people get on board. although you can click on the "Read More" link at the bottom to go to a page that pulls together everyone's most recent posts (an "aggregator"), it's still worth visiting particular blogs individually, both to see some of the impressive design jobs that your fellow students have done (very artistic!) as well as to read and participate in the commenting that follows on the various posts.

Thursday, January 24, 2008

*cough* who knew blogs could get so dusty? *wheeze*

well, i'm back. please, no applause. thank you.

hereby i shall resurrect my old blog from last year to serve as a model, inspiration, and touchstone for you, my class, whom i have again tasked with starting and keeping your own blogs, where you will comment on your readings, thinkings, analyses, and general development as statisticians. and probably crack a few corny jokes.

i have kept the links to last year's blogs (over on the right side of the page) for the time being so that -- browsing through them -- you can get a sense of what was attempted by last year's students. some were quite successful.

my goals for this project this time around are twofold: first -- to foster introspection, or, as the educational psychologists call it, metacognition. in short: if you have to think about what you're thinking about, you're likely to get more out of thinking about it. that's the idea, anyways. ymmv.

second -- i want to foster discussion. again, for pedagogical reasons, this has important benefits: it builds a sense of community (which is especially important in a challenging class such as this one), and it gives each of you the opportunity to share what you've figured out. you never really know a subject so well as when you've had to teach it to someone else.

at any rate, even if all that fails, it's still better than quizzes.

and you get to crack jokes. e.g.: "97.3% of all statistics are made up."

Tuesday, April 17, 2007

types i, ii, and iii contingency tables

hereby some examples drawn from Sokal and Rohlf (1995: 724 et seq.), and edited and expanded a bit by me to (hopefully) clarify the distinction among Models I, II, and III contingency tables:

Type I: neither set of column totals set by investigator:

100 plants are examined, and their soil type and leaf texture is recorded:


Pubescent Leaves Smooth Leaves Total
Serpentine Soil 12 30 42
Non-serpentine Soil 47 11 58
Total 5941 100


Type II: one set of column totals set by investigator:

100 moths are exposed to bird predation: 50 light morphs and 50 dark morphs (note that the proportion doesn't have to be 50:50, though); investigator records whether moths are eaten or not:



Prey Survivor Total
Light Morph 39 11 50
Dark Morph 30 20 50
Total 69 31 100


Type III: both sets of column totals set by investigator:

one hundred beans are placed in a jar: 50 with thick skins and 50 with thin skins (again, doesn't have to be 50:50). seventy hungry weevil larvae -- each of which will burrow in to one unoccupied bean -- are added to the jar, and some time later the investigator records the numbers of each type of bean and whether or not it was attacked:



Attacked Not Attacked Total
Thick Skin a b 50
Thin Skin c d 50
Total 70 30 100


at first glance, it seems that having both the row totals and the column totals fixed will automatically fix the cell totals; this is not actually true, as the following values of a, b, c, and d will illustrate:

a = 20, b = 30, c = 50, d = 0;
a = 25, b = 25, c = 45, d = 5;
a = 35, b = 15, c = 35, d = 15;
etc.

Sokal and Rohlf indicate that they have "not yet encounted a [non-hypothetical] example of this model."

Friday, April 6, 2007

ancova

as a (relatively) uncomplicated published example of ancova, i humbly present for my biostats students' consideration the following: http://www.tulane.edu/~guill/Reprints/Guill_and_Heins_2000.pdf

perhaps most useful to them will be the formats it uses for reporting the results of the analyses (which -- looking back over it, i'm embarrassed to say, aren't perfect: one needs 2 values for the degrees of freedom for an F ratio. my bad.)

also, it may serve as a reasonably useful model for what i'll be looking for in their independent projects -- basically something approximating the 'methods' and 'results' section of this paper in length and depth, supported perhaps by a few well-crafted figures and tables, as appropriate. anything beyond that (e.g. intro or discussion) will be lagniappe.

anova by hand

and here i was thinking i was being all progressive and modern by not making my biostats students work through all the calculations for anova by hand, but -- lo and behold! -- busy tosser has opined that the old-skool approach might actually be helpful. she's probably right;) so, let it never be said that i'm not willing to hand out additional work when it's asked for -- here you go:

a quick google search on "anova by hand" turned up the following worksheet:statisticshell.com/anovabyhand.pdf. the 'parent' site that it comes from -- statisticshell.com -- is a hoot. i've worked through the worksheet and it's actually quite good -- he walks you through one example (response to viagra, no less!) and then gives you a second problem to work on your own, followed by the answers to that one as well. i checked his results in R and get the same answers as he did, so -- if you're so inclined -- have at it!

let me know if it helps.