multiplying entities unnecessarily: 2007

Tuesday, April 17, 2007

types i, ii, and iii contingency tables

hereby some examples drawn from Sokal and Rohlf (1995: 724 et seq.), and edited and expanded a bit by me to (hopefully) clarify the distinction among Models I, II, and III contingency tables:

Type I: neither set of column totals set by investigator:

100 plants are examined, and their soil type and leaf texture is recorded:

	Pubescent Leaves	Smooth Leaves	Total
Serpentine Soil	12	30	42
Non-serpentine Soil	47	11	58
Total	59	41	100

Type II: one set of column totals set by investigator:

100 moths are exposed to bird predation: 50 light morphs and 50 dark morphs (note that the proportion doesn't have to be 50:50, though); investigator records whether moths are eaten or not:

	Prey	Survivor	Total
Light Morph	39	11	50
Dark Morph	30	20	50
Total	69	31	100

Type III: both sets of column totals set by investigator:

one hundred beans are placed in a jar: 50 with thick skins and 50 with thin skins (again, doesn't have to be 50:50). seventy hungry weevil larvae -- each of which will burrow in to one unoccupied bean -- are added to the jar, and some time later the investigator records the numbers of each type of bean and whether or not it was attacked:

	Attacked	Not Attacked	Total
Thick Skin	a	b	50
Thin Skin	c	d	50
Total	70	30	100

at first glance, it seems that having both the row totals and the column totals fixed will automatically fix the cell totals; this is not actually true, as the following values of a, b, c, and d will illustrate:

a = 20, b = 30, c = 50, d = 0;
a = 25, b = 25, c = 45, d = 5;
a = 35, b = 15, c = 35, d = 15;
etc.

Sokal and Rohlf indicate that they have "not yet encounted a [non-hypothetical] example of this model."

Friday, April 6, 2007

ancova

as a (relatively) uncomplicated published example of ancova, i humbly present for my biostats students' consideration the following: http://www.tulane.edu/~guill/Reprints/Guill_and_Heins_2000.pdf

perhaps most useful to them will be the formats it uses for reporting the results of the analyses (which -- looking back over it, i'm embarrassed to say, aren't perfect: one needs 2 values for the degrees of freedom for an F ratio. my bad.)

also, it may serve as a reasonably useful model for what i'll be looking for in their independent projects -- basically something approximating the 'methods' and 'results' section of this paper in length and depth, supported perhaps by a few well-crafted figures and tables, as appropriate. anything beyond that (e.g. intro or discussion) will be lagniappe.

anova by hand

and here i was thinking i was being all progressive and modern by not making my biostats students work through all the calculations for anova by hand, but -- lo and behold! -- busy tosser has opined that the old-skool approach might actually be helpful. she's probably right;) so, let it never be said that i'm not willing to hand out additional work when it's asked for -- here you go:

a quick google search on "anova by hand" turned up the following worksheet:statisticshell.com/anovabyhand.pdf. the 'parent' site that it comes from -- statisticshell.com -- is a hoot. i've worked through the worksheet and it's actually quite good -- he walks you through one example (response to viagra, no less!) and then gives you a second problem to work on your own, followed by the answers to that one as well. i checked his results in R and get the same answers as he did, so -- if you're so inclined -- have at it!

let me know if it helps.

Thursday, March 8, 2007

testing testing

this is just a test to see what kind of HTML formatting blogger will allow you to use such that we can incorporate R code and output while preserving spacing for columns and the like (so it looks like the output on your R terminal).

for instance, compare this:

> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa

with this:


> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
>

the trick is to enclose the cut-and-pasted output from R in between the HTML markup tags <PRE> ... <\PRE>

Tuesday, March 6, 2007

g&e ch 6

i realized only this morning that i hadn't blogged my usual preview note for the upcoming chapter. i apologize; theryn will be acting as MC today, leading our class discussion, and i guess i shifted into "participant" mode a little too soon. at any rate, it's probably too late for anyone to benefit from reading my rambling thoughts (i doubt anyone is up reading my blog at 6 AM) but at least this will be here for later review and reflection.

this chapter is straightforward, mostly non-quantitative (only a couple of equations pertaining to what is -- imho -- a largely tangential bit about modelling...), and full of good advice on how to worry about all of the things that might go wrong with your field study. no, not really... but it does lay out and emphasizes well the general principles of replication and randomization, and how necessary they are in order to make the results of your hard work as generalizable (and interesting, and useful, and, therefore, publishable) as possible.

i think the authors may be a bit too sanguine about the about the possibility of controlling for or taking into account all of the potential confounding variables that may affect a field study. my perspective is that, given finite resources and time, there will always be the risk of an unmeasured covariate that presents itself as possibly important after-the-fact, but that a reasonably well replicated and randomized design minimizes (but doesn't eliminate) this risk.

Monday, February 26, 2007

G&E CH 5

I like the framework that G&E have laid out in this chapter on the several different general approaches to statistical analysis, and I do think it is all worth reading fairly closely. That said, I think the simple example that they use (ant nests in forests and fields) to illustrate the different approaches (an excellent pedagogical approach, IMHO) is telling: Their descriptions of how one would go about implementing their "monte carlo" approach is clear and I expect would be easy (if tedious) for most any one at your level to implement using a spreadsheet. Their description of the standard parametric analysis is -- I think -- a reasonable compromise between overview and detail (which you'll get a a little later in the semester); after reading it I think you should have some sense of what F represents in an ANOVA (although not the ability to calculate it yet). As to Bayesian analysis -- I'll keep my opinion to myself for now, but I will prompt you with the following: after reading through this section, ask yourself if you could begin to put together the approach that you would need to follow in order to repeat the authors' analysis.

I do think they do a bit of a disservice to non-parametric statistics, and, given their ubiquity, maybe should have spent a bit more time on them. We will, ultimately, come back to some of the more popular of these approaches (e.g. chi square) in later chapters.

Wednesday, February 14, 2007

JV ch. 4

although this chapter is titled 'multivariate data', most of it is spent filling in the gaps and expanding your understanding of how R deals with data in the form of lists and data frames. [although we haven't really talked about it, you've been using data frames since you first started using attach().] also of note will be the additional practice you will get (and skills you will develop) in making plots. although it may seem insanely hard at first, once you get the hang of it, R will allow you to make some really nice plots with comparably little effort (at least in comparison to at least some other statistical graphing packages that i'm familiar with).

as to what to focus on -- at the beginning of the chapter the author again spends some time showing you how to make various tables, which, as i've indicated before, i think may be something better left to spreadsheets. at least at the beginning.

the end of section 4.1 gives you a nice explanation of high- versus low- level plotting features, and some examples of additional plotting options.

section 4.2 is a tedious but useful (and necessary) breakdown of some of the details of data frames and lists, whereas section 4.3 is, in my opinion, a little on the tangiential side. if you're reading along about xtabs(), split(), and stack(), and you're zoning out, don't worry too much. you can come back to these things when you find a problem that necessitate them.

lattice graphics (section 4.4) are pretty cool when your data are appropriate to be shown in this fashion, so this section is worth a read, whereas, possibly with the exception of 'factors', most of section 4.5 can be safely skimmed or skipped at this point (as JV himself indicates).

Monday, February 12, 2007

so i finally found my book...

after searching all day, it turns out my copy of G&E was in the back of the car which my wife had at work...

anyway, by the minute it's becoming too late for me to give you any real guidance on reading ch. 4, so i'm going to suggest something different -- we'll turn the tables and i'm going to ask you to bring a list of a dozen or so things YOU thought were most important about hypothesis testing, and we'll compare notes before we begin going through the chapter together. oh -- and try not to bunch them all up at the beginning of the chapter.

if you want to REALLY impress me, you can assemble your list in the form of questions. :)

Wednesday, February 7, 2007

jv ch. 3

hereby my hopefully helpful but nevertheless random thoughts as i read back through JV chapter 3. unfortunately our two texts will begin to diverge for a bit at this point. it was a nice bit of synchronicity that JV chapter 2 and G&E chapter 3 largely overlapped in terms of summary statistics, but, for the next little while, each book's author(s) take a bit of a different tack. in a way, i think this is good, because, on their own, i think G&E would be a bit too theoretical, whereas JV would be a bit too pragmatic. i think the two balance each other nicely in this regard, although i do wish the content covered meshed a bit more consistently. looking ahead, we'll return to synchronous treatments of the fairly detailed topics of regression (which we touch on in this chapter), and ANOVA.

imho, some things are better done in spreadsheets, at least until you get the hang of the R way of doing things, so if you find yourself getting bogged down with binding vectors and adding margins in the early part of the chapter, i'd say you can safely skim it, and just be aware that you can do such things. in general, spreadsheets (such as microsoft excel) do this more easily and intuitively, and may be the tool of choice if you wish to do this for a big data set.

i do think it's pretty cool when JV shows you how to produce the side-by-side boxplots and overlapping density plots, and that skill will be useful in the future.

the q-q plots are a bit arcane, and i wouldn't spend too long on them. imho, there are better (if less visual) ways of checking the normality of your data.

scatterplots are particularly important, as are correlation and regression. be aware, though, that we'll be coming back to regression later in the semester. it does make good sense to at least introduce it here, though. i think the short bits on transformations and outliers are worth reading closely, too. though, again, we'll be coming back to them.

Thursday, January 25, 2007

jv ch. 1

today we did (as a class) our first real work in R, and my impression was that it went okay. my sense was that most people didn't get through even the problems from the text that i'd assigned, much less on to the
new project of uploading and looking at the class questionnaire data that i collected the first day. personally, though, i'm okay with that. i wonder how others feel.

maybe someone will comment and let me know... hint, hint

:)

Wednesday, January 24, 2007

more on the independent project

one of my students sought guidance as to what was "required" with respect to the independent project for this course, and i find that i've composed an uncharacteristically lucid (for me, anyways) reply, with which i'm quite pleased. hereby i share it with you that it may inform your thinking on the subject:

.... ultimately, i'd like you to do a project that you (1) enjoy; (2) get something out of; (3) learn from; and (4) have an opportunity to apply what you will have learned in this class.

i know that all sounds very vague and healy-feely, so, as to specifics:

(1) there should be some data involved. you can gather them yourself, borrow them from someone else, make them up (though scientists tend to frown somewhat on this last option), whatever. they should probably involve something you have at least a passing interest in... maybe?

(2) these data (probably) shouldn't have already been analyzed and published (at least in the way you plan to analyze them).

(3) you should do some analysis. test a hypothesis or two. do the right tests. do them the right way. don't violate any assumptions, or if you do, do it with panache. :)

(4) you should maybe keep a notebook, journal, big pile of paper (with a note on top that says "don't move this!), or, if you like blogging, maybe a separate blog (blogger lets one "person" (your login name) keep multiple blogs) in which you record the progress and process of your analyses. i'll show you how to do this in class.

(5) you should write it all up in some format approximating the "methods" and "results" portions of a standard scientific paper. you may or may not want to add a little "intro" and/or "discussion."

(6) you should give it all to me by the end of the course. if you're not shy, you can show me what you've got as you go along, and i will try to offer helpful advice, humor, anecdotes, and other generally irritatingly useless and vague comments.

(7) or you can ignore all that and just do something else.

how's that?

Tuesday, January 23, 2007

G&E ch 1

seventy-five minutes isn't nearly as much time as you think it is. which is weird, because looking at the computer screen when i finished writing my 'socratic' questions for today, i thought to myself 'i wonder if these are enough to keep us talking the whole time?' the truth is, i probably talked too much. the 'atlas complex' is a hard one to shake. (i encourage those of you who see a teaching component in your future career to follow that last link!)

anyway, today's material was a bit of (very) general background on experimental design, followed by -- in my opinion -- just enough set theory and probability calculus to get us all into trouble, but probably not enough to get us back out again. hmmm. i wondered to myself as i was reading it over again last night just how central these topics were going to be in the rest of our 'primer' of (mostly) regular old parametric stats. i started looking up the terms they chose to boldface (e.g. 'complex events,' 'proper subset,' 'conditional probability') in the index and, curiously enough, most of them don't appear there at all. and most of those few that do appear (e.g. 'venn diagram') only come back up mostly in passing in a later chapter. except 'independence,' of course.

so, given G&E's bold statements on the first page:

In this chapter, we develop basic concepts and definitions required to understand probability and sampling.... The concepts in this chapter lay the foundations for the use and understanding of statistics....

i'm curious to see how (and if) they link these back in in later chapters. (note my skeptical tone).

Monday, January 22, 2007

my own independent project

one of the main assignments for the biostatistics course i'm teaching is to devise and execute a small 'independent project' that will make use of the techniques that the students will be learning this semester. i've been deliberately vague so far about the details because i don't want to constrain anyone's thinking about what he or she might want to do, and some interesting ideas are already beginning to form. (see the blogs that this one links to).

i realize, though, that i do need to give some sort of guidance to my students (especially those who have never been involved in a research project before), but rather than a formalized checklist of things it has to include, etc., i though it would be better to model the sort of project i had in mind with my own research.

dr. david heins and i have started working on another project together that will build on his work with the effects of infection with cestode parasites on the life-histories of threespine sticklebacks (Gasterosteus aculeatus). my contribution will include gathering data on the body shapes of fish from several different lakes in alaska (and possibly from the UK) and conducting the analyses to test the hypotheses that parasitism is associated with differences in overall body shape and/or differences in the shape of the head.

i hope to begin gathering the data sometime this week.

Wednesday, January 17, 2007

asset or liability?

in addition to these blogs that i'm requiring my students to keep, we're also making use of a "blackboard" course-content-management-system web site, and a wiki. each of these is hosted in different places, has different access controls, and -- i'm hopeful -- provides something useful to the students.

while getting all of this set up, though, it occurs to me just how complex all of this is, and causes me to wonder if the pedagogical value of these tools actually outweighs whatever confusion might arise from all this technological complexity. and, as if almost in answer to my own question, i just now had to go back and edit the links i included above because when i first wrote them, they were in wiki format (enclosed in double brackets) rather than regular html format (a href, and all of that).

this, in turn, caused me to remember how dr. dunlap (from whom i took all of my statistics -- back in the last century!) showed up to class with a stack of notes (which i don't think he ever looked at) and a piece of chalk.

i wonder if that was the better approach.

Tuesday, January 9, 2007

that first yucky skipping-ballpoint line in an otherwise pristine notebook

i thought it only fair -- since i'm requiring my students in Bioststatistics and Experimental Design (EBIO 408/708) to keep a blog -- that i should start one up myself.

(actually, i believe i started one way back in the paleozoic of the blogosphere [that is, a couple of years ago], but i don't think i ever did anything with it, and i can't seem to remember what i called it, nor where it even was.)

so i'm requiring my students to record their various musings about their own voyages of statistical discovery in order to engage in what the educational psychologists call 'metacognition' -- thinking about how they think about things. of course, journal-keeping in academic disciplines is nothing new (although i suppose it is a bit unusual in a stats course), but i hadn't heard of anyone using this newfangled technology to enable students to read and respond to one another's thoughts.

i'm curious to see what they write.

multiplying entities unnecessarily