Finding patterns in data

A Marine Ecological Example

Dataset: marine.xls

This case-study is typical of those seen in community ecology; species abundances measured at a set of sites and with associated 'environmental' data. The dataset was collected to test a specific hypothesis - that run-off from a rubbish site near Australia's Casey Station in Antarctica influenced marine benthic species composition and abundance. The study work was part of a PhD thesis by Jonny Stark of the Australian Antarctic Division, Hobart, Tasmania (see references below).

The Data

The dataset contains species abundances, physical and chemical environmental variables. The design of the experiment was to sample marine benthic species and associated environmental properties at sites close to, and far away from the influence of a rubbish site. While it would have been nice to have been able to setup a BACI design (before-after-control-impact), no systematic samples were ever taken of the variables before the establishment of the rubbish site. Therefore, a ‘control’ site was selected at a location that was hopefully out of the area of potential influence of the rubbish tip, but with (also hopefully) other superficially similar physical properties to the putatively impacted sites.

The sites marked as ‘BB’ are from Brown’s Bay near to the rubbish site (putatively impacted). Those marked “OB”, the control sites, are from O’Brian’s Bay, the next bay to the south of Brown’s Bay. In each area, the samples are from two locations. In the case of Brown’s Bay, closer to, and further from the shore (the rubbish site). Further details of the survey can be obtained from the references.

There are 16 sites from Brown Bay (2 lots of 8) and 16 from O'Brien Bay (also 2 lots of 8). There are 78 variables, that I have broken into 5 classes; biota, biotic summary statistics, chemistry, sediment grainsize and one a-priori variable (location). For 'location', I have simply labelled each of the four areas 1, 2, 3 and 4. I have done this as an 'a-priori' grouping. Think of it as a type of extrinsic variable; one that will not be involved in the analysis but may be handy in the evaluation.

Variables

Biota Stats Chemistry Grainsize A-priori
syllids individuals Zn Mgraph Location
capitellids richness As Sorting
paraonids shannon Cd Skew
orbiniids evenness Cr Kurtosis
maldanids Cu <2mm
cirratulids Fe <63mm
nephtyids Pb 2-63mm
opheliids Mn >2mm
spirorbids Hg Wholewt
Oligochaetes Ni Twt<2mm
Nemerteans Ag <63mmws
Nematodes Sb <63of<2
SBA Sn <2mm>63ws
Gastropod II TOC <2ws
Gastropod IV >2ws
Abatus MMicron
Ophuroids SrtMicr
Gammarid I
Gammarid II
Gammarid IIA
Gammarid III
Gammarid IV
Gammarid VI
Gammarid VII
Gammarid VIIA
Cumacean I
Cumacean II
Cumacean V
Tanaid I
Tanaid II
Tanaid IV
Isopod I
Isopod II
Isopod III
Isopod IIIB
Isopod VIIA
Ostracod I
Ostracod II
Ostracod V
Cyclopoid I
Harpact
Pycno I

References

Austin, M.P. & Belbin, L.(1982). A new approach to the species classification problem in floristic analysis. Australian Journal of Ecology 7, 75-89.

Belbin, L., Faith, D.P. & Minchin, P.R.(1984). Some Computer Algorithms Contained in the Numerical Taxonomy Package NTP. CSIRO Division of Land Use Research Technical Memorandum 84/23. October 1984, Canberra. 31p.

Belbin, L. (1991). Semi-strong hybrid scaling, a new ordination algorithm. Journal of Vegetation Science, 2, 491-496.

Belbin, L., Faith, D.P. & Milligan, G.M. (1992). A comparison of two approaches to beta flexible clustering. Multivariate Behavioural Research, 27, 417-433.

Clarke, K. R. (1993). Non-parametric multivariate analysis of changes in community structure. Australian Journal of Ecology 18, 117-143.

Stark J. S. (1998) Heavy metal pollution and macrobenthic assemblages in soft-sediments in two Sydney estuaries. Aust. J. Mar. Freshwat. Res. 49, 533–40.

Stark J. S. (2000) The distribution and abundance of softsediment macrobenthos around Casey Station, East Antarctica. Polar Biol. 23, 840–50.

Stark, J. S. (2000). The distribution and abundance of soft-sediment macrobenthos around Casey Station, East Antarctica. Polar Biology 23, 840–850.

Stark J. S. (2001) Human impacts and assemblages in marine soft-sediments at Casey Station, Antarctica. PhD Thesis, University of New England, Armidale, NSW.

Stark J. S., Riddle M. J., Snape I. & Scouller R. C. (in press) Human impacts in Antarctic marine soft-sediment assemblages: Correlations between multivariate biological patterns and environmental variables. Est. Coast. Shelf Science.

The Goal

To see if PATN can be used to understand the nature of the variation of marine benthic communities and to identify what if any environmental properties seem to be controlling the distribution of species. Do we have a difference in species communities from putatively impacted to ‘non-impacted’ similar sites? We will see.

The Analysis

Step 1: Import the data

The original data is in a Microsoft Excel™ file with the samples forming the rows of the data matrix. Here is what a component of the original data looks like in Excel™-

Excel page showing top left of data matrix

In this dataset, all the biotic variables happen to be on the left of the Excel file and all the 'environmental' variables are on the right-hand side. To PATN, it would not matter at all what order the variables or the samples were in; the results would remain the same. The order of the variables has no influence on the analysis.

So, to PATN. We start PATN and select Import from the '"What would you like to do?" opening screen. As there are in this case, a number of worksheets in the original Excel file (but not the one on the Web site), PATN will ask which worksheet is to be imported. In this case, I select one called 'PATN', that has both the biotic and environmental variables together in the one worksheet. PATN will scan through the worksheet on entry, checking values for setting up defaults, and creating the 'visible stats' values to the right and to the bottom of the Data Table. PATN will also display the record number during the import and then, on completion, display the total number of rows and columns in the Data Table. PATN has automatically picked up the first row as the column labels and the first column as the row labels.

Here is the resulting Data Table. As you can see below, PATN is reporting 32 rows (samples) and 78 variables. Note that the visible stats are currently displaying (by default with PATN V3.1), the minimum and maximum values and the number of values > 0. Because we have a mixture of intrinsic (the variable that will be used in the analysis phase) and extrinsic (the variables that will be used only as an aid to evaluating the analysis), the row stats will be 'messy'. There are some environmental variables with very high values while many of the abundances are zero. The row statistics contain biotic and environmental values. The column stats are however very useful.

Step 2: Identifying Extrinsic Variables

The next step is to identify to PATN, those variables that we only want involved in the interpretation of the analysis; the extrinsics. In this case, all the extrinsic variables are on the right-hand side of the Data table so it makes it easier. Click the left mouse button on the column marked 'individuals', scroll across to the right-hand side of the Data Table using the scroll bar, hold down the SHIFT-key and click the column, marked 'Location'. All the extrinsic variables are now highlighted.

Next click the 'Extrinsic' button Make extrinsic button to move them across into the extrinsic area of the data table. Marginal stats - number > 0 for rows Now note the TABs on the on the right-hand side of the Data Table. They are marked 'Stats',

'Intrinsic stats', 'Groups' and 'Ordination'. The intrinsic stats tab applied only the intrinsic variables. 'Groups' will list the group number for each of the rows. As now, these numbers will be 'missing' as no groups have been created as yet. If there were a-priori groups, they would also be listed on this tab. The ordination tab is used to display the ordination coordinates. As with the groups, they are currently set to missing as an ordination has not yet been created.

If you now click the Intrinsic Stats TAB, you will get a summary of only the intrinsic (biotic) variables.

This is now a lot easier to browse to make sure all your data looks 'right'!

We can now see that sample BB2S1P13 ranges from 0 to 128 individual counts of species and that there are 15 non-zero values. Scanning down the sites, we can see that there is a site with a maximum of 6 and one with a maximum of 200. Quite a range, but expected for this type of data. Is it ok to leave the coding as is? We will address that below.

Step 3: Save the Data as a PATN Project File

At the moment, the imported data, with our newly created extrinsic variables has not been saved. If we were to try and exit PATN, it would warn you to save the file. But, it is a good idea to save the data at regular intervals so that you can, if required, back-up and take a different analysis route. Now is a good time to save the data. Just press CTRL-S or select File | Save, or File | Save As and call it Casey-1; our first version of the data.

Step 4: Examine the Values of the Intrinsic Variables

Examining the intrinsic values is always a wise move, regardless of whether you think the data is just fine. If you have collected the data, it is hard to stand back and make sure that it is appropriate. It is far easier to do a few quick checks before you get into serious analysis than trying to 'back-peddle' when you find some strange outlier caused by a slip of the fingers, or a stray cosmic ray, or even some digits falling down the cracks between various bits of software.

The intrinsic data is species abundance values. Typical ecological data. The way that you must start to think in Pattern Analysis, is "what is PATN going to do with these values?" This, of course takes a little practice, and this is why the use-case. The default 'visual stats' are minimum, maximum and number of values > 0. They are the defaults as these parameters are the most useful for data such as this. Scanning the columns, the minimum values should all be 0, and this seems to be the case here. The maximum values vary considerably, from 1 to 200. This is an issue that will need to be addressed. Why?

PATN has already figured out that the Bray and Curtis association measure will be best for these data. How has it done this? PATN has examined the Data Table and noted that the number of zeros in the table suggests that the data seem to be 'ecological'; counts of species. This time, it got it right. Even if it wasn't species data, the conclusion as to an appropriate measure of association would likely have been the same.

The Bray and Curtis association measure is effectively 'the sum of the differences between two objects (sites/samples in this case) over the sum of the sums of the two objects (see the PATN Help). Therefore, if nothing is done to recode the values, the species with the higher abundances will be more influential than those species with low abundances. Is this what is wanted? Usually the answer is "no". What should be done then?

Step 5: Transforming the Abundance Values

We probably need to down-weight the species with high abundance values. This must, by the way, up-weight species with lower abundance values! Very few analysts seem to recognise this fact. OK, how should be down-weight & up-weight? The easiest method is to use some form of transformation. [A transformation in PATN means that a value in the Data Table can be changed without any need to reference other values]. Common choices of transformation include (in order of increasing intensity) square root, cube root and log. How intense? This is a hard question to answer. The more intense the transformation, the less influence abundant species will have and the more influence less abundant species will have. So, if you think there is some systematic signal in the less abundant species, up-weighting should do little harm.

Highlight syllids column in Data Table

Looking at histograms will help to get a little better idea of how the values are distributed. You would expect rather skewed, I hope. In PATN, you can easily get a histogram of all species abundance values. Click the left mouse button on the label of the left-most column (syllids) and then scroll to the right, hold down the SHIFT- key, and click on the species label furthest to the right in the intrinsics in the Data table (Pycno I). All the species will now be highlighted.

Next, simply click the histogram button, just to the right of the Data Table button on the PATN button bar; the one that looks like a histogram. The histogram and the summary statistics will now be displayed. Sure enough, quite skewed. The histogram also confirms that the minimum value is 0 and the maximum value is 200.

So, back to transformations. What is the square root, cube root and log of 200? Answer: ~14, ~6 and ~2 (base 10) respectively. If I have the opportunity, I usually asked the person who collected the data how they would discriminate between various values (in this case abundances). Most of the time, I found that around 5 values were sufficient to provide a level of discrimination that made sense ecologically. For example, 0 mapped to 0, 1-3 mapped to 1, 4-10 mapped to 2, 11-30 mapped to 3, 31-100 mapped to 4 and 101+ mapped to 5. Coming from an analysts perspective, 5 or so categories seems to reduce the noise with little loss of signal.

Transformation standardization options

It is always wise to err on the conservative side. A square root transformation doesn't look like producing a strong enough influence and a log transform seems a tad 'heavy'. I'm therefore going to opt for a cube-root transformation.

Click on the transformation/ standardization button on the top-right of the PATN Toolbar,Transformation button  and select all the intrinsic variables (the species), click the Add button, then select the top transformation option "Value**Constant", and then press the Run button. You will then be asked for the value of the exponent, enter 0.3333 and press OK.

Transformation exponent

That's it. All the abundance values will now be transformed such that abundant species will not swamp the analysis, and hopefully, the rarer, but meaningful species will play a more important role. Fingers crossed.

Just to be safe, do a File | Save As, and save the transformed data to a new PATN Project file - call it something different to the unstandardized file. I'm going to call this project file "Casey_2".

Remember: If you think you can avoid this transformation decision, you ARE making a decision anyway - you have chosen to have an analysis strongly influenced by species with higher abundances!

Step 6: First Analysis

It's time for a preliminary analysis. There could be some really noisy species in this dataset, so we may need some refinement. That's what PATN is designed for; interaction. Click on the analysis button on the PATN Toolbar. Analysis button

The master data analysis window is displayed. "Perform analysis on rows" should be already selected.

Analysis options window

Next, click in the "All Evaluation" box as this will automatically apply all the evaluation options (box and whisker plots, ANOSIM, PCC and MCAO) to all extrinsic variables. This will save us having to manually run the third phase (prepare, analyse and evaluate), although we can easily run any part of the process in seconds if we wish to.

At this stage, we will not run any analysis of the variables (the columns of the Data Table). Again, we can do it later if we want to.

6.1 Association

Association is the first an most important step in an analysis. The subsequent classification, ordination or network will depend totally on the measure of association chosen. Get is wrong here and recovery of real patterns may be an isue.

Association matrix and histogram display

Select "Set Parameters" and the association tab of the analysis window will be displayed and the Bray and Curtis association measure will be highlighted.

Why Bray and Curtis? This measure has been shown to best relate to what we may term 'ecological distance'. How do we know this? More on this below, but for the moment, we also know that even this measure tends to underestimate true distance. Simply put, once two objects (sites in this case) fail to have many or any species in common, it is difficult to estimate the true association. Bray and Curtis will go to a value of '1', when we know that some association values must exceed '1'.

PATN will produce a lower-symmetric matrix of Bray and Curtis values ranging from 0-1 where 0 means that the two objects (sites here) are identical and 1 means that the two sites share no species in common and are maximally different. Bray and Curtis is therefore a distance-type measure, not like the (Pearson's Product Moment) correlation coefficient. All association measures in PATN are distance-type. What's a lower symmetric matrix? Think of a spreadsheet where the row and column are both labelled with our 32 sites. We can enter the value of the Bray and Curtis association between site 1 and site 1 (itself) as 0. Ditto with all self comparisons, so we really don't need to store those values as we know what the result will be. Comparing site 1 with site 2 will give the same result as comparing site 2 with site 1. The association is therefore symmetric (two-step is not totally but we won't go there for the moment). Therefore we also don't have to store one side of the comparisons. 'Lower symmetric' means we just store the lower-left part. You will see this structure if, after running the analysis, you click the association matrix button.

6.2 Classification

Next, click on the classification tab and let's see what we may want to change from PATN's defaults.

Classification options

Agglomerative hierarchical classification is highlighted (because we have less than 100 objects) with Flexible UPGMA set in the selection box. This particular strategy is selected because it is the one that has been found to have the best recovery of 'truth' for 'ecological-style' data. Truth has been defined by knowledge of how species relate to environmental gradients, and generating artificial datasets with known properties.

Flexible UPGMA weights objects equally (Flexible WPGMA weights groups equally), and is the technique that also best correlates ultrametric distance with our association matrix. That is, if you were to measure the implied associations from the resulting dendrogram, those would most closely match the input association matrix.

The beta value controls the distortion of the 'space' defined by the species; hopefully some form of 'ecological space'. Negative beta values 'dilate' the space (make objects and groups appear to move away from each other as fusion progresses, just like stars in our galaxy and galaxies in our universe). Positive values do the opposite, this is called 'space contracting'. A beta value of '0' is termed space conserving. I use a default beta value of -0.1 (slightly dilating) as it counteracts the underestimation of larger Bray and Curtis values.

Lance and Williams (the inventors of the flexible formula which gave rise to the 'flexible' strategy) used to use Flexible WPGMA with a beta value of -0.25, but the beta values are not totally equivalent in terms of the degree of dilation/contraction. Lance and Williams didn't like dendrograms that were 'chained' (where a joins b, then c joins a and b, then d joins a, b, and c etc). While chaining does tend to make the dendrogram difficult to interpret (you often have a few large groups and all the rest are single- membered groups), such as structure may be reflecting reality. A degree of chaining often reflects species partitioning environment gradients.

If you decrease the value of beta toward -0.5, it will force each group in the dendrogram (regardless of where you slice it) to have an equal number of members. This is a rarity in reality, unless you know what is driving the variation and sampled carefully to partition it! So, I suggest you stick with beta = -0.1 until you know more than I do.

The number of groups? This is a "how long is a piece of string" question. The dendrogram presents everything from n groups (where n=number of objects) to 1 group, but the structure of the dendrogram will give you some idea if 'natural' (read well-defined') groups exist, and maybe a hint on how many groups may be appropriate. From my experience, a quality dataset of 400 objects can lend itself to interpreting over 200 meaningful groups. But, try communicating 200 groups in a seminar or a journal paper - not likely! Many years ago, Bill Williams suggested that the a rule of thumb of the square root of the number of objects seemed about right. He was implying that information increased with the square root of the number of objects. This is the starting point in PATN. Personally, I like 5 for just about everything. Why? Most people can hold the concept of 5 different things in their mind concurrently. This means, if your going to communicate an outcome, 5 groups maybe a very effective number. I'm going to use 4 here as I know there are four sample locations. Simple.

Think of classification as reducing the n rows by m columns in the Data Table to g (where g=number of groups) rows by m variables.

6.3 Ordination

The term ordination is derived from the German word for axes. In Pattern Analysis, ordination refers to a suite of techniques that attempt to reduce the dimensionality of a dataset. The Casey data has 42 intrinsic variables (the species), that could be considered as axes (using abundance as the scale) - all at right angles to each other. Tough to visualize, but if we could view the 42-dimensional space. we would see a set of points representing the sites. Ordination attempts to squeeze the 42 dimensions down to a few that we can visualize; 2 or 3 if we are lucky. The measure of success is summarised by stress.

In PATN, the ordination method is called semi-strong hybrid multidimensional scaling, or SSH for short. Multidimensional scaling (MDS) is probably the simplest, robust and most effective ordination technique currently around. What MDS does is to first ask you how many dimensions you want to try for. Same issue here as the number of groups, but you know that it will have to be 1, 2 or 3 in PATN, and 3 is the default. MDS then will randomly allocate some x, y and z coordinates to each site. It will then measure the Euclidean distance between each pair of sites. This will generate another lower-symmetric matrix. MDS then compares (using regression) the Bray and Curtis association values with their corresponding Euclidean distance values. As you may expect, the result will almost certainly show a poor relationship, but the MDS algorithm will know which way to move the sites in the Euclidean space to improve the fit. The points are moved and then the Euclidean distance is re-measured, followed by another regression. This process repeats (iterates) until any movement of the sites will decrease the fit. The degree of fit is the stress.

SSH has two features that are in improvement over basic MDS. First, unlike all other MDS programs, SSH can mix linear (metric) regression with ordinal (non-metric) regression. From above, we know that association measures such as Bray and Curtis will be accurate for smaller association values but will underestimate larger association values. What this means is that smaller association values will be linearly related to true distance and, at best, larger association values will be ordinally related to true distance. Therefore SSH asks for a cut-point association value. Linear regression will be used on all association values below this value and ordinal regression will be used on association values above this. Simulation studies suggest that a value of around 0.9 produces good results. That's the 'hybrid' part of SSH.

So what is 'semi-strong'? Most non-metric (ordinal regression) MDS programs will not break tied values of association. This means that 1's will remain 1's. We know however that many 1's are greater than 1, in some cases, much greater. If there are a lot of zero values in your Data Table, chances are that there will be lots of 1's in your Bray and Curtis association matrix. It is by the way, a good idea to view the histogram of association values, once the analysis has been run. Semi-strong ordinal regression will allow tied values of association to be broken. Doesn't this seem like a logical strategy?

Ordination options

What else is on the ordination tab? Number of random starts. Why? Well, SSH, like most Pattern Analysis algorithms (including classification) can rarely if ever guarantee to find an optimal solution. PATN's algorithms are heuristic. They use simple rules of thumb to try and find the best solution.

By default, SSH tries 10 different random starts (randomly allocated coordinates) and picks the one that gives the best result. Ten is usually enough to be reasonably confidant that you have found the best, but never completely. If you want more, try more, but be warned, SSH run-time is proportional to the cube of the number of objects! 200 objects may be fine, but for 300, you make to take an extended coffee break.

Think of the random starts like this: The 'solution surface' of SSH is like a landscape, where your looking for the absolute lowest point, the lowest stress value. Each random start is like a parachutist who is blindfolded and dropped from an airplane and given the instructions "Land and then proceed downhill till you can't go downwards any further. Stop and report your altitude". As you can see from this analogy, having 10 parachutists is a lot better than having one and relying on the fact that they have found the overall lowest point in the landscape!

Random number seed? PATN has an inbuilt random number generator that is used to generate a sequence of random values. Entering the same seed value as a previous run will (hopefully) result in the same values being produced.

Number of iterations? This is how many times SSH will move the points and do the regressions. Fifty is usually more than enough. If the movement is too small, or the stress goes up, SSH will stop at that iteration anyway.

Analysis recipe

Now you can press Run, and the Analysis Recipe window will be displayed. It lists the options selected for the analysis. Just run your eye over it an check that it is indeed what you want to do.

Note that by selecting 'All Evaluations', PATN has automatically added in all the extrinsic variables to all standard evaluation modules (box and whisker, ANOSIM, PCC and MCAO). More on those later.

At this point, you can save the recipe to a text file by pressing Save and/or just press OK and stand back! PATN will do all the analysis and evaluations, then it is just a matter of examining all the results at your leisure. The analysis and evaluations should only take a few seconds on the Casey dataset.

SSH will report the stress automatically. As PATN uses the ordination as the basis of its display, this is one of the more important values to watch. Stress!

Ordination stress

Should I be happy with a value of 0.1086? Basically yes, but here is the rule of thumb that I use. It is based on the experience of Joe Kruskal (the inventor of MDS) and myself, but in Pattern Analysis, as ever, these values are just guides. If you see something that doesn't quite jell, then think about it and test a few ideas. While I have designed PATN to be a lot of fun (easy and exciting) to use, don't treat it as a 'black box'. Think!

SSH uses Kruskal's stress formula 2: SQRT((Dout-Dest)/Dout) where 'Dout' is the Euclidean distance as measured in the reduced dimensional space and 'Dest' is the estimate of that distance from the regression. Stress varies from 0 (perfect fit) to asymptotically approaching 1. Here is my take on stress-

Stress Value Interpretation
> 0.3 Try again!
0.2-0.3 Not good
0.15-0.2 Lower would be better!
0.1-0.15 Maybe OK
<0.1 Not bad
<0.05 Too good!

So, we are in the 'acceptable' region right off. This is a very good sign, but we may be able to do better. I'll bet the stress would have been higher if we had not transformed the abundances. Why?

Step 7. Interpretation of the Analysis

The analysis is quick, the interpretation isn't! A thorough knowledge of your data is essential if you are to get the most of out of it.

7.1 Association

The first action should be to look at the histogram of associations to see what the extent of the under-estimation problem may be. The stress suggests that it may not be too bad, and that is what we find. The environmental gradient may not be too long. The histogram shows 35 or so values above 0.9, but this is only around 7% of the values. Comforting. Why? If we got too many 1's (say 90%), then there is so little discriminatory information (so little overlap between sites in terms of species composition) that we should start again!

Association matrix and histogram

7.2 The Dendrogram

Next, let's look at the dendrogram of the sites, press the button on the toolbar that has a dendrogram on it. Dendrogram button

I'm going to reduce the size of the font to enable the whole dendrogram of the 32 sites to be displayed within the window. Using the right mouse button in the windows is a function that you should use all the time. So, press the right mouse button in the dendrogram window and select font and select '8'. I could have done the same thing from the Tools | Options menu by the way. In either case, the default font for all the displays will now be Arial 8.

Row dendrogram

What can we make from the dendrogram?

Four groups certainly look like a solid bet, with the BB3 group (Brown Bay outer) looking like a clear split at the 5-group level if you wanted to go that far. While there is more interpretable structure below the 5-group level, we won't go further at the moment. Suffice to say, that generally the replicates (the 6th letter in the site codes) within the 4 groups appear in terms of species abundance, to be replicates. That's nice.

We seem to have perfect discrimination between the four sample areas (BB2, BB3, OB1 and OB2) with the BB (putatively impacted) sites and the OB (hopefully the control) sites nicely split at the 2-group level. We could not ask for a better result first off!

We could in fact write up a pretty good report based on the dendrogram and the group species composition alone. Not a bad start.

7.3 Box and Whisker Plots

Box and whisker button What next? If we had done an analysis of species, we could have done a two-way table. Let's take a look at the box and whisker plots and see how the environmental variables may help us to interpret the groups. Press the box and whisker button on the PATN task bar.

Box and whiker plots

By default PATN sorts the box and whisker plots by decreasing Kruskal-Wallis statistical values (the most significant discriminating variables are at the top of the window). It's now obvious that our location code is the best. Perfect discrimination of the 4 groups!

The next best discriminating variable is 'individuals'. Group 1 (BB inner) has the highest number of individuals (but the greatest variation) while groups 2 (BB outer) and 3 (OB1) have the lowest number of individuals. Group 4 is intermediate. Is this counterintuitive? Yes, maybe a little if you are not familiar with these type of data. Would we expect a high abundance near the contamination source? My first thought is "no", but let's look further and see if it adds up.

Copper values are high in groups 1 and 2 and low in 3 and 4. This makes perfect sense. Because there is no overlap in quartiles (edges of the boxes), we know that there is a statistically significance difference in Cu between groups 1+2 and 3+4. I'll look at the other metals further down.

What else? We could look through the grainsize, but let's see if we can figure out more about the species stats. Site richness (the number of different species at a site) has a KW value of 23.3; very high. There is a significant difference between the BB sites and the OB sites with the OB sites being high and the BB sites low. Shannon-Weiner diversity, with a KW value of 19.6 says the same thing. So, the BB sites are high in number of individuals but low in diversity while the opposite is true of the OB sites. This may make ecological sense. There could be a few specialist species that can cope with impacted sites and can take advantage of lack of competition from a diverse range of species operating in the benthic environment.

Now, let's look at the chemistry, the metals that we would expect to see with the rubbish site in Brown Bay. Sb (antimony), Cd (cadmium), Zn (zinc), Cr (chromium), As (arsenic), Fe (iron), Pb (lead), Ni (Nickel), Sn (tin), Ag (silver) and TOC (total organic carbon) all echo the same pattern (in decreasing discriminatory order); high in Brown Bay and low in O'Brien Bay. Note that in most cases, group 2 has less variation in the metals than group 1. The only anomaly in the trend is cadmium. It is high in group 4, the O'Brien Bay site 2 where you would not normally expect it. Why? Superficially, given the spectrum of metal discrimination, it would appear that this site is naturally high in cadmium. This idea (hypothesis?) should be tested.

It is interesting that Mn (manganese) and Hg (mercury) appear to show poor site discrimination. Mercury was probably at the limits of detection, but this also needs to be checked. Manganese may be naturally occurring in the sediments, be mobile, or be in low concentrations in the tip or all three. Again, this should be tested. Either way, we should know what is happening and then may be able to refrain from using variables that effectively don't vary significantly enough to help untangle the patterns and processes in similar situations. When you move to another environment, however superficially similar it may appear, one never knows how the conclusions from prior sites will hold up. That's science.

One thing that the KW values can suggest, is to see if we can improve the 'signal to noise' within the dataset by eliminating the least discriminating variables. The extrinsic variables mentioned here have no influence on the analysis so removing them from the dataset is meaningless. What we can see so far is that we have excellent group discrimination and what appears to be a fairly clear ecological story; we appear to be getting a fair idea on what appears to be causing the variation. Other external factors may be driving it, but at least we know what the variables in the dataset seem to be saying. What we don't have as yet is a species profile of the groups, other than the issue of number of individuals and richness. I'd like however to see what the other evaluation tools are saying before I go back and have a look later at the species profiles.

7.4 ANOSIM

ANOSIM button ANOSIM is a technique for the 'analysis of similarities' due to Clarke (1993). ANOSIM is more akin to ANOVA using association values rather than raw data. The approach taken in PATN is to use the generated association matrix, calculate the average within group to between group association (like an f-ratio), remember this value, then randomly re-allocate objects between groups and re-calculate the within/between association value. This permutation process continues usually for 100 or 1,000 iterations. The idea is then to see how many of the permuted configurations generated an average within/between value greater than the actual one. The idea suggests you have a significant result if less than the equivalent of 5 in 100 or 1 in 100 permuted solutions beat the real one (or whatever significance level you come up with).

A lot of people use ANOSIM, but I find it extremely rare to find any permuted configuration to beat the one generated by PATN. After all, that's what PATN is designed to do! That's just what happens here. The real average between/within is 2.15 while the best permuted value is 1.23. Surprise! ANOSIM was mainly included in PATN to test a-priori groups (those established in association with a hypothesis). With this dataset the a-priori groups (1+2) vs. (3+4) would certainly be significantly different!

7.5 Ordination

Ordination button Now, let's get to the real cute bit in PATN, the 3d-ordination display. Does it confirm the classification? Does it tell us different things? Click on the ordination button on the PATN toolbar to find out.

Row ordination plot

The image to the left is obviously a 2-dimensional image of a 3-dimensional structure. I've also added the minimum spanning tree (MST) to help clarify the structure (press "m" or use the right mouse button menu for many other options).

The colours are meaningful, and will also give you an idea of where the groups are. If you want the group colours (all objects in each group have the colour of the group centroid), simply press "g" on the keyboard and there they are.

In PATN, my favourite (most used) tool is the manual rotation available in the 3d ordination display- hold the CTRL+SHIFT keys and with the left mouse, drag the structure until you have a comprehensive knowledge of what the configuration is like.

You can see that one of the objects is labelled (BB3S2P22: a Brown Bay outer site). You can also click the left mouse button anywhere not on a site and you will get the location of the closest group centroid. This will tell you where the groups centroids are. Click you way around the configuration to help to understand what PATN is trying to convey to you. A lot of information is in this diagram, so invest the time.

7.6 PCC

PCC button I have also selected 6 extrinsic variables to display using PCC (also from the right mouse button menu). PCC is from PATN's old DOS version where it stood for Principal Component Correlation. PCC actually does a multiple linear regression on the user selected variables. Remember we selected all evaluations, so each extrinsic variable will be independently fitted . So, the vectors you see are the best linear fits in the ordination (MDS)-space. Hopefully, in this space, linear makes sense. I selected the 6 variables to display on the basis of their r-squared values (listed in the variable selection table). All these variables have r-squared values greater than 0.6. Remember r-squared is roughly the amount of variance (variation if you like) accounted for. High r-squared values are implying that the ordination (done by the species remember!) seems to be reflecting these environmental variables. The implication is that the species abundances are in some way controlled by these environmental variables. Modelling could confirm this.

As expected, the metals all increase toward groups 1 and 2, 'individuals' increase toward group 1 (BB inner) while 'site richness' increases toward groups 3 and 4. This all makes good sense. You can get a listing of the coordinates and the r-squared values at the bottom of the Data Table by pressing the PCC button. This will simply select the PCC tab for you.

If we drag between groups 1 and 4, on the left of the 3d-display, we see that the best 5 discriminating variables are location (surprise!), richness, Cu, Sb and individuals. That makes sense as group 1 is closest to an old tip site and group 4 is furthest away. Obviously, there are a lot more subtleties and these are included in the publications noted above.

The variables that are showing a high correlation in the ordination space are either responsible for the patterns as intrinsic variables (the species abundances) or potentially causative, as in the case of the extrinsic (environmental) variables. There is a lot to be learnt particularly for the latter. But, remember that there is also the other side of the 'coin' - whose variables with low r-square values, for example, syllids or Hg from the table below (which was extracted from the Data Table export and sorted by r-square value). Why are these variables appearing not to be 'important'? Syllids are in low abundance (there is a maximum of 2 individuals in two sites and single specimens in another 4 sites) but in all site groups except 1. Syllids don't appear to be great discriminators, except maybe group 1, but their low abundance is not solid evidence. With more abundant species, a low r-square value is possible if the abundance at sites increases along a line in the ordination and then decreases again. There is a pattern, but it is not linear. In fact, this is what you would expect normally if the environmental gradients are long enough. A species will prefer a combination of environmental conditions. Move away from any of those conditions in any direction and you would expect lower abundances. What about Hg (mercury)? As noted above, it is probably on the limits of detection, so is probably in the same situation as the Syllids. Mn (manganese) discriminates a little between groups 3 and 4, but is widespread. But, unlike the Syllids, you would expect variables such as Hg to be linearly related to this ordination space. This was a brief outline of what must be done for each variable until you understand what is happening and/or understand what must be done to understand!

Variable r-squared Variable r-squared Variable r-squared
Gammarid IIA 0.88 MMicron 0.58 (<2mm>63ws 0.29
Cumacean I 0.86 Sorting 0.57 nephtyids 0.29
Gammarid I 0.85 Cd 0.57 Fe 0.27
Gammarid VIIA 0.79 Mgraf 0.54 evenness 0.26
Ostracod I 0.79 Skew 0.48 Gammarid VII 0.26
Gammarid II 0.77 Cumacean II 0.48 maldanids 0.23
Gammarid III 0.77 Tanaid IV 0.46 Ostracod V 0.23
richness 0.75 Ag 0.45 Isopod II 0.23
Tanaid I 0.73 Twt<2mm 0.44 Gastropod II 0.22
Gammarid IV 0.73 Pb 0.43 Abatus 0.21
Isopod I 0.70 SBA 0.43 >2mm 0.20
shannon 0.70 Gammarid VI 0.42 <2ws 0.19
Oligochaetes 0.69 <2mm 0.40 >2ws 0.19
individuals 0.69 capitellids 0.39 Cyclopoid I 0.19
Ostracod II 0.67 Wholewt 0.39 orbiniids 0.18
paraonids 0.67 2-63mm 0.39 Gastropod IV 0.18
Sb 0.66 Nematodes 0.39 Kurtosis 0.16
Ni 0.65 Cumacean V 0.39 Mn 0.15
Tanaid II 0.63 spirorbids 0.38 Isopod III 0.15
cirratulids 0.63 Cu 0.38 Isopod IIIB 0.15
Zn 0.61 <63mmws 0.37 Pycno I 0.15
SrtMicr 0.61 Sn 0.36 Nemerteans 0.13
As 0.61 <63of<2 0.34 opheliids 0.11
Cr 0.60 Isopod VIIA 0.34 Ophuroids 0.09
Location 0.59 <63mm 0.33 syllids 0.04
TOC 0.58 Harpact 0.30 Hg 0.03

7.7 MCAO

MCAO button MCAO standards for "Monte-Carlo attributes (variables) in an Ordination". It is a permutation test similar to ANOSIM, except that the values of the extrinsic variables used in PCC are randomly re-allocated between objects (sites) and the multiple linear regression (PCC) re-run. Usually 100 or 1,000 permutations are run. Each result is compared with the true r-squared value. Personally, I think the test here is more meaningful than ANOSIM! MCAO is trying to test the 'robustness' of the multiple linear regression (PCC).

MCAO statistics table

We can access the statistical results by pressing the "M" key on the PATN toolbar.

In this dataset, most of the PCC-created vectors are statistically significant. Those that aren't are Mn, Hg, Kurtosis (the 'peakiness' of sediment sizes), and a few of the grainsize fractions. If you look back at the box and whisker plots, there is a reasonable story why each of these are not that useful. We can therefore feel reasonably confidant that the other vectors are meaningful.

You should examine each significant vector in the 3d-plot to make sure that you have a good understanding of what it is saying. Also, try and see if there are any anomalies. We could write a small book on this dataset (and a fair bit has already been written!) so I won't go into further detail on the groups or ordination, as there is another cat to fry!

8. What now?

Well, the proper thing is to see if we have answered the question we started out with, viz "To see if PATN can be used to understand the nature of the variation of marine benthic communities and to identify what if any environmental properties seem to be controlling the distribution of species. Do we have a difference in species communities from putatively impacted to ‘non-impacted’ similar sites?"

We seem to have answered the first question. We know that there are significant (even statistically!) differences between the impacted and non-impacted sites and we at least have a good idea what maybe causing this. At this point, we could run specific experiments to prove the link between contaminant metals and 'individuals', 'richness', 'evenness' etc. While we can assume that there are real differences in species composition between the four groups (or the two groups), we don't actually know what the species profiles are. Remember that our intrinsic data is species composition, and we have four very different groups, that must be based on differences in species composition/abundance.

I would like to run PCC and MCAO on the species to a) understand a little more about the species profiles of the groups and b) to see if we can improve the signal to noise ratio by skipping species that do not appear to be discriminating. That even raises the issue of c) can we develop one or more indicator species; those that could be used in other studies in a similar environment to detect impact? Maybe we can.

9. Evaluate the Species - No, re-run the Analysis.

To run PCC and MCAO on the species abundance data, we could go back and click the evaluation button Evaluation button, but to show you more of how PATN works, I'll go back a re-run the analysis, this time getting an analysis of the species as well as the sites.

Select the analysis button again Analysis button and leave the original parameters that were selected for the sites, but this time, select an analysis of the columns (the species) and hit the 'Set Parameters' button. For the association measure, select 'Twostep' (see Austin & Belbin, 1982). This is an unusual association measure; one designed specifically for species, but one that may have broader application.

With species community data, an absence of a species is very different to a presences of a species. In mathematical terms, the situation is not symmetric. Short of miss-identification (which is common!), if a species is found at a site, it is really there! If a species is however not found at a site, it may not be there for many rather different reasons. It may not be there because the investigator didn't happen see it, or sample it, and it was there. The species may not be there because it really wasn't there, because it couldn't be there for many reasons. This observation gave rise to an asymmetric association matrix (Bray and Curtis as it happens); a species could only make a direct observation of other species at a site if it was there in that site itself. This builds in a degree of robustness, but how do you now deal with this asymmetric matrix? I treat it like a data matrix, comparing the rows is like comparing a pair of species on the basis of their joint association with all other species, including each other. So, that is why it is called Twostep. It is a neat strategy that can get around the underestimation problem (species that have no co-occurrences can be meaningfully associated (linked) via other species).

Under the classification tab, and make sure hierarchical classification is selected, with Flexible UPGMA again, and select 6 groups, then run the analysis. This process will replicate the site analysis, but without the extrinsics. We will deal with the intrinsics later.

9.1 Dendrogram of Species Groups

Column variables dendrogram

Let's take a look at the species dendrogram. The first thing to note is the overall structure. There does appear to be reasonably well-defined species groups. We can tell this from the gap between group fusions toward the top of the dendrogram. We could just as easily selected 12 groups or 2. I'm sure both configurations would have been readily interpretable. For the moment, 6 groups seems a fair start to summarise.

Note that unlike the colouring on the site dendrogram which comes from the ordination, we simply alternate colours of the 6 groups between blue and red down the dendrogram. This makes our 6 groups easier to see.

What else? We have Gammarids in group 1, 4 and 5. We also have Tannaids in 1 and 5, Isopods in 1,2 and 6, and Ostracods in 1 and 6. This means that there is greater similarity (smaller association value) between some different genera in terms of where they occur, than there is between species. This may simply imply that different genera may coexist for a range of good reasons. They may be less competitive for resources, be partitioning the resources differently etc. Overall, there are as many species in the same group as there are in different groups so no universal truth there.

By the way, the quick way to scan the species groups is to click on the Groups tab at the bottom of the Data Table.

Anyone with a good knowledge of the marine benthic species should be able to easily put a label on each of the 6 groups. As I will assume that most of the audience is probably not in that category, let's see how PATN can tell us more about the local ecology.

 

 

 9.2 Two-way table

two-way table button Let's take the next button that we haven't used on the PATN Toolbar- the two-way table. You can select a series of options controlling the display. I've gone with the defaults, but when the two-way table was displayed, I clicked the right-mouse button and swapped the axes to lay out the table in a form more suited to presentation.

Two-way table We can immediately see that the species groups (the row blocks or noda) are partitioned neatly between the site groups (the column noda).

Species group 1 we can now see are the 'generalist' species. While they occur mainly in group 1 (darker shading), they appear in all groups to a fair extent. If they can handle the most impacted site, it looks like they can handle anything! This makes sense. If the other species groups are not able to handle the heavy metals as well, this leaves this part of the 'environmental space' to the most tolerant. We could setup an experiment to test the implied high tolerance of species in this group. The Gammarids would be good candidates!

Species group 2 seems to be a specialist group that prefers Brown Bay inner impacted sites but does have a very minor presence in the other site groups.

Species group 3 (Ophuroids) may as well be part of group 2. The dendrogram confirms this.

Species group 4 is concentrated in Brown Bay outer (site group 2) but with the Gammarids also at the inner site. We know that site group 2 has fair concentrations of heavy metals, but lower than site group 1. Maybe this group prefers the sediment here, which is coarser (and therefore less likely to capture heavy metals than finer grainsize sediments)?

Species group 5 is the opposite of species group 4; it is almost universally restricted to putatively non-impacted O'Brien Bay sites. Almost any species in this group except the Oligochaetes could be handy indicators of the lack of biologically significant levels of heavy metals. Again, some trials would confirm this hypothesis (and remember, hypothesis generation is one of the roles of Pattern Analysis).

Species group 6 seem to be a less abundant version of the species in group 5, and could for most purposes, be lumped with them. Why are they generally less abundant than species in group 5? Are they more abundant in other environments? Are they poorer competitors? Why are species group 6 more abundant in sites OB2S2P11 and OB2S2P23? We can highlight these two rows in the data table (hold the CTRL-key and click the left mouse button on them) and then look at them in the ordination by pressing the "o" key. They don't seem extreme at first glance, although they certainly have low contaminants.

Ordination with MST

9.3 Box and Whisker Plots of Intrinsics and Extrinsics

To see box and whisker plots of the species, we now need to run the evaluations on both the intrinsics and extrinsics. All the PATN evaluation algorithms treat all the variables independently. We don't need to re-run ANOVA as this is based on the association matrix, so just add all the variables for box and whisker plots, PCC and MCAO and then run the evaluations.

Let's look at the box and whisker plots. 'Location' is still right at the top. No surprise. Next comes Ostracod II and I. The plots show up some of the features that are not quite so obvious from the two-way table. These features are there but you need to look harder to see them. Both these Ostracods occur mainly in site groups 1 (putatively most impacted) and 4 (putatively least impacted). Odd. Why? What have these two groups got in common? Note also that site groups 2 and 4 are somewhat adjacent in the ordination, so the latter is responding meaningfully at least in relation to what these species appear to be doing. If this is the case, it is likely that other species also have the same profile. Gammarid III and 'individuals' are similar, but what looks like the common factor is the sediment composition, the home of the benthic species. That makes sense.

You should go through each of the plots to see if there is anything else significant that we may have missed from the two-way table. But for brevity here, I'll move along.

Box and whisker plots Are there any species that have no useful discriminatory profile between the groups? Go to the bottom of the box and whisker plots and scan up. The species down the bottom are rare, occurring with low abundance (often a single occurrence) in a few sites. While these species do seem to discriminate between the groups, their rarity would make them less reliable as indicators. Because they do seem to discriminate, masking them out of the analysis (this is easily done by making such species extrinsic) would gain little.

If we really did have some species with a low KW statistic were genuinely 'noisy', I would probably eliminate them. This would usually reduce the stress in the ordination, and this is particularly important if the stress is high!

Now let's see if there is something new in the PCC/MCAO of the species.

9.4 PCC and MCAO on the Species

PCC statistics and selection table If we click the right mouse button on the ordination plot and select those variables with an r-squared > 0.7 (a pretty impressive variance accounted for!), we note that there are quite a few. This is a nice outcome, and one you may expect if the environmental gradients are relatively short. Why? Look at section 7.6; the high r-square value species are only increasing or decreasing in the ordination (environmental) space, not both.

You select individual species in the standard (Microsoft!) way by holding down the CTRL key and clicking on the species in the list, as in the image to the left.

The number of high r-squared species confirms that we have a pretty good ordination. Why would this be? Remember the species generated the ordination. If we see many intrinsic variables showing a good correlation with the configuration, it suggests that there is strong pattern in the distribution of the objects. Once we click OK, the following pattern emerges-

Ordination with PCC vectors While some of the species appear overwritten, it is just a matter of rotating the plot to identify what is on top of what. For example, 'individuals' (which I selected as useful to see) is almost on top of Gammarid IV. This makes sense as this species is one of the most abundant.

I clicked in the area around group 3 just to be reminded of where the groups were, and to be reminded of the best discriminating 5 variables.

Species 'richness' (and Shannon-Weiner diversity) align with each other and with group 4, as they should. Gammarid I and IIA align with group 2 as they should. Cumacian I, Gammarid VIIA and the Oligochaetes align with groups 3 and 4, as they should. If we drag the mouse between groups 3 and 4, we will see the variables that best discriminate between them on the left of the plot. Other than 'location', this turns out to be Ostracod I and II, the clay fraction and 'individuals' The box and whisker plot tells us that group 4 has significantly more Ostracod I and II, more individuals and less clay than group 3. This is useful information.

We could go a lot further, but let's finish off by looking at the MCAO to see if we were basing any of our observations on a dodgy PCC. The variables that don't appear to have a significant multiple linear regression are - Syllids, Orbiniids, Maldanids, Opheliids, Nemertians, Gastropod IV, Abatus, Ophuroids, Gammarid VII, Isopod IIIb, Ostracod V, Cyclopoid I and Pycnol. As these species were not considered key determinants of any of the conclusions, we can feel safe. These species may still however make a contribution. Some may have a low r-square but also have a strong signal in the ordination. You need to work through them on the ordination plot, two-way table, box and whisker plots and most importantly, the Data Table.

I hope that this example has been useful in demonstrating how PATN can work for you.

Lee Belbin