Finding patterns in data

Socio-Economic Example

The data

Dataset: Medals.xls

This dataset is essentially socio-economic in nature. The data comes from a range of institutions that collect international data such as the United Nations. This dataset was generated and distributed with PATN because most will find it easy to understand the variables and their significance to national differences. This does not necessarily mean that the data is 'simple'. Far from it. There are 18 variables and 191 countries. I selected the 18 variables on the basis of personal interest and their potential to profile nation states. Many more variables could have been chosen-

  1. Medals: The number of Olympic medals received in Sydney in 2000. Gold=3, Silver=2, Bronze=1 (extrinsic)
  2. Population: The total population of the country (extrinsic)
  3. GNP/Cap: The gross national product of the country divided by the population of the country
  4. Cars/1000: The average number of cars/1000 people
  5. Ed/GNP%: The percentage of gross national product that is given to education
  6. Univ%: The percentage of the population that has a tertiary-level educational qualification
  7. Prot Land%: The percentage of land that is under some form of protection
  8. Life Exp: Average life expectancy in years (male and female)
  9. Arable/1000: The amount of arable land (Ha)/1000 people
  10. CO2/Cap: The volume of CO2 produced divided by the population of the country
  11. Mil/GNP%: The percentage of gross national product that is given to the military
  12. Literacy rate: Percentage of the population that can read and write (total of males and females)
  13. Deforestation rate: Percentage of annual clearing of forest land
  14. Population Density: The number of people/square kilometre
  15. Death rate: Number of deaths annually/1,000 persons
  16. People/doctor: The number of people/trained doctor
  17. Electricity/Cap: Electricity consumption/capita (kilowatt-hours)
  18. Coal/Cap: Coal consumption/capita (kg)

The goal

The dataset was collected to pursue my interest in Australia’s unhealthy interest in elite sport.Divide #medals by the population and you will see what I mean. The dataset raises questions about approaches to national identity and 'development'. Primarily, it is a useful basic dataset to profile nations, and learn from the experience of others. Sadly, few national governments appear capable of seeing how others have approached development, and therefore destined to re-invent the wheel, and often experience the negative outcomes. Pity that the public rather than politicians always seem to be the 'guinea pigs'.

I would encourage you to post any insights into what the dataset may be hinting at to the discussion group on the PATN Web site http://www.patn.com.au/phpBB3. Any ideas of other useful national statistics would also be appreciated.

The analysis

1. Import the data

While the Medals dataset is included with PATN as a PATN-formatted file, it maybe helpful to demonstrate a simple import from Excel™. Here is the original data-

The dataset

Note that the dataset in some version of Excel will display "#N/A" as 'not-applicable' or missing data. This code is automatically detected by PATN on import. Missing data is typical in many datasets and some thought is required about how it is handled.

Run PATN and select the bottom button (Import data from an external file)-

New data options

and point PATN at the Excel file containing the data-

File list

in this case, there are a number of Worksheets in the spreadsheet, so select the worksheet that is to be imported, in this case, the one named 'ALL'-

Spreadsheet tabs

and here is the resulting Data Table in PATN-

Data Table

Note: 191 rows have been imported, the labels are detected and used by PATN, missing data is identified by '..' and that PATN has automatically generated some basic statistics (called 'Visible Statistics' in PATN).

2. Examine the data

It is always wise to scan the Visible Statistics of an imported file to see if the process worked well, and to detect issues that will need to be addressed before any serious analysis. In this case I have used the minimum, maximum and the number of missing values as visible stats. To select these, use Tools | Options and select the required summary statistics. You can also set the number of decimals to be displayed (it does not change the data). Given the size of some of the variables 0 or 1 decimal place is appropriate. You can also change the the number of decimal places displayed in the Data Table; just tick the box next to "Set the number of visible decimals onm all columns to" and set the number desired.

Data Table statistics

After examining the stats, it appears that there are some countries and variables that have a level of missing data that could make for a less than robust analysis. My first step is to eliminate countries that have more than 5 out of the 18 variables missing. You can do this either semi-automatically or manually.

Selection options

To automatically select the countries with too many missing values, go to Data | Selection options and select the Rows in the left column, check "Number of Missing values" in the centre box, Greater than in the right box and then enter the value "4" and click Run. The 15 countries that fit this criteria (listed below) will then be highlighted in the Data Table. Then right click on any one of the highlighted country names and select Delete. The following countries will then be removed from the Data Table.

  • Andora (12)
  • Côte d’Ivoire (8)
  • Federated States of Micronesia (10)
  • Grenada (5)
  • Kiribati (6)
  • Liechtenstein (10)
  • Marshall Islands (10)
  • Monaco (14)
  • Nauru (12)
  • Palau (8)
  • San Marino (12)
  • Solomon Islands (6)
  • Tuvalu (14)
  • Vatican City (17)
  • Yugoslavia (8)

You can also select the countries manually iof you wish. To do this, use CTRL and the left mouse button and selecting the labels of those countries to be eliminated. Once they are all highlighted, you could either make them extrinsic, or simply delete them. I'll delete these countries (click the right mouse button on any of the county labels and select delete). We now have 176 countries left, none with more than 2 missing values.

What about the variables? After examining the stats, I'll set as extrinsic the following variables with greater than 10 missing values. Select the column labels as done with the countries either automatically or manually, but this time rather than deleting the variables, press the Make extrinsic button button. This places those variables apart of the analysis (called extrinsic) but makes them available for the evaluation if required.

  • Medals (99)
  • Cars/1000 (11)
  • University% (13)
  • Prot Land (12)
  • Mil%GNP (11)

We now have 8 variables in the analysis: The intrinsic variables. Why eliminate countries with more than 5 missing values or variables with more than 10? It is a best guess threshold after a careful examination of the data. Eliminating all countries or variables with missing data is too extreme as PATN handles 'fair' levels of missing data. The other extreme would be leave the data 'as is'. This would in some cases, generate estimates of resemblance or relationship bwteen countries or variables that were based on too few values, leading to potential unreliability of the outcomes. Somewhere in between seems fair. A lot of pattern Analysis is like this.

Save!

At this point it is wise to SAVE your data as a new PATN file. If saving is done on a regular basis, or when the data or analysis has changed, you can go back to prior steps and also have a good backup. Just do a File | Save As and give the PTN file a name that reflects the contents and the step.

3. Preliminary analysis

I'll now do a quick analysis (which is very easy with PATN) to what the structure is like and if there are other issues to address. I never do just one analysis as you will see! First, select the analysis button or from the menu Data | Analysis-

Analysis options

Association

Association options

The Association tab is selected and PATN is 'greying-out' association to suggest it probably shouldn't happen. Why? There are more than 100 objects and PATN is suggesting that non-hierarchical classification is preferable. Read what PATN says about this in the box below. But, if we want an ordination, we need the association matrix. So, we check "Generate a lower-symmetric matrix of associations" and select the Gower Metric.

PATN would have selected the Gower Metric as a default on examining the imported data. Why? I won't go into details here, but to say that the variables have widely different ranges and that the Gower metric's range standardized Manhattan distance is appropriate.

Classification

Classification options

We then select the Classification tab and then select hierarchical classification and the Flexible UPGMA option. In tests, this strategy seems the most robust. In summary, flexible UPGMA with a beta value of -0.1 seems to be able to classify realistic artificial data better than other methods. You can only evaluate such algorithms by the analysis of many datasets where 'truth' is known. This of course is no easy matter when realistic data are complex. This is not the place to dive into too much theory. The beta value controls what is referred to as space-dilation and space-contraction. It is known from artificial and real datasets, that as differences become greater between objects, measures of association tend to underestimate the true difference. Setting beta to -0.1 dilates the space defined by the variables so as to recover a better estimate of true difference. A value of zero would not contract or dilate the space while a positive value (such as 0.1) would contract the space.

Select 6 groups. Hierarchical classification produces everything from 1 group to in our case, 191 groups. Selecting 6 here simply tells PATN how many groups you would like to use as the level of summary. Why 6? Seems like a good number to me. It is more a matter of what your brain can handle easily than anything else, such as the number of real groups in the data, if there are any. 6 is a good number of groups of things that can be easily differentiated. We can change this later if we wish, but for the moment, 6. By the way, if we chose 30 here for example, I'll bet (if the data is 'good') that PATN would summarise 30 meaningful groups.

Classification reduces a larger number of objects to a smaller number of groups. 6 is easier to consider than 191.

Ordination

Ordination options

Next click the Ordination tab and select SSH; semi-strong hybrid multidimensional scaling. Quite a mouthful, but an excellent ordination method. I'm biased though. What is ordination and what is SSH? Ordination attempts to reduce the number of variables necessary to describe or summarise the relationships between objects. If we can get down from 18 to 3, we can display the objects in a 3-dimensional plot. SSH is a type of multidimensional scaling (MDS) method, with a twist. MDS generally asks the user to select the number of dimensions. The answer is most often 3 or 2. MDS then randomly distributes the objects in the selected space and then measures the distances between each pair of objects. It then compares these distances with an association matrix. MDS uses either metric (linear) or non-metric (non-linear) regression for this comparison. The 'fit' will be bad, but MDS knows which way to move the objects to improve the fit. MDS does this and re-measures the distances, and repeats the comparison. When any movement of objects fails to improve the fit, it stops. SSH uses a combination of metric and non-metric regressions for the fit. Why? Remember above about the under-estimation of associations between distant objects? SSH (like the beta -0.1 value) attempts to circumvent this but applying linear regression where distances are assumed accurate, and non-linear where they are at best ok in terms of rank. Simple and effective.

Stick with the defaults unless you know more than I do.

The analysis is now ready to go. All of it. Once you click OK in the analysis box, PATN displays what it calls the recipe. This is what it will do once you press OK. All analysis steps are saved and can be exported as required. Again, handy for future reference. At this point, read the recipe and see if it what you want. If it is not, it is no big deal. An analysis in PATN is very quick so it can always be re-done quickly.

Analysis recipe

The first thing that you will notice (depending on the speed of your computer) is a series of progress bars for each phase of the analysis. The first window displayed at the end is the stress value from the ordination, in this case, it is 0.0945.

SSH Stress

This is the standardized difference between the association matrix and the ordination distances. Obviously, a stress value of zero would be neat, but don't hold your breath for that rarity. A value of one would be terrible. It would mean that PATN could not squeeze the original 18 variables to anything less. Again, thankfully a rarity. A useful rule of thumb suggests-

  • 0.3 = try again!
  • 0.2-0.3 = not great
  • 0.15-0.2 = lower would be better
  • 0.1-0.15 = possibly ok
  • 0.05-0.1 = not bad
  • <0.05 = so good something is probably wrong

So, 0.0945 is not too bad for complex data, but we may be able to improve on this.

Evaluation

We have a result. That is the easy part. Now we have to analyse what PATN is suggesting. That's the hard part. While we could have selected 'All evaluations' on the first analysis dialog box, this would have only evaluated the extrinsic variables. I'd prefer the lot. So, either click the evaluation button on the Menu Bar at the top of the window or select Data | Evaluation and select the 'Box Whisker' tab and press the 'Add All >>' button. This will run all variables (extrinsic and intrinsic) across the 6 groups seeking to find how they are distributed using box and whisker plots and associated Kruskal-Wallis (KW) statistical values. More on KW later.

Evaluation options - Box and Whiskers

Next, select the PCC / MCAO tab and also 'Add All' variables for PCC. Don't bother with MCAO. What is PCC? It is an abbreviation for Principal Coordinate Correlation. It's a hangover term from the old days of DOS PATN where I correlated a set of variables with the axes of another ordination technique called principal coordinate analysis. Basically, what PCC does is to use multiple linear regression to fit a set of variables into an ordination space. In our case, PCC will take each of the 18 variables and a) give us the best fit direction and b) the correlation. This will show us diagrammatically, how the variables can help define the directions in the ordination we have produced. You will see this further on.

Evaluation options PCC and MCAO

While it is only the intrinsic variables that generated the 3-dimensional ordination, there is no harm in seeing what this means to all variables. It won't change the ordination, it is just an evaluation. Next, as in the analysis, PATN display the evaluation recipe for you to check-

Evaluation recipe

We press OK and the evaluation is done in a fraction of a second. We can now look at the analysis with the help of the evaluations. Note that the PATN menu bar is useful for this phase. The buttons that are greyed out refer to steps that have not been run as yet (in this case a two-way table and MCAO). The buttons (from left to right) are (display) association matrix, dendrogram, two-way table, ordination, ANOSIM, box and whisker plots, PCC, MCAO and the analysis recipe.

Evaluation buttons

First up, click the box and whisker button and the following window will be displayed (only the first two variables are shown here)-

Box and whisker plots

This is a very useful set of graphs to examine how the variables are distributed across the 6 groups. By default, the graphs are sorted by decreasing Kruskal-Wallis value. The higher the KW value, the better the variable is at discriminating between the 6 groups. In this case, Life Expectancy is best (KW=141.4) and Electricity/capita is next best (KW= 128.32). Here is the complete table of KW values (easily generated from PATN's Export | Evaluation Data menu). In the table I have added and "x(extrinsic at time of analysis) and "->x" (make extrinsic for re-analysis).

Variable KW Value Extrinsic
Life Exp 141.4
Electricity/Capita 128.3
CO2/cap 115.5
GNP/Cap 114.9
Cars/1000 113.4 x
Univ% 109.2 x
People/Doctor 107.7
Literacy 89.5
Deaths/1000 87.8
Deforestation Rate 52.8
Coal/Capita 37.2 x
Ed /GNP% 26.9 x
Population Density 18.7 x
Prot Land% 10.7 x
Arable land/1K 10.0 x
Pop 8.6 x
Medals 5.6 x
Mil%GNP 5.6 x

From this table we can see that two extrinsic variables (Cars/1000 and University%) show good discrimination between groups even if they were extrinsic. This means that these two variables align with the classification based on instrinsic variables.

It appears that variables below deforestation rate are no where near as good as those above. Look at the plots. The left bar is the minimum, the left edge of the box is the 1st quartile (25% of values are below this line), the vertical line in the box is the median (50% below), the circle is the mean, the right box edge is the 3rd quartile (25% of the values are above this line) and the right bar is the maximum. Deforestation rate does show fairly good discrimination across the groups while Coal/capita doesn't.

Remember that variables can be considered as comprising two parts, signal (what we are seeking) and noise (what we would like to remove). Some variables, at least as far as this dataset goes, appear to have a high-level of noise. If we eliminate them from the analysis, we may be able to increase the signal. We could use the same argument to suggest that Cars/1000 and Univ% could be made intrinsic. Probably, but for this demo, I'll leave them out (as extrinsics) and see what they do next. So, set Coal/cap, Ed/GNP%, Population Density, Arable land/1000 and population as extrinsic and re-do the analysis and see what happens. I did take a quick look at the ordination plot and it did suggest we had a rational pattern, but to save space, on with the re-analysis using the same parameters as before.

4. Analysis 2

The stress has now dropped from 0.0994 to 0.0639. That's a good sign, but remember we now have 8 variables, not 13 and it will be easier to squeeze 8 into 3 than it is 13 into 3.

Ordination stress

When the analysis is done, re-do the evaluation exactly the same as before - include all 18 variables in. Examine the box and whisker plots and associated KW values carefully, but this time, I'll just display the PCC values to demonstrate how they work. For this, press the PCC button and note that PATN highlights the PCC tab in the Data Table. This tab has 4 values - x, y, z and r-squared. The x,y, and z are the coordinates in the ordination space of the 'head' of the vector representing the variable. For example, Life Expectancy is -0.3, 0.9, -0.3 and 0.3 (rounded to 1 decimal place). This means that a vector in the ordination space will have the tip at these coordinates and the r-squared value suggests this variable is poorly correlated with the distribution. Deaths/100 however has an r-squared value of 0.9 suggesting that ~90% of variation in those values is explained by the ordination. The table of PCC values can be view from the right-mouse button menu menu on the ordination plot, or by exporting the PCC values from the File | Export Evaluation Data menu.

Data Table

View the main PATN display, the ordination plot (click on the graph icon on the icon bar). This is the display where you will spend most of your time analysing the results. For the display below, I have rotated the plot manually (SHIFT + left mouse button) to a position where it is easy to get an idea of the overall structure. There is nothing like rotating the objects (the countries) every which way manually to get a good idea on the variation across countries. The affluent Group 5 counties are the orange balls to the top-left of the plot. If you left-click on the plot but not on a country, you will see displayed on the left side of the plot, the best overall KW values across the 6 groups. If you press 'G', PATN will display the group, rather than the individual object colours. This makes it easier to see the groups generated by the hierarchical classification. If you then click near the centre of Group 5 and drag the mouse until PATN identifies Group 1, the 5 variables that best discriminate between these two groups will be identified. Nice?

Obviously is an affluence trend from the SE (group 1, the poorest) to the NW (group 5, the richest). As 'poor' and 'rich' are not variables, this hints at the type of evaluation that needs to be made.

Ordination 3d plot

Outliers

Clicking on countries in the ordination plot will help you identify the various components of the structure. It is efficient to identify the outliers, like Yemen in the south of the plot. As can be seen from the group colour, Yemen is a single member group. Note: When you click on an object (a country here) in the ordination plot, the object is also highlighted in the data table, and vice versa.

Outlier identification

Outliers have strong influence on the overall ordination structure.Ordination is the same as regression here as SSH uses regressions. There is therefore a strong argument to eliminate serious outliers as they will influence the ordination well-beyond their single-object status. Once outliers are identified, it it wise to figure out why they are outliers and then eliminate them from subsequent analyses.

So, I'll make Yemen extrinsic and go again on the analysis. Highlight Yemen in the data Table by clicking on its label (left mouse button) and then either right licking on the label and selecting Make Extrinsic or click on the make extrinsic button on the PATN toolbar. This will place Yemen out of the analysis but available for any interpretation my wish to perform.

5. Analysis 3

No change in parameters for this re-analysis. This time, the SSH stress is down to 0.0624. We should be more than happy about this given the number of objects and the complexity of the dataset.

Ordination stress

Now we can get to work on the evaluation proper. Select the Evaluation button and run all evaluation options (ANOSIM, Box and Whisker, PCC and MCAO) on all variables; intrinsics and extrinsics.

Box and Whisker plots and Kruskal-Wallis values

First, let's look at how all the variables (intrinsic and extrinsic) are distributed across the new 6 groups (extrinsic variables are marked with an "x")-

Variable KW Value Extrinsic?
Life Exp 135.0
Electricity/Capita 129.4
Literacy 122.8
CO2/cap 119.4
People/Doctor 115.4
Univ% 110.8 x
GNP/Cap 110.7
Cars/1000 107.2 x
Deaths/1000 97.2
Coal/Capita 57.3 x
Deforestation Rate 35.5
Ed /GNP% 24.8 x
Mil%GNP 23.7 x
Prot Land% 17.0 x
Medals 13.6 x
Population Density 10.9 x
Arable land/1K 10.6 x
Pop 6.9 x

Things to note:

  1. Most of the intrinsic variables are effective discriminators across the 6 groups, except maybe 'Deforestation rate'. This is strongly backed up by the box and whisker plots. You can judge the effectiveness of the box and whisker plots by their ability to create a decision tree that could be used to discriminate the objects between groups. In this case, it is easy to build up such a decision tree starting from the best discriminating variable first, then second best, down to (if necessary) 'deaths /1000'.
  2. 'Deforestation rate' (an intrinsic) isn't a useful group discriminator. We could cull it as an extrinsic and go back and re-analyse, but that is probably not necessary as the stress is so good, and the classification also makes a lot of sense.
  3. Some of the extrinsic variables show good discrimination. This implies that although some extrinsic variables had an uncomfortable number of missing values, the values that are there do seem to display a signal that aligns well with the classification based on intrinsic variables. I'd include cars/1000 and Univ% in this category. It means that we can use these two variables to help interpret the groups and to correlate with the intrinsic variables.

ANOSIM

ANOSIM will tell us how effective (different) the 6 groups are on the basis of the 'within-group and 'between-group' values of the Gower metric. ANOSIM is similar to an f-test using association values rather than the ordiginal variables. Click the 'A-button' on the PATN toolbar and you will see this window pop-up-

Row ANOSIM window

This tells us that none of the 100 randomised solutions (swapping objects between the 6 groups) is better than the grouping that PATN generated. With a standard analysis in PATN, this is far from surprising. A value > 5% would suggest a poor classification which may indicate a lot of noise or poor variables or poor sampling.

Note: As we have not yet run an analysis on the variables, ANOSIM on variables is not currently available.

PCC

PATN's PCC routine uses multiple linear regression to fit each selected variable independently into the ordination space (1, 2 or 3-dimensions). The result is a set of coordinates that represents the tip of the vector of the variable. SSH centres the coordinates of the objects so this vector represents the best fit direction of the variable. An r-squared value provides some estimate on how good the fit was. For example, an r-squared value of 0.7 means that 70% of the variation of the variation of that variable is accounted for by the vector.

Pressing the PCC button on the PATN toolbar will highlight the PCC TAB in the Data Table.

PCC values in the Data Table

Alternatively, you could export the PCC values from the File Menu. If we do that and then sort on the r-squared values we get the following table. I've added an extra column to designate the extrinsic variables (marked with an "x")

Use Case - Medals 2 ...

How does the 'utility' of the variables as measured by the r-squared value differ from the those highlighted by the Kruskal-Wallis value and why? First, why? Remember that the KW values are based on the 6 groups while the r-squared value is based on the coordinates of the countries in the ordination plot. KW is 'clumped' and r-squared isn't.

I have tabulated both the KW and PCC values below and included a difference in the ranks as the last column. For example, the difference in the rank of Deaths/1000 for KW and r-squared is 7; different.

Variable X Y Z r-squared KW r-Rank KW-rank Rank Diff Extrinsic
Life Exp 0.98 -0.13 -0.17 0.89 135.04 1 1 0
Electricity/Cap 0.16 -0.86 -0.48 0.55 129.43 8 2 6
Literacy -0.21 -0.12 -0.97 0.76 122.68 3 3 0
CO2/Cap 0.22 -0.94 0.26 0.44 119.38 9 4 5 X
People/Doctor -0.51 -0.15 0.85 0.65 115.38 4 5 1
Univ% -0.05 -0.50 -0.86 0.61 110.81 6 6 0 X
GNP/Cap 0.09 -0.76 -0.65 0.59 110.70 7 7 0
Cars /1000 -0.07 -0.70 -0.72 0.64 107.18 5 8 3 X
Deaths/1000 -0.96 -0.18 -0.23 0.83 97.27 2 9 7
Coal/Cap -0.44 -0.69 -0.57 0.18 57.31 11 10 1
Deforestation Rate 0.11 0.99 -0.05 0.29 35.49 10 11 1
Ed /GNP% -0.31 0.14 -0.94 0.14 24.79 12 12 0 X
Mil%GNP 0.25 -0.47 0.85 0.07 23.67 15 13 2 X
Prot Land% 0.53 0.00 -0.85 0.02 16.96 17 14 3 X
Medals -0.25 0.25 -0.94 0.09 13.60 13 15 2 X
Population Density 0.75 0.64 -0.18 0.04 10.88 16 16 0 X
Arable land/1000 -0.70 -0.47 -0.54 0.07 10.65 14 17 3 X
Pop 0.62 -0.31 0.72 0.01 6.86 18 18 0 X

What do these differences in ranking imply? It could imply problems in the ordination. There maybe a few problems (which we will discuss later) but overall, this is unlikely. Looking at the Box and Whisker plot for Literacy, you can see why it has a high KW value; the only overlap between 1st and 3rd quartiles is between groups 2 and 6.

Box and whisker plot

The order in the B&W plots is group 1(lowest)-3-2-6-4-5 (highest). You can order the group centroids in the ordination plot so they are ordered group 1 (lower left)-2-3-4-6-5 (right). The only difference in order is a swap between groups 5 and 6. One would think a vector for Literacy could therefore go from lower left to upper right, but this isn't the case.

Group ordination (group centroids) 3d plot

Instead, it goes at an angle of about 45 degrees from the main distributionn but has a high r-squared value (0.76). GNP/capita aligns perfectly with the main trend (r-squared 0.68), clearly indicating why Norway is on one end and Niger is on the other!

Ordination and attribute vectors

Looking at the overall distribution of countries in the ordination (above), it makes a lot of sense. The 'poorest' countries (light green) are at one end while the 'richest' countries (orange and red) are up the other end, and those in between from an economic perspective, are in between in the plot.

The red group (group 5) are interesting. They seem to be the oil states. Some are close to the rich group 6 and some are closer to group 4. Note that in distribution on the SSH plot, the countries are rather scattered. Does this suggest that they have been more difficult for the ordination than the classification?

Group 6

  • Bahrain
  • Brunei
  • Kuwait
  • Libya
  • Oman
  • Qatar
  • Saudi Arabia
  • United Arab Emirates

Looking at tghe ordination above, 'GNP/Cap' and 'Electricity/Cap' are co-linear suggesting that these variables are highly correlated.This doesn't seem surprising? These two variables align with Group 5 (see below), the most affluent countries. The opposite direction, as expected identifies the 'poorest' countries (Group 1). 'Deaths per 1000' and 'People per doctor' are almost co-linear but opposite suggesting as one goes up, the other goes down. This makes sense. We could run a classification of variables to quantify this relationship (all variables should be normalised before doing this - and don't forget to save the normalised dataset to another PTN file).

The orientation of 'Deaths per 1000' aligns as expected with Group 2, with Lesotho at an extreme. 'Life expectancy' should align roughly with group 5 (positive relationship - see the box and whisker plot), but is also oriented to align with the negatively related Group 2.

What about our Medals variable? With a low r-squared (0.09) and KW (13.6), it is hardly inspiring. It is almost co-linear with Literacy, which is interesting, but it doesn't align with the main trend. I may have expected Group 5 to maximise it, but this is not the case at least in the ordination. Group 5 does have the highest median score though. There is a lot of variation of values of Medals in Group 5 and this suggests affluence isn't the only factor. There are also a lot of missing values so this won't help. My conclusion is that there are probably other factors driving it.

The Minimum Spanning Tree

Qatar is an outlier. It was also on the previous ordination, but I decided to leave it in and see what happens. Looking at the Minimum Spanning Tree imposed on the ordination below, I should have eliminated it. Use the right mouse button in the ordination plot window and select Display MST or simply press "M" on the keyboard)-

Ordination plot with MST

Qatar is connected by the MST to another group 6 member, the United Arab Emirates. Makes eminent sense. Kuwait also look problematic. This shows us that MST is a powerful tool to identify countries that SSH had problems placing. The classification probably got it right, but SSH may need 4 dimensions to place Qatar and Kuwait 'correctly'. Note all other other group 6 members are 'well-connected' by the MST, as are most other groups.

Group Summaries

Group 1: The 'poorest' countries
  • Mean (closest country to the centroid) - Chad, Extreme - Niger, Marginals - Bhutan (group 3), Mali (group 3)
  • lowest electricity usage/capita
  • lowest CO2/capita
  • lowest GNP/capita
  • lowest literacy
  • lowest university %
  • lowest cars/1000
  • 2nd lowest life expectancy
  • 2nd highest deaths/1000
  • highest number of people/doctor
  • highest deforestation
Group 2: 'Poor' countries with higher literacy
  • Mean - Swaziland, Extreme - Botswana, Marginals - Tanzania (group 3), Rwanda (group 1)
  • lowest life expectancy
  • low CO2/capita
  • low electricity usage/capita
  • 2nd lowest university %
  • 2nd lowest GNP/capita
  • 2nd highest people/doctor
  • high literacy
  • high deforestation
  • highest deaths/1000
Group 3: Intermediate between poorest and 'average' countries
  • Mean - Papua new Guinea, Extreme - Cape Verde, Marginals - Togo(group 1), Republic of Congo (group 2), Seychelles (group 4)
  • 2nd lowest CO2/capita
  • low electricity usage/capita
  • intermediate literacy
  • low university %
  • low GNP/capita
  • low cars/1000
  • intermediate life expectancy
  • intermediate deaths/1000
Group 4: 'Healthy and literate' countries
  • Mean - Suriname, Extreme - none, Marginals - Italy (group 5), Nicaragua (group 3), Jordan (group 6)
  • low CO2/capita
  • low electricity usage/capita
  • low GNP/capita
  • low cars/1000
  • 2nd lowest deaths/1000
  • 2nd lowest people/doctor
  • intermediate university %
  • high life expectancy
  • 2nd highest literacy
Group 5: 'Affluent' countries
  • Mean - Finland, Extreme - Norway, Marginals - Israel (group 4)
  • lowest people/doctor
  • intermediate deaths/1000
  • 2nd highest CO2/capita
  • highest life expectancy
  • highest electricity/capita
  • highest coal/capita
  • highest literacy
  • highest university %
  • highest GNP/capita
  • highest cars/1000
Group 6: Oil states
  • Mean - Bahrain, Extreme - Qatar, Marginals - Saudi Arabia (group 4), Oman (group 3)
  • lowest coal/capita
  • low people/doctor
  • 2nd highest life expectancy
  • 2nd highest electricity usage/capita
  • 2nd highest CO2/capita
  • 2nd highest GNP/capita
  • 2nd highest cars/1000

Number of groups

If we were to reduce the number of groups, how would our 6 groups merge? To see what happens, display the dendrogram and use the right mouse button to display it at a group level like so-

Row - country dendrogram

PATN labels the groups by the first (in terms of sequence number in the data Table from top to bottom) country in the group. Why? PATN always generates the dendrogram in a way that has group 1 at the top, group 2 next down, then group 3 ... group k. Therefore you always know that the kth group is k groups from the top. For example, Australia is the 'label' for group 5 because it is the first country in group 5 because the Data Table is in alphabetic order. Therefore, PATN uses a label to help you to quickly identify what that group may be. That's my logic anyway.

If we want to simplify the groups (reduce to 5, 4 or 3 groups) we would in order-

  1. Join groups 1 & 2 (giving 5 groups)
  2. Join groups 5 & 6 (giving 4 groups)
  3. Join group 3 with 1 & 2 (giving 3 groups)
  4. Join group 4 with 5 & 6 (giving 2 groups)

More groups could be defined by re-running the classification and asking for whatever number of groups you want. Re-analysis is so easy in PATN there is little need for a dendrogram slicing routine such as the old GDEF in old versions of PATN that re-defined the number of groups.

6. Analysis of variables

We could analyse any or all of the original 18 variables. For brevity, I'll analyse the 8 intrinsic variables we have at this point. More of the 18 variables could be used with the usual caveats. The level of missing data is less of an issue for the variables as there are up to 176 values (countries) per variable for the current dataset. The only variable you may think twice about is 'Medals' with 99 missing values out of the 176.

Standardisation

In PATN, the variables could be analysed in parallel with the countries, but the values of the variables are so disparate, some form of standardisation is required first. The countries were analysed with the Gower Metric, which has an in-built range standardisation. This approach enabled each variable to contribute equally to country differences and also enabled the evaluations to display the raw data values.

What option should be used? Take the simplest option: range standardisation.

First, save the current project by pressing CTRL S, then do a Save As either from the File menu or using ALT F A and select a different project name. I'm up to 'Medals 5' at this stage as I have saved all of the intermediate datasets, just in case I want to quickly return to an intermediate step in the analyses.

Next, press the transformation/standardisation button Transformation standardization button and select and range standardisation on all of the8 intrinsic variables (columns)-

Transformation options window

Press 'Run' and you will end up with all the intrinsic variables ranging from zero to one inclusive. This will enable us to compare the variables because they are now on the same scale. Take a look at the Visible Stats and you will see that this is so.

Analysis

We may as well run both the countries (use the same parameters as before) and the variables analysis. This will enable us to swap back and forth if needed. The standardisation, will have no effect on the analysis of countries as the previous use of Gower Metric effectively range standardized anyway. So why didn't we do this from the start? Simple, I prefer to see the raw values on the evaluation, not the standardised values.

Select Gower Metric for the variables, flexible UPGMA with beta=-0.1 and choose 3 groups; more than enough for 8 variables. Run this, and really the main interest this time with only 8 'objects' - the variables, is the association matrix (press the display association matrix button button)-

GNP/Cap Life Exp CO2/Cap" Literacy DeforRate Deaths/1K People/doctor
Life Exp 0.62
CO2/Cap 0.10 0.71
Literacy 0.76 0.22 0.84
Deforestation Rate 0.57 0.34 0.63 0.38
Deaths/1000 0.34 0.61 0.32 0.64 0.35
People/doctor 0.17 0.73 0.09 0.86 0.63 0.31
Electricity/Cap 0.07 0.65 0.06 0.78 0.59 0.32 0.14

and the dendrogram-

Column (variables) dendrogram

The three 'per capita' variables are closely related as expected. These three variables are also closely related to 'People per doctor'. In other words, affluence (high GNP, CO2 and Electricity use) equates roughly with a low number of people per doctor. Interestingly 'Life expectancy' and 'Literacy' are related, but not quite as closely. From the association matrix, we see a value of 0.22 which is a fair relationship (0=identical and 1=no similarity). Deforestation rate is related to Deaths per 1000 but again, not highly (0.35).

Conclusions and inferences

  1. An 'analysis' usually requires more than one re-run due to missing data, errors, standardization or transformations and outliers. Each of these are important issues to consider carefully. Remember 'garbage in - garbage out'.
  2. In most situations, all evaluation tools in PATN should be enlisted to aid understanding.
  3. The analysis appears to be logical and meaningful as the groups are internally cohesive (have similar country profiles) and externally relatively distinct. The classification will always imply better separation of groups than the ordination will usually suggest.
  4. The classification is more robust than the ordination, but the ordination is more powerful.
  5. The ordination identifies the major trends which could be interpreted as affluence and health.
  6. There is sufficient evidence to support some predictive models based on the variables identified as having a fair r-squared nd KW values 'life expectancy', 'Literacy', 'Deaths/1000', 'People/doctor and 'University%'.
  7. The analysis of variables corroborates what we have learnt from the analysis of countries.

This hasn't been a comprehensive analyses but it should hopefully provide an indication of how to quickly get a useful perspective on your data.

Lee Belbin