Finding patterns in data

Large Datasets

When does the size of the dataset change your analysis strategy? There are 4 options in PATN that may not be appropriate for large datasets-

  1. Association (when the generation of a lower symmetric matrix is involved) and three options that are based on this association matrix
  2. Hierarchical classification (such as flexible UPGMA)
  3. Ordination (SSH in PATN) and

Once you have more than around 100 objects, traditional hierarchical clustering is less appropriate than a non-hierarchical strategy. Examining a dendrogram of more than 100 objects can be overwhelming, unless you have an intimate knowledge of the data and the processes that are generating the variation.

Ordination of more than 100 objects can pose greater problems. The Ordination plot will display objects that are visible on the external parts of clusters. By omission you can usuallyu infer to location on the plot of unseen objects. It may be useful to have an ordination for a few hundred objects just to examine the overall distribution of objects and get the PCC (the directions of the variables in the ordination space) orientations.

When you get to thousands of objects, generation time and memory requirements in the generation of association values starts to be a consideration. PATN will happily try and produce the association matrix, but it may not be wise for you to ask for it! Hierarchical classification of thousands of objects really is getting crazy. Ordination will be even worse because the computer resources required for it are the greatest in PATN. Waiting a few hours for a useful outcome maybe appropriate, but when it is of marginal value, think again.

A Suggested Strategy for Larger Datasets

  1. Don't generate a (pair-wise) association matrix between objects
  2. Use non-hierarchical classification (which uses object-centroid association values)
  3. Group statistics and box and whisker plots as usual
  4. Export row (object) group centroids
  5. Import the group centroids as a new PATN dataset
  6. Run association and SSH
  7. Use PCC, MCAO and ANOSIM as required

NOTE: If there are more than 100 objects in the Data Table, PATN will by default-

  1. Deselect the generation of pair-wise association values
  2. select non-hierarchical classification with an appropriate measure
  3. Deselect ordination

Very approximate run Times for PATN Datasets (3GHz, Windows XP, 512MB memory)

# Objects # Variables Analysis Time (seconds)
100 100 Gower Metric on rows and columns, UPGMA (defaults), SSH (defaults) and all variables applied to ANOSIM & MCAO (100 iterations) and PCC 2
200 200 " 4
300 300 " 12
500 500 " 50
1000 1000 " 33
1000 50 " 17
2000 50 Non-hierarchical classification only (Gower metric) 54
5000 50 " 210

Large Datasets and Dendrograms

The non-hierarchical classification algorithm in PATN will not generate a dendrogram. The algorithm will create a set of k groups. Often what is wanted is a dendrogram of these groups. In PATN V3, this can be achieved fairly easily. When any classification is run, PATN automatically produces group statistics. For each variable in each group PATN reports the following values-

  1. Minimum
  2. First quartile
  3. Median
  4. Mean
  5. Third quartile
  6. Maximum

An example of this file -

Group statistics

To produce a group dendrogram do the following-

  1. Select File | Export Evaluation Data | Row Group Statistics
  2. Edit the file (or write a program or script) to produce to produce an Excel file in the form of Rows groups by column means or medians. The first row of the table will be group 1 and the first column will be variable 1 mean or median. The second column will be variable 2 mean or median and so on for each of the k groups. Make sure you have both row and column labels and save the file in Excel format.
  3. Import the Excel file into PATN
  4. Select the same association measure that you used for the non-hierarchical classification
  5. Select hierarchical clustering with defaults, and select the number of final groups that you think you may want
  6. Run the classification and display the dendrogram
  7. If needed, alter the number of groups and re-run.