Answers to Frequently Asked Questions about PATN and pattern analysis-
PATN is a software package that performs Pattern Analysis. PATN aims to try and display patterns in complex data. Complex in PATN's terms, means that you have at least 6 objects that you want to know something about and a suite of more than 4 variables that describe those objects. Data must be in the form of a spreadsheet of rows (the objects in PATN) and the columns (variables), as in Microsoft Excel™.
There are usually around 7 components to a 'realistic' (read as adequate, comprehensive, fair, reasonable or intelligent) pattern analysis in PATN-
- Import the data
- Check the data using PATN's Visible Statistics functions. It is important that you are confidant that your data is error-free. PATN will operate on what data is available. There have been examples where conclusions have been drawn from flawed data. Don't be among them!
- Possible data transformation or standardization. Make sure the data are in a form where the association, classification and ordination will will make have greatest opportunity to detect patterns.
- Generate association values between objects and / or variables. Association in PATN (resemblance, distance, dissimilarity, affinity etc) is a quantitative estimate of the relationship between each pair of objects and / or variables. This is probably the most important step in PATN. Understand how the different options work!
- Classify. PATN has two options to organise the objects and / or the variables into a set of discrete groups. If there really are well-defined groups in the data, PATN will easily find them. In most cases, there is a gradation between groups, so objects may be marginal. User-defined groups or groups from other applications can be imported and analysed. Think of classification as reducing n-objects to k-groups. Makes information easier to view!
- Ordinate. This is the most powerful technique in PATN. Think of ordination as reducing the number of significant variables to 2 or more usually, 3. With 3 new variables, visualization of the objects (not variables in PATN) is easy!
- Analyse the results. This is where you come in! PATN can do most of the computational work, but the hard work is in interacting with PATN to interpret the results. Name the groups from a classification! Detect the trends from an ordination!
PATN is setup to make it easy for you to follow this process. For the average dataset, the first 6 steps should take no more than 5 minutes! The 7th step could however take many hours as you come to grips with what PATN is trying to say about your data. The 3-dimensional plot in PATN is the most powerful component for helping you with step 7.
PATN was the result of research in CSIRO toward finding patterns in ecological data. Traditionally, PATN has comprised four types of basic analytical tools-
- Clustering or classification
- Evaluation methods
While statistical packages all have some techniques in most of the three areas, some of the early innovative developments of these techniques happened in Australia in botany. Dr Godfrey Lance was the Director of the CSIRO Division of Computing Research and Professor Bill Williams was a research scientist with the CSIRO Division of Plant Industry. Between them, they developed a range of new classification techniques that pioneered what was then known as pattern analysis. Years later, Dan Faith, Peter Minchin and myself advanced understanding of association measures and ordination. The result of this work was originally called NTP (Numerical Taxonomy Package) and subsequently PATN. It was originally, a computer program designed for research - to seek new more effective algorithms and strategies for finding patterns in complex data.
The DOS versions of PATN have a comprehensive range of options because each needed to be tested for effectiveness. Most of the options would rarely be used. Subsequent investigations suggested that many should never be used! The philosophy behind the DOS version was modularity; modifications and new options needed to be able to be easily added in a research environment. Another key was getting PATN users to think about why certain options were being selected; to move away from a 'black box' approach to pattern analysis that existed at the time. PATN also had a comprehensive range of data manipulation options. With all the options available, there were thousands of potential pathways through a typical analysis.
With the Windows version of PATN, the philosophy has radically changed. I have used my 25 years of experience in pattern analysis to establish a far smaller suite of robust pattern analysis options. The new version examines your data on import and designs an appropriate suite of default analysis options and analysis scenario. A typical analysis that may have taken hours to work through many steps, now is done in seconds. Clustering, ordination and network components are now all done together (on rows or columns of the Data Table) and can be presented to you ways that make it easier to understand. Now, most of your time should be spent working with the Ordination Plot. It is here that most of the results can be interactively displayed and interpreted.
Additions to PATN (for Windows™) will only be made when significant improvements can be demonstrated from extensive research and user feedback.
Installing PATN should be very simple but it does have some oddities. These are due to the requirement of some universities that use PATN in teaching. In this case, additional security has required the embedding of code that monitors the installations. As we didn't want to manage another version of PATN, everyone is subjected to the same indignities. My apologies.
The number of steps involved is dependent on your Internet connection. Generally, the process has these steps-
- On the eSellerate Web site, fill in your details, pay and download PATN
- Save a copy of the downloaded file somewhere secure. If you purchased the optional eSellerate Download Service, you can always re-download the original file from the eSellerate site for up to a year after the date of purchase.
- You will receive an e-mail using your supplied e-mail address containing the following information-
- A URL that will link to your purchase details
- An Order Number
- Date of purchase
- Your contact details
- Your order details
- Your Serial Number
It is important to keep a copy of this e-mail secure. I suggest you should make a few copies (at least one hard copy) and keep the information in a secure location.
Double click on the downloaded file to run the PATN setup. The program will check if you are online. If you are online, you will be requested to enter your Serial Number. When this is done, PATN will then authenticate the serial number against the user database. Once this is done, PATN can then be run with no further bureaucracy. Note, it may take a minute or two for PATN to try a few options due to security and firewall complexities.
PATN will present the following window with three options-
Option 1 - Activate using a web browser on this computer
This options will use the web browser on the same system as PATN. In cases where firewalls stop non-browser applications from accessing the Internet, you can activate your web browser on the URL in the window below, follow the instructions and receive an activation key from the eSellerate site.
Option 2 - Activate using a different computer that has web access
This option uses a web browser on another computer that is linked to the Internet. If Option 2 is selected, the following screen is displayed. Just follow the instructions. Take a copy of the Installation ID displayed as below to another computer that is linked to the Internet and when you get an Activation Key, return to the computer that PATN is on and follow the instructions.
Option 3 - I have an activation key
This option is when you have received received an Activation Key.
Simply enter the Key and the following screen will be displayed. Note that it is wise to use the Save button and keep a copy of the Installation ID and Activation Key in a secure location.
The process of entering your serial number or activation key creates an activation event in the PATN user database.
The activation process is certainly easier if you are online!
Training courses in pattern analysis and PATN can be arranged on demand. If you are new to pattern analysis and PATN, then a two-day course is recommended. A one-day course can be arranged for those who want to get the most out of PATN, but participants require some prior experience in pattern analysis.
These courses are intense, and wherever possible, will make use of your own data for running examples.
The courses are run by Lee Belbin, the author of PATN. It is recommended that the courses contain a maximum of 8 people. Charges are on a per day basis with expenses. If you are interested in organising or attending a course, please email Lee Belbin for further details.
Anderberg, M.R. (1973). Cluster Analysis for Applications. Academic press.
Austin, M.P. and Belbin, L. (1982). A new approach to the species classification problem in floristic analysis. Australian Journal of Ecology., 7, 75-89 (two-step)
Belbin, L. (1980). Twostep: A Program Incorporating Asymmetric Comparisons that uses Two steps to Produce a Dissimilarity Matrix. CSIRO Division of Land Use Research. Technical Memorandum 80/9, June 1980. Canberra.
Belbin, L. (1984). FUSE, a FORTRAN 5 program for agglomerative fusion on micro-computers. Computers and Geosciences 10(4), 361-384
Belbin, L. (1987). The use of non-hierarchical allocation methods for clustering large sets of data. Australian Computer Journal,19,1,32-41.
Belbin, L. (1991). Semi-strong hybrid scaling, a new ordination algorithm. Journal of Vegetation Science, 2: 491-496.
Belbin, L. (1995). A multivariate approach to the selection of biological reserves. Biodiversity and Conservation 4, 951-963.
Belbin, L., Faith, D.P. and Milligan, G.W. (1992). A comparison of two approaches to ß-flexible clustering. Multivariate Behavioural Research. 27, 417-433.
Belbin, L., Faith, D.P. and Minchin, P.R. (1984). Some algorithms contained in the Numerical Taxonomy Package NTP. CSIRO Division of Water and Land Resources Technical Memorandum 84/23.
Belbin, L., Marshall, C. & Faith, D.P.(1983). Representing relationships by automatic assignment of colour. The Australian Computing Journal 15, 160-163.
Bray, J.R. and Curtis J.T. (1957). An ordination of the upland forest communities of southern Wisconsin, Ecological Monographs, 27, 325-349.
Clark, K.R. & Green, R.H. (1988). Statistical design and analysis for a 'biological effects' study. Marine Ecology Progress Series, 46: 213-226.
Clifford, H.C and Stephenson, W.C (1975). An Introduction to Numerical Classification. (Wiley).
Coxon A.P.M. (1982). The user's guide to multidimensional scaling. Heineman, London, 271p. (good text)
Czekanowski J (1913): Zarys method statystycznyck. Warsaw.
Everitt, B. (1980). Cluster Analysis. 2nd Ed. (Heinemann Educational for Social Science Research Council: London). 136 p.
Faith, D.P., Minchin, P.R. and Belbin, L (1987). Compositional dissimilarity as a robust measure of ecological distance: A theoretical model and computer simulations. Vegetatio 69, 57-68. (hybrid scaling)
Goodall, D.W. (1969) Affinity between and individual and a cluster in numerical taxonomy. Biometrie-Praximetrie 9, 52-55.
Gower, J. C. (1967): A comparison of some methods of cluster analysis. Biometrics 23(4):623-637.
Gower, J.C and Ross, G.J.S. (1969). Minimum spanning trees and single linkage cluster analysis. Applied Statistics 18: 54-64.
Gower, J.C. (1971). A general coefficient of similarity and some its properties. Biometrics 27: 857-71.
Guttman L (1968) A general non-metric technique for finding the smallest coordinate space for a configuration of points. Psychometrika 33, 469-506.
Jaccard, P. (1908). Nouvelles recherches sur la distribution florale. Bull.Doc.Vaud.Sci.Nat, 44: 223-270.
Jardine, N. & Sibson R. (1971) Mathematical Taxonomy. Wiley, London. 286p.
Kruskal J B & Wish M (1978) Multidimensional scaling. Sage, California, 94p. (very readable)
Kruskal J B, Young F W and Seery J B (1973) How to use KYST, a very flexible program to do multidimensional scaling and unfolding. Unpublished, Bell Laboratories. (KYST manual, not fabulous)
Kruskal, J B (1962) Multidimensional scaling by optimising goodness of fit to a non-metric hypothesis. Psychometrika 29(1), 1-27.
Kruskal, J B (1964) Non-metric multidimensional scaling: a numerical method. Psychometrika 29(2), 115-129.
Lance, G.N. & Williams, W.T. (1967) A general theory of classificatory sorting strategies. 1. Hierarchical systems, Computing Journal, 9, 373-380.
Lehmann, E.L. (1975). Nonparametrics: statistical methods based on ranks. Holden-Day, Oakland, Cal.
Lingoes J C & Roskam E . (1973) A mathematical and empirical analysis of two multidimensional scaling algorithms. Psychometrika 38(1), 1-81. (technical summary)
Manly, B.F.J. (1991). Randomization and Monte Carlo Methods in Biology. Chapman & Hall, London, 281p.
Mantel, N. (1967). The detection of disease clustering and a generalized regression approach. Cancer Research, 27, 209-220.
Shepard R N (1962) The analysis of proximities: multidimensional scaling with an unknown distance function. Psychometrika 27(2), 125-140. (important paper)
Shifman S S, Reynolds M L & Young F W (1981) Introduction to multidimensional scaling. Theory methods and applications. Academic Press, New York, 413p. (lots of examples)
Sneath, P.H.A. and Sokal, R.R. (1973). Numerical Taxonomy. (W.H. Freeman and Company: San Francisco). 573 p.
Sokal, R.R. & Michener, C.D. (1958) A statistical method for evaluating systematic relationships. Univ.Kansas.Sci.Bull., 38, 1409-1438.
Spencer R (1986) Similarity mapping. Byte, August, pp 85-92. (very simple introduction to multidimensional scaling) .
When does the size of the dataset change your analysis strategy? There are 4 options in PATN that may not be appropriate for large datasets-
- Association (when the generation of a lower symmetric matrix is involved) and three options that are based on this association matrix
- Hierarchical classification (such as flexible UPGMA)
- Ordination (SSH in PATN) and
Once you have more than around 100 objects, traditional hierarchical clustering is less appropriate than a non-hierarchical strategy. Examining a dendrogram of more than 100 objects can be overwhelming, unless you have an intimate knowledge of the data and the processes that are generating the variation.
Ordination of more than 100 objects can pose greater problems. The Ordination plot will display objects that are visible on the external parts of clusters. By omission you can usuallyu infer to location on the plot of unseen objects. It may be useful to have an ordination for a few hundred objects just to examine the overall distribution of objects and get the PCC (the directions of the variables in the ordination space) orientations.
When you get to thousands of objects, generation time and memory requirements in the generation of association values starts to be a consideration. PATN will happily try and produce the association matrix, but it may not be wise for you to ask for it! Hierarchical classification of thousands of objects really is getting crazy. Ordination will be even worse because the computer resources required for it are the greatest in PATN. Waiting a few hours for a useful outcome maybe appropriate, but when it is of marginal value, think again.
A Suggested Strategy for Larger Datasets
- Don't generate a (pair-wise) association matrix between objects
- Use non-hierarchical classification (which uses object-centroid association values)
- Group statistics and box and whisker plots as usual
- Export row (object) group centroids
- Import the group centroids as a new PATN dataset
- Run association and SSH
- Use PCC, MCAO and ANOSIM as required
NOTE: If there are more than 100 objects in the Data Table, PATN will by default-
- Deselect the generation of pair-wise association values
- select non-hierarchical classification with an appropriate measure
- Deselect ordination
Very approximate run Times for PATN Datasets (3GHz, Windows XP, 512MB memory)
|# Objects||# Variables||Analysis||Time (seconds)|
|100||100||Gower Metric on rows and columns, UPGMA (defaults), SSH (defaults) and all variables applied to ANOSIM & MCAO (100 iterations) and PCC||2|
|2000||50||Non-hierarchical classification only (Gower metric)||54|
Large Datasets and Dendrograms
The non-hierarchical classification algorithm in PATN will not generate a dendrogram. The algorithm will create a set of k groups. Often what is wanted is a dendrogram of these groups. In PATN V3, this can be achieved fairly easily. When any classification is run, PATN automatically produces group statistics. For each variable in each group PATN reports the following values-
- First quartile
- Third quartile
An example of this file -
To produce a group dendrogram do the following-
- Select File | Export Evaluation Data | Row Group Statistics
- Edit the file (or write a program or script) to produce to produce an Excel file in the form of Rows groups by column means or medians. The first row of the table will be group 1 and the first column will be variable 1 mean or median. The second column will be variable 2 mean or median and so on for each of the k groups. Make sure you have both row and column labels and save the file in Excel format.
- Import the Excel file into PATN
- Select the same association measure that you used for the non-hierarchical classification
- Select hierarchical clustering with defaults, and select the number of final groups that you think you may want
- Run the classification and display the dendrogram
- If needed, alter the number of groups and re-run.
File and Data Management
File and data management are issues that need to be considered in any computing environment and the efficient use of PATN is no exception. What do I mean by data management? Basically, data management is having a strategy about how your data is stored, managed, displayed, recovered and re-used. Always assume that you or someone else will want to go back to your data and/or analysis at some point in the future. If you want to save a lot of angst, an understanding of a few basics will help a lot.
The outputs from PATN depend on what is output. There are three basic types of output from PATN-
- PATN project Files (xxx.ptn)
- CSV export files (xxx.csv)
- Images (either xxx.emf) or xxx.bmp) or video (xxx.avi).
All analysis results are stored in PATN's Project file (a database with the extension 'PTN'). This database will however only store the latest analysis results so it pays to save Project files when any changes are made to the Data or analyses. A key strategy that should be used is to fill in Tools | Option | Project Comment box with as much detail as possible to reflect the exact description/status of the file. That way, any recovery/re-analysis is hopefully self explanatory.
Comma-Separated Variables Files
PATN can export all numerical results in comma separated variables format (CSV). These files can be read by Excel and most other statistical or analytical programs. As these data are from the Project file, there is usually little reason to export to archive. The only exception here is the Data Table itself - (File | Export | Data Table). This is a very SAFE format for long-term archival.
It may be a good idea to dump out the Data Table and all analyses results and then zip them into a single file. CSV files can usually be highly compressed so the resultant zip file containing all the APTN files is usually small in size.
Images and Video Files
PATN exports all graphical material either as EMF format (enhanced meta file - vector-based) or BMP format (image based). As these are images, they can be edited by any image editor (e.g. Photoshop, ACDSee etc.)
In summary, the best strategy for data management is to ensure that well-documented multiple copies of the PATN Project files are maintained along with an occasional CSV dump of the Data Table and optionally the analyses exports - all zipped into a single file.
PATN v3 will check for updates when it is running or you can use the Files Menu | Check for Updates command. Note that some firewalls may block access to port 80 (the standard Web port) from non browser applications - which PATN is. We have also noted some problems with some user's proxy server. If these problems occur, a manual update can be done.
- A file named something like PATNxxx (where xx is the version number) will be downloaded to a directory on the root drive (usually c:) called something like" C:ESELLERATE DOWNLOADS 7-20-2004" where the "07-20-2004'" is the date in USA format. Normally, you won't need to know this.
- When the file has been downloaded, PATN will close and the installer will run and detect that a version of PATN is already installed. This is fine. Just install the new version over the top of the old one.
You will only have to enter the serial number that you have from the original download of PATN if you had inadvertently uninstalled PATN.
That is all there is to it.
Updating from PATN manually if you have firewall or proxy server problems-
- Open your browser and enter the following URL http://www.patn.com.au/patnupgrade
- Double click on the file called patn300_302update.exe
- This will install PATN v3.02 over PATN v3.00 or PATN v3.01.
You will only have to enter the serial number that you have from the original download of PATN if you had inadvertently uninstalled PATN.
That is all there is to it.
Ancient DOS PATN vs PATN v3
The original DOS version of PATN contained a huge suite of options, but most were rarely used. The Windows version has fewer options but is far simpler to use. The underlying algorithms in PATN V3 are basically the same as the DOS version of PATN, but the user-interface of PATN V3 provides a radically different analysis environment. For those unusual people who occasionally ask, the following table outlines the differences in utilities and options between the DOS version of PATN and PATN V3.
|DOS PATN||Description||PATN v3|
|ASIM||ANOSIM (randomization tests based on association)||Yes|
|ASO||Association measures between rows (17)||Yes 4|
|ASON||Input, manipulation and output of association matrices||Yes (LSM)|
|BOND||Bonding lists based on NNB||No|
|CHI2||Chi-squared statistic across groups||No|
|COLR||Mapping of objects by colour||Yes|
|DATN||Data input, editing and export||Yes|
|DEND||Dendrograms (from hierarchical classification)||Yes|
|DCOR||Detrended correspondence analysis (DECORANA)||No|
|FUSE||Agglomerative hierarchical clustering (8)||Yes 4|
|GASO||Association between rows with variable grouping||No|
|GDEF||Define or manipulate group definitions (9)||Yes 3|
|GSTA||Statistical summaries of variables across groups||Yes|
|HIST||Histograms and univariate statistics||Yes|
|LABN||Input, creation and output of labels||Yes|
|MASK||Masking, sampling of data||Yes|
|MCAO||Monte-Carlo testing of variables in ordination||Yes|
|MDIV||Monothetic divisive clustering||No|
|MERG||Left-right and up-down merging of data||Yes|
|MST||Minimum spanning tree||Yes|
|NNB||Nearest neighbour analysis||No|
|PCA||Principal component/coordinate analysis||No|
|PCC||Fit variables to ordination using multiple linear regression||Yes|
|PCR||Orthogonal rotation of ordination axes||No|
|PDIV||Polythetic divisive clustering||No|
|PRAM||Specify data and environmental parameters||Yes|
|PROC||Procrustes analysis (compare two ordinations)||No|
|RAND||Data generation by random variates (8 distributions)||Yes|
|RIND||Hubert & Arabie statistic of group difference||No|
|SALE||Travelling salesman network||No|
|SAMP||Row or column sampling strategies||No|
|SCAT||Scatter plots||Yes (ord)|
|SENS||Sensitivity/ redundancy analysis of variables across groups||No|
|SERE||Seriation (1d ordination by 'smoothness')||No|
|SSH||Semi-strong hybrid multidimensional scaling||Yes|
|TRNA||Transformations and standardizations of association matrices||No|
|TRND||Transformations and standardizations of data (12)||Yes 7|
|TSPN||Pre-processor for Acrospin 3d vizualization (+SPIN)||Yes better!|
|TWAY||two-way table (re-ordering of data by classifications)||Yes|
|TWIN||Mark Hill's Twinspan (two-way indicator species analysis)||No|
*PATN v3+ does not include ALOB's weighting of variables option as such, but a weighting may be achieved using a combination of variable standardization and Minkowski metrics. PATN V3 does handle large datasets (1 million+ objects).