This Appendix lists software implementations of the clustering methods presented earlier. The coverage is not exhaustive: only software known to the author to be useful either via direct experience or online reviews is included.
8.1 Cluster analysis facilities in general-purpose statistical packages
Most general-purpose statistics / data analysis packages provide some subset of the standard dimensionality reduction and cluster analysis methods: principal component analysis, factor analysis, multidimensional scaling, k -means clustering, hierarchical clustering, and sometimes others not covered in this book. In addition, they typically provide an extensive range of extremely useful data creation and transformation facilities. A selection of them is listed in alphabetical order below; URLs are given for each and are valid at the time of writing.
8.1.1 Commercial
- GENSTAT
http://www.vsni.co.uk/software/genstat - MINITAB
http://www.minitab.com/en-US/products/minitab/ - NCSS
http://www.ncss.com/ - SAS
http://www.sas.com/ - SPSS
http://www-01.ibm.com/software/uk/analytics/spss/ - STATA
http://www.stata.com/ - STATGRAPHICS
http://www.statgraphics.com/ - STATISTICA
http://www.statsoft.com/ - SYSTAT
http://www.systat.com/
8.1.2 Freeware
- CHAMELEON STATISTICS
http://www.seventh-sense-software.com/chameleon.htm - MICROSIRIS
http://www.microsiris.com/ - ORIGINLAB
http://www.originlab.com/ - PAST
http://folk.uio.no/ohammer/past/ - PSPP
http://www.gnu.org/software/pspp/ - TANAGRA
http://eric.univ-lyon2.fr/~ricco/tanagra/en/tanagra.html - This is unusual in including the self-organizing map in addition to the standard methods.
- WINIDAMS
www.unesco.org/idams/ - WINSTAT
http://www.winstat.com/
8.3 Programming languages
All the foregoing packages are good, most are excellent, and any corpus linguist who is seriously interested in applying cluster analysis to his or her research can use them with confidence. That corpus linguist should, however, consider learning how to use at least one programming language for this purpose. The packages listed above offer a small subset of the dimensionality reduction and cluster analysis methods currently available in the research literature, and users of them are restricted to this subset; developments of and alternatives to these methods, such as DBSCAN and the many others that were not even mentioned, remain inapplicable. These developments and alternatives have appeared and continue to appear for a reason: to refine cluster analytic methodology. In principle, researchers should be in a position to use the best methodology available in their field, and programming makes the current state of clustering methodology accessible to corpus linguists because it renders implementation of any current or future clustering method feasible. A similar case for programming is made by Gries (2011a).
There are numerous programming languages, and in principle any of them can be used for corpus linguistic applications. In practice, however, two have emerged as the languages of choice for quantitative natural language processing generally: Matlab and R . Both are high-level programming languages in the sense that they provide many of the functions relevant to statistical and mathematical computation as language-native primitives and offer a wide range of excellent graphics facilities for display of results. For any given algorithm, this allows programs to be shorter and less complex than they would be for lower-level, less domain-specific languages like, say, Java or C++, and makes the languages themselves easier to learn.
Matlab ( http://www.mathworks.co.uk/ ) is described by its website as a high-level language and interactive environment for numerical computation, visualization, and programming. It provides numerous and extensive libraries of functions specific to different types of quantitative computation such as signal and image processing, control system design and analysis, and computational finance. One of these libraries is called Math, Statistics, and Optimization, and it contains a larger range of dimensionality reduction and cluster analysis functions than any of the above software packages: principal component analysis, canonical correlation, factor analysis, singular value decomposition, multidimensional scaling, Sammons mapping, hierarchical clustering, k -means, self-organizing map, and Gaussian mixture models. This is a useful gain in coverage, but the real advantage of Matlab over the packages is twofold. On the one hand, Matlab makes it possible for users to contribute application-specific libraries to the collection of language-native ones. Several such contributed libraries exist for cluster analysis, and these substantially expand the range of available methods. Some examples are:
- D. Corney: Clustering with Matlab
http://www.dcorney.com/ClusteringMatlab.html - J. Abonyi: Clustering and Data Analysis Toolbox
http://www.mathworks.co.uk/matlabcentral/fileexchange/