Multiple factor analysis of mixed tables: a proposal for
analysing problematic metric variables
Elena Abascal Fernández1, Maria Isabel Landaluce Calvo2 & Ignacio Garcia Lautre1
1Universidad Pública de Navarra & 2Universidad de Burgos, Spain
It is commonly accepted that principal component analysis (PCA) is a suitable method to analyse graphically the information of a rectangular table composed by n individuals and p metric variables. The objective is to study the association between variables and the similarities between individuals; PCA allows us to reduce the dimension of the table and to project variables and individuals onto the factorial axes. However, some problems emerge depending on the type of data we have. Two main points are considered in this work:
· Presence of variables with irregular and/or asymmetric distribution or with a large amount of zero values.
· No linear relation between variables.
The linear correlation coefficient is not an adequate indicator to measure the relationship between variables in these two situations. So, in these cases, PCA does not seem a suitable method. If the analyst has not detected these problems, he probably will be doing an incorrect interpretation of the results.
The objective of this work is twofold. First, we propose a new way to analyse this type of table, converting the problematic variables to qualitative variables (Escofier & Pagès, 1986). After doing this, we apply multiple factor analysis (MFA - see Escofier & Pagès, 1986, 1990, 1994) to the resultant mixed table and compare the results with others suggested in the literature. The second objective is to develop explicitly the formulas that allow us to interpret an MFA with this type of mixed tables. We explain how to interpret the MFA planes that maintain, at the same time, the characteristics of PCA planes for quantitative variables and the characteristics of MCA (multiple correspondence analysis) planes for qualitative variables (see, for example, Lebart et al., 1995).
Escofier, B. & Pagès, J. (1990). Analyses Factorielles Simples et Multiples. Paris: Dunod.
Escofier, B. & Pagès, J. (1994). Multiple factor analysis (MFULT package). Computational Statistics & Data Analysis, 18, 121-140.
Lebart L., Morineau, A. & Piron, N. (1995). Statistique Exploratoire Multidimensionelle. Paris:Dunod
Measuring the degree of congruence between groups of related respondents: an application of correspondence analysis
Anadolu University, Turkey
A point of interest in different scientific fields such as psychology, market research and education is the question to which degree related groups agree on a special research topic. For instance, it could be important to examine to which degree the points of view of pupils, teachers and school directors differ.
One way to measure those differences or similarities for categorical data is to analyze the congruency of survey responses between the surveyed groups. In the first step it is possible to show the responses of the different groups on tables which are similar to contingency tables. Based on these tables it is then possible to test the agreement between the researched groups statistically.
General methods to measure the strength of agreement for categorical data are Cronbach’s alpha, Kendall rank correlation coefficient and the Kappa index (Light, 1971; Carmines & Zeller, 1979). These measures generally show the degree of agreement between the respondents’ groups but do not show what this agreement looks like. For example, if a mother and a father have the same opinion on when their child should go to bed, Kappa will be high. However, there is missing information about the concrete hour they have agreed upon. Correspondence analysis, which is used for the analysis of contingency tables where two or more categorical variables are shown in one table, can also be used for measuring the degree of agreement between related groups and in addition it visualizes the location of this agreement between the different response categories.
This presentation will demonstrate the possibilities of correspondence analysis measuring the degree of agreement of related groups, in this example we have used mothers, fathers and children. The similarities of the groups will be shown by the example on the opinion how much time children should spend with their computers. To measure this research question we first used the Kappa index and afterwards applied correspondence analysis.
Carmines, E. G. & Zeller, R. A. (1979). Reliability and Validity Assessment. London: Sage Publications.
Light, R. J. (1971). Measures of response agreement for qualitative data: generalizations and alternatives. Psychological Bulletin, 76, 365-377.
Correspondence analysis of ordinal cross-classifications
Eric J. Beh & Pam J. Davy
University of Western Sydney & University of Wollongong, Australia
email@example.com & firstname.lastname@example.org
Multiple correspondence analysis (MCA) is a popular method of graphically identifying the association between more than two variables of a contingency table. A popular way to use the procedure is to apply singular value decomposition (SVD) to the indicator matrix or the Burt matrix generated from the data. More recently, extensions of SVD such as the Tucker and CANDECOMP/PARAFAC methods of decomposition, can also be used.
For the analysis of ordinal variables, the existing methods do not take into consideration their ordinal structure, and so neglect the information they may provide. One approach is to modify the method of decomposition so that the co-ordinates reflect the ordinality. For the analysis of bivariate ordinal cross-classifications this has been studied, for example, by Parsa & Smith (1993), Ritov & Gilula (1993) and Schriever (1983). Alternatively, the correspondence analysis approach of Beh (1997) is applicable to ordinal two-way categorical data and uses the bivariate moment decomposition (BMD) to identify linear (location), quadratic (dispersion) and higher order moments for each of the ordinal variables. While most of the techniques designed to analyse ordinal cross-classifications focus solely on the linear-by-linear association, the advantage of using the BMD is that non-linear measures of association, called generalised correlations, can easily be found. The linear correlations reflect the Pearson product moment correlation and Spearman’s rank correlation, and higher order correlations reflect generalised, non-linear versions of these.
This paper describes the method of simple correspondence analysis for ordinal cross-classifications, and shows how it may be generalised to perform ordinal MCA. The development of the technique will focus on the three-way contingency table with three ordered variables. We will demonstrate that the correspondence plots generated from the analysis have the same mathematical properties as the Tucker and PARAFAC/CANDECOMP approaches to MCA, yet offer a far more intuitive interpretation of the association of the variables. Of particular interest, for a completely ordered three-way table, the total inertia can be decomposed into three bivariate chi-squared terms and a three-way term (Beh & Davy, 1998). The procedure does not rely on calculating maximum likelihood estimation techniques, but instead relies on orthogonal polynomials generated from a simple recurrence relation described in Beh (1997) and so is a simple and easily computable tool for the analysis of ordinal cross-classified data.
Beh, E. J. (1997). Simple correspondence analysis of ordinal cross-classifications using orthogonal polynomials. Biometrical Journal, 39, 589-613.
Beh, E. J. & Davy, P. J. (1998). Partitioning Pearson's chi-squared statistic for a completely ordered three-way contingency table. The Australian and New Zealand Journal of Statistics, 40, 465-477.
Parsa, A. R. & Smith, W. B. (1993). Scoring under ordered constraints in contingency tables. Communications in Statistics (Theory and Methods), 22, 3537-3551.
Ritov, Y. & Gilula, Z. (1993). Analysis of contingency tables by correspondence models subject to ordered constraints, Journal of the American Statistical Association, 88, 1380-1387.
Schriever, B. F. (1983). Scaling of order dependent categorical variables with correspondence analysis. International Statistical Review, 51, 225-237.
Statistical aspects of pottery quantification for dating some archaeological contexts in the city of Tours
Lise Bellanger, Philippe Husi & Richard Tomassone
Université de Nantes, France, Université Francois Rabelais de Tours, France & Institut National Agronomique, Paris, France
email@example.com, firstname.lastname@example.org & email@example.com
This paper describes some statistical analyses of a particular archaeological material (pottery) coming from some sites in the city of Tours. The important number of excavations realized, with the same system of data recording, during the last thirty-five years (1968-2002) explain the interest in Tours. We list 16 excavations leading to stratigraphic abundance data in the historic centre of the town. As pottery is a very good chronological indicator, its quantitative study is crucial to comparing different archaeological contexts (or sets). Each context retained in the study is represented by its pottery assemblage. The corpus of data comprises a two-way table with rows representing different fabrics and columns specifying archaeological contexts. Columns are separated into two groups. The first one, the active group, includes archaeological contexts for which dates are attested by money; the second one, the supplementary group, includes contexts, whose dates are badly defined or unknown.
Our statistical approach corresponds to different archaeological needs:
i) A comparison of the most used measures of pottery quantification to assess their performance using the multidimensional scaling of categorical data.
ii) A spatial (inter-assemblage) and chronological approach to estimate date contexts, in which the primary source of variation is thought to be the different proportions of fabrics (pies of pottery), representing either geographical or temporal variation (or both).
The statistical procedure is tackled in several steps:
1) Investigation of the relationship between contexts and fabrics using correspondence analysis
2) Use of the secure representation of the contexts obtained, previously for estimating their date with a regression model.
3) Model checking as an essential component of this fitting process, including resampling methods (jackknife and bootstrap).
4) Estimating the dates for the supplementary group, using the regression model.
5) Scrutiny of estimated dates and possible insertion into the active group of some new contexts, belonging to the previously defined supplementary group.
6) Repetition of this process to obtain a new basic active group consisting of well dating contexts.
This method provides an effective complementary tool for dating archaeological contexts. It seems to be a good example of integration between formal statistical theories and their practical application in scientific discipline.
On the connection between the distribution of eigenvalues in multiple correspondence analysis and log-linear models
Saloua Ben Ammou & Gilbert Saporta
CEDRIC, Paris, France
Multiple correspondence analysis (MCA) and log-linear modelling are two techniques of multi-way contingency table analysis having different problematics and fields of applications. Log-linear models are profitable when applied to a small number of variables (Bishop et al., 1975). Multiple correspondence analysis is useful in large tables (Lebart et al., 2000). This efficiency is balanced by the fact that MCA is not able to explicit relations between more than two variables, as can be done by log-linear modelling (Andersen, 1991). The two approaches are complementary.
In this presentation we shall demonstrate that in MCA under independence hypothesis each observed eigenvalue is asymptotically normally distributed. These distributions have the same mean, different variances and converge to normal distribution (Ben Ammou, 1996; Ben Ammou & Saporta, 1998).
Under some modelling hypothesis, the MCA eigenvalues distribution diagram takes some particular shapes, especially in the case of mutual independence model (theoretically there is only one non trivial, multiple eigenvalue =1/p, where p is the number of variables), in practice, observed eigenvalues µi are different but still close to 1/p : µi = 1/p ± ε. Therefore the shape of observed eigenvalues diagram is very peculiar. This shape changes if there is one or more interaction between variables. We can recognize the model fitted by data in some particular cases, especially when the number of interactions is not very large, i.e. we can easily identify the observed eigenvalues that are equal (or very close) to 1/p. When the number of interactions increases, we can no more distinguish between eigenvalues theoretically equal to 1/p and those different from 1/p.
Based on these results we propose a simple procedure, fitting progressively log-linear models, where the goodness of fit procedure is based on MCA eigenvalues diagram: the model is inducted by successive utilisations of MCA (non constrained by the number of variables).
The procedure is validated on several data sets from the literature corresponding to various cases: mutual independence, saturated models and graphical models with two-way interactions.
Andersen, E. B. (1991). The Statistical Analysis of Categorical Data. (Second edition), New-York: Springer.
Ben Ammou, S. (1996). Comportement des Valeurs propres en Analyse des Correspondances Multiples sous certaines Hypothèses de Modèles. Doctoral Thesis, University Paris IX Dauphine.
Ben Ammou, S. & Saporta, G. (1998). Sur la normalité asymptotique des valeurs propres en ACM sous l’hypothèse d’indépendance des variables. Revue de Statistique Appliquée, 46, 21-35.
Bishop, Y. M. M., Fienberg, S. E. & Holland, P. W. (1975). Discrete Multivariate Analysis: Theory and Practice. Boston: MIT Press.
Lebart, L., Morineau, A. & Piron, M. (2000). Statistique Exploratoire Multidimensionnelle, 3ème édition, Paris: Dunod.
Statistical method in image retrieval
Mónica Benito & Daniel Peña
University Carlos III of Madrid, Spain
firstname.lastname@example.org & email@example.com
Exploratory image studies generally aim at data inspection and dimensionality reduction. Any particular image is represented by a matrix X of dimensionality IxJ , i.e., with I rows and J columns. Principal component analysis (PCA) has been used in the past to reduce dimensionality and derive useful compact representations for image data. Low-dimensional representations are also important when one considers the intrinsic computational aspect. This work is concerned in particular with dimension reduction from large image databases with applications to image reconstruction. PCA was first applied to reconstruct human faces by Kirby and Sirovich (1990), considering the images as vectors in a high dimensional space. Turk and Pentland (1991) further developed a well-known face recognition method, known eigenfaces, where the eigenfaces correspond to the eigenvectors associated with the dominant eigenvalues of the face covariance matrix. The eigenfaces define a feature space, or ‘face space’, which drastically reduces the dimensionality of the original space, and face reconstruction and identification are carried out in the reduced space. An important property of PCA is its optimal signal reconstruction in the sense of minimum mean square error (MSE) when only a subset of principal components are used to represent the original signal.
The new method proposed is based on the projection of the images as matrices and it is shown to lead to a better reconstruction for the data analysed. Instead of considering the images as vectors, as in the PCA approach, the idea is maintain the matrix structure of the images, and project each matrix onto a vector. We measure the discriminatory power of the projection vector by the scatter of the projected samples. The optimal set of projection axes are the eigenvectors corresponding to the highest eigenvalues of the image total covariance matrix. This covariance matrix has dimension IxI (assuming I<J). The set of projected feature vectors associated with any image X of the sample can be used to form a feature matrix and estimate a multivariate linear model using this known design matrix. The design matrix is unique for each observation, and its obtained using the information of all the training sample. The unifying theme of the new schemes is that of lowering the space dimension (data compression) subject to increased fitness for the reconstruction task.
The method is illustrated using a set of full-face pictures of males and females, extracted from digitized images in a gray-scale.
Christensen, R. (1991). Linear Models for Multivariate, Time Series and Spatial Data. Springer-Verlag.
Kirby, M. & Sirovich, L. (1990). Application of the Karhunen-Loeve procedure for the characterization of human faces. In IEEE Transactions on Pattern Analysis and Machine Intelligence, 12, 103-108.
Swets, D. & Weng, J. (1996). Using discriminant eigenfeatures for image retrieval. Technical Report.
Turk, M. & Pentland, A. (1999). Face recognition using eigenfaces. In Proceedings of the IEEE Conference in Computer Vision and Pattern Recognition, 586-591.
Yang, J. & Yang, J. (2002). From image vectors to matrices: a straightforward image projection technique. Pattern Recognition, 35, 1997-1999.
Types and anti-types as test points in correspondence analysis
Jörg Betzin & Erwin Lautsch
Technical University of Berlin @ University of Kassel, Germany
firstname.lastname@example.org & email@example.com
The concept of types and anti-types was mainly developed by psychologists in order to analyze the relationship between categorical variables more closely than under the more or less unspecific general independence hypothesis.
In the framework of configuration frequency analysis (CFA), first introduced in the late sixties by Lienert (see e.g. Krauth/Lienert, 1973, “Die Konfigurationsfrequenzanalyse”), the single cells of a configuration frequency table are investigated. We denote by “type” a cell, in which the observed frequency is significantly higher than the expected frequency under the hypothesis of general independence of the table. The term “anti-type” is analogously defined, with a significant lower value of observed frequency. The significance of the difference between observed and expected frequencies is measured by the respective χ² component of the χ² test statistic for the table. Thereby the χ² components are handled as χ² tests with one degree of freedom, whereas adjusted significance levels are used. We connect this approach with correspondence analysis (CA), where a χ² distance is used to measure relationships between different categories.
One of the difficulties in CA is the interpretation and, particularly, the detection and visualization of meaningful multivariate categories points. By a multivariate categories point we understand a combination of category scores of different variables lying close together in the “graphical description” of CA. By the term “graphical description” we do not refer to the real graphical mapping of the CA solution, but rather the possible spatial description in more than two or three dimensions. Comparing types/anti-types from CFA with graphical representations from CA may be helpful to detect relevant points and to interpret relations from the graphic.
In this presentation we will describe the technique of CFA and the use of types and anti-types as testpoints in CA. For multivariate categories points we show the connection to be a type or an anti-type in CFA with his spatial location in the CA graphic. In particular, we discuss the importance of the distance from the zero point, the distance to the dimensional axes and the volume of the convex envelope of a multivariate categories point in the graphical description of CA for this connection. Moreover, we show the usefulness of a so-called determination coefficient from the CFA to interprete the practical remarkableness in the CA context.
The usefulness of the envisaged concept is demonstrated by results from real data from the “Shell youth study 2001 (Germany)”. We drew two different data sets from this survey. The first one describes the confidence in social and political institutions of German adolescents. This data set has a very clear and stable CA solution, so the types/anti-types concept is not very helpful here. The second data set contains variables from the sociodemographic environment of the survey participants. Here the CA solution is very complex. We point out that introducing types and anti-types is helpful in the interpretation of the CA solution.
A three-step approach to assessing the behaviour of
survey items in cross-national research using biplots
Jörg Blasius & Victor Thiessen
University of Bonn, Germany & Dalhousie University, Halifax, Canada
firstname.lastname@example.org & Victor.Thiessen@Dal.Ca
To make meaningful international comparisons using survey data presupposes a common understanding of the questionnaire items and an acceptable level of quality of the data. To the extent that these conditions are not met, cross-national findings are not comparable. This paper employs a three-step approach to assessing the comparability of survey items in cross-national research. In the first step, classical principal component analysis (PCA) is used, which makes rather stringent assumptions about the distributions and the level of measurement of the survey items. In the second step, the results of PCA are then compared with those obtained using nonlinear PCA. Divergences in the results of these two types of analyses indicate the existence of measurement problems. These are then explored more systematically using the biplot methodology in the third step. This methodology helps to locate both differences in the underlying structure of the survey items, and to violations of metric properties in the individual items.
We exemplify our approach by focusing on a set of five-point Likert-type items used in the 1994 International Social Survey Program (ISSP), which focuses on family and gender roles. Information is available for 24 countries on opinions in the areas of women and work, marriage, and children. Our results show that several subsets of countries can be meaningfully compared within but not across the subsets. The reason that they cannot be compared across subsets is that the underlying structures in the different subsets of countries is not equivalent. In addition, for several countries the behaviour of the survey items was such that we conclude that they cannot fruitfully be used for substantive comparisons.
The implication of social classification for analyses of the field of higher education – the case of Sweden
Mikael Börjesson, Donald Broady & Mikael Palme
Department of Teacher Education, Uppsala University, Sweden
email@example.com, firstname.lastname@example.org & email@example.com
This paper falls into two parts. Firstly, the classification of social origin found in official statistics in Sweden is discussed, as well as the possibility of an alternative classification giving a less uni-dimensional representation of social space. In this context, the key issue of what constitutes a “household” is highlighted. Secondly, an alternative, multi-dimensional classification system is employed for analysing the recruitment to higher education in Sweden in 1998.
Until recently, Statistics Sweden has used two different types of social classification systems, the Nordic Occupational Classification (NYK) and the Socio-Economic Index (SEI). Being based on professions, the NYK allows the separation of over 3,000 different occupations, which can be aggregated according to branches with different levels of aggregation. The SEI, comprising some 20 categories, is a hierarchical classification, using several different criteria for distinguishing between the categories. In order to obtain a classification system that accounts both for hierarchical differences between social groups and for the specific nature of the assets that these groups possess, a social classification that combines NYK and SEI is described, distinguishing between 32 social groups.
Social groups separated by any classification system tend to differ also as regards properties usually not transparent in the definition of the groups as such. For a cohort of all grade 9 leavers in 1988 (appr. 110,000 individuals), the characteristics of the particular social group of origin are analysed along several dimensions (marriage patterns, income, education, immigration, number of children, etc.; all data provided from the national census in 1990). A main finding is that groups with high social positions that largely depend on educational capital (espec. university teachers and physicians) differ significantly from groups pertaining to the economic fractions of the dominant class. Typically, they find spouses with an equally high social position based on cultural capital, while men belonging to the economic elite more often marry women in lower social positions. It is concluded that the statistical analysis must be specifically aware of the implications of various alternative definitions of the “household” or “family” when exploring the effects of social origin as a variable.
In the second part of the article, the social structure of the field of higher education in Sweden is analysed using simple correspondence analysis. The columns of the matrix consists of 32 social groups divided by sex, i.e. forming 64 social groups (where sons of university teachers are separated from daughters of university teachers, etc.), and the rows of approx. 1,400 educational programmes (distinguishing both between different types of programmes, such as civil engineering programmes in computer sciences, physics, and architecture, and between different institutions of higher education). Three important dimensions constitute the field. The first axis separates men from women, where programmes in natural sciences and technology stand against programmes in education, social services and nursing. The second axis differentiates the dominating social groups, especially those whose positions depend on educational or cultural capital, from dominated ones. The former ones normally attend longer, more prestigious study programmes at the traditional universities and prominent professional schools, while the latter are directed towards shorter programmes and provincial institutions. The third dimension shows an opposition between the cultural fractions and the economic fractions of the dominant class. It is concluded that an understanding of the complexity of the structure of the field of higher education requires a differentiated classification system of social origin that separates social groups with different kinds of assets.
Discriminant analysis on categorical variables
Stéphanie Bougeard(1), El Mostafa Qannari(2) & Hicham Noçairi(2)
(1)AFSSA, Ploufragan & (2) ENITIAA-INRA, Nantes, France
Fisher’s discriminant analysis and logistic regression are used in order to predict a categorical variable from a set of numeric variables. Adaptations of these methods to the case where it is desirable to predict a categorical variable from other categorical variables are discussed in the literature. After a brief review of these techniques, we investigate methods of prediction that aim at circumventing the well-known problem of multicollinearity among predictors. A first approach consists in performing multiple correspondence analysis on the predictors and, thereafter, uses a subset of the principal axes as predictors. It should be stressed, however, that the issue regarding how to choose the principal axes to be introduced in the prediction model is a tricky problem. On the one hand, with few axes there is a risk to discard useful information for the discrimination purpose and, on the other hand, one should pay attention not to introduce principal axes that may cause instability in the model.
We investigate alternative methods which make is possible to derive, step by step, principal axes that are tightly related to the categorical variable to be predicted. These methods pertain to redundancy analysis and partial least squares discriminant analysis performed on categorical variables.
The various methods of analysis are illustrated and compared on the basis of real data sets.
GINKGO, a multivariate analysis program oriented towards distance-based classifications
Miquel De Cáceres, Francesc Oliva & Xavier Font
Universitat de Barcelona, Spain
firstname.lastname@example.org, email@example.com & firstname.lastname@example.org
Although there are many multivariate programs already available, most of them only present the same classical methods. As a result, non-expert users are not aware of more specialised techniques, which could be more useful for their application needs. GINKGO is an application oriented towards the representation and classification of individuals in multivariate spaces. It is mainly concerned in providing multivariate methods applied to dissimilarity matrices.
Unsupervised classifications can be performed using three different clustering models. 1) Hierarchical agglomerative clusters (single, complete, UPGMA,...). 2) Crisp (K-means, MacQueen, 1967) and fuzzy (FCM, Bezdek, 1981) partitions. 3) Independent clusters (possibilistic C-means, Krishnapuram & Keller, 1993). Additionally, GINKGO allows clustering models (2) and (3) to be performed directly on symmetric dissimilarity matrices (Oliva et al., 2001), avoiding the use of Pythagorean distance or MDS. Non-supervised classification methods available are: linear discriminant, quadratic discriminant and distance-based discriminant (Cuadras et al., 1997) analyses.
Ordination methods implemented in the program are principal components analysis (PCA), metric scaling (MDS), non-metric multidimensional scaling (NMDS), correspondence analysis (CA), as well as related multidimensional scaling (RMDS, Cuadras & Fortiana, 1998).
GINKGO has been entirely developed in Java language and is freely distributed (http:\\biodiver.bio.ub.es\vegana). Software updates are automatically done, by using Java Web Start technology.
Bezdek, J. C. (1981). Pattern Recognition with Fuzzy Objective Functions. New York: Plenum Press.
Cuadras, C. M., Fortiana, J. & Oliva, F. (1997). The proximity of an individual to a population with applications in discriminant analysis. Journal of Classification, 14, 117-136.
Cuadras, C. M. & Fortiana, J. (1998). Visualizing categorical data with related metric scaling. In Visualization of Categorical Data (eds. J. Blasius and M. Greenacre), 365-376. London: Academic Press.
Krishnapuram, R. & Keller, J. M. (1993). A possibilistic approach to clustering. IEEE Transactions on Fuzzy Systems, 1, 98-110.
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observation. Proceedings of theFfifth Berkeley Symposium on Mathematical Statistics and Probability, 281-297.
Oliva, F., De Cáceres, M., Font, X. & Cuadras, C. M. (2001). Contribuciones desde una perspectiva basada en distancias al fuzzy C-means clustering. XXV Congreso Nacional de Estadística e Investigación Operativa. Úbeda 2001.
Hierarchical factor classification for contingency tables
Sergio Camiz, Jean-Jacques Denimal & Elena Rova
Università di Roma La Sapienza, Italy, Université des Sciences et Technologies de Lille, France & Università di Venezia Ca’ Foscari, Italy
email@example.com, firstname.lastname@example.org & email@example.com
Recently, Denimal (2001) introduced a hierarchical classification of continuous variables, based on a sequence of principal component analyses. Two particular features deserve mentioning: first, it can cluster variables highly correlated, irrespective of the direction of correlation, thus producing dipoles of variables opposed to each other; second, for each node it produces a specific factor plane, where both the clustered variables and the units, as seen only by these variables, are projected. In this way, a sequence of factor planes is produced. On these, the factors previously built in the hierarchical process can also be projected, since axes belonging to adjacent nodes, far from being orthogonal to each other, are usually the most correlated.
In this paper, an analogous procedure is proposed for the columns of a contingency data table. We set in the frame of correspondence analysis (CA), namely we deal with profiles. This implies that we cannot base the analysis on a couple of columns (that would produce only one factor), but rather on four columns. Given a contingency table, crossing m rows with n columns, for any column j a new column j* is built, whose i-th element is valued ni· - nij, the complement of nij to the row total ni·. For all couples of columns (j1, j2) a four-column contingency table, K(j1, j2, j*1, j*2) is then submitted to correspondence analysis, that will give two factors. The agglomeration criterion will be to aggregate to a node n+1 the pair of columns whose correspondence analysis second eigenvalue is the lowest. Based on the first factor coordinates, both the node column jn+1 and its complement j*n+1 will be computed and the procedure will be iterated to form a complete hierarchy.
It will be shown that: i) at each step n, where columns j1 and j2 are merged, a factor plane is produced whose first factor represents what the two columns have in common and the second what distinguishes them; the factors variances correspond to the two eigenvalues respectively; ii) at each step, all merged columns and all units can be projected on the factor plane: the latter are represented as they are seen only by these columns; iii) since the resemblance among two nodes is evaluated irrespective of the covariance sign, the groups of columns may assume a shape of dipoles; iv) the sequence of second eigenvalues is non-decreasing, so that they can be used as indexes of the hierarchy; v) the total inertia is decomposed as the sum of these indexes with the first eigenvalue of the last CA, corresponding to the (n–1)-th node of the hierarchy.
An application to the study of images of Mesopotamian sealings (IV millennium b. C.) will be presented: the images were coded through a formalised text, describing in detail the iconographic content of the image, and textual analysis allowed to build a contingency table crossing the images with the lexical forms used for the description. In this way the forms indicate the nature of the represented elements, their attributes, and the relations among elements, and the classification of forms can group the elements, attributes, attitudes, and relations, in order to detect the compositional elements that occur jointly more frequently.
Denimal, J. J. (2001). Hierarchical factorial analysis, Actes du 10th International Symposium on Applied Stochastic Models and Data Analysis, Compiègne, 12-15 June 2001.
Regression biplot: linear and non-linear
P O S T E R
Olesia Cárdenas C.1, Ma Purificación Galindo V.2 & José L. Vicente-Villardón2
1Universidad Central, Venezuela & 2Universidad de Salamanca, Spain
Even where classic biplot methods are used to describe data matrices without making assumptions about the population distribution, it may be possible to interpret the biplot of a matrix as a multiplicative bilinear model (Gollob, 1968) and use it for model diagnosis (Bradu & Gabriel, 1978; Gabriel, 1998), considering it as an extension of the generalized linear model (Nelder & Wedderburn, 1972). Gower & Hand (1996) follow an approach other than the classical approach, which may be related to the classic factorial form of the French school of data analysis, as well as to the ordination methods used in the biometric school, i.e., they describe biplot geometry in terms of projections onto a subspace, as opposed to the a geometric approach followed in model diagnosis, calling these regression biplots. In an entirely different context, the classic factorial form of data analysis for variables with distributions within the exponential family may be compared to arriving at continuous latent variables in the social sciences, as is the case with item response theory (Baker, 1992), for instance.
The purpose of this paper falls within these varied lines of research, i.e., it is aimed at describing a data matrix, using general multiplicative bilinear models in approximating regression biplots, analyzing their geometry and formally proposing one alternative estimation method. One advantage of regression biplots versus principal components regression, for example, is the possibility that the distribution of the variables contained in the data matrix belong to the exponential family, making it possible to exhibit on the plot the association between individuals and variables. Another advantage of this estimating procedure is that it may be generalized to include external information. A practical application is carried out which demonstrates their applicability.
Baker, F. B. (1992). Item Response Theory. New York: Marcel Dekker.
Bradu, D. & Gabriel, K. R. (1978). The biplot as a diagnostic tool for models of two-way tables. Technometrics, 20, 47-68.
Gabriel, K. R. (1998). Generalised bilinear regression. Biometrika, 85, 689-700.
Gollob, H. (1968). A statistical model with combines features of factor analytic and analysis of variance techniques. Psychometrika, 33, 73-115.
Gower, J. C. & Hand, D. J. (1996). Biplots. London: Chapman & Hall.
Nelder, J. A. & Wedderburn, R. W. (1972). Generalized linear models. Journal of the Royal Statistical Society A, 135, 370-384.
Which structures do generalised principal component analyses display ? The case of multiple correspondence analysis
Henri Caussinus & Anne Ruiz-Gazen
Université Paul Sabatier & Université des Sciences Sociales, Toulouse, France
firstname.lastname@example.org & email@example.com
Let us consider an individual ´ variable array. Projection pursuit aims to find low-dimensional projections displaying interesting features in the structure of the units distribution. Principal component analysis and related methods produce such graphical displays for users whose interest focuses on preserving dispersion as far as possible. However, various choices of the metric on the units space allow them to give various meanings to the word dispersion. Some metrics lead to generalised principal component analyses which are likely to display various kinds of special structures in the data, thus meeting the aims of projection pursuit techniques. For some proposals and their properties, see, for example, Caussinus & Ruiz-Gazen (1995) and Caussinus et al. (2002, 2003). In these papers, the authors investigate the properties of their methods for quantitative data. They rely on a mixture model where the “non-interesting noise” is a normal distribution, while the (non-normal) mixing distribution is the “structure” of interest. Roughly speaking, their methods look like factor discriminant analyses where the classes would not be known.
In the case of qualitative data (n units ´ p categorical variables) the same methods can be formally applied to indicator matrices, but their properties are far from being clear. In particular, the mixture model above no longer makes sense. It is now more sensible to replace the normal noise by the independence of the p responses inside each component of the mixture. This is exactly the latent class model. The complementary use of this model and multiple correspondence analysis has been considered by several authors (Aitkin et al., 1987; McCutcheon, 1997). In our framework, it is easy to see that both techniques are actually very strongly related. We show why and give illustrative examples.
Aitkin, M., Francis, B. & Raynal, N. (1987). Une étude comparative d’analyses des correspondances ou de classifications et des modèles de variables latentes ou de classes latentes. Revue de Statistique Appliquée, 35, 53-82.
Caussinus, H. & Ruiz-Gazen, A. (1995). Metrics for finding typical structures by means of principal component analysis. In Data Science and its Applications (eds Y. Escoufier & C. Hayashi), 177-192. Tokyo: Academic Press.
Caussinus, H., Hakam, S. & Ruiz-Gazen, A. (2002). Projections révélatrices contrôlées: recherche d’individus atypiques. Revue de Statistique Appliquée, 50, 5-37.
Caussinus, H., Hakam, S. & Ruiz-Gazen, A. (2003). Projections révélatrices contrôlées: groupements et structures diverses. Revue de Statistique Appliquée, 51, 37-58.
Caussinus, H., Fekri, M., Hakam, S. & Ruiz-Gazen, A. (2003). A monitoring display of multivariate outliers. Computational Statistics and Data Analysis (forthcoming).
McCutcheon, A. L. (1998). Correspondence analysis used complementary to latent class analysis in comparative social research. In Visualization of Categorical Data (eds J. Blasius & M. J. Greenacre), 477-488. London: Academic Press.
Advantages and limits of correspondence analysis for
comparative analysis of socio-political data
CNRS, Grenoble, France
There are lot of methodological problems related to the comparative analysis of socio-political data, specially coming from survey research. Among these problems some are due to the comparative survey framework itself : the way items and questions are understood across national (even not speaking of sub-national) contexts can seriously question the “comparativeness” of the research. There are famous examples of “mistranslation”, misunderstandings and poor equivalence between measurements across countries. This is especially true in the case of cross-cultural studies (large comparisons between Western/Asian countries for instance) but can also be true in the case of cross-national comparison across countries culturally close (between EU countries for instance). Large scale comparative surveys like the Eurobarometers, the International Social Survey Program (ISSP) or the new European Social Survey (ESS) are facing such methodological challenges. A second problem comes from the analysis of data, not the collection of it: how to control for national variations in data? How to analyse data in a way that allows the discovery of common or different patterns across nations or time? Must the analysis be done one country by one (to discover each national pattern of data) or “countries simultaneously”?
Among the statistical techniques available to study the patterns of association between categorical or ordinal level data correspondence analysis (binary or multiple) can be a very useful way to investigate both problems. It has advantages over other techniques such as loglinear analysis of tabular data. This paper will investigate these questions by looking at some data sets such as Eurobarometers or national election studies. The use of multiple correspondence analysis or nonlinear principal component analysis will help in assessing if the response patterns varies across countries and how to present the national variations of a “common structure”. Examples of responses to EU attitudes will be investigated. The techniques of supplementary points can also help using one nation or one group of nations as the structure onto which are projected supplementary points. Finally the paper will compare the advantages of correspondence analysis and loglinear analysis for comparative survey research.
The political space of the French electorate in 2002:
geometric data analysis applied to the French political life
Jean Chiche, Brigitte Le Roux, Pascal Perrineau, & Henry Rouanet
CNRS & Université René Descartes, Paris, France
firstname.lastname@example.org & Henry.Rouanet@math-info.univ-paris5.fr
The aim of the study is to delineate the structure of the political space of French electors and the social and ideological evolutions that entailed the presence of an extreme right wing candidate at the second round of the French presidential election that took place in April 2002 (Perrineau, 2003). As a basic statistical method we will use specific multiple correspondence analysis (see Le Roux & Chiche, 1998 and Le Roux, 1999) concentrating on the representation and interpretation of the clouds of individuals.
During the spring 2002, French research laboratories conducted three waves of surveys involving more than 10,000 respondents. The first wave was administered during the two weeks before the first ballot of the presidential election (April 21), the second wave after the second ballot (May 5), and the 3rd after the legislative elections in the last days of June. These surveys, known as “The French Electoral Panel 2002”, pertain to attitudes, values and stakes of French electors.
In the paper, we will first present the results of geometric data analysis on the first wave data, following an approach similar to the one of our earlier studies (Chiche et al., 2000) of a 1997 data survey (Boy & Mayer, 1997). We will characterize the political cleavages among the electors by means of the interpretation of principal axes. Then, using the method of structuring factors, as an extension of that of supplementary variables (see for example, Le Roux & Rouanet, 1998), we will project the individual positions of the main electorates – shown as concentration ellipses – in the principal geometric space that represents the French political space. We will match these results with the ternary structure of the political space found in the earlier paper, showing the major cleavages (ancient or novel) in the French society. Then we will compare the structures found for the first wave with those obtained in the second (post presidential election). Does the space remain “stable’? Are the intensities of the main factors comparable ?
In answering these questions, we hope to show how geometric data analysis – reintroducing the individuals at the heart of statistical analysis – can contribute to a major social debate and bring elements of answer to the burning question: will the presence of an extreme right candidate at the second ballot of the presidential election in the spring of 2000 be remembered as a mere “accident de l’histoire”; - or … ?
Boy, D. & Mayer, N. (1997). L’Électeur a ses Raisons. Paris: Presses de la Fondation nationale des sciences politiques.
Chiche, J., Le Roux, B., Perrineau, P. & Rouanet, H. (2000). L’espace politique des électeurs français à la fin des années 1990. Revue Française de Sciences Politiques, 50, 463-487.
Le Roux, B. (1999). Analyse spécifique d’un nuage euclidien: application à l’étude des questionnaires. Math. Inf. Sc. Hum., 146, 65-83.
Le Roux, B. & Chiche, J. (1998). Analyse spécifique d’un questionnaire: cas particulier des non-réponses. xxx-èmes journées de Statistique de la S.F.d.S., Rennes, Mai 1998.
Le Roux, B. & Rouanet, H. (1998). Interpreting axes in MCA: method of the contributions of points and deviations. In Visualization of Categorical Data (eds Jörg Blasius & Michael Greenacre). London: Academic Press.
Perrineau, P. (2003). Le vote de tous les refus : les élections présidentielle et législatives d’avril-mai 2002. Collection Chroniques électorales, Paris: Presses de Sciences Politiques.
Correspondence analysis and two-way clustering
Antonio Ciampi & Ana González Marcos
McGill University, Montreal, Canada & University of La Rioja, Spain
email@example.com & firstname.lastname@example.org
In modern clustering problems such as micro-array analysis and text mining, the challenge is not only to discover proximity relationships among individuals and variables, but also to discover groups of variables and of individuals such that the variables are useful in describing proximities among the individuals. To this end, techniques known as two-way and crossed classifications have been developed, with the aim of producing homogeneous blocks in a rectangular data matrix.
Correspondence analysis (CA), as well as other biplot techniques (Gordon, 1999), offers the remarkable feature of jointly representing individuals and variables. As a result of such analyses, not only does one gain insight in the relationship amongst individuals and amongst variables, but one can also find an indication of which variables are important in the description of each individual. It is therefore natural to develop clustering algorithms that are based on the coordinates of a CA. Indeed this was commonly done by practitioners of “analyse des données” well before the advent of micro-array and text mining (Lebart et al., 1984).
More recently, in an early attempt to develop clustering methods for micro-array data, Tibshirani et al. (1999) used the coordinates associated with the first vectors of a singular value decomposition to simultaneously rearrange the rows and the columns of a data matrix. They eventually abandoned this approach to concentrate on block clustering. In this work we explore their early idea further. Instead of using only the first axis, we select a few important axes of a CA and apply clustering algorithms to the corresponding coordinates of both rows and columns. Then, instead of ordering rows and columns by the value of the respective first coordinates, we use one of the orderings of rows and columns induced by the classification thus obtained. The result is an algorithm, which combines two-way clustering as is now currently applied in micro-array analysis, with the ‘dual’ perspective provided by CA.
Our novel contribution consists in i) proposing a simple method for selecting the number of axes; ii) visualizing the data matrix as is done in micro-array analysis; iii) enhancing this representation by emphasizing those variables and those individuals which are ‘well represented’ in the subspace of the chosen axes. Also, we underline the utility of this approach to clustering by presenting a ‘traditional’ clustering problem: the classification of a group of psychiatric patients.
Lebart, L., Morineau, A. & Warwick K. (1984). Multivariate Descriptive Statistical Analysis. New York: Wiley.
Tibshirani, R., Hastie, T., Eisen, M., Ross, D., Botstein, D. & Brown, P. (1999). Clustering methods for the analysis of DNA microarray data. Technical Report, Division of Biostatistics, Stanford University. http://www-stat.stanford.edu/~tibs/research.html
Comparing three methods for representing categorical data
Carles M. Cuadras & Michael Greenacre
Universitat de Barcelona & Universitat Pompeu Fabra, Barcelona, Spain
email@example.com & firstname.lastname@example.org
Correspondence analysis (CA) is a multivariate method to visualize categorical data, typically presented as a two-way contingency table. The distance used in the graphical display of the rows (and columns) of the table is the so-called chi-square distance between the profiles of rows (and columns).
In an early paper, Rao (1948) introduced the concept of canonical coordinates, also for graphical representation of multivariate data, specially quantitative multivariate data in several populations. More recently, Rao (1995) also used canonical coordinates to represent the rows of a contingency table, using the Hellinger distance (HD) between the profiles of rows.
A third alternative to represent categorical data is based on compositional data (Aitchison, 1986). Suppose that the rows of the table are vectors of positive values summing to one, for example the row profiles in CA. Let us consider the singular value decomposition of the weighted double-centering of the logarithms of this table. This log-ratio method (LR) is equivalent to considering a third distance between rows (see Aitchison & Greenacre, 2000).
First, we compare CA and HD along principal dimensions. Both methods are equivalent for tables close to independence between rows and columns. A measure of agreement between the matrices used in CA and HD is defined. This measure is decomposed into components, each component being the product of the weighted means of coordinates in CA and HD, measuring the difference along a specific dimension.
Second, we jointly compare CA, HD and LR. Then CA can be compared to HD and to LR, but a formal analogy between HD and LR is not apparent. However, when rows and columns are almost independent, a simple formula shows that CA, HD and LR may provide a quite similar graphical display.
Finally, two illustrative examples are given. In the first one the results are very similar for the first dimension, but some differences are found along the second dimension. In the case of the second example there are hardly any differences along the first and second dimensions.
As a conclusion, these methods may provide similar results under some circumstances. CA is the best for several reasons (symmetric joint representation, probabilistic interpretation), but actually may have some drawbacks when the rows are multinomial populations, for which the HD approach may be preferable.
Aitchison, J. (1986). The Statistical Analysis of Compositional Data. London: Chapman and Hall.
Aitchison, J. & Greenacre, M. J. (2000). Biplots of compositional data. Applied Statistics, 51, 375-392.
Rao, C. R. (1948). The utilization of multiple measurements in problems of biological classification (with discussion), Journal of the Royal Statistical Society, Series B, 10, 159-193.
Rao, C. R. (1995). A review of canonical coordinates and an alternative to correspondence analysis using Hellinger distance. Qüestiió, 19, 23-63.
Content and functions of social sharing of emotions: an application of multiple correspondence analysis
P O S T E R
Antonietta Curci & Giannangela Mastrorilli
University of Bari, Italy
The aim of the present study is the investigation of the contents of social sharing of emotions and its intra- and interpersonal functions for individuals’ life. Small samples of people are requested to answer semi-structured interviews on emotional experiences of medium to high intensity. Participants are individuals who have experienced recent emotional experiences (e.g., undergraduate students after an important exam, users of health services who have been exposed to stressful and/or traumatic situations, etc.).
After an initial free account of the original experience, participants are requested to recall relevant episodes of social sharing of that experience with important partners, in order to investigate the contents and reasons for their social sharing. Interviews are accompanied by the usual measures of intensity and type of emotions, and scales of mental rumination and social sharing (Rimé et al., 1998). A corresponding number of people is interviewed on non-emotional experiences (e.g., a work-day, hobby, etc.). Participants are randomly assigned to one of the two emotional vs. non-emotional conditions.
Interviews are audiotaped and the texts are content-analysed according to a predefined category system. With respect to the contents of social sharing, beside the classical distinction between emotional and factual aspects (Pennebaker & Beall, 1986), some other features are expected to emerge from the analysis, that is references to evaluative processes, Self and life goals, coping strategies, belief system affected by the emotional experience. Concerning the functions of social sharing, references are expected to emerge to the functions of catharsis and insight, social support, search for meaning, subjective feeling of well-being, perceptions of self-efficacy and continuity, social influence and cultural references, construction of new life goals and/or consolidation of already adopted goals.
Category frequencies are entered in a multiple correspondence analysis model in order to provide a visual display of the main contents and functions of social sharing of emotions with respect to the type and intensity of the emotional experiences.
Pennebaker, J. W. & Beall, S. K. (1986). Confronting a traumatic event: toward an understanding of inhibition and desease. Journal of Abnormal Psychology, 95, 274-281.
Rimé, B., Finkenauer, C., Luminet, O., Zech, E. & Philippot, P. (1998). Social sharing of emotion: new evidence and new questions. In European Review of Social Psychology (eds W. Stroebe & M. Hewstone), 8, 145-189, Chichester: Wiley.
Using correspondence analysis to explore the position
of major organisational constructs in a comprehensive model: organisational climate, trust and mental health.
Alessia D'Amato, Alexandra Lopes & Antonietta Curci
University of Surrey, UK, London School of Economics, UK & University of Bari, Italy
email@example.com, firstname.lastname@example.org & email@example.com
Organisational climate is widely recognised as an organisational framework to understand employees’ perceptions and behaviour (Forehand & von Haller, 1964; Burke et al., 2002). While organisational climate is considered to be an organisational-level variable, there are other different organisation-level and individual-level constructs that are affected by the former (Schneider et al.,1998; Ashkanasy et al., 2000) and recent studies have empirically demonstrated that strategically focused climate measures produce strong relationships with specific organizational outcomes.
The paper will be presenting the results of the use of correspondence analysis to analyse the structure of ten core first-order factors included in the general organisational climate, the resulting structure with regard to the socio-demographic variables wards and function and the correlation with other major organizational constructs: trust, stress, burnout and climate for service.
The ten core first-order factors included in the general organisational climate were: communication, leadership, job involvement, job description, team, reward, innovativeness, development, autonomy, consistency, and were obtained from a social constructionist perspective. In this model the variables are not context-specific but applied to different organizations: organizational climate is therefore considered as a generalizable set of factors (Majer & D’Amato, 2001), socio-demographic variables have been included in the overall model to account for their impact as mediators of perceptions on organisational climate.
Data were collected using a survey of 406 employees in different functions of a major Italian hospital. Results are discussed with regard to current research literature on organizational climate and some organizational and individual outcomes and demonstrate that, although correspondence analysis can be considered a neglected method in the research and literature on organizational behaviour, its contribution can be substantive for the understanding of the organizational processes and their relationships.
Ashkanasy, N. M., Wilderon, C. & Peterson, M. F. (eds. 2000). Handbook of Organizational Culture & Climate. Thousand Oaks: Sage Publications.
Burke, M. J., Borucki, C. C. & Kaufman, J. D. (2002). Contemporary perspectives on the study of psychological climate: A commentary. European Journal of Work and Organizational Psychology, 11, 325-340.
Forehand, G. A. & von Haller, G. (1964). Environmental variation in studies of organizational behavior. Psychological Bulletin, 62, 361-382.
Majer, V. & D’Amato, A. (2001). L’MDOQ, il questionario multidimensionale per la diagnosi del Clima Organizzativo. Padova: Unipress.
Schneider, B., White, S. S. & Paul, M. C. (1998). Linking Service Climate and Customer Perceptions of Service Quality: Test of a Causal model. Journal of Applied Psychology, 83, 150-163.
Finding significant partitions in multiple correspondence analysis
Josep Daunis-i-Estadella1, Tomàs Aluja-Banet2 & Santiago Thió-Henestrosa1
1Universitat de Girona & 2Universitat Politècnica de Catalunya, Barcelona, Spain
firstname.lastname@example.org, email@example.com & firstname.lastname@example.org
Displaying the existing relationships between variables on a global map is one of the most appealing tools in multivariate analysis. It is intended for discovering hidden patterns and revealing meaningful information. Multiple correspondence analysis (MCA) is a common default analysis for the case of categorical data.
Categorical data can be represented by means of a hypercube. Then MCA provides a global map of the associations among variables existing in the faces of the hypercube. Also it is well known that with actual data, the hypercube is sparse due to the curse of dimensionality, making it impossible to assess high-order interactions among variables. In this paper we intend to go one step further in the analysis of the association among variables within the framework of multivariate descriptive analysis, that is using the inertia as a measure of association between variables. It is based on the decomposition of global inertia into between-inertia and within-inertia (see, for example, Greenacre, 1984), like we perform in a conditional multiple correspondence analysis (Escofier, 1987). In particular, we compute the significance of "the partition induced for every variable in the remaining ones", using the asymptotical distribution of the between-inertia, based on a chi-square distribution. Then we can go deeper within the relationships existing in the hypercube.
We will present an application of this methodology to the analysis of the scores on different subjects in a first university course and we will compare the obtained results with the significance of the corresponding terms of a log-linear model, which is the classical approach for such data.
Escofier, B. (1987). Analyse des correspondances multiples conditionelle. Technical Report, INRIA.
Greenacre, M. J. (1984). Theory and Applications of Correspondence Analysis. London: Academic Press.
Conditional bias measures of influence in correspondence analysis
Coral de la Cámara-García1, Rafael Pino-Mejías2,3, Juan Muñoz-Pichardo2
& María-Dolores Cubiles-de-la-Vega2
1Universidad de Huelva, 2Universidad de Sevilla & 3Centro Andaluz de Prospectiva, Spain
email@example.com, firstname.lastname@example.org, email@example.com & firstname.lastname@example.org
We face the problem of constructing influence diagnostics in simple correspondence analysis, considering the identification of influential rows or columns. Our paper offers an alternative approach to the measures based on the influence functions, and we present a formalised study arising from the topic of conditional bias, introduced by Muñoz-Pichardo et al. (1995). Given a realisation xI of a subset XI of the sample X, the conditional bias of a statistic T is defined as S(xI;T)=E[T/XI=xI]-E[T], thus taking into account the conjoint influence of a set of observations, and it accommodates the study of single observations as a particular case. In preceding work we have proposed influence measures based on the estimation of the conditional bias in the general linear model, both univariate (Muñoz-Pichardo et al., 1995) and multivariate (Muñoz-Pichardo et al., 2000) ones. In posterior work (Enguix-González, 2002) we considered the principal components model, developing influence measures for the eigenvalues and eigenvectors of the correlation and covariance matrices.
Our previous research led us to the problem of correspondence analysis. We exploit the principal components interpretation of correspondence analysis that emerges from the generalized singular value decomposition of the chi-square residuals for the independence test between rows and columns. From this viewpoint, we consider expansions for the eigenvalues and eigenvectors resulting from the deletion of a set of rows or columns, building approximations based in the first and second order terms, so the cumbersome task of recomputing the correspondence analysis results is avoided. These approximations are incorporated into the definition of the measures we propose to identify influential categories for the eigenvalues and the coordinates which are the main objectives of the correspondence analysis, so they are the considered statistics when adapting the conditional bias definition.
We have implemented these proposed measures in the R system, being illustrated by several datasets, and we finally suggest possible extensions to the multiple correspondence analysis framework.
Enguix-González, A. (2002). Influence Analysis in Principal Components. PhD thesis, Universidad de Sevilla.
Muñoz-Pichardo, J. M., Muñoz-García, J., Moreno-Rebollo, J. L. & Pino-Mejías, R. (1995). A new approach to influence analysis in linear models. Sankhya: The Indian Journal of Statistics, Series A, 57, 393-409.
Muñoz-Pichardo, J. M., Muñoz-García, J., Fernández Ponce, J. M. & Jiménez-Gamero, M. D. (2000). Influence analysis in multivariate general linear models. Communications in Statistics. Theory and Methods, 29, 529-547.
Principal curves for correcting the horseshoe effect in
Pedro Delicado & Tomàs Aluja
Universitat Politècnica de Catalunya, Barcelona, Spain
email@example.com & firstname.lastname@example.org
Correspondence analysis (CA) is a useful technique to analyze count data, revealing meaningful association patterns among row and column categories of a table. However, a non-linear ordination of rows and columns often appears in CA displays. In the case when the probability distributions of rows or columns categories are unimodal, the successive factors appear to be increasing polynomials of the first one, though they are uncorrelated by construction. This is known as the horseshoe effect or Guttman effect (Benzécri, 1973; Greenacre, 1984). Although it is interesting to explore the departures of points from the curvature, it is problematic when we are interested in building an index from data,. since a linear effect is shown as being non-linear. This is the case in ecology when sites are correlated with species or when building a social status index from census data.
The technique of principal curves (Hastie & Stuetzle, 1989) appeared as a way to generalize principal component analysis to non-linear settings. Principal curves are smooth curves that pass through the middle of a multivariate continuous data set (see also Kégl et al., 2000, and Delicado, 2001).
In this presentation we propose to extract the principal curves from the data as a way of eliminating the non-linearity. In practice, only a few (from 2 to 4) factors are used as input for the principal curves algorithm. It is possible to define new individual scores from their relative position with respect to principal curves. If only one principal curve is extracted, the new scores summarize the data better than the first correspondence analysis factor. This is because the principal curve maximizes the projected dispersion (inertia) over a class of functions in which the straight lines are included.
The method is illustrated for deriving a social status index for the geographical units that compose the city of Barcelona, using the socio-professional profile of their inhabitants.
Benzécri, J.-P. et al. (1973). L’Analyse des Données. Tome 1: La Taxinomie. Tome 2: L’Analyse des Correspondances. Paris: Dunod.
Delicado, P. (2001). Another look at principal curves and surfaces. Journal of Multivariate Analysis, 77, 84-116.
Greenacre. M. J. (1984). Theory and Applications of Correspondence Analysis. London: Academic Press.
Hastie, T. & Stuetzle, W. (1989). Principal curves. Journal of the American Statistical Association, 84, 502-516.
Kégl, B., Krzyzak, A., Linder, T. & Zeger, K. (2000). Learning and design of principal curves. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 281-297.
Analyses of matched pairs of data matrices by complex
singular value decomposition
Université de Pau et des Pays de l'Adour, Pau, Framce
A central concept in the analysis of square tables is that of symmetry and, consequently, that of departure from symmetry. The square table under consideration, possibly pre-processed, is split into two matrices, a symmetric part and a skew symmetric part (Greenacre, 2000). A similar decomposition can be elaborated for a set of two matched two-way tables. In the more general setting of the set of two matched pairs of matrices A and B, the concept of symmetry translates into the concept of “common” part, while departure from symmetry into departure from the common part, that is the “specific” part. Most statistical analyses addressed to square tables can then be extended and it turns out that descriptive and modelling points of view are closely intertwined.
The main purpose of this presentation is an investigation into the exploratory analysis of a set of two matched two-way tables and their biplot visualizations. It is shown how standard methods, initially derived for the analysis of square tables, extend to this more general setting. Biplot visualizations for a given data matrix M are derived from reduced rank approximations obtained by (generalized) singular value decomposition and that these approximations are least-squares optimal (Falguerolles & Greenacre, 2000). Then we review some of the singular value decompositions which can be considered for the joint analysis of tables A and B (or of associated “common” part C and “specific” part D).
The idea of this paper is to consider the complex matrix C+iD for the joint analysis of C and D. Then the central trick in this work is the singular value decomposition of complex matrices which provides simultaneously a reduced rank approximation for both the “common” part and the “specific” part. The natural bi-dimensionality of this approach is appealing: biplots are best displayed in two-dimensions. The modelling interpretation will be emphasized and is taken up for comparing the different approaches for the analysis of such matched pair of tables. It turns out that it can also be used to fit reduced rank models (Falguerolles, 1998). The results are illustrated on data sets.
Falguerolles, Antoine de (1998). Log-bilinear biplots in action. In Visualisation of Categorical Data (eds J. Blasius & M. Greenacre). London: Academic Press.
Falguerolles, Antoine de, & Greenacre, Michael (2000). Statistical modelling for matched tables. In Statistical Modelling, proceedings of the 15th International Workshop on Statistical Modelling (IWSM) (eds V. Núñez-Antón & E. Ferreira), Universidad del País Vasco, 195-200.
Greenacre, Michael (2000). Correspondence analysis of square asymmetric matrices. Applied Statistics, 49, 297-310.
Introduction of correspondence analysis in multiway
methods of simultaneous ordination
Anne B. Dufour, Sandrine Pavoine & Daniel Chessel
Université Claude Bernard, Lyon, France
email@example.com, firstname.lastname@example.org & email@example.com
We propose to introduce the logic of correspondence analysis (CA), using the duality diagrams, in the K-tables methods such as the multiple factor analysis (MFA), multiple co-inertia analysis and the simultaneous analysis of tables (ACT-STATIS, Lavit, 1988; Lavit et al., 1994). The experimental situation is the analysis of a set of K contingency tables or a set of arrays with positive or null values. All the arrays are paired by rows. An example of such kind of data is the study of genetic relationships among cattle breeds with microsatellites (Moazami-Goudarzi et al., 1997).
Different preliminary methods are available for analysing these data. Each array is a contingency table which can be analysed separately, resulting in an association of K analyses for which each row has the same weight. It is also possible to introduce a coordination of each separate analysis with the intra-class analysis onto the column partition. This method is a CA when the marginal profiles are constant. Finally, this approach is generalized by the correspondence analysis of doubly partitioned arrays, the so-called internal correspondence analysis (Cazes et al., 1988).
Other methods, such as MFA (Escofier & Pagès, 1994), analyse K duality diagrams for contingency tables. MFA is an intra-block correspondence analysis introducing a simultaneous representation of the rows (cattle breeds in our example). It can be shown that the MFA is also related to another method, multiple co-inertia analysis (Chessel & Hannafi, 1996) which explores the importance of each table in the graph of the synthetic variables. The typology revealed by the two previous analyses can be obtained by the compromise structure of ACT-STATIS. In conclusion, we illustrate the interest of each method studying a data structure. All the analyses and plots are performed with the ade4 package for the R environment.
Cazes, P., Chessel, D. & Doledec, S. (1988). L'analyse des correspondances internes d'un tableau partitionné : son usage en hydrobiologie. Revue de Statistique Appliquée, 36, 39-54.
Chessel, D. & Hanafi, M. (1996). Analyses de la co-inertie de K nuages de points. Revue de Statistique Appliquée, 44, 35-60.
Escofier, B. & Pagès, J. (1994). Multiple factor analysis (AFMULT package). Computational Statistics and Data Analysis, 18, 121-140.
Lavit, C. (1988). Analyse Conjointe de Tableaux Quantitatifs. Paris: Masson.
Lavit, C., Escoufier, Y., Sabatier, R. & Traissac, P. (1994). The ACT (Statis method). Computational Statistics and Data Analysis, 18, 97-119.
Moazami-Goudarzi, K., Laloë, D., Furet, J. P. & Grosclaude, F. (1997). Analysis of genetic relationships between 10 cattle breeds with 17 microsatellites. Animal Genetics, 28, 338-345.
Application of constrained and unconstrained correspondence analysis to benthic communities of the Great Barrier Reef
Rodney Ellis¹, Roland Pitcher2, Bronwyn Harch² & Kaye Basford¹
¹University of Queensland, Brisbane, ²CSIRO, Cleveland, Australia
Finding patterns in multivariate species assemblage data and additionally relating those patterns to a collected set of environmental parameters is an important and common endeavour amongst marine ecologists. These types of studies can be approached from either a constrained (direct gradient) or an unconstrained (indirect gradient) analysis using various ordination techniques. This presentation applies both approaches using forms of correspondence analysis to analyse biomass data collated for 922 epibenthic species assemblages and 19 related environmental parameters at 162 sampling stations inside the far northern section of the Great Barrier Reef. The effects of data transformation, taxonomic resolution, and numbers of species retained on the results and interpretations given by each approach were investigated. Procrustes analysis was used to aid in the comparison of the ordinations given by these different sets of analyses.
The unconstrained ordinations resulted in the first axes accounting for 3.1% to 16.2% of the total biological variation with some ordinations showing detectable inshore and offshore trends among sampling stations. The constrained ordinations showed the same inshore and offshore trends with the first axis explaining 17% to 51% of the total species-environmental relationship. Percent mud, benthic stress, phosphate, silicate, grainsize and chlorophyll-a were the contributing parameters in each of the analyses. The number of species used in the analysis greatly affected the results and interpretations given by both approaches, as did the taxonomic resolution. The difference between the result of using a log(biomass +1) transformation and a conversion to presence/absence scores applied to assemblage data increased with decreasing taxonomic resolution. The ordination scores on the first axes from both the constrained and unconstrained analyses generally revealed the same underlying gradient, separating inshore and offshore sites.
A joint statistical analysis for a pair of tables which are not completely matched
Antoine de Falguerolles
UniversitéPaul Sabatier, Toulouse, France
In this presentation, I will analyze a pair of historical contingency tables taken from the Mémoires pour servir à l'histoire de Languedoc by Nicolas de Lamoignon de Basville (1734). Nicolas de Lamoignon de Basville (26 avril 1648 - 17 mai 1724), the intendant (governor) of the Languedoc province for 23 years (1695-1718), is famous for his harsh repression of the Calvinist “phanatiques” living in the Cévennes. During his governance, he supervised a memoir for the instruction of the Duc de Bourgogne giving a thorough description of the Languedoc province. Numerous hand-written copies of the Mémoires were circulated after 1697 before its late publication in 1734 (see Moreil, 1985).
Among several tables, the Mémoires presents two so-called maps: one containing the number of ecclesiastics, the number of religious houses and monasteries, and one concerning the convents for women and the number of nuns. Interestingly, the lay-out for these maps, is the actual standard form for two-way tables: counts cross-classified by dioceses (23 in the Languedoc province) and by religious orders (25 for men, 23 for women). Clearly, the two tables have a common geographical dimension and are thus related. However, they are not completely matched since the categories for the men do not correspond to that of women (see Falguerolles & Greenacre, 2000).
I will discuss several bi-linear models (Falguerolles & Francis, 1992; Falguerolles, 2000) for the analysis of a pair of tables having in common one marginal. I will try to see if Nicolas de Lamoignon de Basville could report to the Court in Versailles a proper coverage of the Languedoc province by Roman Catholic nuns and friars with special attention to the unrest in the Cévennes.
Basville, N. Lamoignon de (1734). Mémoires pour servir à l’Histoire de Languedoc. Amsterdam: Pierre Boyer.
Falguerolles, A. de (2000). Gbms: Glms with bilinear terms. In COMPSTAT 2000, proceedings in computational statistics, (eds J. Bethlehem & P.G.M. van der Heijden), 53-64. Heidelberg: Physica-Verlag.
Falguerolles, A. de & Francis, B. (1992). Algorithmic approaches for fitting bilinear models. In COMPSTAT 92, proceedings in computational statistics (eds Y. Dodge & J. Whittaker), 1, 77-82. Heidelberg: Physica-Verlag.
Falguerolles, A. de & Greenacre, M. J. (2000). Statistical modelling for matched tables. In Statistical Modelling, Proceedings of the 15th International Workshop on Statistical Modelling (IWSM) (eds V. Núñez-Antón and E. Ferreira), 195-200. Bilbao: Universidad del País Vasco.
Moreil, F. (1985). L'Intendance de Languedoc à la Fin du XVIIème Siécle, Édition Critique du Mémoire pour l'Instruction du Duc de Bourgogne. Paris: CTHS.
Type of organizational culture at a public university
Karmele Fernández-Aguirre, Petr Mariel & Ana Martín-Arroyuelos
Universidad del País Vasco/Euskal Herriko Unibertsitatea, Bilbao, Spain
The objective of this paper is to analyze organizational aspects at the University of the Basque Country at three different levels: departments, research groups and university overall paying special attention to the organizational culture. The model we adopt in our analysis is the Model of Values in Competition (Cameron & Quinn, 1999) which is based on two bipolar dimensions. The first one opposes the organizational position towards interior against exterior and the second one opposes flexibility against control. These two axes form four quadrants in which the following organizational position (type of culture) can be placed: clan, hierarchy, market and innovation.
The type of culture compatible with hierarchy is defined as a space of formalized and structured work, where the formal rules and policies are pillars of the organization. Here, the effective leaders have to coordinate and organize properly and the general long run objectives are stability, ability to foresee and efficiency. The organizational form called market is based on “management by objectives” and “cost transaction”. An institution is directed towards exterior more than to own internal issues, if it is so towards transactions with external organizations as suppliers, customers, trade unions etc. The internal control is maintained through the economic mechanisms of market and not by central decisions and rules as in the hierarchy. The name clan of the third type of culture is used for its similarity with the family organization. This type of organization seems to be a large family more than an economic unit characterized by shared values, shared objectives, cohesion, participation and very strong feeling of “we”. The rules and procedures typical to hierarchy are replaced by involving of the employees and corporative agreement. Finally, innovation means that the most important task of management is to stimulate knowledge, risk and creativity in order to be the “most recent”. This type of culture is based on groups which improve the basic procedures of an organization to achieve adaptability, flexibility and creativity.
We use a data set obtained from a survey which collects responses from 600 lecturers out of a total of 2900 who work at the University of the Basque Country. We apply cluster analysis in the space of the first factors obtained from a multiple correspondence analysis (Lebart, 1994), centering our analysis on the characteristics of the formation and dissolution of research groups which present quite different organizational structure in comparison with the university. The conclusions we obtain indicate that the flexible culture prevails the rigid one and that the clan-type of culture with a high percentage of innovation is the most perceived one. These conclusions support the hypothesis about the opening of the rigid university structure through high level quality research groups.
Cameron, Kim S. & Quinn, Robert E. (1999). Diagnosing and Changing Organizational Culture. Reading, MA: Addison Wesley Longman.
Lebart, L. (1994). Complementary use of correspondence analysis and cluster analysis. In Correspondence Analysis in the Social Sciences (eds M. Greenacre & J. Blasius), 162-178. London: Academic Press.
Using optimal scaling to scale items for questionnaires
Giovanni Battista Flebus
Università degli Studi di Milano-Bicocca, Milano, Italy
Although the technique of optimal scaling has been known for decades (Guttman, 1950), there is hardly any example of its applications in mental test construction (Greenacre, 1984). The method enables a researcher to scale nominal answers in multiple choice tests (see, for example, Gifi, 1990), even though current examples imply the existence of one right (=1) and several wrong answers (=0). It will be shown that the technique can also be applied to "typical performance" tests, such as attitude or personality questionnaires. To illustrate this principle, two empirical examples of test construction with the optimal scaling technique are presented, where there are no right answers, and (as in the first example) where there is no a priori or ordered scoring.
Example 1: an attitude questionnaire. An eight-item attitude scale, meant to measure attitudes towards gay people, was constructed using the multiple-choice format. Each item (the stem) is to be answered by selecting one sentence out of five; two of them depict a negative attitude, two others present a positive attitude, while the fifth presents a more or less indifferent attitude. On the eight items the optimal score technique was applied, and the total score was compared with a Likert scale, validated to measure attitudes in a more traditional way (Flebus & Montano, 2001). The sample, made up of 2323 Italian adults, gave a high reliability coefficient for the optimal score scale, and a high correlation coefficient was found with parallel, more Likert scales. The Guttman effect (horseshoe effect) can be used as a diagnostic tool to ascertain that the scale is – as it should be – unidimensional.
Example 2: a multi-factor questionnaire to detect vocational indecision. A 62-items questionnaire, written to detect students' indecision, in the same format as a multiple-choice test, was scaled with the optimal score technique. By alternating factor analysis and optimal scoring, a multi-factor solution was found: the internal validity was ascertained with Cronbach's alpha coefficient, and concurrent validity was assessed with interviews (Flebus, 2000).
Flebus, G. B. (2000). Un questionario di autovalutazione della scelta scolastica. In Orientamenti per l'Orientamento (ed. S. Soresi). Firenze: Giunti.
Flebus, G. B. & Montano, A. (2001). The Italian Homophobia Scale - an internal and concurrent validity study. Presentation at 2001 ISSID Congress in Edinburgh.
Gifi, A. (1990). Nonlinear Multivariate Analysis. New York: John Wiley.
Greenacre, M. J. (1984). Theory and Application of Correspondence Analysis. New York: Academic Press.
Guttman, L. (1950). The principal components of scale analysis. In Measurement and Prediction (ed. S. A. Stouffer). Princeton , NJ: Princeton University Press.
The Milestones Project: a case-study in the historiography of data visualization
York University, Toronto, Canada
The graphic representation of quantitative information has deep roots. These roots reach into the histories of the earliest map-making and visual depiction. Later, they extend to thematic cartography, statistics and statistical graphics, medicine, and other fields which have now come to rely upon visual representations to display, illustrate, or explain relations or phenomena more easily than with just words or tables (Friendly & Denis, 2000).
Along the way, developments in technologies (printing, reproduction, computing), mathematical and statistical theory and practice — empirical observation and recording, nurtured and replenished the soil. The Milestones Project (Friendly & Denis, 2001) attempts to document and illustrate these historical developments leading to modern data visualization and visual thinking. There are several goals:
· Prepare a comprehensive catalog of important milestones in all fields related to data visualization.
· Collect representative images, bibliographical citations, cross-references, web links in a single location.
· Enable searching for researchers to find and study themes, antecedants, influences, patterns, trends, and so forth.
In this presentation I discuss:
· An overview of the project and its current status.
· Some examples of graphical excellence from the “golden age of statistical graphics” (1860-1900), e.g. Friendly (2002).
· Questions of documenting milestone “events” for modern historiography.
· Meta-questions of representation of this history.
Friendly, M. (2002). Visions and re-visions of Charles Joseph Minard. Journal of Educational and Behavioral Statistics, 27, 31–51.
Friendly, M. & Denis, D. J. (2000). The roots and branches of statistical graphics. Journal de la Société Française de Statistique, 141, 51–60 (published in 2001).
Friendly, M. & Denis, D. J. (2001). Milestones in the history of thematic cartography, statistical graphics, and data visualization. http://www.math.yorku.ca/SCS/Gallery/milestone/1
An alternative to the nonsymmetrical correspondence analysis based on TUCKALS3 algorithm
Purificación Galindo & Sonia Salvo-Garrido
Universidad de Salamanca, Spain & Universidad de la Frontera, Chile
firstname.lastname@example.org & email@example.com
There are three different approaches for the study of a three-way contingency table; one is to construct two two-way contingency tables, crossing the dependent variable with each of the explanatory variables, that is, working with the marginal distributions. This approach does not consider possible relationships between the explanatory variables. A second approach considers interactively coding the two explanatory variables in a new I x JK table, however this approach does not consider the information as included in the original table. The third approach corresponds to the partial nonsymmetrical correspondence analysis defined by Lauro & Balbi (1999), which consists in analyzing the relationship between the dependent variable i and the explanatory variable j at a given level of k or conversely.
However, none of the previous approaches truly considers the three-way structure of the table, rather decomposing it into several forms of two-way tables. We need an additional approach which, considering the dependency among the independent variables, simultaneously analyzes the dependence of the response with respect to them. For this purpose, it is necessary to define a new form of representation of the three-way structure of the table in the plane.
Taking into account the nonsymmetrical correspondence analysis (NSCA) proposed by Lauro and D’Ambra (1984), we propose an alternative based on the generalized SVD using a criteria proposed by Timmerman and Kiers (2000) and the new graphical interactive biplot representation proposed by Carlier and Kroonenberg (1996). The interpretation of the matrix of interactions between the latent dimensions of the three modes, after applying the TUCKALS3 algorithm to the residuals matrix, allows us to determine the predictive capacity of the explicative variables. The interactive biplot representation of the residuals matrix shows the best predicted categories of the response variable, and the categories of the explicative variables which have the greatest predictive capacity. By projecting the categories of the response variable onto the vectors defined by combinations of categories of the predictors, we obtain the graphical representation of the levels of association between the predictor and response variables.
We also present a generalization of the Gray and Williams (1975) multiple association index for the case of several explanatory variables.
Carlier, A. & Kroonenberg, P. (1996). Decompositions and biplots in three-way correspondence analysis. Psychometrika, 61, 355-373.
Gray, I. N. & Williams, J. S. (1975). Goodman and Kruskal's tau b: multiple and partial analogs. Proceedings of Social Statistics Sections of the American Statistical Association, 444-448.
Kroonenberg, P. (1989). Singular value decompositions of interactions in three-way contingency tables. In Multiway data analysis (eds R. Coppi & S. Bolasco). Amsterdam: North-Holland.
Lauro, N. C. & D'Ambra, L. (1984). L'analyse non symétrique des correspondances. In Data Analysis and Informatics, III (eds E. Diday et al.), 433-446. Amsterdam: North Holland.
Lauro, N. C. & Balbi, S. (1999). The analysis of structured qualitative data. Applied Stochastic Models and Data Analysis, 15, 1-27.
Timmerman, M. E. & Kiers, H. A. L. (2000). Three-mode principal component analysis. Choosing the numbers of components and sensitivity to local optima. British Journal of Mathematical and Statistical Psychology. 53, 1-16.
Exploring differences and overlap between Middle Stone Age artefacts using multiple correspondence analysis and biplot methodology
Sugnet Gardner & Niël J. le Roux
University of Stellenbosch, South Africa
Technology changed more subtly in the Middle Stone Age (MSA) than in today’s rapid Computer Age. Artefacts known as blades and points excavated on the south coast of Africa provide information for exploring differences between the sub-stages MSA I, MSA II Upper and MSA II Lower. Because of the slow rate of change, some overlap between the sub-stages is to be expected. Both categorical and continuous variables were measured for exploring the relationships between the sub-stages. However, since the categorical measurements are more subjective, conclusions based on the correspondence between these variables and the sub-stage classification might be questioned.
Developments in biplot methodology (Gower & Hand, 1996; Gardner, 2001) since its introduction by Gabriel (1971) have provided the infrastructure for many novel applications when dealing with such exploratory data analyses. The unified biplot methodology introduced by Gower and Hand allows for separate as well as simultaneous graphical representations of continuous and categorical variables. Multiple correspondence analysis and generalised biplot representations can easily be obtained through this unified approach by utilising different distance metrics. These graphical representations can then just as easily be linked to class separation through canonical variate analysis biplot displays.
Gardner (2001) explored different methods of describing the spread of a cloud of points, leading to the definition of an a-bag. Quantifying the separation and overlap between the artefacts of the three sub-stages is possible by superimposing these a-bags onto biplot displays.
In this presentation the continuous measurements in the data set discussed by Wurz et al. (2003) are supplemented with categorical data. Multiple correspondence analysis representations are compared to several generalised biplot displays. Canonical variate biplot displays with accompanying a-bags are utilised to quantify and describe the overlap and separation between the sub-stages. The relevance of the categorical variables in discriminating between the sub-stages is evaluated with biplot displays based on the continuous variables.
The paper provides an illustration of how the exploration of multivariate data sets consisting of both continuous and categorical data can be approached by combining multiple correspondence analysis with various biplot techniques and a-bags.
Gabriel, K. R. (1971). The biplot graphical display of matrices with application to principal component analysis. Biometrika, 58, 453 – 467.
Gardner, S. (2001). Extensions of biplot methodology to discriminant analysis with applications of non-parametric principal components. Unpublished Ph D thesis. University of Stellenbosch.
Gower, J. C. & Hand, D. J. (1996). Biplots. London: Chapman & Hall.
Wurz, S., le Roux, N. J., Gardner, S. & Deacon, H. J. (2003). Discriminating between the end products of the earlier Middle Stone Age sub-stages at Klasies River using biplot methodology. Journal of Archaeological Science (in press).
Topics of interest in internet access: an application of
Beatriz Goitisolo & Amaya Zárraga
Universidad del País Vasco/Euskal Herriko Unibertsitatea, Bilbao, Spain
firstname.lastname@example.org & email@example.com
The aim of this work is to show the conduct of individuals in relation to new technologies and, especially, their behaviour towards the Internet. To this end we use the Survey on the Information Society (ESI) drawn up by the Basque Institute of Statistics (EUSTAT) in the fourth quarter of 2000.
The ESI obtains information on the equipment available to individuals at home, at school and at work, on administrative use of automatic teller machines (ATMs) and on contact with the different mass media and the Internet. The variables that refer to the Internet can be grouped in four blocks:
· Internet knowledge
· Use of Internet services
· Topics of interest in Internet access
The individuals surveyed are characterized using the sociodemographic variables extracted from the Survey of Population in Relation to Activity (PRA), also drawn up by EUSTAT for the same period of time.
With the information from these surveys various contingency tables are created crossing the sociodemographic variables with those relative to the Internet. The method used for the joint study of these tables is simultaneous analysis (Zárraga & Goitisolo, 2002 and 2003). This method allows the internal structure of each table to be maintained, and prevents any one of them from dominating in the overall analysis. A more in-depth study on the topic can be found in Goitisolo (2002).
EUSTAT- Instituto Vasco de Estadística (www.eustat.es).
Goitisolo, B. (2002). El Análisis Simultáneo. Propuesta y Aplicación de un Nuevo Método de Análisis Factorial de Tablas de Contingencia. Doctoral thesis, University of the Basque Country.
Zárraga, A. & Goitisolo, B. (2002). Méthode factorielle pour l’analyse simultanée de tableaux de contingence. Revue de Statistique Appliquée, 50, 47-70.
Zárraga, A. & Goitisolo, B. (2003). Étude de la structure inter-tableaux à travers l’Analyse Simultanée. Revue de Statistique Appliquée (forthcoming).
Non-orthogonality in correspondence analysis and related methods
John C. Gower
The Open University, Milton Keynes, U.K.
England and The U.S.A. have been described as two nations divided by a common language. There is ample room for confusion in trying to understand the "nations" of correspondence analysis and related methods. There, the common language is the algebraic eigenvalue problem that is an inevitable consequence of minimising quadratic forms, or ratios of quadratic forms, arising from least-squares criteria. This is certainly a unifying principal but the apparent similarity induced on different methods tends to obfuscate fundamental differences of importance for understanding data analysis, for example the non-orthogonality mentioned in the title. I shall try to isolate what I believe to be some of the basic issues - first for approximations to two-way arrays of quantitative data and then for categorical data. I shall focus on:
(1) The basic models: rank r representations of X, of X'X, of XX', of r-dimensional distances derived from the rows and/or columns of X, or of distances derived from X'X (including or excluding the diagonal).
(2) The criteria used to fit (1): least-squares, ratios of quadratic forms, minimal L1-norm, robust methods, likelihood, …
(3) The algorithm used to fit (2): algebraic eigenvalue and SVD algorithms, alternating least-squares algorithms (ALS), majorisation, …
(4) The role of constraints: normalisations of eigenvectors, to work in deviations from the means or not, orthogonalisation in ALS algorithms (Gower, 1998).
(5) The geometric visualisation of the approximation: what are the appropriate interpretative tools: inner products, distances, centroids, scales, prediction regions?
(6) The measurement of fit and orthogonality of fitted components: some models/ algorithms give non-orthogonal fitted and residual components, which complicate interpretations of measures of goodness-of-fit (Gower and Hand, 1996). Fits obtained by optimising one criterion may be evaluated in terms of another criterion (Gabriel, 2002).
(7) The distinction between a data matrix and a two-way table: a table whose columns represent variables is statistically very different from a two-way table of counts, or of a third variable classified by two other variables.
Misunderstandings arise because (a) (1), (2), (3) and (4) are often presented in a nearly inextricable manner, (b) there is a too-uncritical readiness to carry over algebra that is valid for a data matrix to the analysis of a two-way table (and vice versa), (c) the correspondence analysis of a two-way contingency table is linked in a fairly opaque way to the multiple correspondence analysis of a data matrix with two categorical variables, (d) fit statistics may be misinterpreted, (e) by concentrating on the fitted part of a model, sight may be lost of what is happening to the residual part and (f) the performance of the primary criterion may be evaluated in terms of a secondary criterion.
Gabriel, K. R. (2002). Goodness of fit of biplots and correspondence analysis. Biometrika, 89, 423-436.
Gower, J. C. (1998). The role of constraints in determining optimal scores. Statistics in Medicine, 17, 2709-2721.
Gower, J. C. & Hand, D. J. (1996). Biplots. London: Chapman and Hall.
Correspondence analysis with quantitative supplementary variables
Universitat Politècnica de Catalunya, Barcelona, Spain
Correspondence analysis (CA) is a well-known method for making pictures (biplots) of contingency tables and tables of count data. On occasions it is of interest to display samples or cases in a biplot made by correspondence analysis that were not included in the original analysis. Such samples are known as supplementary points, and their position in a biplot is usually calculated by using the “transition formulae” or “barycentric relationships” of correspondence analysis.
On other occasions it may be of interest to represent a quantitative variable, not used in the original analysis, in a biplot obtained by correspondence analysis. The representation of such supplementary variables can greatly enhance the interpretation of the biplot. In ecological studies the procedure for representing such variables is known as “indirect gradient analysis” (ter Braak, 1987). General formulae for the calculation of coordinates of supplementary points and supplementary variables in biplots are given by Gabriel (1995). Specific results for correspondence analysis are discussed by Graffelman & Aluja-Banet (2003).
In this talk we define a specific geometrical problem, and search for an optimal direction in a CA-biplot that best represents the quantitative supplementary variable. The optimal direction can be found by solving a weighted least squares problem, and plotting the regression coefficients in the biplot. If the supplementary variable is standardized (in the weighted sense), then its coordinates in the biplot are given by the weighted correlation coefficients of the variable with the standardized biplot axes. Both row and column markers in the CA biplot are interpretable with respect to the supplementary variable vector: the projections of the standard coordinates approximate the supplementary data, and the projections of the principal coordinates approximate weighted averages with respect to the supplementary variable. Geometrical properties of the solution, goodness of fit issues and the relationship with canonical correspondence analysis (ter Braak, 1986) will be pointed out in the talk. Empirical data will be used to illustrate the results.
Gabriel, K. R. (1995). Biplot display of multivariate categorical data, with comments on multiple correspondence analysis. In Recent Advances in Descriptive Multivariate Analysis (ed. W. J. Krzanowski).
Graffelman, J. & Aluja-Banet, T. (2003). Optimal representation of supplementary variables in biplots from principal component analysis and correspondence analysis. Biometrical Journal, 45 (in press).
ter Braak, C. J. F. (1986). Canonical correspondence analysis: a new eigenvector technique for multivariate direct gradient analysis. Ecology, 67, 1167-1179.
ter Braak, C. J. F. (1987). Ordination. In Data analysis in community and landscape ecology. (eds Jongman, R. H. G., ter Braak, C. J. F. & van Tongeren, O. F. R.), 91-173. Wageningen: Pudoc.
Biplots of compositional data using weighted logratio maps
P O S T E R
Michael Greenacre & John Aitchison
Universitat Pompeu Fabra, Barcelona & University of Glasgow, Scotland
firstname.lastname@example.org & John.Aitchison@btinternet.com
Compositional data are a special case of categorical data. They are vectors of data which sum up to a constant, usually proportions or percentages. Common examples are: results of elections, time budgets and gene frequencies in population genetics. These data have what is known as the "unit-sum constraint", i.e. they add up to a constant, for example in the above three examples: 100%, 24 hours, and 1 respectively (Aitchison, 1986). It seems on the surface that such data can be analyzed quite easily using conventional multivariate techniques such as principal component analysis and correspondence analysis (see, for example, Greenacre & Blasius, 1994), but it turns out that these methods do not respect the unit-sum constraint in their solutions. Also, they do not have what is called "subcompositional coherence", a property deemed essential for any methodology applied to compositional data, meaning that the analysis of a subset of the components should not give different results compared to when the subset is analyzed as part of the whole.
In this poster we describe the logratio approach to visualizing compositional data using biplots, an approach which is tailored to compositional data but which also works just as well for general tabular data on a ratio scale, for example contingency tables (Aitchison & Greenacre, 2002). This method does not, however, follow the principle of “distributional equivalence”, deemed by Benzécri (1973) to be the most important property for analysing categorical data. This principle states that merging two categories which have the same conditional distribution (or profile) should not affect the analysis in any way. By introducing a simple modification of the logratio approach which is inspired by the row and column weighting in correspondence analysis, a method can de defined which we call the "weighted logratio map". This modified logratio approach now turns out to have both properties of subcompositional coherence and distributional equivalence. In this sense this method improves the existing logratio approach, and not only forms an interesting competitor to correspondence analysis but appears to have better properties. But, like all methods involving logratios, it suffers from the inconvenience of involving a logarithmic transformation which causes problems when data values are zero, which is often the case in the social and environmental sciences.
Interestingly, this weighted ratio map is theoretically identical to what is known in a completely different context as "spectral mapping", developed in biochemical research by Lewi (1976) - see Lewi's invited paper in this conference. The method is illustrated using compositional data from an archeological study of Roman glass cups.
Aitchison, J. (1986). The Statistical Analysis of Compositional Data. London: Chapman and Hall.
Aitchison, J. & Greenacre, M. J. (2000). Biplots of compositional data. Applied Statistics, 51, 375-392.
Benzécri, J.-P. (1973). Analyse des Données. Tome II: Analyse des Correspondances. Paris: Dunod.
Greenacre, M. J. & Blasius, J. (1994). Correspondence analysis and its interpretation. In Correspondence Analysis and the Social Sciences (eds M. J. Greenacre & J. Blasius), 3-22. London: Academic Press.
Lewi, P. J.(1976). Spectral mapping, a technique for classifying biological activity profiles of chemical compounds. Arzneim. Forsch. (Drug Research), 26, 1295-1300.
Decomposing interactions by generalized bi-additive
models for categorical data
Patrick J. F. Groenen & Alex Koning
Erasmus University Rotterdam, The Netherlands
email@example.com & firstname.lastname@example.org
In the analysis of categorical data, generalized linear models are often used as a generalization of analysis of variance techniques (see, for example, McCullagh and Nelder, 1989). Among these techniques are, for example, loglinear analysis and categorical logistic regression. In many cases, interaction terms are modelled and the most interesting one are the bivariate interactions. However, as the number of categorical variables increases, the total number of bivariate interactions also increases dramatically. Our aim here is to provide a simple graphical representation to facilitate the interpretation of all two-way interactions simultaneously. To reach this goal, we impose rank restrictions on the two-way interactions, thus leading to a bi-additive model. We propose identification constraints to the bi-additive part that allow the main effects to be separated from the interaction effects.
The main reason for proposing the current model is that the bivariate interaction effects in ordinary GLM are hard to interpret, especially if the number of variables or the number of categories per variable is large. The advantage of the bi-additive model is that the interactions can be easily represented in a graphical representation that is similar to the one in multiple correspondence analysis: each category of every variable is represented by a vector. Then, the bivariate interaction effect is modelled by the scalar product of any two vectors representing the categories of two different variables, that is, the projection of one vector onto the other.
We show that the current model is an extension of the generalized bi-additive model for two categorical variables as discussed by van Eeuwijk (1995), De Falguerolles and Francis (1992) and Gabriel (1996), who also provided algorithms. Our extension may be viewed as a generalization of multiple correspondence analysis to GLM.
We shall illustrate our model using an empirical data set.
Eeuwijk, F. A. (1995). Multiplicative interaction in generalized linear models. Biometrics, 85, 1017–1032.
Falguerolles, A. de & Francis, B. (1992). Algorithmic approaches for fitting bilinear models. In Compstat 1992 (eds Y. Dodge & J. Whittaker), 77-82. Heidelberg: Physica-Verlag.
Gabriel, K. R. (1996). Generalised bilinear regression. Biometrika, 85, 689–700.
McCullagh, P. & Nelder, J. A. (1989). Generalized Linear Models. London: Chapman and Hall.
Estimating population genetics ‘F-statistics’ using correspondence analysis with respect to instrumental variables
Bruno Guinand1, Bertrand Parisseaux1, Jean-Dominique Lebreton2 & François Bonhomme1
1Université de Montpellier II, France & 2CNRS, Montpellier, France
Foundations of population genetics are built on a strong and very formal statistical background. Wright’s fixation indices, also known as “F-statistics” (Wright, 1951) are part of this foundation. “F-statistics” aim at analysing how genetic variance is partitioned within and among populations analysed for a set of genetic markers by considering data at several hierarchical levels: individuals, subpopulations, and total sample (Weir & Hill, 2002). Such a hierarchical partitioning of genetic variation should be implemented in a multivariate framework, providing alternative estimations of “F-statistics”.
However, multivariate methods scarcely attracted population geneticists and, basically, only principal component analysis was used in the analysis of human data sets (Cavalli-Sforza et al., 1994). Moreover, most analyses focused on differentiation between subpopulations and did not consider “F-statistics” as a whole. Therefore, population geneticists repeatedly argued that multivariate methods do not provide a clear alternative to the use of “F-statistics”, just being exploratory methods missing the links with the fundamentals of their discipline.
Here, we simply show that the relationship of “F-statistics” with chi-square statistics induces a relationship with scalar products and norms in a Euclidean space that, to our knowledge, has never been exploited to develop links between “F-statistics” and bilinear multivariate methods. “F-statistics” can be estimated and decomposed in subspaces, using correspondence analysis with respect to instrumental variables (or canonical correspondence analysis; Lebreton et al., 1991) in an appropriate way.
We illustrate this approach using previously published data on hybridizing mouse subspecies (Mus musculus) (Orth et al., 1998). We also briefly discuss main interests using and developing coherent multivariate approach according to our increasing knowledge of natural genetic variation of numerous organisms.
Cavalli-Sforza, L. L., Menozzi, P. & Piazza, A. (1994). The History and Geography of Human Genes. Princeton: Princeton University Press.
Lebreton, J.-D., Sabatier, R., Banco, G. & Bacou, A. M. (1991). Principal component and correspondence analyses with respect to instrumental variables: an overview of their role in studies of structure-activity and species-environment relationships. In Applied Multivariate Analysis in SAR and Environmental Studies (eds Devillers, J. & Karcher, J.), 85-114. Dordrecht: Kluwer.
Orth, A., Adama, T., Din, W. & Bonhomme, F. (1998). Hybridation naturelle entre deux sous espèces de souris domestique Mus musculus domesticus et Mus musculus castaneus près de Lake Casitas (Californie). Genome, 41, 104-110.
Weir, B. S. & Hill, W. G. (2002). Estimating F-statistics. Annual Review of Genetics, 36, 721-750.
Wright, S. (1951). The genetical structure of populations. Annals of Eugenics, 15, 323-354.
Interset distances in the barycentric representation of
Willem J. Heiser
Leiden University, The Netherlands
There has been some debate about the correct interpretation of distances between row elements and column elements in a joint display of a correspondence table. The conventional view is that we can scale this joint display in such a way that either the distances between rows can be interpreted, or the distances between columns, but never directly the distances between rows and columns (Heiser & Meulman, 1983; Greenacre & Hastie, 1987). Carroll et al. (1986) proposed an alternative scaling of the coordinates for which they claimed that both between-set and within-set squared distances could be interpreted, but Greenacre (1989) has shown that this claim is not warranted.
Before any dimension reduction, the representation of the data in correspondence analysis is a barycentric configuration of profile points with respect to the unit profiles, which are hypothetical profiles for which all mass is concentrated in one cell. It is shown that a between-set distance interpretation is possible in any barycentric configuration or plot, in comparison with the distance to some specific supplementary points. The distance involved is not of the chi-squared type, but simply Euclidean. The result is equally valid in the full-dimensional space as in a reduced space obtained by projection, or by any other method producing a suitable configuration of the unit profiles.
Carroll, J. D., Green, P. E., & Schaffer, C. M. (1986). Interpoint distance comparisons in correspondence analysis. Journal of Marketing Research, 23, 271-280.
Greenacre, M. J. (1989). The Carroll-Green-Schaffer scaling in correspondence analysis: A theoretical and empirical appraisal. Journal of Marketing Research, 26, 358-365.
Greenacre, M. J. & Hastie, T. (1987). The geometric interpretation of correspondence analysis. Journal of the American Statistical Association, 82, 437-447.
Heiser, W. J. & Meulman, J. (1983). Analyzing rectangular tables with joint and constrained multidimensional scaling. Journal of Econometrics, 22, 139-167.
"Le patronat norvégien": capital structures and political position-taking in the Norwegian field of power
Johs. Hjellbrekke & Olav Korsnes
University of Bergen, Norway.
email@example.com & firstname.lastname@example.org
When it comes to sociological applications of correspondence analysis, Bourdieu's (1989) and Bourdieu & de Saint-Martin's (1978) analyses of the French field of power count among the classic works. Drawing inspiration from these two studies, and following the approach first outlined in Le Roux & Rouanet (1998), and Chiche et al. (2000), this presentation will describe the relations between field positions and agents' political position taking (i.e. their political orientations) in the Norwegian field of power.
Based on data from a survey by Norwegian Power and Democracy Project, distributed to 1711 people in top positions within the Norwegian society - positions within politics, academia, larger private and public companies, the central administration, the church, the judicial system, the military, cultural institutions and also larger occupational and managerial organisations - two main questions will be adressed:
(1) What are the dominant oppositions within the Norwegian field of power in the year 2000? What areas of this field are the most open with respect to social mobility, and where is the intergenerational reproduction at its strongest? These structures will be revealed uncovered in an analysis where 17 variables are defined as active capital indicators.
(2) What are the relations between the capital structures in the Norwegian field of power and the structures in the habituses of the agents that are located in these positions? To what degree are structural oppositions between field positions also present in the agents' political position taking? These relations will be analysed using data on the agents' position taking towards 20 statements, mainly on the relations between the state and the market, but also on more general political issues.
Defining the variables on political position taking as the active set, the capital indicators and the field position variable will be defined as supplementary variables. Finally, both in order to get a better view of the distributions of the individuals within the field, and also of the degree of intrapositional opposition and interpositional separation, ellipses will be drawn around the positions' mean points (see Chiche et al., op.cit.).
Bourdieu, Pierre (1989). La Noblesse d'État. Paris: Editions de minuit.
Bourdieu, Pierre & de Saint-Martin, Monique (1978). Le patronat. In Actes de la recherche en sciences sociales, #20/21, mars-avril 1978.
Chiche, J., Le Roux, B., Perrineau, P. & Rouanet, H. (2000). L'éspace politique des électeurs français à la fin des années 1990. In Revue française de science politique, 50, juin 2000, 463-488.
Le Roux, Brigitte & Rouanet, Henry (1998). Interpreting axes in MCA: method of the contributions of points and deviations. In Visualization of Categorical Data (eds J. Blasius & M. J. Greenacre), pp.197-220. London: Academic Press.
Using nonlinear principal component analysis to assess the relationship between industrial sectors, import countries and export barriers: the case of Norway
Bodø Graduate School of Business, Norway
In this paper we investigate external export barriers, how they are related to each other and how they are related to industrial sectors and to import countries. In 1995, 459 chief executive officers from Norwegian companies have been asked to depict their perceptions of ten export barriers they meet in their most important import country. We used Likert-type items with five categories each, ranging from “very low importance” to “very high importance”.
Applying nonlinear principal component analysis (NLPCA), we chose a two-dimensional space from which the first dimension is described as “level of export barriers”, ranging from “very low export barriers” to “very high export barriers”. The second dimension mirrors the “composition of export barriers” and differentiates between those barriers which are very important for the fishery sector, for example “veterinarian certification”, and those barriers which are important to all sectors. The application of NLPCA allows the production of a two-dimensional map. This map was used in the production of the profile of quantifications for the ten indicators that turned out to be an effective visual framework to investigate scale properties. The visualizations of the quantifications showed that the assumption about measures of export barriers being of a continuous nature, at least in this instance, does not hold. In many cases the items had ordinal properties, but several items only had dichotomous properties. We show that the largest differences in the quantifications in most cases are between the first and the second category, they reflect the greatest differences in the (latent) perception of export barriers. If we like to create a dichotomous scale, we should cut the items at this point and not in the middle of the manifest scale. To visualize the structure of responses in the two-dimensional space, we use biplots of the indicators.
The reported findings show that industrial sectors have different levels of export barriers and that the patterns are heterogeneous across import countries. A free trade agreement with the European Union does not result in any lower level of perceived export barriers than in countries without a free trade agreement. Furthermore, firms within a highly competitive industry meet higher levels of perceived export barriers than firms within other industrial sectors.
Cultural and social backgrounds and students choice of craft within the field of Vocational Education and Training in Denmark
Roskilde University, Denmark
Educational reforms in youth education in Denmark are carried out based on the presumption that young people are individualised, released from tradition and culture and constantly preoccupied with creating their own identity (Giddens 1991; Ziehe 1982). Therefore the political agenda is to reform the educational system according to this “presumed youth”. This also goes for the field Vocational Education and Training (VET). Traditionally the VET area, due to its strong focus on practical work in companies and firms, has absorbed a large part of the youth, which primarily is searching a practical education with little or no resemblance with traditional school-based teaching. The latest reform of the VET system, called Reform 2000, is characterized by a complete individualisation of the education of and an adaptation to each individual student. The students are supposed – in cooperation with a teacher – to create their own individual education choosing between different modules in order to consider their own “learning style”. Furthermore, the students are supposed to take examinations in traditional school-based subjects, preparing for further education.
The purpose of my research project is to establish an empirically based knowledge of the existing social and cultural differences between students in VET in Denmark, their choice of craft and their appreciation of the results of Reform 2000. One of the main theses guiding the project is that young people in general are much more differentiated than the “presumed youth” suggests.
Data were collected using questionnaires distributed among approximately 1200 students from the following crafts: carpenter, metalworker, plumber, graphic designer and computer mechanic. The questionnaires are composed of about 200 questions prepared for electronic processing. The aim of the questionnaire is to make it possible to recreate the social, cultural and economical background of the students, in order to construct their economical, cultural and social capital (Bourdieu, 1984; Bennet et al., 1999) and the interrelation hereof with their choice of craft.
Using a multiple correspondence analysis our strategy is to construct indicators that are able to show differences in the composition of capital among the students and possible correspondences with choice of craft and appreciation of the results of Reform 2000. I shall present preliminary results of the project and discuss the problems of using questionnaires in the construction of the students composition of capital. Furthermore, I shall present my considerations of the applicability of correspondence analysis in this particular project.
Bennett, Tony, Emmison, Michael & Frow, John (1999). Accounting for Tastes: Australian Everyday Cultures. Cambridge University Press.
Bourdieu, Pierre (1984). Distinction: A Social Critique of the Judgement of Taste. Routledge & Kegan Paul.
Giddens, Anthony (1991). Modernity and Self-Identity: Self and Society in the Late Modern Age. Cambridge: Polity.
Ziehe, Thomas; Stubenrauch, Herbert (1982). Plädoyer für ungewöhnliches Lernen : Ideen zur Jugendsituation. Reinbek Rowohlt.
Additive and multiplicative models for three-way tables
Pieter M. Kroonenberg
Leiden University, The Netherlands
In a little referenced paper Darroch (1974) discussed the relative merits of additive and multiplicative modelling for contingency tables. In particular, he compared the following aspects: partition properties, closeness to independence, conditional independence as a special case, distributional equivalence, subtable invariance, and constraints on the marginal probabilities. On the basis of this investigation, he believed that multiplicative modelling is preferable over additive modelling, "but not by so wide a margin as the difference in the attention these two definitions have received in the literature" (p. 213).
It is surprising that one important aspect of modelling contingency tables did not figure in this comparison, i.e. interpretability. The major aim in most empirical sciences is to apply models to data and to get a deeper insight into the subject matter by interpreting the outcome of the statistical models. In this presentation Darroch's investigations are extended by investigating the interpretational possibilities and impossibilities of multiplicative and additive modelling of contingency tables. The investigation will primarily take place at the empirical level, and is limited to three-way contingency tables with medium to large numbers of categories. The focus lies with the interpretation of the dependence present in the table and how one can gain insight into complex patterns of different types of dependence.
In particular, empirical comparisons are made between Goodman's RCM association models (Anderson, 1996; Wong, 2001), and three-mode correspondence analysis for moderate to large three-way contingency tables (Carlier & Kroonenberg, 1996). By limiting ourselves to three-way tables, some of the generality of Darroch's argument is lacking, but some higher-order tables can be fruitfully reduced to three-way tables by multiplicative (or interactive, as it is sometimes called) coding of the categories. Such coding will inevitably eliminate a certain number of interactions and the relative merits of doing so will also be a subject of discussion.
Several data sets which have been previously analysed in the literature will be scrutinised with both multiplicative and additive models and evaluated to what extent they succeed in bringing the patterns contained in the dependence to the fore. An attempt will be made to formulate specific recommendations about when each technique is most likely to be informative, but such recommendations will only be very preliminary.
Anderson, C. J. (1996). The analysis of three-way contingency tables by three-mode association models. Psychometrika, 61, 465‑483.
Carlier, A., & Kroonenberg, P. M. (1996). Decompositions and biplots in three-way correspondence analysis, Psychometrika, 61, 355‑373.
Darroch, J. N. (1974). Multiplicative and additive interaction in contingency tables. Biometrika, 61, 207‑214.
Wong, R. S.-K. (2001). Multidimensional association models. Sociological Methods & Research, 30, 197‑240.
Robustness in nonmetric multidimensional scaling
University of Zurich, Switzerland
Similarity based records in social sciences (measuring relations between a large number of subjects or representing the knowledge about an object field in form of a cognitive structure) are often blotted out by a mixture of scattering and outliers – especially when working with questionnaire data, when dealing with measurements on relatively small samples or when modelling on the level of single individuals.
When such data are proceeded with metric or nonmetric multidimensional scaling methods, a large portion of scattering and outliers can severely affect the resulting geometric structure. The reason is that classical (N)MDS algorithms are only partly suitable for such records, because of their squared error model: to minimize the stress value, large errors in the fit of single distances are to be avoided because they affect the stress to a major degree when squared. Outliers (which, by definition must produce such “errors” in a geometric solution) therefore result in an inadequate shift of the respecting points to scatter the error over as many distances as possible for minimizing the squares. By this shifting of points, outliers can distort the “true” solution to a significant degree.
The subsequent problems of data interpretation which arise from such non-robust results can be illustrated by various examples from the field of social sciences, from intuitive data as well as from prominent published results.
If the concern of outlier affected data is justified, a method would be appropriate which can separate the signal (i.e. “true” structure) from the noise (scattering and outliers). As a suggested solution to this problem of robustness, we present the RobuScal NMDS algorithm, which is based on a robust starting configuration (a further development of the metric TUFSCAL algorithm by Spence & Lewandowsky, 1989) and a weighted error model as proposed by Heiser (1988) for the nonmetric part.
The robustness of the RobuScal algorithm can be proved by a systematic test. This Monte Carlo study provides an adequate set of simulation data which can also be used for a more general evaluation of all existing NMDS algorithms with regard to their robustness. We conclude with the general recommendation that scaling algorithms should pass such a test before they are used for proceeding empirical data.
Heiser, W. J. (1988). Multidimensional scaling with least absolute residuals. In Classification and Related Methods (ed H. H. Bock), 145-155. Berlin: Springer.
Spence, I. & Lewandowsky, S. (1989). Robust multidimensional scaling. Psychometrika, 54, 501-513.
Evolutionary analysis for correspondence data
N. Carlo Lauro & Simona Balbi
Università “Federico II”, Naples, Italy
email@example.com & firstname.lastname@example.org
In two-way correspondence analysis (CA) applications, one of the dimensions concerned is often represented by occasions, namely by the time. Treating time as a categorical variable does not allow consideration of the asymmetrical role that it plays with respect to the other variables. On the usual plots of CA, we are just able to draw trajectories joining points referred to the different occasions. While nonsymmetrical correspondence analysis (NSCA) (Lauro & D'Ambra, 1984; Lauro & Balbi, 1999) solves the problem of asymmetry, it does not offer a suitable visualisation and interpretation of data evolution.
In the literature the case of three-way tables has been considered by different authors (e. g. Foucart,1979; Glaçon,1981), developing some proposals based on the analysis of a contingency table series, obtained by juxtaposing/superposing the strata of the multiple table. Whereas the first author focuses his attention on the distances between marginal distributions and a reference distribution (given by averaging the independence hypothesis in different times, or by building profiles), the second one proposes a STATIS approach. The graphics proposed in both approaches rely on the classical joint plots, on which it is possible to represent trajectories obtained by joining points representing the same (row or column) category, referred to different times. In addition, STATIS allows us to represent synthetically each matrix by one point and visualises global trajectories on the interstructure factorial planes.
It must be noticed that the data evolution is not the actual aim of all these methods and it is just considered descriptively in a final step of the analysis.
The aim of this paper is to introduce a time-series analysis based approach in the frame of the analysis of correspondence data, in order to take into account in an explicit way the temporal dimension (Lauro, 1973). The core of the analysis consists in understanding the mechanism of transition from one state to the following one by estimating the matrix generating the evolutionary process of data, according to a first order autoregressive model, and its eigen–structure. Thus, similarly to NSCA, the proposal allows us to decompose the transition matrix in terms of latent components now depending on time. One of the main outcomes consists in the possibility of an additional time series-like graphical representation, visualising the evolution of the structure with respect to time, which explicitly appears on the graphs.
Methodological implications, computational and interpretative problems are discussed. The method is illustrated using evolutionary data referred to the Italian economic structure, and comparing results with the other previously mentioned methods.
Foucart, T. (1979). Structures de tableaux de probabilité. Description et prevision. Thèse du III cycle, Université de Sciences et Techniques du Languedoc.
Glaçon, F. (1981). Analyse conjointe de plusieurs matrices, Thèse du III cycle, Université de Grenoble
Lavit, Ch., Escoufier, Y., Sabatier, R. & Treissac, P. (1994). The ACT(STATIS method). Computational Statistics & Data Analysis, 18, 97-119.
Lauro, N. C. (1973). Tendenze evolutive del sistema produttivo italiano alla luce di un'analisi strutturale. Quaderni del C.S.E.I., 12.
Lauro, N. C. & D'Ambra, L. (1984). L'Analyse non symétrique des correspondance. In: Data Analysis and Informatics III. Amsterdam: North-Holland.
Lauro, N. C. & Balbi, S. (1999). The analysis of structured qualitative data. Applied Stochastic Models and Data Analysis, 15, 1-27.
The space of central bankers in the world
Université de Picardie Jules Verne, CSE et CEFRESS, Paris, France
The space of the educational, academic and professional trajectories of central bank governors testifies to the coexistence and rivalry between different types of “symbolic capital” (forms of prestige and specific authority). There are "insiders", who come from the central bank institution on the one hand, and "outsiders", whose legitimacy may be academic, economic or political, on the other hand. There are leaders, who come from the financial world and the private sector on the one hand, and then leaders who come from the political or university arenas. Central banks are the places where these different forms of symbolic capital affront each other and combine (Lebaron, 2000). These “neutral” places are characterized by a particular distribution of resources, which define their underlying structure.
Multiple correspondence analysis (MCA) helps to reveal the structure of this very particular social space inside the economic world, and to answer specific comparative questions related to the social characteristics of governors from different regions. In this sense, geometric data analysis (GDA) appears to be a specific kind of “structural” analysis, in which “social agents” (“individuals”) are central to the analysis and the interpretation (see Rouanet et al., 2000).
This paper will present the utility of MCA to treat these sociological problems in the particular case of a central economic institution (devoted to monetary policy and banking supervision). Possible relations between GDA and economic sociology will be discussed. The paper will then focus on the construction of a relevant social space, using a particular set of active questions, related to educational, professional and academic trajectories. It will discuss the relations between statistical and sociological interpretations of the relevant dimensions of this space. It will then try to characterize regions of the world according to the types of capital which are dominant inside the central banks, using the technique of supplementary elements. It will, in the end, assess the general relation between positions in this social space and “opinions”, “position takings” and “strategic choices” (see Bourdieu, 1984; Lebaron, 2001), in the same methodological frame.
Bourdieu, Pierre (1984). Homo Academicus. Paris: Minuit.
Lebaron, Frédéric (2000). The space of economic neutrality. Trajectories and types of legitimacy of central bank managers. International Journal of Contemporary Sociology, 37, 208-229.
Lebaron, Frédéric (2001). Economists and the economic order. The field of economists and the field of power in France. European Societies, 3, 91-110.
Rouanet, H., Ackermann, W. & Le Roux, B. (2000). The geometric analysis of questionnaires: The lesson of Bourdieu's La Distinction. Bulletin de Méthodologie Sociologique, 65.
Validation procedures for principal axes methods
Centre National de la Recherche Scientifique, ENST., France
Bootstrap resampling techniques are frequently used to produce confidence areas on two-dimensional displays derived from principal axes techniques such as correspondence analysis (CA) and principal component analysis (PCA). In the case of PCA, numerous papers have contributed to select the relevant number of axes, and have proposed confidence intervals for points in the subspace spanned by the principal axes. These parameters are computed after the realization of several replicated samples, and involve constraints that depend on these samples. Several procedures have been proposed to overcome these difficulties: partial replications using supplementary elements (partial bootstrap), use of a three-way analysis to process simultaneously the whole set of replications (Holmes, 1989), filtering techniques involving reordering of axes and Procrustean rotations (Markus, 1994; Milan & Whittaker, 1995). Gifi (1981), Meulman (1982), Greenacre (1984) have addressed the problem in the context of two-way and multiple correspondence analyses.
In the PCA case, variants of bootstrap (partial and total bootstrap) are presented for active variables, supplementary variables, and supplementary nominal variables as well (Chateau & Lebart, 1996). In the case of numerous homogeneous variables, a bootstrap on variables is also proposed, with examples of application to the case of semiometric data (Lebart et al., 2003).
In the context of CA (two-way or multi-way), the bootstrap allows one to draw confidence ellipses or convex hulls for both supplementary categories and for supplementary continuous variables. It appears easier to assess eigenvectors than eigenvalues (see Alvarez et al., 2002). In the domain of textual data, these techniques can efficiently tackle the difficult problem of the plurality of statistical units (words, lemmas, segments, sentences, respondents) .
Alvarez, R., Bécue, M., Lanero, J. J. & Valencia, O. (2002). Results stability in textual analysis: its application to the study of the Spanish investiture speeches (1979-2000). In: JADT-2002, 6th International Conference on Textual Data Analysis, (eds. A. Morin & P. Sébillot), INRIA-IRISA, Rennes, 1-12.
Château, F. & Lebart, L. (1996). Assessing sample variability and stability in the visualization techniques related to principal component analysis; bootstrap and alternative simulation methods. COMPSTAT 1996 (ed A. Prat), 205-210. Heidelberg: Physica-Verlag.
Gifi, A. (1981, 1990). Nonlinear Multivariate Analysis. Chichester: Wiley.
Greenacre, M. (1984). Theory and Applications of Correspondence Analysis. London: Academic Press.
Holmes, S. (1989). Using the bootstrap and the RV coefficient in the multivariate context. In: Data Analysis, Learning Symbolic and Numeric Knowledge (ed. E. Diday), 119-132. New York: Nova Science.
Lebart, L., Piron, M. & Steiner, J.-F. (2003). La Sémiométrie. Paris: Dunod.
Markus, M. Th. (1994). Bootstrap confidence regions for homogeneity analysis: the influence of rotation on coverage percentages. COMPSTAT 1994 (eds R. Dutter & W. Grossmann), 337-342. Heidelberg: Physica-Verlag.
Meulman, J. (1982)
Milan, L. & Whittaker, J. (1995). Application of the parametric bootstrap to models that incorporate a singular value decomposition. Applied Statistics, 44, 31-49.
Detection of density modes based on distributional distance: an application to fine-grained clustering of bibliographical data
Alain Lelu & Claire François
INRA, Jouy-en-Josas, France et Université de Franche-Comté/LASELDI
& INIST, Vandoeuvre-lès-Nancy, France
email@example.com & firstname.lastname@example.org
The design of our local components analysis method has been motivated by two main considerations:
(1) In order to keep the same desirable property of distributional equivalence as correspondence analysis, at the same time as to extract local eigen elements for each document cluster issued from large collections, we use the distributional distance defined by Matusita (1955) and Escofier (1978). This distance is also used in spherical factor analysis by Domengès & Volle (1979). In doing so, documents are characterized by a centrality index in their own cluster, and clusters are characterized in a dual sense by deduced values for terms.
(2) Most clustering methods use iterative optimizing algorithms leading to a local maximum of a global quality index (see Diday er al.,1979) for centroid shift methods, and Buntine (2002) for more recent EM methods). This results in unstable representations, difficult to use if one wants to evaluate the influence of deleting / inserting / merging terms or documents, or – a major challenge – to evaluate the evolution of a data-flow. We have adapted a density-mode seeking algorithm (Trémolières, 1994) to an adaptive density definition, based on reciprocal K-nearest-neighbours. An application is presented, and stability is evaluated for a data set of 1397 bibliographical records, described by 935 keywords.
Buntine, W. (2002). Variational extensions to EM and multinomial PCA. In Proceedings of ECML 2002.
Diday, E. et al. (1979). Optimisation en Classification Automatique. Rocquencourt: INRIA.
Domengès, D. & Volle, M. (1979). Analyse factorielle sphérique: une exploration. Annales de l'INSEE, 35.
Escofier, B. (1978). Analyses factorielles et distances répondant au principe d'équivalence distributionnelle. Revue de Statistique Appliquée, 26, 29-37.
Matusita, K. (1955). Decision rules, based on the distance for problems of fit, two examples, and estimation. Annals of Statistical Mathematics, 631-640.
Trémolières, R. C. (1994). Percolation and multimodal data structuring. In New Approaches in Classification and Data Analysis, (eds Diday E. et al.), pp. 263-268. Berlin: Springer-Verlag.
Graphical displays of Markov chains by means of nonsymmetrical correspondence analysis
Danilo Leone & Marilena Fucili
University of Naples, Italy
email@example.com & firstname.lastname@example.org
The Markov chains framework (Bertsekas, 1987) is widely used to model dynamic systems: examples come from engineering and social applications and from the class of problems solved by Markovian decision processes. Recently these tools have been greatly used in machine learning and reinforcement learning techniques. The proposed tool provides an original way to visualize the inner structure of Markov chains, the dynamics of the systems and the advancement of the learning process. The advantages of this methodology are manifold: it uses nonsymmetrical correspondence analysis (NSCA) (Lauro & D’Ambra; 1984) for dealing with the Markovian dependence assumption, it can manage the presence and combinations of several categorical variables (attributes) defining different states, it allows a direct visualization of relations between starting states and destination states by using graphical displays in the style of principal component analysis.
We assume that the starting states (and the destination states) of the system are defined either by I different levels of a unique attribute, or by combinations of n attributes with La (a=1,…,n) levels. If the number of states grows it becomes difficult to read the transition matrix and, eventually, to discover relations among attributes; that is why we have focused our attention in factorial analysis and, precisely, in NSCA. The I destination states are the levels of the dependent variable, the I starting states are the levels of the explanatory variable in the NSCA. The current formulation of NSCA differs from the usual one because the same variable is allocated in the rows and in the columns of the correspondence matrix (the Markov chain). Furthermore, this particular probability data we handle lead to special issues of interpretation of the NSCA.
We also propose to represent the attributes coordinates as supplementary points on the principal axes. The graphical displays must be analyzed applying Markov chain principles (e.g. recurrence) and the geometries of the factorial method. We also describe an application on real data to illustrate the use of our methodology and the advantages of the suggested graphical representation.
Bertsekas, D. P. (1987). Dynamic Programming. Deterministic and Stochastic Models. Prentice Hall.
Lauro, C. & D’Ambra, L. (1984). Non symmetrical correspondence analysis. In Data Analysis and Informatics III (eds E. Diday et al.), pp. 433-446. North Holland.
Specific multiple correspondence analysis
Brigitte Le Roux & Jean Chiche
Université René Descartes, Paris & Centre d’Etudes de la Vie Politique Française, Paris, France
Lerb@math-info.univ-paris5.fr & Chiche@msh-paris.fr
The method of specific multiple correspondence analysis (SMCA) is motivated by the need to analyze questionnaires with nonresponses, the overall aim being to free oneself from the constraints of complete disjunctive coding while preserving the structural properties of multiple correspondence analysis (MCA). In the paper, we will present the method within the general framework of geometric data analysis, then we deal with the special case of a questionnaire, following the line of Le Roux (1999).
We first recall the properties of principal directions and variables of a Euclidean cloud. Then we apply these properties to a protocol for which individual profiles are measures, and we derive the properties of biweighted principal component analysis. Then we recall the formulas of MCA viewed as a particular biweighted principal component analysis.
We then compare SMCA with standard MCA, writing inequalities between eigenvalues and studying the rotation of principal subspaces when one goes from the global analysis to the specific one.
Examples of SMCA are found in Bourdieu (1999) and Chiche et al. (2000).
Bourdieu, P. (1999). Une révolution conservatrice dans l’édition, Actes de la Recherche en Sciences Sociales, 127.
Le Roux, B. (1999). Analyse spécifique d’un nuage euclidien. Mathématiques, Informatique et Sciences Humaines, 37, 65-83.
Chiche, J., Le Roux, B., Perrineau, P. & Rouanet, H. (2000). L’espace politique des électeurs français à la fin des années 1990. Revue Française de Sciences Politiques, 50, 463-487.
Spectral mapping. Was it worth the effort?
Paul J. Lewi
Center for Molecular Design, Janssen Pharmaceutica, Vosselaar, Belgium
I started my career 40 years ago in the research laboratory of Dr. Paul Janssen in Beerse, Belgium, a laboratory which has produced 75 original medicines during the course of 45 years of pharmacological research. Pharmacology is both multidimensional and dual. The multidimensionality follows from the fact that chemical compounds, medicinal drugs in particular, may exhibit a vast spectrum of activities when tested on proteins (and other biopolymers), cells, isolated organs, micro-organisms, plants, animals and man. (Activity in a test is specified on a ratio scale as the estimated dose or concentration of a drug that produces a certain effect in 50% of replicated cases.) The duality derives from the following consideration. Any two drugs can be contrasted in terms of the log ratio of their activities obtained in a given test. Vice-versa, the contrast between any two tests is defined by the log ratio of the activities produced in them by a given drug.
The Janssen laboratory frequently produced exhaustive results for a relatively large number of drugs and tests, with resulting data tables that were intrinsically multivariate and dual, as just described. The problem, then, was to classify the drugs and tests in order to reveal the biological structure underlying the data. Initially, the spectra were drawn on cardboard cards and displayed on a table in the laboratory. But different researchers arranged the cards in distinct ways, some being more biased by the average activity (or size) in the spectrum of a drug or test, others paying more attention to certain features of the profile (or shape) of the corresponding activity spectra. In 1975, in order to resolve the controversies, Dr. Janssen asked me whether it would be possible to find or design an ‘objective’ method to solve this dually multivariate problem. By coincidence, some French statisticians had just handed me a rather cryptic computer code for correspondence (CA) and principal components (PCA) analysis. After some experimenting and re-engineering, I realized that CA revealed contrasts between drugs and, reciprocally, displayed contrasts between tests. It also showed dual specificities between drugs and tests, independently of the potencies of the compounds and of the sensitivities of the tests. It did not provide, however, an interpretation of contrasts in terms of log ratios. The basic design of CA involves double-closure of a contingency table, singular vector decomposition (SVD) and biplot, all weighted by marginal sums of rows and columns. It seemed natural to me to replace the double-closure operation in CA by double-centring of the log transformed table of activity spectra, all other things remaining equal. The weighting by marginal sums makes the resulting spectral map less influenced by drugs with low potency and less biased by tests with weak sensitivity. This ‘spectral mapping’ approach is formally identical to the ‘weighted logratio’ method described in the poster at this conference by Greenacre and Aitchison.
Over the years, spectral mapping has been applied to a large variety of data in various fields of research, marketing and finance. Both inside and outside the laboratory the method has had its strong believers, and also its fierce opponents. At the onset, Dr. Janssen said he would only be convinced of its usefulness if a spectral map revealed something that could not have been readily observed beforehand from the data with the unaided eye. One or two cases were produced that made the point; much less, however, than was hoped for. Looking back now, after more than 25 years, the time has come to ask: was it really worth the effort?
Lewi, P. J. (1998). Analysis of contingency tables. In Handbook of Chemometrics and Qualimetrics: Part B (eds. B. G. M. Vandeginste, D. L. Massart, L. M. C. Buydens, S. de Jong, P .J. Lewi & J. Smeyers-Verbeke), pp. 161-206. Amsterdam: Elsevier.
Three-way multidimensional scaling analysis of corporate failure
Cecilio Mar Molinero & Evridiki Neophytou
Universitat Politècnica de Catalunya, Barcelona, Spain
& University of Southampton, United Kingdom
Cecilio.email@example.com & En498@soton.ac.uk
Multivariate statistical analysis has long been used to study corporate failure, using discriminant analysis or logistic regression. It has long been suggested that company size and area of activity are important factors when predicting failure. For this reason, the companies in the sample of failed and continuing firms are often matched by size and area of activity. However, proceeding in this way has serious disadvantages. First, using samples of equal size does not reflect real life, where continuing companies are much more common than failed companies; this can be addressed by means of Bayesian techniques, but it is rare to find a study that makes such a correction. Second, matching by size and area of activity makes it impossible to assess the importance of such factors. Third, it is unrealistic to assume that the conclusions of an analysis based on a sample that does not take into account time evolution will hold for the future. Fourth, the practitioner who will make use of the results is unlikely to understand the complexities of the analysis.
This paper reports on a large study that attempts to overcome all the above limitations. 370 failed companies are included in the sample, a far larger number of companies than has been previously reported in the US and UK literature. All the UK public quoted companies included in the active file of the FAME database satisfying certain criteria have been included: a total of 818 companies and over 6400 company accounts. The data covers the period 1993 to 2001. For each failed company, data from three to five reporting periods prior to failure were obtained. In the case of continuing companies, financial data covers up to eight reporting periods. As is usual in this type of study, the analysis is based on financial ratios obtained from the balance sheet and the profit and loss account. For each company 19 such ratios were calculated.
The analysis relies on a three-way scaling technique: individual differences scaling (INDSCAL). INDSCAL works from data on proximities. Given the size of the data set, the proximities were calculated between ratios, using companies as observations. A proximity matrix between ratio structures was calculated for each financial year. INDSCAL generates a “common map” that shows the average relationship between financial ratios during the period, and a set of weights that show how the financial ratio structure of companies evolves over time. It was found that the economic cycle influences the structure of company accounts.
Companies were projected on the common map. Location differences between failed and continuing companies in the common map were studied by a series of methods that include visual inspection, cluster analysis, and logit methods. Previously unobserved non-linearities; as suggested by theory, were discovered. It was found that failed companies tend to concentrate in certain areas of the maps, and that these areas are associated with low profitability, bad cash flow, and unsatisfactory debt structure. The impact of size and area of activity also became clear. These are well-known results, but the scaling approach has the advantage of visualising the results and, in this way, helps in the process of decision-making as it makes it possible to combine the qualitative and the quantitative aspects of any decision involving an assessment of the future of a company.
The evaluation of “don´t know” responses by partial
Herbert Matschinger & Matthias C. Angermeyer
University of Leipzig, Germany
Attitudes and other latent dimensions are measured quite frequently by means of Likert-type items where the respondent is asked to evaluate the item with respect to a closed form of mostly ordinal categories. It is implicitly assumed that the respondents are familiar with the problem addressed in the questionnaire. If this is not the case, quite frequently an extra category is employed in order to prevent an inflation of missing values and an uncontrollable bias of the sample. Unfortunately, these "don´t know" categories do not fit into the ordinal scheme of the rest of the categories and therefore are very often treated “per fiat” as neutral categories or as missing value, neither of which is a satisfactory solution. In two surveys on attitudes towards the mentally ill, conducted in 1990 and in 2001 in both parts of Germany, among other questions, ten 5-point items were employed to measure attitudes towards positive or negative effects of psychotropic drugs. Half of the items are worded in favour of psychotropic drugs, the other half deny the potential effects of these drugs. Listwise deletion of respondents with respect to "don´t know" responses would lead to a reduction of the sample from 5613 to 2921 (52%) which makes the evaluation of these responses vitally important. The goal of this investigation therefore is
(1) to estimate the relationship between the latent dimension and the "don´t know" category
(2) to evaluate the meaning of the "don´t know" categories in relation to the other - ordinal - categories for each item and conditional on the wording of the item
(3) to control for the impact of the amount of "don´t know" responses for each respondent on the dimensional structure of the construct to be portrayed. This amount might serve as an indicator for a more general willingness to respond to the questionnaire.
The structural relationship of the categories is evaluated by means of a partial homogeneity analysis (Bekker & De Leeuw, 1988; Heiser & Meulmann, 1994). Here, each set of items not only contains one of the variables of interest but also a copy of the variable: number of "don´t know" responses. Treating the items of the scale as multiple nominal and the sum of "don´t know" responses as numerical we obtain an extra dimension with a perfect fit (eigenvalue of 1) (Verdegaal, 1986). All the other axes then portray the dimension of interest.
It is shown, that the "don´t know" response may serve as an indicator for a more critical appraisal of the effect of psychotropic drugs and that this effect is not the same for the two independent samples in 1990 and 2001.
Bekker, P., & De Leeuw, J. (1988). Relations between variants of non-linear principal component analysis. In Component and Correspondence Analysis (eds J. L. A. van Rijckevorsel & J. De Leeuw), 1-31. Chichester: Wiley.
Heiser, W. J., & Meulmann, J. J. (1994). Homogeneity analysis: exploring the distribution of variables and their nonlinear relationships. In Correspondence Analysis in the Social Sciences: Recent Developments and Applications (eds M. Greenacre & J. Blasius), 179-209. London: Academic Press.
Verdegaal, R. (1986). OVERALS, Users Manual. (UG-86-1 ed.). Leiden: Department of Data Theory.
Multiple correspondence analysis for industrial specialised local areas in southern Italy
Fernanda Mazzotta1, Gianluigi Coppola2 & Maria Rosaria Garofalo1
1University of Salerno, Italy & 2Celpe, Italy
firstname.lastname@example.org, email@example.com & firstname.lastname@example.org
One of the most important social and economic Italian problem is the underdevelopment and the lower levels of industrialisation of the southern part of the country. Since the first years of post world war II, many Government interventions have taken place in order to encourage the localisation of firm in this big area. Particularly at the end of the 1950’s, along with the phase of strong industrial expansion in northern Italy, the Government choice was to encourage the localization of big firm in the area in order to employ as many workers as possible.
In the following decades and particularly from the middle of the 1970’s, the end of the fordism model, and with it the constant decline of large industries, has redrawn the Italian economic geography. In those years the cluster of small and medium firm emerged and caused the economic growth of many Italian geographical areas that covered a secondary role as, for example, the regions of Veneto and Marche. But of the 199 industrial districts covered by ISTAT (National Institute of Statistics) in 1991, only 15 were located in southern Italy (ISTAT, 1996).
This research is a detailed study of Salerno’s productive reality, implemented also through a direct survey on the local firms. Using the Intermediate Census of the Industry and of the Services data (CIIS, 1996), the specialization indexes of the Local Labour Market System of the province of Salerno have been calculated in order to individualize high manufacturing specialization areas. Subsequently, applying a cluster analysis, the existence of micro areas of specialization has been also tested.
The industrial districts areas have also been characterized by the existence of historical, cultural, social and political factors, besides high manufacturing specialization indices or other quantitative indices. Therefore the attention could not be exclusively focused on the productive structures, but had to take into account the institutions, the social network existing in the area, and the mechanisms of interaction between productive structures and the social framework (Brusco & Paba, 1997). Most factors are of a qualitative nature, and hence difficult to measure and to quantify. To such difficulty one obvious, frequently, through the study of cases of specific areas, with interviews directed to entrepreneurs and privileged actors (Viesti, 2000).
In order to obtain those qualitative variable, an in-depth questionnaire has been administered during the years 1998/1999 by an ad hoc survey (Permanent Observatory of the Enterprises of the Province of Salerno OPIS) of a sample (no. 462) of firms, of all sizes, located in the province of Salerno. The questionnaire was structured into nine sections (about 200 questions) covering all aspects of the firm. The objective in this work is to analyse the relation between the belonging to each clusters areas, individualised with the previous cluster analysis, and several qualitative variables. For this aim we first divided the variables in five groups: variables which indicates the initial situation (as financial facilitation, who established the firm, activity done before by the entrepreneur) institutional characteristics (opinion about local institution, professional level of the workers in the area, supplementary wages, workers involvement in the firm) market (sales, buying, relations with other firms) vitality of the firms (innovation, hiring and dismissal, training) and information channels. After we run five MCA one for each group of variables and we insert the variable “cluster” as a supplementary variable.
Constructing fields with multiple correspondence analysis: an applied researcher’s view
Ludwig Boltzmann-Institut für Historische Sozialwissenschaft, Vienna, Austria
The importance of correspondence analysis (CA) for the work of Pierre Bourdieu has been noted repeatedly (e.g. Blasius, 2001; Rouanet & Le Roux, 1993). In fact, the possibility of an experimentally controlled construction of social spaces and relatively autonomous fields has had a profound impact even on the theoretical conceptions themselves. This can be seen for example in the shift from mostly typological models of fields and spaces illustrated with network-sketches (e.g. Bourdieu, 1971) to much more systematic or structural research objects. These objects are based on the formal logic of (n-)dimensional mathematical spaces and constructed with the help of multiple correspondence analysis (MCA; e.g. Bourdieu, 1999). Field in particular manifests a dialectical relation between concepts and techniques and can therefore be regarded rather as a research-programme than as a theory or a method. Many articles by different authors in the “actes de la recherche des sciences sociales” show this dialectization and pluralization of the field programme which results of its experimental orientation.
In such a dialectization, method changes theory just as theory changes method. So, MCA becomes only one tool (but a crucial one) for the construction of a relatively autonomous field, and needs appropriate adaptation for this specific use. Neither an exploratory-typological use (e.g. Cibois, 1984) nor its use to test a given closed theory are appropriate for constructing historical (social/cultural) fields. This requires an integration of the exploratory formulation and the testing of systematic hypotheses concerning the field-structure. Difficulties increase when dealing with historical data from fragmented and heterogeneous sources.
An application to construct the field of the national-socialist youth education 1941-1944 (Mejstrik, 2000) will help to discuss the following issues:
· the fragmentary status of historical data and various ways of structural samplings,
· the use of homogeneity and heterogeneity of the original data for the experimental construction of historical fields,
· a dynamic determination of active and passive elements of the MCA in view of the exploratory as well as the explanatory use,
· a systematic interpretation of the results of an MCA as definition of a latent multidimensional structure and its principle of differentiation and hierarchy with the help of one-dimensional auxiliary graphical representations of categories and individuals, combined with numerical interpretation, and
· the use of the constructed structural principle to describe and explain face-to-face interactions and/or events as well as their dynamics underlying historical/social/cultural change.
Blasius, Jörg (2001). Korrespondenzanalyse. München & Wien.
Bourdieu, Pierre (1971). Le marché des biens symboliques. Année Sociologique, 22, 49-126.
Bourdieu, Pierre (1999). Une révolution conservatrice dans l’édition. Actes de la recherche en sciences sociales, 126-127, 3-26.
Cibois, Philippe (1984). L'Analyse des Données en Sociologie. Paris.
Mejstrik, Alexander (2000). Die Erfindung der deutschen Jugend. Erziehung in Wien 1938-1945. In NS-Herrschaft in Österreich. Ein Handbuch (eds Emmerich Tálos, Ernst Hanisch, Wolfgang Neugebauer & Reinhard Sieder), 494-522. Wien.
Rouanet, Henri & Le Roux, Brigitte (1993). Analyse des Données Multidimensionnelles. Paris.
Relations of inertias
George Menexes & Iannis Papadimitriou
University of Macedonia, Thessaloniki, Greece
email@example.com & firstname.lastname@example.org
The application of correspondence analysis (CA) in order to investigate the association of two categorical variables can be used in at least three data tables: a) simple contingency table, b) indicator matrix 0-1 and c) generalized contingency table (Burt table). The “picture” on factorial planes of the phenomenon under investigation is the same in the three cases (Greenacre 1984, Lebart et al. 2000, Andersen 1991). However, the total inertia and the inertia that is explained by each factorial axis are different. They depend on which of the three tables the analysis will be applied. This has as a result that the percentage of total inertia which is explained, for example, by the first two factorial axes, sometimes gives “poor” and other times “good” fit indices of the data and of the information that is analysed.
Initially, in this study the
mathematical relations that connect the total inertias of the three tables, in
the case of the two variables, are examined. More specifically, if F is a k´l
contingency table of two categorical variables X and Y, then it is known that the total inertia of F is given by Ik´l=c2/n where c2 is the chi-square statistic calculated on the table
and n is the total sample size. If Z
is the corresponding indicator matrix 0-1 then it is also known that the total
inertia I0-1 of Z is equal to [(k+l) / 2] - 1. We prove that the total inertia of the
corresponding Burt matrix IB =
The proof is not based on matrix algebra but on the assumption that IB must be equal to the chi-square statistic of the Burt table divided by 4n where the chi-square statistic is
calculated in the usual way on the Burt matrix as Σ (observed frequency-expected
frequency)2/(expected frequency). The same rationale is also applied to prove that
I0-1=[(k+l)/2]-1. An obvious conclusion is I0-1>IB>Ik´l.
the corresponding generalizations of the relations in
the case of multiple variables are examined.
We prove that the inertia of
the Burt table in the
multivariate case is equal to
(I0-1/m)+(2/m)(Σ Iall/m) where I0-1 is the inertia of the corresponding indicator matrix, m is the number of variables and Σ Iall is the sum of inertias of all m(m-1)/2 pairwise contingency tables. The proof is based on the logic that the total inertia of the Burt table is equal to the chi-square calculated on the Burt table divided by m2n.
These relations can reveal the quality of the information that is produced by CA. For example, the different physical meanings of inertia according to the data table that is analysed each time, the need for some kind of corrections or modifications of the basic results that are produced by CA and the feeling that an average bivariate effect size (inertia) is analysed every time. The relations mentioned above imply the need of a pairwise testing of the categorical variables before application of CA, in order to develop practical criteria for variable selection. If a decision has to be made about the inclusion of some variables in the CA model, we can proceed by selecting those variables that maximise the inertia of the Burt table.
Andersen, E. (1991). The Statistical Analysis of Categorical Data. Springer-Verlag.
Gifi, A. (1996). Non LinearMultivariate Analysis. John Wiley & Sons.
Greenacre, M. (1984). Theory and Applications of Correspondence Analysis. London: Academic Press.
Lebart, L., Morineau, A. & Piron, M. (2000). Statistique Exploratoire Multidimensionnelle. Paris: Dunod.
New proposals in the exploratory analysis of joint tables of categorical variables
Juan Ignacio Modroño Herrán, Karmele Fernández-Aguirre & M. Isabel Landaluce Calvo
Universidad del País Vasco, Bilbao, Spain & Universidad de Burgos, Spain
email@example.com, firstname.lastname@example.org & email@example.com
It is particularly common in survey analysis to have available many more than two categorical variables, for which multiple correspondence analysis (MCA) is suitable for dimensionality reduction. It is also common that the categorical data consist of variables coming from a survey carried out over somewhat different populations or moments of time in a way that some sort of association makes sense to be applied to the variables forming coherent groups, whose influence as a group should also be considered. In such a case a multiple tables analysis such as multiple factor analysis (MFA, see Escofier & Pagès, 1992) can be used; others such as STATIS (see, for example, Lavit, 1994) are not considered because of their inapplicability to categorical data. As a further problem, such data sets, when arranged in groups, typically do not have the same number of rows, what makes standard use of MFA impossible.
In such a situation, the authors, (in Abascal et al., 2001), have recently proposed a method in two steps which consists in substituting the original variables by their coordinates on the main factors extracted from a previous MCA of each of the tables defined by the groups and then performing MFA on it. The number of coordinates is the same across groups whenever the original variables are the same and the categorisation is also equal. This transformation allows for a MFA to be carried out on these matrices of coordinates and, furthermore, as the coordinates are now continuous variables, also permits the use of STATIS.
This method is applied to responses to a survey carried out at the University of the Basque Country which, as part of a large project concerning the process of scientific knowledge research, development and transfer, measures opinions, given by scientific staff from five different broad areas of knowledge, on characteristics of the university research culture. The results show some strong common and positive features of the research carried out at the university but also reveal some particular different opposed opinions in particular areas.
Finally, a simulation exercise is carried out to check the stability of the results.
Abascal, E., Fernández, K., Landaluce, M. I. & Modroño, J. I. (2001). Diferentes aplicaciones de las técnicas factoriales de análisis de tablas múltiples en las investigaciones mediante encuestas. Metodología de Encuestas, 3, 251-279.
Escofier, B. & Pagès, J. (1992). Análisis Factoriales Simples y Múltiples. Bilbao: Servicio Editorial de la Universidad del País Vasco.
Lavit, C. (1994). The act (Statis method). Computational Statistics and Data Analysis, 18, 97-119.
relationship between a clinical classification of diabetes and a typification after a multiple correspondence analysis in a murine model
Nora Moscoloni, Silvana Montenegro, Stella Maris Martínez, Juan Carlos Picena, Hugo Navone & María Cristina Tarrés
Universidad Nacional de Rosario, Argentina.
A major requirement for investigation and management of diabetes is to derive an appropriate criterion to identify its different forms and stages. The Expert Committee on the Diagnosis and Classification of Diabetes Mellitus (American Diabetes Association, 2002) proposed, based on the values of oral glucose tolerance test, a classification of diabetes and other categories of glucose regulation.
Our aim was to characterize individuals of the eSS line of rats, genetically diabetic (Martínez et al., 1993) using, by multiple correspondence analysis, the values obtained during the performance of oral glucose tolerance test and the assessment of glucosuria, together with other physiological and environmental characteristics totalling 12 variables, either continuous quantitative or nominal. Previously, an assignation of missing values of glucosuria was carried out through an artificial neural network classifier based on two criteria: 1) total independence in relation with the analysis to determine the typology of individuals and 2) high flexibility of the technique in order to obtain a predictive model with adequate capacity of generalization (Duda et al., 2001). To characterize individuals, multiple correspondence analysis was applied. Continuous glycemic variables were recoded and considered active, whilst the rest were illustrative (supplementary). When the simultaneous description of data structure in a graphical representation of factorial coordinates was performed, the levels of fasting glycemia and glucose intolerance were ordinated. The study was completed with cluster analysis on the factorial coordinates of the individuals obtaining a typology based on four classes. When these results were correlated with the clinical classification, it was possible to classify eSS males starting with the youngest rats with low body weight, not glucosuric, with normal fasting glycemia but impaired glucose tolerance and ending with diabetic individuals, older, with higher body weight, and glucosuric.
We conclude that the typology obtained agrees with the clinical criterion proposed by the American Diabetes Association, allowing the identification of stages in the progression of diabetic syndrome. This confirms the usefulness of multivariate classificatory algorithms in this biological context.
American Diabetes Association. (2002). Report of the Expert Committee on the Diagnosis and Classification of Diabetes Mellitus. Diabetes Care, 25, S5-S20.
Duda, R. O., Hart, P. E. & Stork, D. G. (2001). Pattern Classification. New York: John Wiley & Sons.
Martínez, S. M, Tarrés, M. C., Picena, J. C. et al. (1993). eSS rat, an animal model for the study of spontaneous non-insulin-dependent diabetes. In Lessons from Animal Diabetes IV (ed E. Shafrir), 75-90. London: Smith-Gordon.
Visualizing three-dimensional maps in correspondence analysis
Oleg Nenadić, Daniel Adler & Walter Zucchini
University of Göttingen, Germany
firstname.lastname@example.org, email@example.com & firstname.lastname@example.org
Maps in correspondence analysis are usually displayed in two dimensions. The lack of convenient software mitigates against the use of a full three-dimensional display in cases where a third dimension would substantially improve the quality. We illustrate how the package RGL can be used for creating three-dimensional displays that can be examined interactively.
Although modern computer hardware provides adequate processing power for real-time visualization in three dimensions, most statistical software packages do not support sophisticated graphics in three dimensions. RGL (see Nenadić et al., 2003; Adler & Nenadić, 2003, for a technical overview) is a library for the statistical computing environment R (Ihaka & Gentleman, 1996) that offers real-time three-dimensional visualization capabilities using OpenGL as the rendering backend. It has been ported to the major platforms Win32 and X11, and is released under the GPL (General Public License, “Copyleft”). The current release can be downloaded from http://22.214.171.124/~dadler/rgl/index.html
RGL has been designed as a general framework for three-dimensional visualization and as such does not offer special purpose functions for particular statistical analyses. It provides basic building blocks (such as points, lines, triangles, planes, surfaces and spheres in three dimensional space) and a number of appearance features (such as lighting properties, transparency effects and texture mapping). A convenient navigational interface for exploring the three-dimensional space using a mouse is supplied. The 21 functions offered by RGL are structured into six categories, with the shape and appearance functions comprising the core. RGL functions are semantically similar to the standard R-commands such as "plot" and "persp" that are familiar to R users. These functions can be used in a very flexible manner to create complex three-dimensional graphics.
In most applications of correspondence analysis the first two dimensions explain a sufficiently high percentage of the total inertia, but in some cases the inclusion of the third dimension improves quality of the display substantially. In such cases it is usual to examine each two-dimensional projection of the three-dimensional map individually, i.e. 1&2, 1&3 and 2&3. In this presentation we will illustrate the visualization capabilities of RGL in the context of correspondence analysis using some examples of application. We show how RGL can be used for interactive exploration of the three-dimensional maps; e.g. to zoom into particular regions in order to examine details of interest. The familiar projections onto two-dimensional space can be viewed by simply moving the viewpoint using the mouse. We illustrate how appearance features offered by RGL (apart from colour) can be used to enhance correspondence analysis displays by incorporating attributes, such as mass and quality, in the display. This capability is especially useful for visualizing maps from stacked tables.
Adler, D. & Nenadić, O. (2003). A framework for an R to OpenGL interface for interactive 3D graphics, Proceedings of the 3rd International Workshop on Statistical Computing, Vienna (forthcoming).
Ihaka, R. & Gentleman, R. (1996). R: a language for data analysis and graphics, Journal of Computational and Graphical Statistics, 5, 299-314.
Nenadić, O., Adler, D. & Zucchini, W. (2003). RGL: a R-library for 3D visualization with OpenGL, Proceedings of the 35th Symposium of the Interface: Computing Science and Statistics, Salt Lake City (forthcoming).
Multidimensional structure and information
University of Toronto, Canada
Following the tradition of multivariate analysis, the total information is typically given by the sum of eigenvalues of the variance-covariance matrix. Thus, for a data set with five standardized variables, the total information is five, irrespective of the covariances among variables. No-one seems to question this definition.
Nishisato (2002a, 2002b), however, discarded this time-honoured definition. When five standardized variables are perfectly correlated, the first eigenvalue is five, and the remaining eigenvalues are all zero; when five variables are totally uncorrelated, the five eigenvalues are all equal to 1. In both cases, the sum of eigenvalues is five. The key objection to this traditional definition comes from the fact (1) that if all five variables are perfectly correlated, only one variable is needed to explain the data since the other four variables are totally redundant, and (2) that if all the variables are uncorrelated one needs all of them to explain the data. It is not difficult to visualize how the volume of the clouds of data points may be influenced by the correlations between variables. Therefore the conclusion is that the data set of perfectly correlated variables contains much less information than that of totally uncorrelated variables.
This view was tied to research on dual scaling of discretized continuous variables (Nishisato, 2000, 2002a, 2002b; Eouanzoui, 2003) for a unified treatment of multivariate data. Far-reaching implications for data analysis are noted here since multivariate analysis typically employs eigenvalues as key statistics of information, while the current study offers a different view that the sum of eigenvalues is not an appropriate measure of total information.
The study proposes new measures of information for individual variables in each dimension and total space, and measures of their joint dimensional contribution and contribution to the total space. Consequences of new measures for multivariate analysis of both continuous and categorical data are discussed.
Eouanzoui, K. (2003). On desensitizing data from interval to nominal measurement with minimum loss of information. Doctoral thesis, University of Toronto.
Nishisato, S. (2000). Data analysis and information: beyond the current practice of data analysis. In Classification and Information Processing at the Turn of the Millennium (eds R. Decker & W. Gaul), 40-51, Heidelberg: Springer-Verlag.
Nishisato, S. (2002a). Differences in data structures between continuous and categorical variables from dual scaling perspectives, and a suggestion for a unified mode of analysis. Japanese Journal of Sensory Evaluation, 6, 89-94 (in Japanese).
Nishisato, S. (2002b). Total information in multivariate data from a dual scaling perspective. Paper presented at the Conference in Honour of Prof. Ross E. Traub's Retirement, December, Toronto.
Nishisato, S. (2003). Geometric perspectives of dual scaling for assessment of information in data. In Recent Developments in Psychometrics (eds H.Yanai, A. Okada, K. Shigemasu, Y. Kano & J. Meulman), 453-462. Tokyo: Springer-Verlag.
Multiple factor analysis for contingency tables
Jérôme Pagès & Mónica Bécue-Bertaut
ENSA / INFSA Rennes, France & Universitat Politècnica de Catalunya, Barcelona, Spain
email@example.com & firstname.lastname@example.org
We study, in the correspondence analysis (CA) framework, a set of contingency tables having the same rows. This kind of data is frequently found in surveys, when one qualitative variable is crossed with several others or when surveys from different countries are compared.
CA is usually applied to such multiple contingency tables using one of two methodologies: i) separate CA of each table; ii) CA of row-wise juxtaposed tables.
In the first case a separate CA of each table is performed and principal axes thus obtained are compared. This basic methodology presents two drawbacks: firstly, structures common to the different tables are only pointed out if they correspond to principal axes; secondly, when the row weights differ between the tables, comparisons are not easy. Furthermore, it is difficult to manage the comparison of several maps.
In the second case a CA is performed of all the tables juxtaposed row-wise (Benzécri, 1982; Cazes, 1980). In this analysis, the inertia of the global columns’ set (union of the columns’ sub-sets of each table) can be decomposed, according to Huygens principle, as the sum of the inertia within the columns of each table (within-tables inertia) and of the inertia between the columns of the different tables (between-tables inertia). This between inertia must not intervene in the study of the profiles: for example, in the case of tables coming from different surveys, the between inertia only expresses a difference between the quotas imposed on the samples. Benzécri (1983), Escofier & Drouet (1983) and Cazes & Moreau (2000) proposed the intra-tables correspondence analysis (ITCA) which eliminates the between-tables inertia.
But two drawbacks remain in this second methodology: some tables can play a dominant role, which conflicts with the aim of a simultaneous analysis; it does not include any reference to the row structure induced by each table. Thus the methodology presented here takes into account the three main problems arising in the simultaneous analysis of several contingency tables having the same rows: the differences between the row margins, the need for balancing the influence of the different tables in a global analysis and the need for a tool to compare the row structures induced by the different tables.
The properties of this method are described and illustrated using contingency tables from an international survey (Lebart et al., 1998) about dishes liked and often eaten.
Benzécri, J. P. (1982). Sur la généralisation du tableau de Burt et son analyse par bandes. Les Cahiers de l’Analyse des Données, 7, 33-43.
Benzécri, J. P. (1983). Analyse de l’inertie intraclasse par l’analyse d’un tableau de contingence. Les Cahiers de l’Analyse des Données, 8, 351-358.
Cazes, P. (1980). Analyse de certains tableaux rectangulaires décomposés en blocs. Les Cahiers de l’Analyse des Données, 5, 145-161; 387-403.
Cazes, P. & J. Moreau (2000). Analyse des correspondances d’un tableau de contingence dont les lignes et les colonnes sont munies d’une structure de graphes bistochastique. In L’Analyse des Correspondances et les Techniques Connexes. Approches Nouvelles pour l’Analyse Statistique des Données (eds J. Moreau, P.A. Doudin & P. Cazes), 87-103. Berlin-Heidelberg: Springer.
Escofier, B. & D. Drouet (1983). Analyse des différences entre plusieurs tableaux de fréquence. Les Cahiers de l’Analyse des Données, 8, 491-499.
Lebart, L., Salem, A. & Berry, E. (1998). Exploring Textual Data. Dordrecht: Kluwer. 181-199.
Using correspondence analysis for exploring regional differences in the educational system. Decentralization, marketization and the social structure of the field of secondary education in four Swedish regions
Mikael Palme, Donald Broady, Mikael Börjesson, Monica Langerth Zetterman, Ida Lidegran, Sverker Lundin & Ingrid Nordqvist
Stockholm Institute of Education; Dept. of Teacher Education, Uppsala University; Dept. of Education, Uppsala University; Chalmers University of Technology, Gothenburg; University College of Gävle, Sweden
email@example.com, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org, email@example.com
Analyzing the effects of the 1991 reform of the Swedish secondary school on the social structure of secondary education in various regional settings, we discuss the use of the Bourdieuan notion of “field” and the employment of correspondence analysis for understanding regional differences in the education system.
In the 1991 reform, all Swedish secondary school study programs were made homogenous in terms of length (3 years) and status as regards formal qualification for the entry into post-secondary education. Affirming the formally equal status of all secondary education study programs in all schools throughout the country, the reform abandoned the previous sharp division between “theoretical” programs and vocational training programs. However, the shift towards a unified secondary education system was accompanied by a parallel shift from a bureaucratic, rule-based management of the education system to a goal and result-oriented type of management (“decentralization”). Great freedom was given to secondary schools to create their own local “profiled” versions of the 16 national study programs, creating a previously unknown heterogeneity of secondary education programs. Also, in 1992, Sweden opted for a voucher system, giving families the right to freely “invest” the public funding for the schooling of their children into any private (“independent”) school, regardless of district or commune boundaries. As a result, the 1990’s witnessed, especially in the large cities, a rapid expansion of independent schools and a sharp increase of secondary school study programs with a local “profile”. In the Swedish capital, Stockholm, the local right-wing government abolished, in 1999, the principle that secondary school students had the right to study in the public school neighbouring their home residence, granting pupils and families the right to compete for entry into any secondary school in the city. As a consequence, the tendency towards homogenization inherent in the reform of secondary education in 1991 was counter-balanced by the creation of an educational market in which both schools and students and their families have to compete.
Using individual data on all pupils in secondary education between 1997 and 2001 (information on schools, educational programs, parents occupations, income, education level, national origin, housing, place of residence, etc.), the effects on the social structure of the 1991 reform are analysed, comparing the “fields” of secondary education in Stockholm, Gothenburg, Uppsala and the provincial town of Gävle. It is shown that while the social structure of each geographical area is reflected in the social structure of secondary education, the analysis also has to take into account the effects of the specific local, politically determined, management models regulating secondary education. These models, in turn, depend on the social structure of the concerned area, its specific history and its impact on political traditions. Largely, information relevant to these aspects of the analysis of the various educational fields cannot be found in the statistical data on secondary schools pupils used in the correspondence analysis as such.
Canonical correspondence analysis, a standard in ecology
Sandrine Pavoine, Anne B. Dufour & Daniel Chessel
Université Claude Bernard Lyon 1, France
firstname.lastname@example.org, email@example.com & firstname.lastname@example.org
Canonical correspondence analysis (CCA) was introduced by ter Braak (1986). This method is largely used in ecology (Birks, 1996). It has been developed to study the relationship between species composition and environment within sites. A site is a basic sampling unit separated in space or time from other sites. CCA is an extension of correspondence analysis (CA) where CA is viewed as a mean to find coefficients of sites that maximise the variance in the species average positions (Hill, 1977). CCA looks for coefficients of environmental variables to obtain a site score that maximises the variance of the average positions of species. This viewpoint corresponds to CA under linear constraint where the site scores should represent a synthetic variable (ter Braak, 1987).
In this presentation, we will recall that CCA is an example of a duality diagram. We emphasize that this ordination analysis is a special case of principal correspondence analysis with respect to instrumental variables (PCAIV). This PCAIV is computed after a CA on the array that contains species composition and a principal component analysis (PCA) on the array that contains environmental variables, quantitative, either qualitative or both (Kiers, 1994). CCA as a particular PCAIV implies another kind of data interpretation. Indeed, CCA corresponds to the following process. It looks for scores of species to obtain a site score that maximises the variance explained by the multiple regression of environmental variables rather than the total variance in the site positions. This explained variance is the product of the total variance and the coefficient of determination of the multiple regression. If many environmental variables are involved, then the explained variance becomes equal to the total variance and CCA is then equivalent to CA.
Birks, H. J. B., Peglar, S. M. & Austin, H. A. (1996). An annotated bibliography of canonical correspondence analysis and related constrained ordination methods 1986-1993. Abstracta Botanica, 20, 17-36.
Hill, M. O. (1977). Use of simple discriminant functions to classify quantitative phytosociological data. In Proceedings of the First International Symposium on Data Analysis and Informatics (ed E. Diday), 181-199. Rocquencourt: IRIA.
Kiers, H. A. L. (1994). Simple structure in component analysis techniques for mixtures of qualitative and quantitative variables. Psychometrika, 56, 197-212.
ter Braak, C. J. F. (1986). Canonical correspondence analysis: a new eigenvector technique for multivariate direct gradient analysis. Ecology, 67, 1167-1179.
ter Braak, C. J. F. (1987). CANOCO - a FORTRAN program for Canonical community ordination by [partial][detrended][canonical] correspondence analysis and redundancy analysis, version 2.1, TNO Institute of Applied Computer Science, Wageningen.
Value orientations of adolescents: applications of cluster and correspondence analyses
Andreas Pöge & Jost Reinecke
University of Trier, Germany
Poge@uni-muenster.de & Reinecke@uni-trier.de
Social inequality is one of the prominent research areas in the social sciences. Classical theories (Marx, Weber) emphasize vertical differences in the society according to socio-economic differences. With the change from industrial to a more business and service oriented society, a vertical scale of social inequality is not sufficient. In addition, horizontal differences are actual under study with concepts, like values, expressive life styles and leisure behaviour. If variables measuring vertical differences are analyzed together with those concepts, people can be classified into distinct milieus (Bourdieu, 1979).
Our research strategy is based on actual social inequality research and focused on the relation between value orientations of adolescents and their deviant behavior. Studies have focused on this relation, but the theoretical background is often unclear and empirical analyses are based on simple bivariate analyses (for a discussion, see Hermann, 2003). Expressive life styles, leisure and other peer group behaviour is incorporated in our empirical analysis, but not part of our presentation.
Our empirical data are part of an ongoing criminological and sociological panel study of adolescents' deviant and criminal behavior. With a self-administered questionnaire data are collected from schools in two German towns (Münster and Duisburg). The sample consists of adolescents from the 7th and 9th grades. Value orientations are classified by cluster analysis and related to behaviour. In a second step the value orientations are analyzed by correspondence analysis. Here, correspondence analysis serves as a confirmation method (for applications see Reinecke & Tarnai, 2000) to validate the results of the first step. Cluster information will be considered in the correspondence analyses as supplementary variables. Comparisons between the cohorts will also be addressed.
Bourdieu, P. (1979). La Distinction. Critique Sociale du Judgement. Paris: Les Editions de Minuit.
Hermann, D. (2003). Werte und Kriminalität. Opladen: Westdeutscher Verlag.
Reinecke, J. & Tarnai, C. (2000) (eds). Angewandte Klassifikationsanalyse in den Sozialwissenschaften. Münster: Waxmann.
Prototype analysis based on similarities
El Mostafa Qannari, Hicham Noçairi & Evelyne Vigneau
ENITIAA-INRA, Nantes, France
We show how the formalization of some data analysis problems in terms of a similarity measure between individuals leads to various statistical methods, some of which are already known and some new. We call this general approach prototype analysis.
Consider a data set and a similarity measure between the individuals. The similarity measure may be computed from the data set itself or may be related to external data. For each individual, we associate a prototype defined as a weighted average of all the individuals, using the entries of the similarity matrix as weights. Thereafter, the method of analysis consists in seeking axes for the representation of the individuals in such a way that individuals are as close to their prototypes as possible. This leads to analyses akin to canonical correlation analysis or PLS2 performed on the original data matrix and the matrix of prototypes.
In the particular case where the similarity measure between individuals is related to the partition of the individuals into various known groups (that is, two individuals have a similarity equal to 1 if they belong to the same group and 0, otherwise), prototype analysis leads to Fisher’s canonical discrimination analysis, or PLS-discriminant analysis.
In the case where individual profiles are derived from a contingency table and the similarity measure is the scalar product associated with the chi-square distance, we retrieve correspondence analysis. Within this context, other choices of the similarity measure are also discussed.
We also discuss how prototype analysis can be used in order to predict the components of a mixture from physical/instrumental data.
Correspondence analysis and homogeneity of style in
Tirant lo Blanc
Alex Riba & Josep Ginebra
Universitat Politècnica de Catalunya, Barcelona, Spain
Alex.email@example.com & firstname.lastname@example.org
Tirant lo Blanc is the main work in Catalan literature and it has been considered to be the first modern novel in Europe. Its main body was written between 1460 and 1465, but it was not printed until 1490. There is an intense and long lasting debate around its authorship arising from its first edition, where its introduction states that the whole book is the work of Martorell (1413?-1468), while at the end it is stated that the last quarter of the book is by Galba (?-1490). For an overview of this debate, see Riquer (1990) and for tentative attempts at this problem using statistical tools, see Ginebra and Cabos (1998) and Riba & Ginebra (2000).
For the current study we exclude
words in italics, and chapters of less than 200 words, leaving 425 chapters
with a total of 398,242 words and 13,828 different words. Following the lead of
the extensive stylometry literature, we use word length, and the use of function
words and vowels to try to detect heterogeneities in the style of
the book, that might indicate the existence of two authors. In particular, we
classify words according to their number of letters, with a category for all
the words of more than nine letters, and build the corresponding
425 x 10 contingency table of ordered rows. We also count the number of appearances of each of the 25 most frequent context-free words in each chapter, forming a 425 x 25 contingency table. Finally, we consider the 425 x 5 table of counts of each vowel in each chapter.
Neither of the two candidate authors left any text comparable to the one under study, and therefore one cannot use discriminant analysis to classify chapters by author. Instead, we explore the three sequences of 425 multinomial observations in the three tables, and the 40 marginal binomial sequences, and observe a clear change in their distributions. Assuming that change to be a sudden one, we find that for most sequences, the maximum likelihood estimate of the change-point is between chapters 371 and 382. Through correspondence analysis, we identify the features that distinguish the chapters before and after that boundary.
In spite of the fact that the change is rather sharp, correspondence analysis seems to indicate that a few of the chapters appearing after the change-point might be more like the ones before that boundary. That is why we proceed to cluster the rows of the three contingency tables (Greenacre, 1988), using non-hierarchical algorithms based on the fit of generalized linear models for polytomous data (McCullagh & Nelder, 1983). We also discuss simple ways to combine the cluster analysis on each of the three tables to identify the chapters that are most likely being misclassified by the estimated change-point.
Ginebra, J. & Cabos, S. (1998). Anàlisi estadística de l’estil literari; Aproximació a l’autoria del Tirant lo Blanc. Afers, 29, 185-206.
Greenacre, M. J. (1988). Clustering the rows and columns of a contingency table, Journal of Classification, 5, 39-51.
McCullagh, P. & Nelder, J. A. (1983). Generalized Linear Models, 2nd Ed. London: Chapman and Hall.
Riba, A. & Ginebra, J. (2000). Riquesa de vocabulari i homogeneïtat d’estil en el Tirant lo Blanc. Revista de Catalunya, 152, 99-118.
Riquer, M. de (1990). Excurs VI. Martí Joan de Galba i la seva intervenció en la novel·la. Aproximació al Tirant lo Blanc. 285-299. Barcelona: Quaderns Crema.
Stavanger University College, Norway
The point of departure of my paper is the idea of “structural causality of a network of factors” that Pierre Bourdieu advocates in the Distinction (Bourdieu, 1979, 1984) to come to terms with the shortcomings of the standard methods in quantitative social research. According to him, the most popular and most utilized methods in quantitative analysis are in fact problematic to use.
In Distinction this approach is developed by introducing two concepts, or constructs: the space of social positions (the social space) and the space of lifestyles. The social space aims at giving the “universe of material conditions of existence” the best possible representation. The space of lifestyles has a similar aim; it tries to disclose the main oppositions and divisions within the “universe of lifestyles”, a space of “position-takings”, whose content are the products of the habitus, i.e., judgements, classifications and perceptions, tastes and distastes.
These two representations function as key concepts in this alternative methodology, where the space of social position tends to command the space of lifestyles in periods of equilibrium. Technically, the social space is the resulting map of the first and second principal axes of a multiple correspondence analysis (MCA) - a network of factors - based on carefully chosen background variables (various indices of economic and cultural capital). In a similar way the space of lifestyles is the resulting map of an MCA of relevant indices of lifestyles signs (another network of factors). My intention is to focus on this alternative methodology and furnish a basis for an empirical evaluation of it.
Thus, I will present results from an ongoing collective research project (with Johs. Hjellbrekke & Olav Korsnes, University of Bergen) on the Norwegian social space. I will present an outline and the characteristics of one version of it, which correspond well with Bourdieu’s findings. In a second movement the analyses will be undertaken utilising the “reciprocal approach” (Lebart et. al., 1984). Then lifestyle components are the “raw material” for an MCA and the resulting map reveals the main divisions among these “products” of the habitus.
A careful comparison of these two independently constructed spaces by focusing on the positions of the individuals reveals that they are structured according to the same set of principles; they are homologous. They are both swayed by the two mechanisms of social differentiation identified by Bourdieu: volume and composition of capital. These two mechanisms are operating both in creating social differences between people in the objective social structure of which the social space is the representation, as well as in catching the formative principles of lifestyles. Further, these homological relations seem to be of a robust nature. They appear invariably, as it seems, in any sub-universe of lifestyle components.
Bourdieu, P. (1979, 1984). Distinction - A Social Critique of the Judgement of Taste. London: Routledge & Kegan Paul.
Lebart, L., Morineau, A. & Warwick, K. (1984). Multivariate Descriptive Statistical Analysis. Correspondence Analysis and Related Techniques for Large Matrices. New York: John Wiley & Sons.
Measure vs variable duality in geometric data analysis
Henry Rouanet & Brigitte Le Roux
Université René Descartes, Paris, France
Rouanet@math-info.univ-paris5.fr & Lerb@math-info.univ-paris5.fr
The formal approach used by Benzécri to develop correspondence analysis (CA) was not an accidental matter of notation, but an integral part of the construction (Benzécri & coll., 1973). The properties of CA are entirely founded on the underlying mathematical theory, essentially abstract linear algebra - with a zest of measure theory - as found in classical textbooks (MacLane, Halmos, etc.): finite-dimensional vector space, homomorphism, scalar product, etc. The cornerstone of the approach is the measure vs variable duality, which formalizes the distinction between two sorts of quantities: those for which grouping units entails summing (adding up) values, such as weights and frequencies, which we call measures (as in mathematical measure theory), versus those for which grouping units entails averaging values, such as scores, rates, which we call variables. This duality is reflected in the duality notation (alias transition notation), putting lower indices for measures and upper indices for variables (Rouanet & Le Roux,1993)
In the paper, we describe the role of measure vs variable duality in CA at the following two crucial stages of geometric modelling:
i. Construction of clouds and the chi-square metric. The marginal frequencies of the table firstly provide reference measures over rows and columns. Secondly, they define Euclidean isomorphisms from variable vector spaces to dual measure vector spaces, hence scalar products and Euclidean norms, therefore they determine without arbitrariness the chi-square metric over those spaces.
ii. Principal directions of clouds and principal coordinates. The fundamental mathematical result is that the solution of spectral equations is the singular decomposition of two adjoint homomorphisms and/or the associated bilinear form. Applying these results to CA immediately yields the transition equations and the reconstitution formulas.
Two implications will be briefly discussed:
1. Formal approach vs matrix approach. Translating abstract linear results with the various roles of duality into matrix formulas is an easy task, and does provide a compact format to transmit the algorithm of CA - using matrix formulas as a shorthand - but no more than the algorithm. The converse translation - i.e. from matrix formulas, deciphering the rationale of the procedure – is more of a headache.
2. Methodologically, measure vs variable duality provides firm operational guidelines to practical data analysis, especially for devising the codings most appropriate to the situation under study.
Benzécri, J.-P. & coll. (1973). Analyse des Données, Volume 2, Analyse des Correspondances. Paris: Dunod.
Rouanet, H. & Le Roux, B. (1993). Analyse des Données Multidimensionnelles. Paris: Dunod.
Changes in UK leisure patterns (1973-1997)
Loughborough University, United Kingdom
This paper advances the results of Gershuny and Fisher’s (1999) work on ‘leisure in the UK across the 20th century’. The authors investigated leisure time-use patterns of UK residents across a 25-year period by means of multiple classification analysis, which provides estimates of the average time spent on leisure and sports activities and the effects of belonging to a particular sub-group in the population. Even though multiple classification analysis is an excellent tool for the quantification of time-use patterns, its contribution to building a qualitative context is limited. Hence, using the same data set that was originally drawn from the General Household Survey (GHS) a simple correspondence analysis (CA) was conducted to explore how people’s leisure behaviour and level of sports participation changed by region, sex, age, and profession between 1973 and 1997. Since the leisure questions of the GHS were not asked in a consistent manner over the years, the concept of historic profiles as introduced by Mueller-Schneider (1994) rather than absolute frequencies were used for this study.
Gershuny and Fisher (1999) revealed that they had difficulties finding the appropriate documentation for the 1973 data file. This problem was reflected in the results of the CA and led to the exclusion of this particular year for many variables. Other results showed that even though there was an apparent north-south division with regard to some leisure activities, patterns of leisure behaviour with regard to the regions varied widely. However, there seemed to be a tendency for ‘new’ leisure styles to emerge in London and the South-East before travelling further north.
The paper will also take ‘leisure studies’ as an example for an interdisciplinary subject that has neglected the wealth of basic statistical information ever since it came into existence. It will be shown how the result of a CA has the potential to integrate geographical, sociological and management issues in leisure research.
Gershuny, J. I. & Fisher, K. (1999). Leisure in the UK across the 20th century, Working papers of the ESRC Research Centre on Micro-social Change, paper number 99-3. Colchester: Institute for Social and Economic Research, University of Essex.
Mueller-Schneider, T. (1994). The visualization of structural change by means of correspondence analysis. In Correspondence Analysis in the Social Sciences (eds M. J. Greenacre & J. Blasius). London: Academic Press.
The environmental impact of Italian farming activities:
testing group membership in surveys through multi-dimensional data analysis
Renato Salvatore & Carlo Russo
University of Cassino, Italy
The study of the impact of human activities on the environment is a current research issue, because of the new directions in European policy in the field of sustainable and environmental-responsible development. However, the unavailability of agri-environmental data is considered a major constraint, preventing analysts from providing reliable assessments (Moxey et al., 1998).
In this paper, a general framework is provided to study the environmental impact of agriculture through farm-level sample surveys, focusing on the system of relationships between environmental and structural data identified by a multiple correspondence analysis. The complex relations between the farm structure and its environmental impact has been empirically identified through exploratory analysis and can be utilized by researchers to infer environmental information from structural data.
The approach utilizes the readily-available structural data to infer environmental behaviour of Italian farms. The methodology is based on testing sample units membership to the census typology of farms at different dimensions of environmental impact, typology obtained by multiple correspondence analysis and cluster analysis in order to group census farms in homogeneous classes (Sabbatini & Russo, 2002). Utilizing sample surveys data, in this paper we apply multidimensional data analysis as a tool to estimate group membership in a pre-established typology, using the distribution of the supplementary variables in each group.
In order to design a sampling strategy for the population of farms grouped in homogeneous classes, we have adapted the convex programming approach to the multivariate sample allocation problem (Bethel, 1989) to the needs of clustering procedures. This technique (Innocenzi & Salvatore, 2002) is useful when we do not need population estimates at the analytic or territorial domains level, but the aim is to allocate the sample in strata that can represent the research domains. The method proposed tests the sampling distribution of the categories of the environmental impact supplementary variables, under the null hypothesis of farm membership to a pre-established environmental impact class.
The technique is evaluated using data from the Italian 2000 census of the Lazio region.
The approach can increase the efficiency of the agri-environmental statistic systems in terms of cost reduction, more timely estimates of environmental trend and more meaningful and intelligible statistics based on a multidimensional approach rather than on separate indicators.
Bethel, J. (1989). Sample allocation in multivariate surveys. Survey Methodology, 15, 47 – 57.
Innocenzi, G. & Salvatore, R. (2002). The implementation of the DPSIR model in the Italian agri-environmental statistic system: methodology issues rising from the 1998 FSS experience. Proceedings of the Eurostat International Conference on the Agricultural Statistics in the new Millennium, Greece (http://www.ariadne2002.gr/en/).
Moxey, A., Whitby, M. & Low, P. (1998). Agri-environmental indicators: issues and choices. Land Use Policy, 15.
Sabbatini, M. & Russo, C. (2002). Assessing agricultural environmental impact: a cluster analysis approach. Proceedings of the Eurostat International Conference on the Agricultural Statistics in the new Millennium, Greece (http://www.ariadne2002.gr/en/).
Cluster analysis and HJ-biplot: a joint approach applied to the evaluation of the adolescent personality
P O S T E R
Sonia Salvo, Paula Alarcón & Eugenia Vinet
Universidad de la Frontera, Temuco, Chile
email@example.com, firstname.lastname@example.org & email@example.com
The aim of cluster analysis is to organise objects in relation to their own characteristics. There are a numbers of ways to construct clusters with respect to p variables on n taxonomic units, but in general it is not possible to know directly the particular configuration of variables responsible for each of the groups.
Our work establishes a relationship between cluster analysis and the HJ-Biplot technique (Galindo & Cuadras 1986). This technique can be applied to any data matrix and gives a better simultaneous quality representation for rows and columns projected on a subspace of maximum inertia. This methodology makes it possible to identify in the factorial plane each one of the clusters and the variables associated with these clusters.
In our study we illustrate these techniques using a sample of 104 low infractor adolescents. The Multiaxial Adolescent Clinic Inventory (MACI) defined in 1993 was applied with associative design, within selective methodology. Using both analysis techniques, four clusters of personality patterns were obtained. The first two were represented by external behaviour with destructive type. The others were represented by passive and inhibited behaviour. The model explains 28.5% of the index variance of the non-social adaptive. These findings are discussed within the psychology evaluation of adolescents and its applications in forensic psychology.
This work was funded by the project FONDECYT No 1010514, which is gratefully acknowledged.
Galindo, M. P. & Cuadras, C. (1986). Una alternativa de representación simultánea: HJ-biplot. Qüestiió, 10, 13-23.
An application of nonsymmetrical correspondence analysis, based on TUCKALS3 algorithm, to electoral marketing data
P O S T E R
Sonia Salvo1, Purificación Galindo2, Luis Cid3, Javier Martín2
1Universidad de la Frontera, Chile, 2Universidad de Salamanca, Spain &
3Universidad de Concepción, Chile
firstname.lastname@example.org & email@example.com
Electoral analysis consists in evaluating information obtained from previous elections in order to compile segmented voting records. This targeting task offers a campaign the possibility to fine tune and direct its communication to certain segments of the electorate, through direct mail, telephone and the internet. Targeting is a particularly useful tool to identify and profile undecided voters. As an illustratikon we analysed data coming from a pre-electoral survey about the elections of 1996 in Spain. The data came from the “Centro de Informaciones Sociológicas” (C.I.S.) The survey is based on a census list prepared for Election 1996 in Spanish State except Ceuta and Melilla. The size of sample was 2547 individuals once the missing and wrong data were eliminated. Fifty variables were measured (more details in Dorado et al., 2002).The aim is describe the preferences of survey respondents about their vote intention to state parties (IU, PP, PSOE) depending on their age (18-24; 25-44; and >45 years old) and educational level (no studies; primary level; secondary level; higher level).
We analysed the three-way contingency table (state parties x age x educational level) by the partial nonsymmetrical correspondence analysis (Lauro & Balbi, 1999) and nonsymmetric correspondence analysis based on TUCKALS3 algorithm (Salvo, 2002) to illustrate the interpretation, advantages and limitations of the former method.
Dorado, A, Galindo M.P., Vicente-Villardón, J & Vicente-Tavera, S. (2002). El CHAID como herramienta de marketing politico. Esic Market. Vol 111.
Lauro, N. C. & Balbi, S. (1999). The analysis of structured qualitative data. Applied Stochastic Models and Data Analysis, 15, 1-27.
Salvo, S. (2002). Contribuciones al análisis de modelos para variables cualitativas que contemplan variable respuesta. Ph D thesis. Universidad de Salamanca.
Correspondence analysis and classification
Conservatoire National des Arts et Métiers, Paris, France
The use of correspondence analysis for classification purposes goes back to the “prehistory” of data analysis (Fisher, 1940) where one looks for the optimal scaling of categories of a variable X in order to predict a categorical variable Y. When there are several categorical predictors a commonly used technique consists in a two step analysis: multiple correspondence analysis is first performed on the predictors set, followed by a discriminant analysis using factor coordinates of units as numerical predictors (Bouroche et al.,1977).
However, in banking applications (for example, credit scoring) logistic regression seems to be more and more used instead of discriminant analysis when predictors are categorical. One of the reasons advocated in favour of logistic regression, is that it gives a probabilistic model and it is often claimed among econometricians that the theoretical basis is more solid, but this is arguable. This tendency is also due to the flexibility of logistic regression software which has been more developed compared to discriminant analysis. However, it can easily be proved that discarding non-informative eigenvectors gives more robust results than direct logistic regression, since it is a regularisation technique similar to principal component regression (Hastie et al., 2001). Moreover, correspondence analysis provides an insight to the data, which is always useful.
Since factor coordinates are derived without taking into account the response variable, one could think of adapting partial least squares (PLS) regression. We will show that PLS is related, at least for the first PLS component, to barycentric discrimination (Celeux & Nakache, 1994; Verde & Palumbo, 1996).
For two-class discrimination, we will also present a combination of logistic regression and correspondence analysis, as well as ridge regression which are interesting alternatives. A comparison of all these methods will be illustrated on a real case study.
Bouroche, J. M., Saporta, G. & Tenenhaus, M. (1977). Some methods of qualitative data analysis. In Recent Developments in Statistics (ed J. R. Barra), 749-755. Amsterdam: North-Holland.
Celeux, G. & Nakache, J. P.(1994). Discrimination sur Variables Qualitatives. Paris: Polytechnica.
Fisher, R. A. (1940). The precision of discriminant functions. Annals of Eugenics, 10, 422-429.
Hastie, T., Tibshirani, F. & Friedman, J. (2001). The Elements of Statistical Learning Theory. New-York: Springer.
Verde, R. & Palumbo, F. (1996). Analisi fattoriale discriminante non-simmetrica su predittori qualitativi. Atti del Convegno della XXXVIII Riunione Scientifica della Società Italiana di Statistica, Rimini.
Simple, optimal, factor and unidimensional scale scores:
Hans Schadee & Giovanni Battista Flebus
Università di Milano-Bicocca, Milano, Italy
firstname.lastname@example.org & email@example.com
In many analyses individual scores are obtained by forming weighted sums from several observed variables or items. Simple sums, optimal scores, factor scores and scores resulting from unidimensional scaling models - whether cumulative scales (Guttman, Mokken, Rasch) or unfolding and seriation models (Coombs) - have all been used for this purpose. The relations between these techniques are relatively well known, though often ignored in applications. The degree to which the results of one analysis are informative with respect to another model, or whether scores from different models give the same results, is less well known. The empirical investigation of trace functions (item characteristic functions) using local (non parametric) regression of item responses on the total score sheds light on these empirical questions.
As examples we use psychological test data - Eysenck's neuroticism scale, Bem Sex role inventory, an abridged version of the Adorno F-scale - and data from public opinion surveys on electoral behaviour..
Regularization in multiple-set canonical correlation
Yoshio Takane & Heungsun Hwang
McGill University, Montreal & HEC, Montreal, Canada
firstname.lastname@example.org & email@example.com
Generalized (multiple-set) canonical correlation analysis (GCANO; Carroll, 1968; Horst, 1961) has attracted the attention of many data analysts primarily because it subsumes a number of interesting techniques in multivariate analysis as special cases (Yanai, 1998). More recently, however, it is recognized as an important method of integrating information from multiple sources (Takane & Oshima-Takane, 2001). In this paper we discuss a regularization technique for linear GCANO. Regularization is considered important as a way of solving ill-posed problems, of supplementing insufficient data by prior knowledge, or of incorporating certain desirable properties in the estimates of parameters in the model. We discuss some mathematical properties of a matrix operator involved in the ridge type of regularization method in GCANO and discuss their implications for multiple correspondence analysis (MCA).
Let Xi denote a column-wise centered cases-by-variables matrix for the ith data set, and let X denote a super-matrix formed from Xi(i = 1, …, K) arranged side by side. Define M (l) = I+l(XX’)-, where l is a regularization parameter, and (XX’)- indicates a g-inverse of XX’. Regularized GCANO obtains the generalized eigenvalue-vector decomposition (GEVD) of X’M(l)X with respect to D(l), which is a block diagonal matrix with Di(l) = X’i M(l)Xi as the ith diagonal block. An optimal value of l is determined by cross validation. Note that the problem reduces to the same GEVD problem as solved in the conventional MCA when l = 0, and consequently M(l) = I. A small positive value of l, on the other hand, has the effect of obtaining parameter estimates with bias, but with a smaller expected value of mean squares (Hoerl & Kenard, 1970). It is useful when the number of variables (categories in the case of MCA) is large relative to the sample size.
Some examples are given to illustrate the method.
Carroll, J. D. (1968). A generalization of canonical correlation analysis to three or more sets of variables. Proceedings of the 76th Annual Convention of the American Psychological Association, 227–228.
Hoerl, A. F. & Kenard, R. W. (1970). Ridge regression: biased estimation for nonorthogonal problems. Technometrics, 12, 55–67.
Horst, P. (1961). Generalized canonical correlations and applications to experimental data. Journal of Clinical Psychology, 17, 331–347.
Takane, Y. & Oshima-Takane, Y. (2001). Nonlinear generalized canonical correlation analysis by neural network models. In Measurement and Multivariate Analysis (eds S. Nishisato et al.), 183–190. Tokyo: Springer Verlag.
Yanai, H. (1998). Generalized canonical correlation analysis with linear constraints. In Data Science, Classification and Related Methods (eds C. Hayashi et al.), 539–546. Tokyo: Springer Verlag.
Increasing Cronbach’s alpha for questionnaire reliability
by using optimal scaling
Dicle Taspinar1, Vedat Coskun2 & Nihat Demirhan2
1Istanbul Commerce University, Turkey & 2Turkish Naval Academy, Turkey
firstname.lastname@example.org & email@example.com
Using questionnaires is the most commonly used technique for data collecting in marketing and social researches. A deterministic criteria in questionnaire reliability is Cronbach’s alpha coefficient. As the value of Cronbach’s alpha increases, questionnaire reliability becomes more reliable. In this paper, we will prepare a simulation questionnaire in order to determine the choices of the respondents which would affect the consistency and therefore reliability of the questionnaire.
Flebus (1990), Bernardi (1994) and Barnette (1999) also showed that Cronbach's alpha coefficent can be affected by the observations. Flebus (1990) studied observations and the correlations among variables together with the variance of these variables. He then developed a program which can calculate the point where Cronbach's alpha is maximized. Bernardi (1994) tried to show that some relations can be found between the variables and the observations even if Cronbach’s alpha is small. Barnette (1999) performed a simulation and found a way to increase and decrease Cronbach's alpha by looking at respondent refusals.
We will show that it is possible to find out which observations which cause decrease in consistency by using homogeneity analysis which is an optimum scaling method. It is then possible to increase the questionnaire reliability by just taking out those observations.
Barnette, J. (1999). Nonattending respondent effects on the internal consistency of self-administered surveys: a Monte Carlo simulation study. Educational and Psychological Measurement, 59, 38-46.
Bernardi, Richard (1994). A validating research results when Cronbach's alpha is below .70: A methodological procedure. Educational and Psychological Measurement, 54, 766-776.
Flebus, Giovanni Battista (1990). A program to select the best items that maximize Cronbach's alpha. Educational and Psychological Measurement, 50, 831.
Co-correspondence analysis: a new ordination method to relate two species compositions
Cajo J. F. ter Braak & André P. Schaffers
University and Research Centre, Wageningen, The Netherlands
Cajo.firstname.lastname@example.org & Andre.Schaffers@wur.nl
A new ordination method, called co-correspondence analysis, is developed to relate two types of communities of species (e.g. a plant community and an animal community) sampled at a common set of n sites in a direct way. The two data sets contain nonnegative values (abundances), typically with very many zeroes, and have many more variables (species) than statistical units (sites). The method improves the simple, indirect approach of applying correspondence analysis (reciprocal averaging) to the separate species data sets and correlating the resulting ordination axes. Co-correspondence analysis maximizes the weighted covariance between weighted averaged species scores of one community with weighted averaged species scores of the other community. It thus attempts to identify the patterns that are common to both communities. Both a symmetric, descriptive and an asymmetric, predictive form are developed. The symmetric form relates to co-inertia analysis (Dolédec & Chessel, 1994). Predictive co-correspondence analysis relates to correspondence analysis as partial least squares (PLS) regression (Martens & Naes, 1992; ter Braak & de Jong, 1998) relates to principal component analysis.
Co-correspondence analysis uses weighted averages where PLS use linear combinations (weighted sums), as in ter Braak (1995). The new method performs better than PLS when the data have a unimodal structure, a strong qualitative nature and/or are sum-constrained, i.e. when each data set is better analyzed by correspondence analysis than by principal component analysis (ter Braak & Prentice, 1988).
In two examples the predictive power of co-correspondence analysis is compared with that of canonical correspondence analyses (ter Braak, 1986; ter Braak & Verdonschot, 1995) on syntaxonomic and environmental data. In the first example carabid beetles in roadside verges are shown to be more closely related to plant species composition than to vegetation structure, and in the second example bryophytes in spring meadows are shown to be more closely related to the species composition of the vascular plants than to the measured water chemistry.
Dolédec, S. & Chessel, D. (1994). Co-inertia analysis: an alternative method for studying species-environment relationships. Freshwater Biology, 31, 277-294.
Martens, H. & Naes, T. (1992). Multivariate Calibration. Chichester: Wiley.
ter Braak, C. J. F. (1986). Canonical correspondence analysis: a new eigenvector technique for multivariate direct gradient analysis. Ecology, 67, 1167-1179.
ter Braak, C. J. F. (1995). Non-linear methods for multivariate statistical calibration and their use in palaeoecology: a comparison of inverse (k-nearest neighbours, partial least squares and weighted averaging partial least squares) and classical approaches. Chemometrics and Intelligent Laboratory Systems, 28, 165-180.
ter Braak, C. J. F. & de Jong, S. (1998). The objective function of partial least squares regression. Journal of Chemometrics, 12, 41-54.
ter Braak, C. J. F. &. Prentice, I. C. (1988). A theory of gradient analysis. Advances in Ecological Research, 18, 271-317.
ter Braak, C. J. F. & Verdonschot, P. F. M. (1995). Canonical correspondence analysis and related multivariate methods in aquatic ecology. Aquatic Sciences, 57, 255-289.
Correspondence analysis and categorical conjoint
Universidad Carlos III de Madrid, Spain
To quantify individuals’ trade-off when they can choose between multidimensional alternatives is a typical study in marketing research, usually handled by conjoint analysis. We want to show that, in a particular case, correspondence analysis (CA) can also be used to analyze conjoint data and further, it offers a map which helps to understand the results obtained.
Conjoint analysis can be understood as a technique which predicts what products or services people will prefer and assesses the weight people give to various factors that underlie their decisions. There exist different conjoint algorithms for analyzing such data, depending on the type of conjoint measurement: in our case we are interested in conjoint measurement on a categorical scale. For this case there exists an algorithm due to Carroll (1969) known as categorical conjoint measurement. In this study we show how correspondence analysis can be applied to tables concatenated in a certain way in order to emulate Carroll’s algorithm. We use canonical correlation analysis applied to dummy variables as a bridge in order to show the equivalence in results between categorical conjoint analysis and correspondence analysis. Previous literature has already demonstrated the equivalence between simple correspondence analysis and canonical correlation analysis for two categorical variables (Greenacre, 1984). Our first innovation came with the demonstration of the equivalence between correspondence analysis and canonical correlation analysis for more than two categorical variables.
A further issue in the conjoint analysis literature is the study of interaction effects (Green, 1973). We incorporated interactions in canonical correlation analysis for the usual way of coding interactions as well as for a new one, establishing the connection with correspondence analysis and categorical conjoint measurement in the presence of interactions.
As an example we give an application in which a potential interaction effect between the type of fragrance for a perfume and its intensity may exist. For example, a particular subject may prefer floral fragrance as well as low intensity fragrance, but for the particular case of citric fragrance, high intensity is preferred.
Carroll, J. D. (1969). Categorical Conjoint Measurement. Unpublished Manuscript. Bell laboratories, Murray Hill.
Green, P. E. (1973). On the analysis of interactions in marketing research data. Journal of Marketing Research, 10, 410-420.
Greenacre, M. J. (1984). Theory and Applications of Correspondence Analysis. London : Academic Press.
Resampling methods applied to stability analysis in
multiple correspondence analysis: two case studies in
P O S T E R
Leonardo Trujillo & Elquin A. Huertas
National University of Colombia, Bogotá, Colombia & Citizen Coexistence Project
email@example.com & firstname.lastname@example.org
When a multiple correspondence analysis (MCA) is applied, there are very important questions regarding either stability of the configurations representing the data or the estimators produced from this data. In particular, we are concerned with two types of stability:
(1) Internal (or inner) stability, which refers to the stability of configurations or estimators due to changes in the data inside a particular sample. A question related to inner stability may be: does a particular observation excessively influence the obtained representation or estimators?
(2) External (or outer) stability, which refers to the stability of configurations and estimators to changes in the whole sample. A question related to outer stability is: would it be possible to obtain the same configurations or estimators if we used a different sample with the same characteristics. Thus, it could be said that a factorial plane is externally stable, if its orientation is minimally altered considering several samples of the same population.
Jackknife and bootstrap techniques (Efron, 1982, Shao & Tu, 1995) are used to study inner and outer stability, respectively. The idea is to repeat MCA on each of the simulated samples to study their fluctuations. Additionally, the bootstrap allows us to obtain confidence zones (Lebart et al., 1995) in order to study the stability of some configurations obtained in the factorial planes.
In this work, two particular applications of stability analyses are presented. First, a study of internal and external stability for two methods of longitudinal data analysis, namely qualitative harmonic analysis (QHA) and STATIS, is performed. These methods were applied to study the mobility of individuals in Bogota, Colombia (Trujillo, 2002). Second, the external stability in the construction of indicators of citizen coexistence in adolescents of Bogota was studied (Huertas & Corzo, 2001).
Efron, B. (1982). The Jackknife, the Bootstrap and other Resampling Plans. Society for Industrial and Applied Mathematics. Philadelphia.
Huertas, E. & Corzo, J. (2001). Análisis de estabilidad de indicadores: Pluralismo en la convivencia ciudadana. Memorias simposio de Estadística. Estadística en la investigación social. Santa Marta - Colombia. Agosto de 2001.
Lebart, L., Morineau, A., & Piron, M. (1995). Statistique Exploratoire Multidimensionnelle. Paris: Dunod.
Shao, J. & Tu, D. (1995). The Jackknife and the Bootstrap. New York: Springer-Verlag.
Trujillo, L. (2002). Estimación de la Varianza de los Valores Propios Estimados para dos Métodos de Análisis de Datos Longitudinales: STATIS Y ACC. Tesis de maestría. Universidad Nacional de Colombia. Facultad de Ciencias. Departamento de Estadística.
Interactive software to produce biplots
Universitat Pompeu Fabra, Barcelona
We analyse and discuss how a generic software to produce biplot graphs should be designed. We describe a data structure appropriate to include the biplot description and we specify the algorithm(s) to be used for different biplot types.
We specify the options the software should offer to the user in two different environments. In a highly interactive environment the user should be able to specify many graphical options and also to change them using the usual interactive tools (Bond & Michailides, 1997). The resulting graph needs to be available in several formats, including high quality printing format. In a Web-based environment, the user submits a data file together with some options specified either in a file or using a form. Then the graphic is sent back to the user in one of several possible formats according to the specifications.
We review some of the already available software (for example, Lipkovich & Smith, 2002) and we present an implementation of the proposed software based in Xlisp-Stat. It can be run under Unix or Windows, and it is also part of a web service that provides biplot graphs through the web.
Preliminary information will eventually be available at http://gauss.upf.es/xls-biplot/ and http://gauss.upf.es/bp-form.html.
Bond, J. & Michailides, G. (1997). Interactive correspondence analysis in a dynamic object-oriented environment. Journal of Statistical Software, 2, 8.
Lipkovich I. & Smith, E. P. (2002). Biplot and singular value decomposition macros for Excel. Journal of Statistical Software, 7, 5.
Multiple correspondence analysis to explore relationships among genetic polymorphisms
Joan Valls, Elisabet Guinó & Víctor Moreno
Catalan Institute of Oncology, Barcelona, Spain
email@example.com, firstname.lastname@example.org & email@example.com
The polymorphisms detected in multiple genes that are related with processes of xenobiotic metabolism or inflammation could partially explain the variability in cancer predisposition. New technology in molecular biology based in DNA microarrays helps to simultaneously determine polymorphisms in hundreds of genes. Classical statistical techniques, useful in the analysis of few variables, lose their utility when the number of variables is near or higher than the observations. In these cases, techniques of dimension reduction can be useful. These methods help the identification of variation patterns or groups of variables, and suggest a hypotheses that can be tested using other techniques.
In this paper multiple correspondence and cluster analysis will be used to explore frequency patterns of categorical variables, such as the different variants identified in multiple genes of interest in colorectal cancer.
We have determined 150 polymorphisms in 50 genes related to inflammation or metabolism in a group of 323 patients with colorectal cancer and 283 hospital controls. For each polymorphism, the genotype has been identified and classified as a categorical variable with three levels (normal homozygous, heterozygous, variant homozygous). Multiple correspondence analysis has been applied to the whole of the polymorphisms (independently to the group of patients) and the categories have been represented in factorial biplots.
Initially, in order to simplify the exploratory analysis, only 5 polymorphisms in different genes have been selected (IL6, IL8, PPARG, NFKB, TNF) and the categories of variant homozygous and heterozygous have been combined assuming a dominant effect. Subsequently, cluster analysis, applied to the new factors created, helps us to understand the relationships between the polymorphisms.
Escofier, B. & Pagès, J. (1988). Análisis Factoriales Simples y Múltiples: Objetivos, Métodos e Interpretación. Bilbao: Servicio editorial de la Universidad del País Vasco.
Greenacre, M. J. (1984). Theory and Applications of Correspondence Analysis. London: Academic Press.
Lebart, L., Morineau, A. & Warwick, K. M. (1984). Multivariate Descriptive Statistical Analysis: Correspondence Analysis and Related Techniques for Large Matrices. New York: John Wiley & Sons.
Inverse correspondence analysis
Michel van de Velden & Patrick Groenen
Universitat Pompeu Fabra, Barcelona, Spain & Erasmus Universiteit Rotterdam,
firstname.lastname@example.org & email@example.com
In correspondence analysis, rows and columns of a data matrix are depicted as points in low-dimensional space. The row and column profiles are approximated by minimizing the so-called weighted chi-squared distance between the original profiles and their approximations. In this paper, we will study the inverse correspondence analysis problem, that is, the possibilities of retrieving one or more data matrices from a low-dimensional correspondence analysis solution. We will show that there exists a nonempty closed and bounded polyhedron of such matrices. We also present an algorithm to find the vertices of the polyhedron. A proof that the maximum of the Pearson chi-squared statistic is attained at one of the vertices is given. In addition, it is discussed how extra equality constraints on some elements of the data matrix can be imposed on the inverse correspondence analysis problem. As a special case, we present a method for imposing integer restrictions on the data matrix as well. The approach to inverse correspondence analysis followed here is similar to the one employed by De Leeuw and Groenen (1997) in their inverse multidimensional scaling problem.
De Leeuw, J. & Groenen, P. J. F. (1997). Inverse multidimensional scaling. Journal of Classification, 14, 3-21.
Measuring values or preferences in a cross-national
context: rating or ranking?
Hester van Herk & Michel van de Velden
Vrije Universiteit Amsterdam, The Netherlands & Universitat Pompeu Fabra, Barcelona, Spain
firstname.lastname@example.org & email@example.com
To measure values or preferences both ratings and rankings are often used. With rankings individuals are explicitly forced to express an ordering of their preferences or values. Ratings on the other hand, allow the individuals to express their preferences or values more freely on a particular, usually predefined scale.
No consensus exists about what method should be preferred for studying preferences or values in a cross-national context. Some argue that ranking is the most appropriate (e.g., Kamakura & Mazzon, 1991), whereas others argue that ratings should be preferred (Klein & Artzheimer, 1999); especially in a cross-national context ratings should be preferred (Ng, 1982). Unfortunately, comparisons of the two measurement methods are typically made at the aggregate level across all individual subjects. Results at this aggregate level indicate that rankings and ratings provide similar results. Moreover, most studies comparing the two measurement types use a between-subject design in which subjects either rated or ranked the items. Consequently a comparison at the individual level is not possible. An exception is the study by Russell and Gray (1994), who let the same subjects rate and rank the same item set. However, this study was done in one country only.
In this paper we consider, at the individual level, the relationships between ratings and rankings across countries. A sample is used from five countries in the European Union: Germany, the UK, France, Italy and Spain, including more than 4000 respondents. Each of the respondents supplied ratings and rankings for the same group of items. Insight is given into the relative merits of rating and ranking measurement by using three-way correspondence analysis (Carlier & Kroonenberg, 1996). This technique enables us to model the expected ordinal character in both scales, as well as differences and similarities between countries. Results show that the way in which people assess rating and ranking measurement procedures is similar across these countries even if substantive content of items differs.
Carlier, A. & Kroonenberg, P. M. (1996). Decompositions and biplots in three-way correspondence analysis. Psychometrika, 61, 355-373.
Kamakura, W. A. & Mazzon, J. A. (1991). Value segmentation: a model for the measurement of values and value systems. Journal of Consumer Research, 18, 208-218.
Klein, M. & Artzheimer, K. (1999). Ranking und Rating Verfahren zur Messung von Wertorientierungen, untersucht am Beispiel des Inglehart-Index. Empirische Befunde eines Methodenexperiments. Kölner Zeitschrift für Soziologie und Sozialpsychologie, 51, 550-564.
Ng, S. H. (1982). Choosing between the ranking and rating procedures for the comparison of values across cultures. European Journal of Social Psychology, 12, 169-172.
Russell, P. A. & Gray, C. D. (1994). Ranking or rating? Some data and their implications for the measurement of evaluative response. British Journal of Psychology, 85, 79-92.
A comparison of correspondence analysis
and nonmetric item response models
Wijbrandt van Schuur & Jörg Blasius
University of Groningen, The Netherlands & University of Bonn, Germany
firstname.lastname@example.org & email@example.com
In the world of survey research we can distinguish at least three different schools by the way they go about measuring their concepts. First, there is the school that starts with the concept of a Likert scale, continues with reliability analysis, and ends with factor analysis and its offspring. In this school, measurement is always metric, the interval properties of the variables are taken for granted, and there is little emphasis on systematic differences among the items.
The second school dates back from Thurstone, continues with Guttman, Coombs, Lazersfeld and Henry, and ends with item response models such as the parametric Rasch model, the nonparametric Mokken model, and their offspring. Here the interval (or ordinal) properties of the variables are derived from the measurement model, and are subject to falsification. Some adherents claim that only the most parsimonious item response theory (IRT) model leads to 'objective' measurement in which measurements are externally valid, in that they are comparable over different groups of subjects in different times and places.
Finally, the third school, that we shall refer to here as correspondence analysis and multiple correspondence analysis, goes back to Hirschfeld. Its major adherents are Benzécri, Gifi, Greenacre, and Nishisato.
In this paper we will make some comparisons between the second and third schools, and relate nonmetric item-response models (Van Schuur, 2003) to multiple correspondence analysis with respect to:
· types of data (dichotomous, rating scales, pick or rank k/n or any/n data)
· interpretation of variables (dependent, independent, indicator, intervening, active, passive)
· representation in one or more dimensions
· representation of variables or of response categories
· measurement of subjects
· sensitivity to frequency distributions
· top-down and bottom-up approaches to finding interpretable structure in the data.
Our findings will be illustrated with a variety of data sets.
Schuur, W. H. van (2003). Mokken scale analysis: between the Guttman scale and parametric item response theory. Political Analysis, 11, 139-163.
Multiple correspondence analysis for symbolic data
Seconda Università di Napoli, Italy
In recent years, the development of symbolic data analysis has yielded many methods for the synthesis and the representation of complex information, expressed in terms of symbolic objects (Bock & Diday, 2000).
According to the definition given by Diday (1989), a symbolic object (SO) is a suitable concept modelling. It can be described by a set of multi-valued variables (multi-categorical, intervals, distributions); furthermore, logical rules can even be considered in the SO’s description in order to reduce the space of description of such variables.
In the framework of factorial methods for representing symbolic data, the present work aims to provide an extension of multiple correspondence analysis (MCA) to the study of data described by multi-categorical variables. In fact, according to the classical application of the MCA to multiple binary data tables, following the generalised canonical analysis (GCA) approach, an extension of this approach is suggested when the symbolic objects descriptors are multi-categorical ones. Moreover, a possible generalization of GCA (Verde, 1998) has been proposed when all the several kinds of descriptors are present in the SO’s description.
The proposed procedure is based on a quantification phase of the symbolic descriptors: relational operators are used on the transformed data in order to preserve the information about the relationships among such descriptors. A fuzzy coding data is performed as well as an evaluation of SO quality on the factorial plan.
In order to provide a suitable visualisation of symbolic data on factorial plans, we propose different kind of representation forms (convex polygons) and a symbolic interpretation of the factorial axes.
The criterion optimized in the analysis is a kind of squared correlation ratio on the factorial variables.
Finally, an application on will be performed using SODAS software, allowing us to validate the proposed approach.
Bock, H. H. & Diday, E. (2000). Analysis of symbolic data, exploratory methods for extracting statistical information from complex data. Studies in Classification, Data Analysis and Knowledge Organisation. Springer-Verlag.
Diday, E. (1989). Knowledge representation and symbolic data analysis. In Proceedings of Second International Workshop on Data, Expert Knowledge, and Decision. Hamburg.
Verde, R. (1998). Generalised canonical analysis on symbolic objects. In Classification and Data Analysis (eds M. Vichi & O. Opitz), 195-202. Heidelberg: Springer Verlag.
Logistic biplot for binary data
José Luis Vicente Villardón1 , M. P. Galindo Villardón1, Miguel Yánez-Alvarado2 & Antonio Blázquez Zaballos1
1Universidad de Salamanca, Spain & 2Universidad de Los Lagos, Osorno, Chile
Classical biplot methods allow for the simultaneous representations of individuals and variables in a data matrix of continuous variables. When variables are binary (presence/absence) a classical linear biplot representation is not suitable and multiple correspondence analysis is commonly used.
In this paper we propose a linear biplot representation based on logistic response models, closely related to latent trait models and item response theory. The geometry of the biplot is such that the coordinates of individuals and variables are calculated to have logistic responses along the latent dimensions.
Gabriel (1998) took into account the probability distributions of the identically distributed manifest variables and adjusted the biplot using generalised bilinear regression. However, that procedure was developed for contingency tables, and has some problems when applied to a matrix of individuals by variables, owing to the size of the matrices involved. The biplot method proposed in this paper has been developed for data matrices that contain individuals by variables.
The main characteristic of the proposal is that, although it is based on a non-linear response model, the representation is linear: the directions of the variable vectors on the biplot show the directions of increasing logit values and therefore, the directions in which the probability of having the characteristic increases, with optimum fit.
A modification of the Newton-Raphson method (Murray, 1972) is considered for the estimation of parameters by joint maximum likelihood, leading to a procedure similar to that used by Baker (1992)
The method is illustrated using real data.
Baker, F. B. (1992). Item Response Theory. Parameter Estimation Techniques. New York: Marcel Dekker.
Gabriel, K. R. (1998). Generalised bilinear regression. Biometrika, 85, 689-700.
Murray, W. (1972). Numerical Methods for Unconstrained Optimization. London: Academic Press.
Amaya Zárraga & Beatriz Goitisolo
Universidad del País Vasc / Euskal Herriko Unibertsitatea, Bilbao, Spain
firstname.lastname@example.org & email@example.com
The classical method used to analyze several contingency tables consists of performing a separate correspondence analysis on each one and/or a correspondence analysis of the juxtaposition of the tables (if the rows of the tables, for example, are the same units). However, the results of these processes can be affected, as Bécue-Bertaut & Pagès (2000) point out, by differences between the marginal row profiles of the different tables and by the relative importance of the tables in the analysis, measured through the contributions of the columns. This, in turn, is due to the differences between the grand totals of the tables and to the differences of structure intensity between the tables.
The aim of this work is to present a new method of factor analysis called simultaneous analysis (Zárraga & Goitisolo, 2002, 2003) which is based on the already known technologies of correspondence analysis and multiple factor analysis (Escofier & Pagès, 1988). Simultaneous analysis allows the treatment and joint study of several tables of information, solving the problems encountered with classical techniques. Simultaneous analysis is especially suitable for the study of several contingency tables and, by extension, complete disjunctive tables, incomplete disjunctive tables and Burt or pseudo-Burt tables.
The proposed method of analysis allows us:
· to balance the influence of the tables, transforming the values of each one;
· to balance the influence of the tables according to the differences in structure intensity between them; and
· to preserve both the weight and the metric of each table in an overall factor analysis.
Bécue-Bertaut, M. & Pagès, J. (2000). Analyse factorielle múltiple intra-tableaux. Application à l’analyse simultanée de plusieurs questions ouvertes. In JADT 2000: 5és Journées Internationales d’Analyse Statistique des Données Textuelles.
Escofier, B. & Pagès, J. (1988) (third edition 1998). Analyses Factorielles Simples et Multiples. Objectifs, Méthodes et Interprétation. Paris : Dunod.
Zárraga, A. & Goitisolo, B. (2002). Méthode factorielle pour l’analyse simultanée de tableaux de contingence. Revue de Statistique Appliquée, 50, 47-70.
Zárraga, A. & Goitisolo, B. (2003). Étude de la structure inter-tableaux à travers l’analyse simultanée. Revue de Statistique Appliquée (forthcoming).
Constrained ordination analysis with flexible response curves
Mu Zhu & Trevor J. Hastie
University of Waterloo, Canada & Stanford University, U.S.A.
firstname.lastname@example.org & email@example.com
Canonical correspondence analysis, or CCA (ter Braak, 1986), is a popular multivariate method for constrained ordination analysis. By straightforward manipulations with matrix algebra, it can be shown (e.g., Takane et al., 1991; Zhu, 2001) that CCA is equivalent to Fisher's linear discriminant analysis (LDA), but this equivalence is apparently not widely known among practitioners.
We provide a more intuitive (and less algebraic) argument to show how this equivalence can be understood directly in the context of the Gaussian response model, a model that is widely used in constrained ordination analysis.
The Gaussian response model, however, only provides a reasonable approximation if the species have unimodal and symmetric response functions. Canonical correspondence analysis also implicitly assumes that the species have the same tolerance level, perhaps the most unreasonable simplification of all. There is growing empirical evidence (e.g., Johnson & Altman, 1999) that such assumptions are often violated in practice.
We show that, by exploiting the equivalence between CCA and LDA, we can model the response functions much more flexibly in constrained ordination analysis. In particular, a nonparametric generalization of Fisher’s LDA (Zhu & Hastie, 2003) can be applied. This allows the species to have different tolerance levels: for example, they can even have response functions that are asymmetric and multimodal.
Johnson, K. W. & Altman, N. S. (1999). Canonical correspondence analysis as an approximation to Gaussian ordination. Environmetrics, 10.
Takane, Y., Yanai, H. & Mayekawa, S. (1991). Relationships among several methods of linearly constrained correspondence analysis. Psychometrika, 56, 667-684.
Zhu, M. (2001). Feature Extraction and Dimension Reduction with applications to Classification and the Analysis of Co-occurrence Data. Ph.D. dissertation, Stanford University.
Zhu, M. & Hastie, T. J. (2003). Feature extraction for nonparametric discriminant analysis. Journal of Computational and Graphical Statistics, 12, 101-120.