Multiple factor analysis of mixed tables: a proposal for

analysing problematic metric variables

Elena Abascal Fernández^{1}, Maria
Isabel Landaluce Calvo^{2} & Ignacio Garcia Lautre^{1}

^{1}Universidad Pública de Navarra & ^{2}Universidad
de Burgos, Spain

eabascal@unavarra.es, iland@ubu.es & nacho@unavarra.es

It is commonly accepted that principal
component analysis (PCA) is a suitable method to analyse graphically the
information of a rectangular table composed by *n* individuals and *p*
metric variables. The objective is to study the association between variables
and the similarities between individuals; PCA allows us to reduce the dimension
of the table and to project variables and individuals onto the factorial axes.
However, some problems emerge depending on the type of data we have. Two main
points are considered in this work:

· Presence of variables with irregular and/or asymmetric distribution or with a large amount of zero values.

· No linear relation between variables.

The linear correlation coefficient is not an adequate indicator to measure the relationship between variables in these two situations. So, in these cases, PCA does not seem a suitable method. If the analyst has not detected these problems, he probably will be doing an incorrect interpretation of the results.

The objective of this work is twofold. First, we propose a new way to analyse this type of table, converting the problematic variables to qualitative variables (Escofier & Pagès, 1986). After doing this, we apply multiple factor analysis (MFA - see Escofier & Pagès, 1986, 1990, 1994) to the resultant mixed table and compare the results with others suggested in the literature. The second objective is to develop explicitly the formulas that allow us to interpret an MFA with this type of mixed tables. We explain how to interpret the MFA planes that maintain, at the same time, the characteristics of PCA planes for quantitative variables and the characteristics of MCA (multiple correspondence analysis) planes for qualitative variables (see, for example, Lebart et al., 1995).

Escofier, B. & Pagès, J. (1990). *Analyses Factorielles Simples et Multiples. *Paris:
Dunod.

Escofier, B.
& Pagès, J. (1994). Multiple factor analysis (MFULT package). *Computational Statistics & Data
Analysis, ***18**, 121-140.

Lebart L., Morineau, A. & Piron, N.
(1995). *Statistique Exploratoire
Multidimensionelle*. Paris:Dunod

Measuring the degree of congruence between groups of related respondents: an application of correspondence analysis

Zerrin Aşan

Anadolu University, Turkey

zasan@anadolu.edu.tr

A point of interest in different scientific fields such as psychology, market research and education is the question to which degree related groups agree on a special research topic. For instance, it could be important to examine to which degree the points of view of pupils, teachers and school directors differ.

One way to measure those differences or similarities for categorical data is to analyze the congruency of survey responses between the surveyed groups. In the first step it is possible to show the responses of the different groups on tables which are similar to contingency tables. Based on these tables it is then possible to test the agreement between the researched groups statistically.

General methods to measure the strength of agreement for categorical data are Cronbach’s alpha, Kendall rank correlation coefficient and the Kappa index (Light, 1971; Carmines & Zeller, 1979). These measures generally show the degree of agreement between the respondents’ groups but do not show what this agreement looks like. For example, if a mother and a father have the same opinion on when their child should go to bed, Kappa will be high. However, there is missing information about the concrete hour they have agreed upon. Correspondence analysis, which is used for the analysis of contingency tables where two or more categorical variables are shown in one table, can also be used for measuring the degree of agreement between related groups and in addition it visualizes the location of this agreement between the different response categories.

This presentation will demonstrate the possibilities of correspondence analysis measuring the degree of agreement of related groups, in this example we have used mothers, fathers and children. The similarities of the groups will be shown by the example on the opinion how much time children should spend with their computers. To measure this research question we first used the Kappa index and afterwards applied correspondence analysis.

Carmines,
E. G. & Zeller, R. A. (1979). *Reliability and Validity Assessment*. London:
Sage Publications.

Light, R. J. (1971). Measures* *of response agreement for
qualitative data:* *generalizations and alternatives.* * *Psychological Bulletin*, **76**, 365-377.

Correspondence analysis of ordinal cross-classifications

Eric J. Beh
& Pam J. Davy

University of Western Sydney & University of Wollongong, Australia

e.beh@uws.edu.au &
pam_davy@uow.edu.au

Multiple correspondence analysis (MCA) is a popular method of graphically identifying the association between more than two variables of a contingency table. A popular way to use the procedure is to apply singular value decomposition (SVD) to the indicator matrix or the Burt matrix generated from the data. More recently, extensions of SVD such as the Tucker and CANDECOMP/PARAFAC methods of decomposition, can also be used.

For the analysis of ordinal variables, the existing methods do not take into consideration their ordinal structure, and so neglect the information they may provide. One approach is to modify the method of decomposition so that the co-ordinates reflect the ordinality. For the analysis of bivariate ordinal cross-classifications this has been studied, for example, by Parsa & Smith (1993), Ritov & Gilula (1993) and Schriever (1983). Alternatively, the correspondence analysis approach of Beh (1997) is applicable to ordinal two-way categorical data and uses the bivariate moment decomposition (BMD) to identify linear (location), quadratic (dispersion) and higher order moments for each of the ordinal variables. While most of the techniques designed to analyse ordinal cross-classifications focus solely on the linear-by-linear association, the advantage of using the BMD is that non-linear measures of association, called generalised correlations, can easily be found. The linear correlations reflect the Pearson product moment correlation and Spearman’s rank correlation, and higher order correlations reflect generalised, non-linear versions of these.

This paper describes the method of simple correspondence analysis for ordinal cross-classifications, and shows how it may be generalised to perform ordinal MCA. The development of the technique will focus on the three-way contingency table with three ordered variables. We will demonstrate that the correspondence plots generated from the analysis have the same mathematical properties as the Tucker and PARAFAC/CANDECOMP approaches to MCA, yet offer a far more intuitive interpretation of the association of the variables. Of particular interest, for a completely ordered three-way table, the total inertia can be decomposed into three bivariate chi-squared terms and a three-way term (Beh & Davy, 1998). The procedure does not rely on calculating maximum likelihood estimation techniques, but instead relies on orthogonal polynomials generated from a simple recurrence relation described in Beh (1997) and so is a simple and easily computable tool for the analysis of ordinal cross-classified data.

Beh, E. J.
(1997). Simple correspondence analysis of ordinal
cross-classifications using orthogonal polynomials. *Biometrical Journal*, **39,** 589-613.

Beh, E. J. & Davy, P. J. (1998).
Partitioning Pearson's chi-squared statistic for a completely ordered three-way
contingency table. *The Australian and New
Zealand Journal of Statistics*, **40**, 465-477.

Parsa,
A. R. & Smith, W. B. (1993). Scoring under ordered
constraints in contingency tables. *Communications
in Statistics (Theory and Methods)*, **22**, 3537-3551.

Ritov, Y. & Gilula, Z. (1993).
Analysis of contingency tables by correspondence models subject to ordered
constraints, *Journal of the American
Statistical Association*, **88**, 1380-1387.

Schriever,
B. F. (1983). Scaling of order dependent categorical variables with
correspondence analysis. *International Statistical
Review*, **51**, 225-237.

Statistical aspects of pottery
quantification for dating some archaeological contexts in the city of Tours

Lise Bellanger, Philippe Husi & Richard Tomassone

Université de Nantes, France, Université Francois Rabelais
de Tours, France & Institut National Agronomique, Paris, France

lise.bellanger@math.univ-nantes.fr, husi@univ-tours.fr
& rr.tomassone@wanadoo.fr

This
paper describes some statistical analyses of a particular archaeological
material (pottery) coming from some sites in the city of Tours. The important
number of excavations realized, with the same system of data recording, during
the last thirty-five years (1968-2002) explain the interest in Tours. We list
16 excavations leading to stratigraphic abundance data in the historic centre
of the town. As pottery is a very good chronological indicator, its
quantitative study is crucial to comparing different archaeological contexts
(or sets). Each context retained in the study is represented by its pottery
assemblage. The corpus of data comprises a two-way table with rows representing
different fabrics and columns specifying archaeological contexts. Columns are
separated into two groups. The first one, the active group, includes
archaeological contexts for which dates are attested by money; the second one,
the supplementary group, includes contexts, whose dates are badly defined or
unknown.

Our statistical approach
corresponds to different archaeological needs:

i) A comparison of the most used measures of
pottery quantification to assess their performance using the multidimensional
scaling of categorical data.

ii) A spatial (inter-assemblage) and chronological
approach to estimate date contexts, in which the primary source of variation is
thought to be the different proportions of fabrics (pies of pottery),
representing either geographical or temporal variation (or both).

The
statistical procedure is tackled in several steps:

1) Investigation
of the relationship between contexts and fabrics using correspondence analysis

2) Use of the
secure representation of the contexts obtained, previously for estimating their
date with a regression model.

3) Model checking
as an essential component of this fitting process, including resampling methods
(jackknife and bootstrap).

4) Estimating the dates for the
supplementary group, using the regression model.

5) Scrutiny of estimated dates and
possible insertion into the active group of some new contexts, belonging to the
previously defined supplementary group.

6) Repetition of this process to obtain a
new basic active group consisting of well dating contexts.

This method provides an effective complementary tool for dating
archaeological contexts. It seems to be a good example of integration between
formal statistical theories and their practical application in scientific discipline.

On the connection between the distribution of eigenvalues in multiple correspondence analysis and log-linear models

Saloua Ben Ammou & Gilbert Saporta

CEDRIC, Paris,
France

saloua.benammou@fdseps.rnu.tn & saporta@cnam.fr

Multiple correspondence analysis (MCA) and log-linear modelling are two techniques of multi-way contingency table analysis having different problematics and fields of applications. Log-linear models are profitable when applied to a small number of variables (Bishop et al., 1975). Multiple correspondence analysis is useful in large tables (Lebart et al., 2000). This efficiency is balanced by the fact that MCA is not able to explicit relations between more than two variables, as can be done by log-linear modelling (Andersen, 1991). The two approaches are complementary.

In this presentation we shall demonstrate that in MCA under independence hypothesis each observed eigenvalue is asymptotically normally distributed. These distributions have the same mean, different variances and converge to normal distribution (Ben Ammou, 1996; Ben Ammou & Saporta, 1998).

Under some modelling hypothesis, the
MCA eigenvalues distribution diagram takes some particular shapes, especially
in the case of mutual independence model (theoretically there is only one non
trivial, multiple eigenvalue _{}=1/*p*, where *p* is the number of variables), in
practice, observed eigenvalues µ_{i} are different but still close to
1/*p* : µ_{i }= 1/*p* ± ε. Therefore the shape of
observed eigenvalues diagram is very peculiar. This shape changes if there is
one or more interaction between variables. We can recognize the model fitted by
data in some particular cases, especially when the number of interactions is
not very large, i.e. we can easily identify the observed eigenvalues that are
equal (or very close) to 1/*p*. When the number of interactions increases,
we can no more distinguish between eigenvalues theoretically equal to 1/*p*
and those different from 1/*p*.

Based
on these results we propose a simple procedure, fitting progressively
log-linear models, where the goodness of fit procedure
is based on MCA eigenvalues diagram: the model is inducted by successive
utilisations of MCA (non constrained by the number of variables).

The procedure is validated on several data sets from the literature corresponding to various cases: mutual independence, saturated models and graphical models with two-way interactions.

Andersen,
E. B. (1991). *The
Statistical Analysis of Categorical Data*. (Second edition), New-York:
Springer.

Ben Ammou,
S. (1996). *Comportement des Valeurs propres en Analyse
des Correspondances Multiples sous certaines Hypothèses de Modèles*. Doctoral Thesis, University Paris IX Dauphine.

Ben Ammou,
S. & Saporta, G. (1998). Sur la normalité asymptotique
des valeurs propres en ACM sous l’hypothèse d’indépendance des variables. *Revue de Statistique Appliquée*, **46,**
21-35.

Bishop, Y.
M. M., Fienberg, S. E. & Holland, P. W. (1975). *Discrete Multivariate Analysis: Theory and Practice*. Boston: MIT Press.

Lebart, L., Morineau, A. & Piron, M.
(2000). *Statistique Exploratoire
Multidimensionnelle*, 3^{ème} édition, Paris: Dunod.

Statistical method in image retrieval

Mónica Benito & Daniel Peña

University Carlos III of Madrid, Spain

mbenito@est-econ.uc3m.es & dpena@est-econ.uc3m.es

Exploratory image studies generally aim at
data inspection and dimensionality reduction. Any particular image is
represented by a matrix **X** of
dimensionality *I*x*J* , i.e., with *I* rows and *J* columns. Principal component analysis (PCA) has been used in the
past to reduce dimensionality and derive useful compact representations for
image data. Low-dimensional representations are also important when one
considers the intrinsic computational aspect. This work is concerned in
particular with dimension reduction from large image databases with
applications to image reconstruction. PCA was first applied to reconstruct
human faces by Kirby and Sirovich (1990), considering the images as vectors in
a high dimensional space. Turk and Pentland (1991) further developed a
well-known face recognition method, known eigenfaces, where the eigenfaces correspond
to the eigenvectors associated with the dominant eigenvalues of the face
covariance matrix. The eigenfaces define a feature space, or ‘face space’,
which drastically reduces the dimensionality of the original space, and face
reconstruction and identification are carried out in the reduced space. An
important property of PCA is its optimal signal reconstruction in the sense of
minimum mean square error (MSE) when only a subset of principal components are
used to represent the original signal.

The new method proposed is based on
the projection of the images as matrices and it is shown to lead to a better
reconstruction for the data analysed. Instead of considering the images as
vectors, as in the PCA approach, the idea is maintain the matrix structure of
the images, and project each matrix onto a vector. We measure the
discriminatory power of the projection vector by the scatter of the projected
samples. The optimal set of projection axes are the eigenvectors corresponding
to the highest eigenvalues of the image total covariance matrix. This
covariance matrix has dimension *I*x*I* (assuming *I*<*J*).
The set of projected feature vectors associated with any image **X** of the sample can be used to form a
feature matrix and estimate a multivariate linear model using this known design
matrix. The design matrix is unique for each observation, and its obtained
using the information of all the training sample. The unifying theme of the new
schemes is that of lowering the space dimension (data compression) subject to
increased fitness for the reconstruction task.

The method is illustrated using a
set of full-face pictures of males and females, extracted from digitized images
in a gray-scale.

Christensen, R. (1991). *Linear Models
for Multivariate,* *Time Series and Spatial Data.* Springer-Verlag.

Kirby, M. & Sirovich, L. (1990).
Application of the Karhunen-Loeve procedure for the characterization of human
faces. In *IEEE
Transactions on Pattern Analysis and Machine Intelligence,* **12**, 103-108.

Swets, D. & Weng, J. (1996). Using
discriminant eigenfeatures for image retrieval. *Technical Report.*

Turk, M. & Pentland, A. (1999). Face
recognition using eigenfaces. In *Proceedings of the* *IEEE Conference
in Computer Vision and Pattern Recognition*, 586-591.

Yang, J. & Yang, J. (2002). From
image vectors to matrices: a straightforward image projection technique. *Pattern
Recognition*, **35**, 1997-1999.

Types and anti-types as test points in correspondence analysis

Jörg Betzin &
Erwin Lautsch

Technical
University of Berlin @ University of Kassel, Germany

betzin@cs.tu-berlin.de
& erla@uni-kassel.de

The concept of types and anti-types was mainly developed by psychologists in order to analyze the relationship between categorical variables more closely than under the more or less unspecific general independence hypothesis.

In the framework of configuration
frequency analysis (CFA), first introduced in the late sixties by Lienert (see
e.g. Krauth/Lienert, 1973, “Die Konfigurationsfrequenzanalyse”), the single
cells of a configuration frequency table are investigated. We denote by “type” a cell, in which the observed
frequency is significantly higher than the expected frequency under the
hypothesis of general independence of the table. The term “anti-type” is
analogously defined, with a significant lower value of observed frequency. The
significance of the difference between observed and expected frequencies is
measured by the respective χ² component of the χ² test statistic for
the table. Thereby the χ² components are handled as χ² tests with one
degree of freedom, whereas adjusted significance levels are used. We connect this
approach with correspondence analysis (CA), where a χ² distance is used to
measure relationships between different categories.

One of the difficulties in CA is the interpretation and, particularly, the detection and visualization of meaningful multivariate categories points. By a multivariate categories point we understand a combination of category scores of different variables lying close together in the “graphical description” of CA. By the term “graphical description” we do not refer to the real graphical mapping of the CA solution, but rather the possible spatial description in more than two or three dimensions. Comparing types/anti-types from CFA with graphical representations from CA may be helpful to detect relevant points and to interpret relations from the graphic.

In this presentation we will describe the technique of CFA and the use of types and anti-types as testpoints in CA. For multivariate categories points we show the connection to be a type or an anti-type in CFA with his spatial location in the CA graphic. In particular, we discuss the importance of the distance from the zero point, the distance to the dimensional axes and the volume of the convex envelope of a multivariate categories point in the graphical description of CA for this connection. Moreover, we show the usefulness of a so-called determination coefficient from the CFA to interprete the practical remarkableness in the CA context.

The usefulness of the envisaged concept is demonstrated by results from real data from the “Shell youth study 2001 (Germany)”. We drew two different data sets from this survey. The first one describes the confidence in social and political institutions of German adolescents. This data set has a very clear and stable CA solution, so the types/anti-types concept is not very helpful here. The second data set contains variables from the sociodemographic environment of the survey participants. Here the CA solution is very complex. We point out that introducing types and anti-types is helpful in the interpretation of the CA solution.

A three-step approach to assessing the behaviour of

survey items in cross-national research using biplots

Jörg Blasius & Victor Thiessen

University
of Bonn, Germany & Dalhousie University, Halifax, Canada

jblasius@uni-bonn.de & Victor.Thiessen@Dal.Ca

To make meaningful international comparisons using survey data presupposes a common understanding of the questionnaire items and an acceptable level of quality of the data. To the extent that these conditions are not met, cross-national findings are not comparable. This paper employs a three-step approach to assessing the comparability of survey items in cross-national research. In the first step, classical principal component analysis (PCA) is used, which makes rather stringent assumptions about the distributions and the level of measurement of the survey items. In the second step, the results of PCA are then compared with those obtained using nonlinear PCA. Divergences in the results of these two types of analyses indicate the existence of measurement problems. These are then explored more systematically using the biplot methodology in the third step. This methodology helps to locate both differences in the underlying structure of the survey items, and to violations of metric properties in the individual items.

We exemplify our approach by focusing on a set of five-point Likert-type items used in the 1994 International Social Survey Program (ISSP), which focuses on family and gender roles. Information is available for 24 countries on opinions in the areas of women and work, marriage, and children. Our results show that several subsets of countries can be meaningfully compared within but not across the subsets. The reason that they cannot be compared across subsets is that the underlying structures in the different subsets of countries is not equivalent. In addition, for several countries the behaviour of the survey items was such that we conclude that they cannot fruitfully be used for substantive comparisons.

The implication of social classification for analyses of the field of higher education – the case of Sweden

Mikael Börjesson, Donald Broady & Mikael Palme

Department
of Teacher Education, Uppsala University, Sweden

mikael.borjesson@ilu.uu.se, broady@nada.kth.se & mikael.palme@lhs.se

This paper falls into two parts. Firstly, the classification of social origin found in official statistics in Sweden is discussed, as well as the possibility of an alternative classification giving a less uni-dimensional representation of social space. In this context, the key issue of what constitutes a “household” is highlighted. Secondly, an alternative, multi-dimensional classification system is employed for analysing the recruitment to higher education in Sweden in 1998.

Until recently, Statistics Sweden has used two different types of social classification systems, the Nordic Occupational Classification (NYK) and the Socio-Economic Index (SEI). Being based on professions, the NYK allows the separation of over 3,000 different occupations, which can be aggregated according to branches with different levels of aggregation. The SEI, comprising some 20 categories, is a hierarchical classification, using several different criteria for distinguishing between the categories. In order to obtain a classification system that accounts both for hierarchical differences between social groups and for the specific nature of the assets that these groups possess, a social classification that combines NYK and SEI is described, distinguishing between 32 social groups.

Social groups separated by any classification system tend to differ also as regards properties usually not transparent in the definition of the groups as such. For a cohort of all grade 9 leavers in 1988 (appr. 110,000 individuals), the characteristics of the particular social group of origin are analysed along several dimensions (marriage patterns, income, education, immigration, number of children, etc.; all data provided from the national census in 1990). A main finding is that groups with high social positions that largely depend on educational capital (espec. university teachers and physicians) differ significantly from groups pertaining to the economic fractions of the dominant class. Typically, they find spouses with an equally high social position based on cultural capital, while men belonging to the economic elite more often marry women in lower social positions. It is concluded that the statistical analysis must be specifically aware of the implications of various alternative definitions of the “household” or “family” when exploring the effects of social origin as a variable.

In the second part of the article, the social structure of the field of higher education in Sweden is analysed using simple correspondence analysis. The columns of the matrix consists of 32 social groups divided by sex, i.e. forming 64 social groups (where sons of university teachers are separated from daughters of university teachers, etc.), and the rows of approx. 1,400 educational programmes (distinguishing both between different types of programmes, such as civil engineering programmes in computer sciences, physics, and architecture, and between different institutions of higher education). Three important dimensions constitute the field. The first axis separates men from women, where programmes in natural sciences and technology stand against programmes in education, social services and nursing. The second axis differentiates the dominating social groups, especially those whose positions depend on educational or cultural capital, from dominated ones. The former ones normally attend longer, more prestigious study programmes at the traditional universities and prominent professional schools, while the latter are directed towards shorter programmes and provincial institutions. The third dimension shows an opposition between the cultural fractions and the economic fractions of the dominant class. It is concluded that an understanding of the complexity of the structure of the field of higher education requires a differentiated classification system of social origin that separates social groups with different kinds of assets.

Discriminant analysis on categorical variables

Stéphanie
Bougeard^{(1)}, El Mostafa Qannari^{(2)} & Hicham Noçairi^{(2)}

^{(1)}AFSSA,
Ploufragan & ^{(2)} ENITIAA-INRA, Nantes, France

s.bougeard@ploufragan.afssa.fr

Fisher’s discriminant analysis and logistic regression are used in order to predict a categorical variable from a set of numeric variables. Adaptations of these methods to the case where it is desirable to predict a categorical variable from other categorical variables are discussed in the literature. After a brief review of these techniques, we investigate methods of prediction that aim at circumventing the well-known problem of multicollinearity among predictors. A first approach consists in performing multiple correspondence analysis on the predictors and, thereafter, uses a subset of the principal axes as predictors. It should be stressed, however, that the issue regarding how to choose the principal axes to be introduced in the prediction model is a tricky problem. On the one hand, with few axes there is a risk to discard useful information for the discrimination purpose and, on the other hand, one should pay attention not to introduce principal axes that may cause instability in the model.

We investigate alternative methods which make is possible to derive, step by step, principal axes that are tightly related to the categorical variable to be predicted. These methods pertain to redundancy analysis and partial least squares discriminant analysis performed on categorical variables.

The various methods of analysis are illustrated and compared on the basis of real data sets.

GINKGO, a multivariate analysis program
oriented towards distance-based classifications

Miquel De Cáceres,
Francesc Oliva & Xavier Font

Universitat de Barcelona, Spain

mcaceres@bio.ub.es, francesc@bio.ub.es & xavier@bio.ub.es

Although there are many multivariate
programs already available, most of them only present the same classical
methods. As a result, non-expert users are not aware of more specialised techniques,
which could be more useful for their application needs. GINKGO is an application oriented towards the representation and
classification of individuals in multivariate spaces. It is mainly concerned in
providing multivariate methods applied to dissimilarity matrices.

Unsupervised
classifications can be performed using three different clustering models. 1) Hierarchical
agglomerative clusters (single, complete, UPGMA,...). 2) Crisp (K-means, MacQueen,
1967) and fuzzy (FCM, Bezdek, 1981) partitions. 3) Independent clusters
(possibilistic C-means, Krishnapuram & Keller, 1993). Additionally, GINKGO
allows clustering models (2) and (3) to be performed directly on symmetric
dissimilarity matrices (Oliva et al., 2001), avoiding the use of Pythagorean
distance or MDS. Non-supervised classification methods available are: linear
discriminant, quadratic discriminant and distance-based discriminant (Cuadras
et al., 1997) analyses.

Ordination
methods implemented in the program are principal components analysis (PCA),
metric scaling (MDS), non-metric multidimensional scaling (NMDS),
correspondence analysis (CA), as well as related multidimensional scaling
(RMDS, Cuadras & Fortiana, 1998).

GINKGO
has been entirely developed in Java language and is freely distributed (http:\\biodiver.bio.ub.es\vegana).
Software updates are automatically done, by using Java Web Start
technology.

Bezdek, J. C. (1981). *Pattern Recognition with Fuzzy Objective Functions.* New York: Plenum Press.

Cuadras, C.
M., Fortiana, J. & Oliva, F. (1997). The proximity of
an individual to a population with applications in discriminant analysis. *Journal
of Classification*, **14**, 117-136.

Cuadras, C. M. & Fortiana, J.
(1998). Visualizing categorical data with related metric scaling. In *Visualization
of Categorical Data *(eds. J. Blasius and M. Greenacre), 365-376. London:
Academic Press.

Krishnapuram,
R. & Keller, J. M. (1993). A possibilistic approach to clustering. *IEEE*
*Transactions on Fuzzy Systems,* **1**, 98-110.

MacQueen,
J. (1967). Some methods for classification and analysis of multivariate
observation. *Proceedings of theFfifth Berkeley Symposium on Mathematical
Statistics and Probability,* 281-297.

Oliva, F.,
De Cáceres, M., Font, X. & Cuadras, C. M. (2001). Contribuciones desde una
perspectiva* *basada en* *distancias al fuzzy C-means clustering. *XXV
Congreso Nacional de Estadística e Investigación Operativa.* Úbeda 2001.

Hierarchical factor classification for contingency tables

Sergio Camiz,
Jean-Jacques Denimal & Elena Rova

*Università
di Roma La Sapienza, Italy, Université des Sciences et Technologies de Lille,
France** & Università
di Venezia Ca’ Foscari, Italy*

sergio.camiz@uniroma1.it, jean-jacques.denimal@univ-lille1.fr & erova@unive.it

Recently, Denimal (2001) introduced a
hierarchical classification of continuous variables, based on a sequence of
principal component analyses. Two particular features deserve mentioning:
first, it can cluster variables highly correlated, irrespective of the
direction of correlation, thus producing dipoles of variables opposed to each
other; second, for each node it produces a specific factor plane, where both
the clustered variables and the units, as seen only by these variables, are
projected. In this way, a sequence of factor planes is produced. On these, the
factors previously built in the hierarchical process can also be projected,
since axes belonging to adjacent nodes, far from being orthogonal to each
other, are usually the most correlated.

In this paper, an analogous
procedure is proposed for the columns of a contingency data table. We set in
the frame of correspondence analysis (CA), namely we deal with profiles. This
implies that we cannot base the analysis on a couple of columns (that would
produce only one factor), but rather on four columns. Given a contingency
table, crossing *m* rows with *n* columns, for any column *j* a
new column *j ^{*}* is built,
whose

It will be shown that: i) at each
step n, where columns *j*_{1}
and *j*_{2} are merged, a
factor plane is produced whose first factor represents what the two columns
have in common and the second what distinguishes them; the factors variances
correspond to the two eigenvalues respectively; ii) at each step, all merged
columns and all units can be projected on the factor plane: the latter are
represented as they are seen only by these columns; iii) since the resemblance
among two nodes is evaluated irrespective of the covariance sign, the groups of
columns may assume a shape of dipoles; iv) the sequence of second eigenvalues
is non-decreasing, so that they can be used as indexes of the hierarchy; v) the
total inertia is decomposed as the sum of these indexes with the first
eigenvalue of the last CA, corresponding to the (n–1)-th node of the hierarchy.

An application to the study of images of Mesopotamian sealings (IV millennium b. C.) will be presented: the images were coded through a formalised text, describing in detail the iconographic content of the image, and textual analysis allowed to build a contingency table crossing the images with the lexical forms used for the description. In this way the forms indicate the nature of the represented elements, their attributes, and the relations among elements, and the classification of forms can group the elements, attributes, attitudes, and relations, in order to detect the compositional elements that occur jointly more frequently.

Denimal, J. J. (2001). Hierarchical factorial analysis, *Actes
du 10th International Symposium* *on Applied Stochastic Models and Data
Analysis*, Compiègne, 12-15 June 2001.

Regression biplot: linear and non-linear

P O S T E R

Olesia Cárdenas C.^{1}, M^{a }Purificación
Galindo V.^{2} & José L. Vicente-Villardón^{2}

^{1}Universidad Central, Venezuela
& ^{2}Universidad de Salamanca, Spain

ocardena@cantv.net, pgalindo@usal.es & villardon@usal.es

Even where classic biplot methods are used to describe data matrices without making assumptions about the population distribution, it may be possible to interpret the biplot of a matrix as a multiplicative bilinear model (Gollob, 1968) and use it for model diagnosis (Bradu & Gabriel, 1978; Gabriel, 1998), considering it as an extension of the generalized linear model (Nelder & Wedderburn, 1972). Gower & Hand (1996) follow an approach other than the classical approach, which may be related to the classic factorial form of the French school of data analysis, as well as to the ordination methods used in the biometric school, i.e., they describe biplot geometry in terms of projections onto a subspace, as opposed to the a geometric approach followed in model diagnosis, calling these regression biplots. In an entirely different context, the classic factorial form of data analysis for variables with distributions within the exponential family may be compared to arriving at continuous latent variables in the social sciences, as is the case with item response theory (Baker, 1992), for instance.

The purpose of this paper falls within these varied lines of research, i.e., it is aimed at describing a data matrix, using general multiplicative bilinear models in approximating regression biplots, analyzing their geometry and formally proposing one alternative estimation method. One advantage of regression biplots versus principal components regression, for example, is the possibility that the distribution of the variables contained in the data matrix belong to the exponential family, making it possible to exhibit on the plot the association between individuals and variables. Another advantage of this estimating procedure is that it may be generalized to include external information. A practical application is carried out which demonstrates their applicability.

Baker, F. B. (1992). *Item Response Theory**. *New York: Marcel
Dekker.

Bradu, D. & Gabriel, K. R. (1978).
The biplot as a diagnostic tool for models of two-way tables. *Technometrics,
***20**, 47-68.

Gabriel, K.
R. (1998). Generalised bilinear regression. *Biometrika, * **85**, 689-700.

Gollob, H. (1968). A statistical model
with combines features of factor analytic and analysis of variance techniques. *Psychometrika,
***33**, 73-115.

Gower, J.
C. & Hand, D. J. (1996). *Biplots*.* *London: Chapman & Hall.

Nelder, J.
A. & Wedderburn, R. W. (1972). Generalized linear
models.* **Journal of the Royal
Statistical Society*
A, **135**, 370-384.

Which structures do generalised principal component analyses display ? The case of multiple correspondence analysis

Henri Caussinus & Anne Ruiz-Gazen

Université Paul Sabatier & Université des Sciences Sociales, Toulouse, France

caussinus@cict.fr & ruiz@cict.fr

Let us consider an individual ´ variable array. Projection pursuit aims to find low-dimensional
projections displaying interesting features in the structure of the units
distribution. Principal component analysis and related methods produce such
graphical displays for users whose interest focuses on preserving dispersion as
far as possible. However, various choices of the metric on the units space
allow them to give various meanings to the word dispersion. Some metrics lead to
generalised principal component analyses which are likely to display various
kinds of special structures in the data, thus meeting the aims of projection pursuit
techniques. For some proposals and their properties, see, for example,
Caussinus & Ruiz-Gazen (1995) and Caussinus et al. (2002, 2003). In these
papers, the authors investigate the properties of their methods for
quantitative data. They rely on a mixture model where the “non-interesting
noise” is a normal distribution, while the (non-normal) mixing distribution * *is the “structure” of interest. Roughly
speaking, their methods look like factor discriminant analyses where the
classes would not be known.

In the case of qualitative data (*n*
units ´ *p* categorical variables) the
same methods can be formally applied to indicator matrices, but their
properties are far from being clear. In particular, the mixture model above no
longer makes sense. It is now more sensible to replace the normal noise by the
independence of the *p* responses inside each component of the mixture.
This is exactly the latent class model. The complementary use of this model and
multiple correspondence analysis has been considered by several authors (Aitkin
et al., 1987; McCutcheon, 1997). In our framework, it is easy to see that both
techniques are actually very strongly related. We show why and give
illustrative examples.

Aitkin, M., Francis, B. & Raynal, N.
(1987). Une étude comparative d’analyses des correspondances ou de
classifications et des modèles de variables latentes ou de classes latentes. *Revue
de Statistique Appliquée, ***35, **53-82.

Caussinus, H. & Ruiz-Gazen, A.
(1995). Metrics for finding typical structures by means of principal component
analysis. In *Data Science and its Applications *(eds Y. Escoufier &
C. Hayashi), 177-192. Tokyo: Academic Press.

Caussinus,
H., Hakam, S. & Ruiz-Gazen, A. (2002). Projections
révélatrices contrôlées: recherche d’individus atypiques. *Revue de
Statistique Appliquée, ***50, **5-37.

Caussinus,
H., Hakam, S. & Ruiz-Gazen, A. (2003). Projections révélatrices
contrôlées: groupements et structures diverses. *Revue de Statistique
Appliquée, ***51, **37-58.

Caussinus,
H., Fekri, M., Hakam, S. & Ruiz-Gazen, A. (2003). A
monitoring display of multivariate outliers. *Computational Statistics and
Data Analysis* (forthcoming).

McCutcheon, A. L. (1998). Correspondence
analysis used complementary to latent class analysis in comparative social research.
In *Visualization of Categorical Data* (eds J. Blasius & M. J.
Greenacre), 477-488. London: Academic Press.

Advantages and limits of correspondence analysis for

**comparative
analysis of socio-political data**

Bruno Cautres

CNRS,
Grenoble, France

cautres@cidsp.upmf-grenoble.fr

There are lot of methodological problems related to the comparative analysis of socio-political data, specially coming from survey research. Among these problems some are due to the comparative survey framework itself : the way items and questions are understood across national (even not speaking of sub-national) contexts can seriously question the “comparativeness” of the research. There are famous examples of “mistranslation”, misunderstandings and poor equivalence between measurements across countries. This is especially true in the case of cross-cultural studies (large comparisons between Western/Asian countries for instance) but can also be true in the case of cross-national comparison across countries culturally close (between EU countries for instance). Large scale comparative surveys like the Eurobarometers, the International Social Survey Program (ISSP) or the new European Social Survey (ESS) are facing such methodological challenges. A second problem comes from the analysis of data, not the collection of it: how to control for national variations in data? How to analyse data in a way that allows the discovery of common or different patterns across nations or time? Must the analysis be done one country by one (to discover each national pattern of data) or “countries simultaneously”?

Among the statistical techniques available to study the patterns of association between categorical or ordinal level data correspondence analysis (binary or multiple) can be a very useful way to investigate both problems. It has advantages over other techniques such as loglinear analysis of tabular data. This paper will investigate these questions by looking at some data sets such as Eurobarometers or national election studies. The use of multiple correspondence analysis or nonlinear principal component analysis will help in assessing if the response patterns varies across countries and how to present the national variations of a “common structure”. Examples of responses to EU attitudes will be investigated. The techniques of supplementary points can also help using one nation or one group of nations as the structure onto which are projected supplementary points. Finally the paper will compare the advantages of correspondence analysis and loglinear analysis for comparative survey research.

The political space of the French electorate in 2002:

geometric data analysis applied to the French political life

Jean Chiche,
Brigitte Le Roux, Pascal Perrineau, & Henry Rouanet

CNRS & Université René Descartes, Paris, France

chiche@msh-paris.fr,
Brigitte.Leroux@math-info.univ-paris5.fr

pascal.perrineau@sciences-po.fr & Henry.Rouanet@math-info.univ-paris5.fr

The aim of the study is to delineate the
structure of the political space of French electors and the social and
ideological evolutions that entailed
the presence of an extreme right wing candidate at the second round of the French
presidential election that took place in April 2002 (Perrineau, 2003). As a
basic statistical method we will use specific multiple correspondence analysis
(see Le Roux & Chiche, 1998 and Le Roux, 1999) concentrating on the
representation and interpretation of the clouds of individuals.

During the spring 2002, French research laboratories conducted three waves of surveys involving more than 10,000 respondents. The first wave was administered during the two weeks before the first ballot of the presidential election (April 21), the second wave after the second ballot (May 5), and the 3rd after the legislative elections in the last days of June. These surveys, known as “The French Electoral Panel 2002”, pertain to attitudes, values and stakes of French electors.

In the paper, we will first present the results of geometric data analysis on the first wave data, following an approach similar to the one of our earlier studies (Chiche et al., 2000) of a 1997 data survey (Boy & Mayer, 1997). We will characterize the political cleavages among the electors by means of the interpretation of principal axes. Then, using the method of structuring factors, as an extension of that of supplementary variables (see for example, Le Roux & Rouanet, 1998), we will project the individual positions of the main electorates – shown as concentration ellipses – in the principal geometric space that represents the French political space. We will match these results with the ternary structure of the political space found in the earlier paper, showing the major cleavages (ancient or novel) in the French society. Then we will compare the structures found for the first wave with those obtained in the second (post presidential election). Does the space remain “stable’? Are the intensities of the main factors comparable ?

In answering these questions, we hope to show how geometric data analysis – reintroducing the individuals at the heart of statistical analysis – can contribute to a major social debate and bring elements of answer to the burning question: will the presence of an extreme right candidate at the second ballot of the presidential election in the spring of 2000 be remembered as a mere “accident de l’histoire”; - or … ?

Boy, D. & Mayer, N. (1997). *L’Électeur a ses Raisons*. Paris: Presses de la
Fondation nationale des sciences politiques.

Chiche, J., Le Roux, B., Perrineau, P.
& Rouanet, H. (2000). L’espace politique des électeurs français à la fin
des années 1990. *Revue Française de Sciences Politiques*, **50**,
463-487.

Le Roux, B. (1999). Analyse spécifique
d’un nuage euclidien: application à l’étude des questionnaires. *Math. Inf. Sc. Hum*., **146,** 65-83.

Le Roux, B. & Chiche, J. (1998).
Analyse spécifique d’un questionnaire: cas particulier des non-réponses. *xxx-èmes journées de Statistique de la
S.F.d.S*., Rennes, Mai 1998.

Le Roux, B. &
Rouanet, H. (1998). Interpreting axes in MCA:
method of the contributions of points and deviations. In *Visualization of Categorical Data* (eds Jörg Blasius & Michael Greenacre). London: Academic
Press.

Perrineau, P.
(2003). *Le vote de
tous les refus : les élections présidentielle et législatives d’avril-mai 2002*. Collection Chroniques électorales, Paris: Presses de Sciences
Politiques.

Correspondence analysis and two-way clustering

Antonio Ciampi & Ana González Marcos

McGill University, Montreal, Canada & University of La Rioja, Spain

antonio.ciampi@mcgill.ca
& ana.gonzal@dim.unirioja.es

In modern clustering problems such as micro-array analysis and text mining, the challenge is not only to discover proximity relationships among individuals and variables, but also to discover groups of variables and of individuals such that the variables are useful in describing proximities among the individuals. To this end, techniques known as two-way and crossed classifications have been developed, with the aim of producing homogeneous blocks in a rectangular data matrix.

Correspondence analysis (CA), as well as other biplot techniques (Gordon, 1999), offers the remarkable feature of jointly representing individuals and variables. As a result of such analyses, not only does one gain insight in the relationship amongst individuals and amongst variables, but one can also find an indication of which variables are important in the description of each individual. It is therefore natural to develop clustering algorithms that are based on the coordinates of a CA. Indeed this was commonly done by practitioners of “analyse des données” well before the advent of micro-array and text mining (Lebart et al., 1984).

More recently, in an early attempt to develop clustering methods for micro-array data, Tibshirani et al. (1999) used the coordinates associated with the first vectors of a singular value decomposition to simultaneously rearrange the rows and the columns of a data matrix. They eventually abandoned this approach to concentrate on block clustering. In this work we explore their early idea further. Instead of using only the first axis, we select a few important axes of a CA and apply clustering algorithms to the corresponding coordinates of both rows and columns. Then, instead of ordering rows and columns by the value of the respective first coordinates, we use one of the orderings of rows and columns induced by the classification thus obtained. The result is an algorithm, which combines two-way clustering as is now currently applied in micro-array analysis, with the ‘dual’ perspective provided by CA.

Our novel contribution consists in i) proposing a simple method for selecting the number of axes; ii) visualizing the data matrix as is done in micro-array analysis; iii) enhancing this representation by emphasizing those variables and those individuals which are ‘well represented’ in the subspace of the chosen axes. Also, we underline the utility of this approach to clustering by presenting a ‘traditional’ clustering problem: the classification of a group of psychiatric patients.

Lebart, L.,
Morineau, A. & Warwick K. (1984). *Multivariate Descriptive Statistical Analysis.* New York:
Wiley.

Tibshirani,
R., Hastie, T., Eisen, M., Ross, D., Botstein, D. & Brown, P. (1999). *Clustering methods for the analysis of DNA microarray data*. Technical Report, Division
of Biostatistics, Stanford University.
http://www-stat.stanford.edu/~tibs/research.html

Comparing three methods for representing categorical data

Carles M. Cuadras & Michael Greenacre

*Universitat de Barcelona & Universitat Pompeu
Fabra, Barcelona, Spain *

ccuadras@ub.edu & michael@upf.es

Correspondence analysis (CA) is a multivariate method to visualize categorical data, typically presented as a two-way contingency table. The distance used in the graphical display of the rows (and columns) of the table is the so-called chi-square distance between the profiles of rows (and columns).

In an early paper, Rao (1948) introduced the concept of canonical coordinates, also for graphical representation of multivariate data, specially quantitative multivariate data in several populations. More recently, Rao (1995) also used canonical coordinates to represent the rows of a contingency table, using the Hellinger distance (HD) between the profiles of rows.

A third alternative to represent categorical data is based on compositional data (Aitchison, 1986). Suppose that the rows of the table are vectors of positive values summing to one, for example the row profiles in CA. Let us consider the singular value decomposition of the weighted double-centering of the logarithms of this table. This log-ratio method (LR) is equivalent to considering a third distance between rows (see Aitchison & Greenacre, 2000).

First, we compare CA and HD along principal dimensions. Both methods are equivalent for tables close to independence between rows and columns. A measure of agreement between the matrices used in CA and HD is defined. This measure is decomposed into components, each component being the product of the weighted means of coordinates in CA and HD, measuring the difference along a specific dimension.

Second, we jointly compare CA, HD and LR. Then CA can be compared to HD and to LR, but a formal analogy between HD and LR is not apparent. However, when rows and columns are almost independent, a simple formula shows that CA, HD and LR may provide a quite similar graphical display.

Finally, two illustrative examples are given. In the first one the results are very similar for the first dimension, but some differences are found along the second dimension. In the case of the second example there are hardly any differences along the first and second dimensions.

As a conclusion, these methods may provide similar results under some circumstances. CA is the best for several reasons (symmetric joint representation, probabilistic interpretation), but actually may have some drawbacks when the rows are multinomial populations, for which the HD approach may be preferable.

Aitchison, J.
(1986). *The Statistical Analysis of Compositional Data*. London: Chapman
and Hall**.**

Aitchison, J. & Greenacre, M. J. (2000).
Biplots of compositional data. *Applied Statistics, ***51**, 375-392*.*

Rao, C. R. (1948). The utilization of multiple
measurements in problems of biological classification (with discussion), *Journal
of the Royal Statistical Society, Series B,* **10,** 159-193.

Rao, C. R. (1995). A review of canonical
coordinates and an alternative to correspondence analysis using Hellinger distance.
*Qüestiió*, **19**, 23-63.

Content and functions of social sharing of emotions: an application of multiple correspondence analysis

P O S T E R

Antonietta Curci & Giannangela Mastrorilli

University
of Bari, Italy

a.curci@psico.uniba.it

The aim of the present study is the investigation of the contents of social sharing of emotions and its intra- and interpersonal functions for individuals’ life. Small samples of people are requested to answer semi-structured interviews on emotional experiences of medium to high intensity. Participants are individuals who have experienced recent emotional experiences (e.g., undergraduate students after an important exam, users of health services who have been exposed to stressful and/or traumatic situations, etc.).

After an initial free account of the original experience, participants are requested to recall relevant episodes of social sharing of that experience with important partners, in order to investigate the contents and reasons for their social sharing. Interviews are accompanied by the usual measures of intensity and type of emotions, and scales of mental rumination and social sharing (Rimé et al., 1998). A corresponding number of people is interviewed on non-emotional experiences (e.g., a work-day, hobby, etc.). Participants are randomly assigned to one of the two emotional vs. non-emotional conditions.

Interviews are audiotaped and the texts are content-analysed according to a predefined category system. With respect to the contents of social sharing, beside the classical distinction between emotional and factual aspects (Pennebaker & Beall, 1986), some other features are expected to emerge from the analysis, that is references to evaluative processes, Self and life goals, coping strategies, belief system affected by the emotional experience. Concerning the functions of social sharing, references are expected to emerge to the functions of catharsis and insight, social support, search for meaning, subjective feeling of well-being, perceptions of self-efficacy and continuity, social influence and cultural references, construction of new life goals and/or consolidation of already adopted goals.

Category frequencies are entered in a multiple correspondence analysis model in order to provide a visual display of the main contents and functions of social sharing of emotions with respect to the type and intensity of the emotional experiences.

Pennebaker, J. W. & Beall, S. K.
(1986). Confronting a traumatic event: toward an understanding of inhibition
and desease. *Journal of Abnormal Psychology, ***95**, 274-281.

Rimé,
B., Finkenauer, C., Luminet, O., Zech, E. & Philippot, P. (1998). Social
sharing of emotion: new evidence and new questions. In *European Review of
Social Psychology* (eds W. Stroebe & M. Hewstone), **8**, 145-189,
Chichester: Wiley.

Using correspondence analysis to explore the position

of major organisational constructs in a comprehensive model: organisational climate, trust and mental health.

Alessia D'Amato, Alexandra Lopes & Antonietta Curci

University of Surrey, UK, London School of Economics, UK & University of Bari, Italy

aldamato@unipd.it, a.c.lopes@lse.ac.uk & a.curci@psico.uniba.it

Organisational climate is widely recognised as an organisational framework to understand employees’ perceptions and behaviour (Forehand & von Haller, 1964; Burke et al., 2002). While organisational climate is considered to be an organisational-level variable, there are other different organisation-level and individual-level constructs that are affected by the former (Schneider et al.,1998; Ashkanasy et al., 2000) and recent studies have empirically demonstrated that strategically focused climate measures produce strong relationships with specific organizational outcomes.

The paper will be presenting the results of the use of correspondence analysis to analyse the structure of ten core first-order factors included in the general organisational climate, the resulting structure with regard to the socio-demographic variables wards and function and the correlation with other major organizational constructs: trust, stress, burnout and climate for service.

The ten core first-order factors included in the general organisational climate were: communication, leadership, job involvement, job description, team, reward, innovativeness, development, autonomy, consistency, and were obtained from a social constructionist perspective. In this model the variables are not context-specific but applied to different organizations: organizational climate is therefore considered as a generalizable set of factors (Majer & D’Amato, 2001), socio-demographic variables have been included in the overall model to account for their impact as mediators of perceptions on organisational climate.

Data were collected using a survey
of 406 employees in different functions of a major Italian hospital. Results
are discussed with regard to current research literature on organizational climate
and some organizational and individual outcomes and demonstrate that, although
correspondence analysis can be considered a neglected method in the research
and literature on organizational behaviour, its contribution can be substantive
for the understanding of the organizational processes and their relationships.

Ashkanasy,
N. M., Wilderon, C. & Peterson, M. F. (eds. 2000). *Handbook
of Organizational Culture & Climate.* Thousand Oaks: Sage Publications.

Burke, M.
J., Borucki, C. C. & Kaufman, J. D. (2002). Contemporary
perspectives on the study of psychological climate: A commentary. *European Journal of Work and Organizational
Psychology*, **11**, 325-340.

Forehand,
G. A. & von Haller, G. (1964). Environmental variation
in studies of organizational behavior. *Psychological
Bulletin*, **62**, 361-382.

Majer, V. & D’Amato, A. (2001). *L’MDOQ, il questionario
multidimensionale per la diagnosi* *del Clima Organizzativo*. Padova:
Unipress.

Schneider, B., White, S. S. & Paul,
M. C. (1998). Linking Service Climate and Customer Perceptions of Service
Quality: Test of a Causal model. *Journal
of Applied Psychology*, **83**, 150-163.

Finding significant partitions in multiple correspondence analysis

Josep Daunis-i-Estadella^{1}, Tomàs
Aluja-Banet^{2} & Santiago Thió-Henestrosa^{1}

^{1}Universitat de Girona & ^{2}Universitat
Politècnica de Catalunya, Barcelona, Spain

josep.daunis@udg.es, tomas.aluja@upc.es &
santiago.thio@udg.es

Displaying the existing relationships between variables on a global map is one of the most appealing tools in multivariate analysis. It is intended for discovering hidden patterns and revealing meaningful information. Multiple correspondence analysis (MCA) is a common default analysis for the case of categorical data.

Categorical data can be represented by means of a hypercube. Then MCA provides a global map of the associations among variables existing in the faces of the hypercube. Also it is well known that with actual data, the hypercube is sparse due to the curse of dimensionality, making it impossible to assess high-order interactions among variables. In this paper we intend to go one step further in the analysis of the association among variables within the framework of multivariate descriptive analysis, that is using the inertia as a measure of association between variables. It is based on the decomposition of global inertia into between-inertia and within-inertia (see, for example, Greenacre, 1984), like we perform in a conditional multiple correspondence analysis (Escofier, 1987). In particular, we compute the significance of "the partition induced for every variable in the remaining ones", using the asymptotical distribution of the between-inertia, based on a chi-square distribution. Then we can go deeper within the relationships existing in the hypercube.

We will present an application of this methodology to the analysis of the scores on different subjects in a first university course and we will compare the obtained results with the significance of the corresponding terms of a log-linear model, which is the classical approach for such data.

Escofier, B. (1987). *Analyse des* *correspondances
multiples conditionelle.* Technical Report, INRIA.

Greenacre,
M. J. (1984). *Theory and Applications of Correspondence Analysis*.
London: Academic Press.

Conditional bias measures of influence in correspondence analysis

Coral de la Cámara-García^{1}, Rafael
Pino-Mejías^{2,3}, Juan Muñoz-Pichardo^{2}

& María-Dolores Cubiles-de-la-Vega^{2}

^{1}Universidad de Huelva, ^{2}Universidad
de Sevilla & ^{3}Centro Andaluz de Prospectiva, Spain

camara@uhu.es, rafaelp@us.es, juanm@us.es & cubiles@us.es

We face the problem of constructing
influence diagnostics in simple correspondence analysis, considering the
identification of influential rows or columns. Our paper offers an alternative
approach to the measures based on the influence functions, and we present a
formalised study arising from the topic of conditional bias, introduced by
Muñoz-Pichardo et al. (1995). Given a realisation x_{I }of a subset X_{I}
of the sample X, the conditional bias of a statistic *T* is defined as *S*(x_{I};*T*)=E[*T*/X_{I}=x_{I}]-E[*T*],
thus taking into account the conjoint influence of a set of observations, and
it accommodates the study of single observations as a particular case. In
preceding work we have proposed influence measures based on the estimation of
the conditional bias in the general linear model, both univariate
(Muñoz-Pichardo et al., 1995) and multivariate (Muñoz-Pichardo et al., 2000)
ones. In posterior work
(Enguix-González, 2002) we considered the principal components model,
developing influence measures for the eigenvalues and eigenvectors of the correlation
and covariance matrices.

Our previous research led us to the problem of correspondence analysis. We exploit the principal components interpretation of correspondence analysis that emerges from the generalized singular value decomposition of the chi-square residuals for the independence test between rows and columns. From this viewpoint, we consider expansions for the eigenvalues and eigenvectors resulting from the deletion of a set of rows or columns, building approximations based in the first and second order terms, so the cumbersome task of recomputing the correspondence analysis results is avoided. These approximations are incorporated into the definition of the measures we propose to identify influential categories for the eigenvalues and the coordinates which are the main objectives of the correspondence analysis, so they are the considered statistics when adapting the conditional bias definition.

We have implemented these proposed measures in the R system, being illustrated by several datasets, and we finally suggest possible extensions to the multiple correspondence analysis framework.

Enguix-González, A. (2002). *Influence
Analysis in Principal Components*. PhD thesis, Universidad de Sevilla.

Muñoz-Pichardo, J. M., Muñoz-García, J., Moreno-Rebollo,
J. L. & Pino-Mejías, R. (1995). A new approach to influence analysis in
linear models. *Sankhya: The Indian Journal of Statistics, Series A, ***57**, 393-409*.*

Muñoz-Pichardo, J. M., Muñoz-García, J., Fernández Ponce,
J. M. & Jiménez-Gamero, M. D. (2000). Influence analysis in multivariate
general linear models. *Communications in
Statistics. Theory and Methods*,* ***29**, 529-547*.*

Principal curves for correcting the horseshoe effect in

correspondence
analysis

Pedro Delicado & Tomàs Aluja

Universitat
Politècnica de Catalunya, Barcelona, Spain

pedro.delicado@upc.es
& tomas.aluja@upc.es

Correspondence analysis (CA) is a useful
technique to analyze count data, revealing meaningful association patterns
among row and column categories of a table. However, a non-linear ordination of
rows and columns often appears in CA displays. In the case when the probability
distributions of rows or columns categories are unimodal, the successive
factors appear to be increasing polynomials of the first one, though they are
uncorrelated by construction. This is known as the horseshoe effect or Guttman
effect (Benzécri, 1973; Greenacre, 1984). Although it is interesting to explore
the departures of points from the curvature, it is problematic when we are interested
in building an index from data,. since a linear effect is shown as being
non-linear. This is the case in ecology when sites are correlated with species
or when building a social status index from census data.

The technique of principal curves
(Hastie & Stuetzle, 1989) appeared as a way to generalize principal
component analysis to non-linear settings. Principal curves are smooth curves
that pass through the middle of a multivariate continuous data set (see also
Kégl et al., 2000, and Delicado, 2001).

In this presentation we propose to
extract the principal curves from the data as a way of eliminating the
non-linearity. In practice, only a few (from 2 to 4) factors are used as input
for the principal curves algorithm. It is possible to define new individual
scores from their relative position with respect to principal curves. If only
one principal curve is extracted, the new scores summarize the data better than
the first correspondence analysis factor. This is because the principal curve
maximizes the projected dispersion (inertia) over a class of functions in which
the straight lines are included.

The method is illustrated for deriving a social status index for the geographical units that compose the city of Barcelona, using the socio-professional profile of their inhabitants.

Benzécri, J.-P. et al. (1973). *L’Analyse
des Données. Tome 1: La Taxinomie. Tome 2:* *L’Analyse des Correspondances*.
Paris: Dunod.

Delicado, P. (2001). Another look at principal curves and surfaces. *Journal
of Multivariate Analysis*, **77**, 84-116.

Greenacre.
M. J. (1984). *Theory and Applications of Correspondence Analysis*.
London: Academic Press.

Hastie, T.
& Stuetzle, W. (1989). Principal curves. *Journal of the American Statistical Association*,
**84**, 502-516.

Kégl, B.,
Krzyzak, A., Linder, T. & Zeger, K. (2000). Learning and design of principal curves. *IEEE
Transactions on Pattern Analysis and Machine Intelligence*, **22**, 281-297.

Analyses of matched pairs of data matrices by complex

singular value decomposition

Simplice Dossou-Gbété

Université de Pau et des Pays de l'Adour, Pau, Framce

simplice.dossou-gbete@univ-pau.fr

A central concept in the analysis of square
tables is that of symmetry and, consequently, that of departure from symmetry.
The square table under consideration, possibly pre-processed, is split into two
matrices, a symmetric part and a skew symmetric part (Greenacre, 2000). A
similar decomposition can be elaborated for a set of two matched two-way
tables. In the more general setting of the set of two matched pairs of matrices
**A** and **B**, the concept of symmetry translates into the concept of
“common” part, while departure from symmetry into departure from the common
part, that is the “specific” part. Most statistical analyses addressed to
square tables can then be extended and it turns out that descriptive and
modelling points of view are closely intertwined.

The main purpose of this
presentation is an investigation into the exploratory analysis of a set of two
matched two-way tables and their biplot visualizations. It is shown how standard
methods, initially derived for the analysis of square tables, extend to this
more general setting. Biplot visualizations for a given data matrix **M**
are derived from reduced rank approximations obtained by (generalized) singular
value decomposition and that these approximations are least-squares optimal (Falguerolles
& Greenacre, 2000). Then we review some of the singular value decompositions
which can be considered for the joint analysis of tables **A** and **B**
(or of associated “common” part **C** and “specific” part **D**).

The idea of this paper is to
consider the complex matrix **C**+i**D** for the joint analysis of **C**
and **D**. Then the central trick in this work is the singular value
decomposition of complex matrices which provides simultaneously a reduced rank
approximation for both the “common” part and the “specific” part. The natural
bi-dimensionality of this approach is appealing: biplots are best displayed in
two-dimensions. The modelling interpretation will be emphasized and is taken up
for comparing the different approaches for the analysis of such matched pair of
tables. It turns out that it can also be used to fit reduced rank models
(Falguerolles, 1998). The results are illustrated on data sets.

Falguerolles, Antoine de (1998).
Log-bilinear biplots in action. In *Visualisation of Categorical Data *(eds
J. Blasius & M. Greenacre). London: Academic Press.

Falguerolles, Antoine de, &
Greenacre, Michael (2000). Statistical modelling for matched tables. In *Statistical Modelling,
proceedings of the 15th International Workshop on Statistical Modelling (IWSM*)
(eds V. Núñez-Antón & E. Ferreira)*,* Universidad del País Vasco,
195-200.

Greenacre, Michael (2000).
Correspondence analysis of square asymmetric matrices. *Applied Statistics*,
**49**, 297-310.

Introduction of correspondence analysis in multiway

methods of simultaneous ordination

Anne B. Dufour, Sandrine Pavoine & Daniel Chessel

Université
Claude Bernard, Lyon, France

dufour@biomserv.univ-lyon1.fr,
pavoine@biomserv.univ-lyon1.fr & chessel@biomserv.univ-lyon1.fr

We propose to introduce the logic of
correspondence analysis (CA), using the duality diagrams, in the K-tables
methods such as the multiple factor analysis (MFA), multiple co-inertia
analysis and the simultaneous analysis of tables (ACT-STATIS, Lavit, 1988;
Lavit et al., 1994). The experimental situation is the analysis of a set of *K
*contingency tables or a set of arrays with positive or null values. All the
arrays are paired by rows. An example of such kind of data is the study of genetic
relationships among cattle breeds with microsatellites (Moazami-Goudarzi et
al., 1997).

Different preliminary methods are
available for analysing these data. Each array is a contingency table which can
be analysed separately, resulting in an association of *K* analyses for
which each row has the same weight. It is also possible to introduce a
coordination of each separate analysis with
the intra-class analysis onto the column partition. This method is a CA
when the marginal profiles are constant. Finally, this approach is generalized
by the correspondence analysis of doubly partitioned arrays, the so-called
internal correspondence analysis (Cazes et al., 1988).

Other methods, such as MFA (Escofier
& Pagès, 1994), analyse *K* duality diagrams for contingency tables.
MFA is an intra-block correspondence analysis introducing a simultaneous representation
of the rows (cattle breeds in our example). It can be shown that the MFA is
also related to another method, multiple co-inertia analysis (Chessel &
Hannafi, 1996) which explores the importance of each table in the graph of the
synthetic variables. The typology revealed by the two previous analyses can be
obtained by the compromise structure of ACT-STATIS. In conclusion, we
illustrate the interest of each method studying a data structure. All the analyses
and plots are performed with the ade4 package for the R environment.

Cazes, P., Chessel, D. & Doledec, S.
(1988). L'analyse des correspondances internes d'un tableau partitionné : son
usage en hydrobiologie. *Revue de Statistique Appliquée*, **36**,
39-54.

Chessel, D.
& Hanafi, M. (1996). Analyses de la co-inertie de K
nuages de points. *Revue de Statistique Appliquée*, **44**,
35-60.

Escofier, B. & Pagès, J. (1994). Multiple factor analysis (AFMULT
package). *Computational Statistics and Data Analysis*, **18**, 121-140.

Lavit, C. (1988). *Analyse Conjointe
de Tableaux Quantitatifs.* Paris: Masson.

Lavit, C., Escoufier, Y., Sabatier, R.
& Traissac, P. (1994). The ACT (Statis method). *Computational
Statistics and Data Analysis*, **18**,
97-119.

Moazami-Goudarzi,
K., Laloë, D., Furet, J. P. & Grosclaude, F. (1997). Analysis of genetic relationships
between 10 cattle breeds with 17 microsatellites. *Animal Genetics*, **28**, 338-345.

Application of constrained and unconstrained correspondence analysis to benthic communities of the Great Barrier Reef

Rodney Ellis¹, Roland Pitcher^{2},
Bronwyn Harch² & Kaye Basford¹

¹University of Queensland, Brisbane, ²CSIRO, Cleveland, Australia

rellis@uq.edu.au

Finding patterns in multivariate species assemblage data and additionally relating those patterns to a collected set of environmental parameters is an important and common endeavour amongst marine ecologists. These types of studies can be approached from either a constrained (direct gradient) or an unconstrained (indirect gradient) analysis using various ordination techniques. This presentation applies both approaches using forms of correspondence analysis to analyse biomass data collated for 922 epibenthic species assemblages and 19 related environmental parameters at 162 sampling stations inside the far northern section of the Great Barrier Reef. The effects of data transformation, taxonomic resolution, and numbers of species retained on the results and interpretations given by each approach were investigated. Procrustes analysis was used to aid in the comparison of the ordinations given by these different sets of analyses.

The unconstrained ordinations resulted in the first axes accounting for 3.1% to 16.2% of the total biological variation with some ordinations showing detectable inshore and offshore trends among sampling stations. The constrained ordinations showed the same inshore and offshore trends with the first axis explaining 17% to 51% of the total species-environmental relationship. Percent mud, benthic stress, phosphate, silicate, grainsize and chlorophyll-a were the contributing parameters in each of the analyses. The number of species used in the analysis greatly affected the results and interpretations given by both approaches, as did the taxonomic resolution. The difference between the result of using a log(biomass +1) transformation and a conversion to presence/absence scores applied to assemblage data increased with decreasing taxonomic resolution. The ordination scores on the first axes from both the constrained and unconstrained analyses generally revealed the same underlying gradient, separating inshore and offshore sites.

A joint statistical analysis for a pair
of tables which are not completely matched

Antoine de Falguerolles

UniversitéPaul
Sabatier, Toulouse, France

falguero@cict.fr

In this presentation, I will analyze a pair
of historical contingency tables taken from the *Mémoires pour servir à l'histoire de Languedoc* by Nicolas de
Lamoignon de Basville (1734). Nicolas de Lamoignon de Basville (26 avril 1648 - 17 mai 1724), the *intendant* (governor) of the Languedoc
province for 23 years (1695-1718), is famous for his harsh repression of the
Calvinist “phanatiques” living in the Cévennes. During his governance, he
supervised a memoir for the instruction of the Duc de Bourgogne giving a
thorough description of the Languedoc province. Numerous hand-written copies of
the *Mémoires *were circulated after 1697 before its late publication in
1734 (see Moreil, 1985).

Among several tables, the *Mémoires* presents two so-called maps:
one containing the number of ecclesiastics, the number of religious houses and
monasteries, and one concerning the convents for women and the number of nuns.
Interestingly, the lay-out for these maps, is the actual standard form for
two-way tables: counts cross-classified by dioceses (23 in the Languedoc
province) and by religious orders (25 for men, 23 for women). Clearly, the two
tables have a common geographical dimension and are thus related. However, they
are not completely matched since the categories for the men do not correspond
to that of women (see Falguerolles & Greenacre, 2000).

I will discuss several bi-linear models (Falguerolles & Francis, 1992; Falguerolles, 2000) for the analysis of a pair of tables having in common one marginal. I will try to see if Nicolas de Lamoignon de Basville could report to the Court in Versailles a proper coverage of the Languedoc province by Roman Catholic nuns and friars with special attention to the unrest in the Cévennes.

Basville, N. Lamoignon de (1734). *Mémoires pour servir à l’Histoire de
Languedoc*. Amsterdam: Pierre Boyer.

Falguerolles, A. de (2000). Gbms: Glms with bilinear terms. In *COMPSTAT
2000, proceedings in computational statistics*, (eds J. Bethlehem &
P.G.M. van der Heijden), 53-64. Heidelberg: Physica-Verlag.

Falguerolles, A. de & Francis, B.
(1992). Algorithmic approaches for fitting bilinear models. In *COMPSTAT 92, proceedings in computational statistics* (eds Y. Dodge
& J. Whittaker), **1****,** 77-82. Heidelberg:
Physica-Verlag.

Falguerolles, A. de & Greenacre, M.
J. (2000). Statistical modelling for matched tables. In *Statistical Modelling, Proceedings of the 15th International Workshop
on Statistical Modelling (IWSM) *(eds* *V. Núñez-Antón and E. Ferreira),
195-200. Bilbao: Universidad del País Vasco.

Moreil, F. (1985). *L'Intendance de Languedoc à la Fin du XVII ^{ème} Siécle,
Édition Critique du Mémoire pour l'Instruction du Duc de Bourgogne*. Paris: CTHS.

Type of organizational culture at a public university

Karmele Fernández-Aguirre, Petr Mariel & Ana Martín-Arroyuelos

Universidad del País Vasco/Euskal Herriko Unibertsitatea, Bilbao, Spain

etpfeagk@bs.ehu.es

The objective of this paper is to analyze
organizational aspects at the University of the Basque Country at three
different levels: departments, research groups and university overall paying
special attention to the organizational culture. The model we adopt in our
analysis is the *Model of Values in
Competition* (Cameron & Quinn, 1999) which is based on two bipolar dimensions.
The first one opposes the organizational position towards *interior* against *exterior*
and the second one opposes *flexibility*
against *control*. These two axes form
four quadrants in which the following organizational position (type of culture)
can be placed: *clan*, *hierarchy*, *market* and *innovation*.

The type of culture compatible with *hierarchy* is defined as a space of
formalized and structured work, where the formal rules and policies are pillars
of the organization. Here, the effective
leaders have to coordinate and organize properly and the general long run
objectives are stability, ability to foresee and efficiency. The organizational
form called *market* is based on
“management by objectives” and “cost transaction”. An institution is directed
towards exterior more than to own internal issues, if it is so towards
transactions with external organizations as suppliers, customers, trade unions
etc. The internal control is maintained through the economic mechanisms of
market and not by central decisions and rules as in the *hierarchy*. The name *clan*
of the third type of culture is used for its similarity with the family
organization. This type of organization seems to be a large family more than an
economic unit characterized by shared values, shared objectives, cohesion,
participation and very strong feeling of “we”. The rules and procedures typical
to *hierarchy *are replaced by
involving of the employees and corporative agreement. Finally, *innovation* means that the most important
task of management is to stimulate knowledge, risk and creativity in order to
be the “most recent”. This type of culture is based on groups which improve the
basic procedures of an organization to achieve adaptability, flexibility and
creativity.

We use a data set obtained from a
survey which collects responses from 600 lecturers out of a total of 2900 who
work at the University of the Basque Country. We apply cluster analysis in the
space of the first factors obtained from a multiple correspondence analysis
(Lebart, 1994), centering our analysis on the characteristics of the formation
and dissolution of research groups which present quite different organizational
structure in comparison with the university. The conclusions we obtain indicate
that the *flexible* culture prevails
the *rigid* one and that the *clan*-type of culture with a high
percentage of *innovation* is the most
perceived one. These conclusions support the hypothesis about the opening of
the rigid university structure through high level quality research groups.

Cameron, Kim S. & Quinn, Robert E.
(1999). *Diagnosing and Changing
Organizational Culture.* Reading, MA: Addison Wesley Longman.

Lebart, L. (1994). Complementary use of correspondence analysis
and cluster analysis. In *Correspondence Analysis in the Social
Sciences* (eds M. Greenacre & J. Blasius), 162-178. London: Academic
Press.

Using
optimal scaling to scale items for questionnaires

Giovanni
Battista Flebus

Università
degli Studi di Milano-Bicocca, Milano, Italy

giovannibattista.flebus@unimib.it

Although the technique of optimal scaling
has been known for decades (Guttman, 1950), there is hardly any example of its
applications in mental test construction (Greenacre, 1984). The method enables
a researcher to scale nominal answers in multiple choice tests (see, for example,
Gifi, 1990), even though current examples imply the existence of one right (=1)
and several wrong answers (=0). It will be shown that the technique can also be
applied to "typical performance" tests, such as attitude or
personality questionnaires. To illustrate this principle, two empirical
examples of test construction with the optimal scaling technique are presented,
where there are no right answers, and (as in the first example) where there is
no *a priori* or ordered scoring.

*Example
1*: an attitude
questionnaire. An eight-item attitude scale, meant to
measure attitudes towards gay people, was constructed using the multiple-choice
format. Each item (the stem) is to be answered by selecting one sentence out of
five; two of them depict a negative attitude, two others present a positive
attitude, while the fifth presents a more or less indifferent attitude. On the
eight items the optimal score technique was applied, and the total score was
compared with a Likert scale, validated to measure attitudes in a more
traditional way (Flebus & Montano, 2001). The sample, made up of 2323
Italian adults, gave a high reliability coefficient for the optimal score
scale, and a high correlation coefficient was found with parallel, more Likert
scales. The Guttman effect (*horseshoe effect*) can be used as a
diagnostic tool to ascertain that the scale is – as it should be –
unidimensional.

*Example 2*: a multi-factor questionnaire to detect vocational indecision. A
62-items questionnaire, written to detect students' indecision, in the same
format as a multiple-choice test, was scaled with the optimal score technique.
By alternating factor analysis and optimal scoring, a multi-factor solution was
found: the internal validity was ascertained with Cronbach's alpha coefficient,
and concurrent validity was assessed with interviews (Flebus, 2000).

Flebus,
G. B. (2000). Un questionario di autovalutazione* *della
scelta scolastica*.* In *Orientamenti per l'Orientamento* (ed. S.
Soresi). Firenze: Giunti.

Flebus, G. B. & Montano, A. (2001). *The Italian
Homophobia Scale - an internal and concurrent validity study*. Presentation
at 2001 ISSID Congress in Edinburgh.

Gifi, A. (1990). *Nonlinear Multivariate Analysis*.
New York: John Wiley.

Greenacre, M. J. (1984). *Theory and Application of
Correspondence Analysis*. New York: Academic Press.

Guttman, L. (1950). The principal components of scale
analysis. In *Measurement and Prediction *(ed. S. A. Stouffer). Princeton
, NJ: Princeton University Press.

The Milestones Project: a case-study in the historiography of data visualization

Michael
Friendly

York University, Toronto, Canada

friendly@yorku.ca

The graphic representation of quantitative information has deep roots. These roots reach into the histories of the earliest map-making and visual depiction. Later, they extend to thematic cartography, statistics and statistical graphics, medicine, and other fields which have now come to rely upon visual representations to display, illustrate, or explain relations or phenomena more easily than with just words or tables (Friendly & Denis, 2000).

Along the way, developments in technologies (printing, reproduction, computing), mathematical and statistical theory and practice — empirical observation and recording, nurtured and replenished the soil. The Milestones Project (Friendly & Denis, 2001) attempts to document and illustrate these historical developments leading to modern data visualization and visual thinking. There are several goals:

· Prepare a comprehensive catalog of important milestones in all fields related to data visualization.

· Collect representative images, bibliographical citations, cross-references, web links in a single location.

· Enable searching for researchers to find and study themes, antecedants, influences, patterns, trends, and so forth.

In this presentation I discuss:

· An overview of the project and its current status.

· Some examples of graphical excellence from the “golden age of statistical graphics” (1860-1900), e.g. Friendly (2002).

· Questions of documenting milestone “events” for modern historiography.

· Meta-questions of representation of this history.

Friendly, M. (2002). Visions and
re-visions of Charles Joseph Minard. *Journal of Educational and Behavioral
Statistics*, **27**, 31–51.

Friendly, M. & Denis, D. J. (2000).
The roots and branches of statistical graphics. *Journal de la Société
Française de Statistique*, **141**, 51–60 (published in 2001).

Friendly, M. & Denis, D. J. (2001). Milestones in the history of
thematic cartography, statistical graphics, and data visualization. http://www.math.yorku.ca/SCS/Gallery/milestone/1

An alternative to the nonsymmetrical correspondence analysis based on TUCKALS3 algorithm

Purificación Galindo & Sonia Salvo-Garrido

Universidad
de Salamanca, Spain & Universidad de la Frontera, Chile

pgalindo@usal.es & ssalvo@ufro.cl

There are three different approaches for
the study of a three-way contingency table; one is to construct two two-way
contingency tables, crossing the dependent variable with each of the explanatory
variables, that is, working with the marginal distributions. This approach does
not consider possible relationships between the explanatory variables. A second
approach considers interactively coding the two explanatory variables in a new *I *x* JK*
table, however this approach does not consider the information as included in
the original table. The third approach corresponds to the partial
nonsymmetrical correspondence analysis defined by Lauro & Balbi (1999),
which consists in analyzing the relationship between the dependent variable *i*** **and the explanatory variable**
***j*** **at a given level of** ***k*** **or conversely**. **

However, none of the previous approaches truly considers the three-way structure of the table, rather decomposing it into several forms of two-way tables. We need an additional approach which, considering the dependency among the independent variables, simultaneously analyzes the dependence of the response with respect to them. For this purpose, it is necessary to define a new form of representation of the three-way structure of the table in the plane.

Taking into account the nonsymmetrical correspondence analysis (NSCA) proposed by Lauro and D’Ambra (1984), we propose an alternative based on the generalized SVD using a criteria proposed by Timmerman and Kiers (2000) and the new graphical interactive biplot representation proposed by Carlier and Kroonenberg (1996). The interpretation of the matrix of interactions between the latent dimensions of the three modes, after applying the TUCKALS3 algorithm to the residuals matrix, allows us to determine the predictive capacity of the explicative variables. The interactive biplot representation of the residuals matrix shows the best predicted categories of the response variable, and the categories of the explicative variables which have the greatest predictive capacity. By projecting the categories of the response variable onto the vectors defined by combinations of categories of the predictors, we obtain the graphical representation of the levels of association between the predictor and response variables.

We also present a generalization of the Gray and Williams (1975) multiple association index for the case of several explanatory variables.

Carlier, A. & Kroonenberg, P.
(1996). Decompositions and biplots in three-way correspondence analysis. *Psychometrika, ***61**, 355-373.

Gray, I. N. & Williams, J. S.
(1975). Goodman and Kruskal's tau *b*: multiple and partial analogs. *Proceedings of Social Statistics Sections of
the American Statistical Association*, 444-448.

Kroonenberg, P. (1989). Singular value
decompositions of interactions in three-way contingency tables. In *Multiway
data analysis *(eds R. Coppi & S. Bolasco). Amsterdam: North-Holland.

Lauro, N.
C. & D'Ambra, L. (1984). L'analyse non symétrique des
correspondances. In *Data* *Analysis and* *Informatics, III *(eds
E. Diday et al.), 433-446. Amsterdam: North Holland.

Lauro, N. C. & Balbi, S. (1999). The analysis of
structured qualitative data. *Applied Stochastic Models and Data Analysis*,
**15**, 1-27.

Timmerman,
M. E. & Kiers, H. A. L. (2000). Three-mode principal
component analysis. Choosing the numbers of components and sensitivity to local
optima. *British Journal of Mathematical
and Statistical Psychology*. **53**, 1-16.

Exploring differences and overlap between Middle Stone Age artefacts using multiple correspondence analysis and biplot methodology

Sugnet Gardner
& Niël J. le Roux

University of Stellenbosch, South Africa

njlr@sun.ac.za

Technology changed more subtly in the Middle Stone Age (MSA) than in today’s rapid Computer Age. Artefacts known as blades and points excavated on the south coast of Africa provide information for exploring differences between the sub-stages MSA I, MSA II Upper and MSA II Lower. Because of the slow rate of change, some overlap between the sub-stages is to be expected. Both categorical and continuous variables were measured for exploring the relationships between the sub-stages. However, since the categorical measurements are more subjective, conclusions based on the correspondence between these variables and the sub-stage classification might be questioned.

Developments in biplot methodology (Gower & Hand, 1996; Gardner, 2001) since its introduction by Gabriel (1971) have provided the infrastructure for many novel applications when dealing with such exploratory data analyses. The unified biplot methodology introduced by Gower and Hand allows for separate as well as simultaneous graphical representations of continuous and categorical variables. Multiple correspondence analysis and generalised biplot representations can easily be obtained through this unified approach by utilising different distance metrics. These graphical representations can then just as easily be linked to class separation through canonical variate analysis biplot displays.

Gardner (2001) explored different methods of describing the spread of a cloud of points, leading to the definition of an a-bag. Quantifying the separation and overlap between the artefacts of the three sub-stages is possible by superimposing these a-bags onto biplot displays.

In this presentation the continuous measurements in the data set discussed by Wurz et al. (2003) are supplemented with categorical data. Multiple correspondence analysis representations are compared to several generalised biplot displays. Canonical variate biplot displays with accompanying a-bags are utilised to quantify and describe the overlap and separation between the sub-stages. The relevance of the categorical variables in discriminating between the sub-stages is evaluated with biplot displays based on the continuous variables.

The paper provides an illustration of how the exploration of multivariate data sets consisting of both continuous and categorical data can be approached by combining multiple correspondence analysis with various biplot techniques and a-bags.

Gabriel, K.
R. (1971). The biplot graphical display of matrices with
application to principal component analysis. *Biometrika*, **58**, 453 – 467.

Gardner,
S. (2001). *Extensions of biplot methodology to
discriminant analysis with applications of non-parametric principal components*.
Unpublished Ph D
thesis. University of Stellenbosch.

Gower, J. C. & Hand, D. J. (1996). *Biplots*. London: Chapman & Hall.

Wurz, S., le Roux, N. J., Gardner, S.
& Deacon, H. J. (2003). Discriminating between the end products of the earlier Middle Stone Age
sub-stages at Klasies River using biplot methodology. *Journal of Archaeological Science* (in press).

Topics of interest in internet access: an application of

simultaneous analysis

Beatriz Goitisolo & Amaya Zárraga

Universidad del País Vasco/Euskal Herriko Unibertsitatea, Bilbao, Spain

bg@alcib.bs.ehu.es & az@alcib.bs.ehu.es

The aim of this work is to show the conduct
of individuals in relation to new technologies and, especially, their behaviour
towards the Internet. To this end we use the Survey on the Information Society
(ESI) drawn up by the Basque Institute of Statistics (EUSTAT) in the fourth
quarter of 2000.

The ESI obtains information on the equipment available to individuals at home, at school and at work, on administrative use of automatic teller machines (ATMs) and on contact with the different mass media and the Internet. The variables that refer to the Internet can be grouped in four blocks:

· Access

· Internet knowledge

· Use of Internet services

· Topics of interest in Internet access

The individuals surveyed are characterized
using the sociodemographic variables extracted from the Survey of Population in
Relation to Activity (PRA), also drawn up by EUSTAT for the same period of
time.

With the information from these
surveys various contingency tables are created crossing the sociodemographic
variables with those relative to the Internet. The method used for the joint
study of these tables is simultaneous analysis (Zárraga & Goitisolo, 2002
and 2003). This method allows the internal structure of each table to be
maintained, and prevents any one of them from dominating in the overall
analysis. A more in-depth study on the topic can be found in Goitisolo (2002).

EUSTAT-
Instituto Vasco de Estadística (www.eustat.es).

Goitisolo, B. (2002). *El Análisis
Simultáneo. Propuesta y Aplicación de un Nuevo Método de Análisis Factorial de
Tablas de Contingencia*. Doctoral thesis, University of the Basque Country.

Zárraga, A. & Goitisolo, B. (2002).
Méthode factorielle pour l’analyse simultanée de tableaux de contingence. *Revue
de Statistique Appliquée*, **50, **47-70.

Zárraga, A. & Goitisolo, B. (2003).
Étude de la structure inter-tableaux à travers l’Analyse Simultanée. *Revue
de Statistique Appliquée* (forthcoming).

Non-orthogonality in correspondence
analysis and related
methods

John C. Gower

The Open University, Milton Keynes, U.K.

j.c.gower@open.ac.uk

England and The U.S.A. have been described as two nations divided by a common language. There is ample room for confusion in trying to understand the "nations" of correspondence analysis and related methods. There, the common language is the algebraic eigenvalue problem that is an inevitable consequence of minimising quadratic forms, or ratios of quadratic forms, arising from least-squares criteria. This is certainly a unifying principal but the apparent similarity induced on different methods tends to obfuscate fundamental differences of importance for understanding data analysis, for example the non-orthogonality mentioned in the title. I shall try to isolate what I believe to be some of the basic issues - first for approximations to two-way arrays of quantitative data and then for categorical data. I shall focus on:

(1) The basic models: rank *r*
representations of **X**, of **X**'**X**,
of **XX**', of *r*-dimensional distances derived from the rows and/or columns of **X**, or of distances derived from **X**'**X**
(including or excluding the diagonal).

(2) The criteria used to fit (1): least-squares, ratios of quadratic
forms, minimal L_{1}-norm, robust methods, likelihood, …

(3) The algorithm used to fit (2): algebraic eigenvalue and SVD algorithms, alternating least-squares algorithms (ALS), majorisation, …

(4) The role of constraints: normalisations of eigenvectors, to work in deviations from the means or not, orthogonalisation in ALS algorithms (Gower, 1998).

(5) The geometric visualisation of the approximation: what are the appropriate interpretative tools: inner products, distances, centroids, scales, prediction regions?

(6) The measurement of fit and orthogonality of fitted components: some models/ algorithms give non-orthogonal fitted and residual components, which complicate interpretations of measures of goodness-of-fit (Gower and Hand, 1996). Fits obtained by optimising one criterion may be evaluated in terms of another criterion (Gabriel, 2002).

(7) The distinction between a data matrix and a two-way table: a table whose columns represent variables is statistically very different from a two-way table of counts, or of a third variable classified by two other variables.

Misunderstandings arise because (a) (1), (2), (3) and (4) are often presented in a nearly inextricable manner, (b) there is a too-uncritical readiness to carry over algebra that is valid for a data matrix to the analysis of a two-way table (and vice versa), (c) the correspondence analysis of a two-way contingency table is linked in a fairly opaque way to the multiple correspondence analysis of a data matrix with two categorical variables, (d) fit statistics may be misinterpreted, (e) by concentrating on the fitted part of a model, sight may be lost of what is happening to the residual part and (f) the performance of the primary criterion may be evaluated in terms of a secondary criterion.

Gabriel, K.
R. (2002). Goodness of fit of biplots and correspondence
analysis. *Biometrika,* **89**,
423-436.

Gower, J. C. (1998). The role of
constraints in determining optimal scores. *Statistics in Medicine*, **17**,
2709-2721.

Gower, J. C. & Hand, D. J. (1996). *Biplots*.
London: Chapman and Hall.

Correspondence analysis with quantitative supplementary variables

Jan Graffelman

Universitat
Politècnica de Catalunya, Barcelona, Spain

jan.graffelman@upc.es

Correspondence analysis (CA) is a
well-known method for making pictures (biplots) of contingency tables and
tables of count data. On occasions it is of interest to display samples or
cases in a biplot made by correspondence analysis that were not included in the
original analysis. Such samples are known as *supplementary points*, and
their position in a biplot is usually calculated by using the “transition
formulae” or “barycentric relationships” of correspondence analysis.

On other occasions it may be of
interest to represent a quantitative variable, not used in the original
analysis, in a biplot obtained by correspondence analysis. The representation
of such *supplementary variables* can greatly enhance the interpretation
of the biplot. In ecological studies the procedure for representing such
variables is known as “indirect gradient analysis” (ter Braak, 1987). General
formulae for the calculation of coordinates of supplementary points and
supplementary variables in biplots are given by Gabriel (1995). Specific
results for correspondence analysis are discussed by Graffelman &
Aluja-Banet (2003).

In this talk we define a specific geometrical problem, and search for an optimal direction in a CA-biplot that best represents the quantitative supplementary variable. The optimal direction can be found by solving a weighted least squares problem, and plotting the regression coefficients in the biplot. If the supplementary variable is standardized (in the weighted sense), then its coordinates in the biplot are given by the weighted correlation coefficients of the variable with the standardized biplot axes. Both row and column markers in the CA biplot are interpretable with respect to the supplementary variable vector: the projections of the standard coordinates approximate the supplementary data, and the projections of the principal coordinates approximate weighted averages with respect to the supplementary variable. Geometrical properties of the solution, goodness of fit issues and the relationship with canonical correspondence analysis (ter Braak, 1986) will be pointed out in the talk. Empirical data will be used to illustrate the results.

Gabriel, K.
R. (1995). Biplot display of multivariate categorical
data, with comments on multiple correspondence analysis. In *Recent Advances in Descriptive
Multivariate Analysis *(ed. W. J. Krzanowski).

Graffelman, J. & Aluja-Banet, T.
(2003). Optimal representation of supplementary variables in biplots from
principal component analysis and correspondence analysis. *Biometrical Journal, ***45** (in
press).

ter Braak,
C. J. F. (1986). Canonical
correspondence analysis: a new eigenvector technique for multivariate direct
gradient analysis*. Ecology*, **67**, 1167-1179.

ter Braak, C. J. F. (1987). Ordination. In *Data
analysis in community and landscape ecology.* (eds Jongman, R. H. G., ter
Braak, C. J. F. & van Tongeren, O. F. R.), 91-173. Wageningen: Pudoc.

Biplots
of compositional data using weighted logratio maps

P O S T E R

Michael Greenacre & John Aitchison

Universitat Pompeu Fabra, Barcelona & University of Glasgow,
Scotland

michael@upf.es & John.Aitchison@btinternet.com

Compositional data are a special case of categorical data. They are
vectors of data which sum up to a constant, usually proportions or percentages.
Common examples are: results of elections, time budgets and gene frequencies
in population genetics. These data have what is known as the "unit-sum
constraint", i.e. they add up to a constant, for example in the above
three examples: 100%, 24 hours, and 1 respectively (Aitchison, 1986). It seems
on the surface that such data can be analyzed quite easily using conventional
multivariate techniques such as principal component analysis and correspondence
analysis (see, for example, Greenacre & Blasius, 1994), but it turns out
that these methods do not respect the unit-sum constraint in their solutions.
Also, they do not have what is called "subcompositional coherence", a
property deemed essential for any methodology applied to compositional data,
meaning that the analysis of a subset of the components should not give
different results compared to when the subset is analyzed as part of the whole.

In
this poster we describe the logratio approach to visualizing compositional data
using biplots, an approach which is tailored to compositional data but which
also works just as well for general tabular data on a ratio scale, for example
contingency tables (Aitchison & Greenacre, 2002). This method does not,
however, follow the principle of “distributional equivalence”, deemed by Benzécri
(1973) to be the most important property for analysing categorical data. This
principle states that merging two categories which have the same conditional
distribution (or *profile*) should not affect the analysis in any way. By
introducing a simple modification of the logratio approach which is inspired by
the row and column weighting in correspondence analysis, a method can de
defined which we call the "weighted logratio map". This modified logratio
approach now turns out to have both properties of subcompositional coherence
and distributional equivalence. In this sense this method improves the existing
logratio approach, and not only forms an interesting competitor to
correspondence analysis but appears to have better properties. But, like all methods involving logratios,
it suffers from the inconvenience of involving a logarithmic transformation
which causes problems when data values are zero, which is often the case in the
social and environmental sciences.

Interestingly,
this weighted ratio map is theoretically identical to what is known in a completely
different context as "spectral mapping", developed in biochemical research
by Lewi (1976) - see Lewi's invited paper in this conference. The method is illustrated using compositional
data from an archeological study of Roman glass cups.

Aitchison, J. (1986). *The Statistical Analysis
of Compositional Data*. London: Chapman and Hall**.**

Aitchison, J. & Greenacre, M. J. (2000). Biplots of
compositional data. *Applied
Statistics, ***51**, 375-392*.*

Benzécri, J.-P. (1973). *Analyse des Données. Tome II: Analyse des Correspondances.* Paris: Dunod.

Greenacre, M. J. & Blasius, J. (1994).
Correspondence analysis and its interpretation. In *Correspondence Analysis
and the Social Sciences* (eds M. J. Greenacre & J. Blasius), 3-22. London: Academic Press.

Lewi, P. J.(1976). Spectral mapping, a
technique for classifying biological activity profiles of chemical compounds. *Arzneim.
Forsch. (Drug Research),* **26**, 1295-1300.

Decomposing interactions by generalized bi-additive

models for categorical data

Patrick J. F. Groenen & Alex Koning

Erasmus
University Rotterdam, The Netherlands

groenen@few.eur.nl & koning@few.eur.nl

In the analysis of categorical data, generalized linear models are often used as a generalization of analysis of variance techniques (see, for example, McCullagh and Nelder, 1989). Among these techniques are, for example, loglinear analysis and categorical logistic regression. In many cases, interaction terms are modelled and the most interesting one are the bivariate interactions. However, as the number of categorical variables increases, the total number of bivariate interactions also increases dramatically. Our aim here is to provide a simple graphical representation to facilitate the interpretation of all two-way interactions simultaneously. To reach this goal, we impose rank restrictions on the two-way interactions, thus leading to a bi-additive model. We propose identification constraints to the bi-additive part that allow the main effects to be separated from the interaction effects.

The main reason for proposing the current model is that the bivariate interaction effects in ordinary GLM are hard to interpret, especially if the number of variables or the number of categories per variable is large. The advantage of the bi-additive model is that the interactions can be easily represented in a graphical representation that is similar to the one in multiple correspondence analysis: each category of every variable is represented by a vector. Then, the bivariate interaction effect is modelled by the scalar product of any two vectors representing the categories of two different variables, that is, the projection of one vector onto the other.

We show that the current model is an extension of the generalized bi-additive model for two categorical variables as discussed by van Eeuwijk (1995), De Falguerolles and Francis (1992) and Gabriel (1996), who also provided algorithms. Our extension may be viewed as a generalization of multiple correspondence analysis to GLM.

We shall illustrate our model using an empirical data set.

Eeuwijk, F.
A. (1995). Multiplicative interaction in generalized
linear models. *Biometrics*, **85**, 1017–1032.** **

Falguerolles, A. de & Francis, B.
(1992). Algorithmic approaches for fitting bilinear models. In *Compstat 1992 *(eds Y. Dodge & J.
Whittaker), 77-82. Heidelberg:
Physica-Verlag.** **

Gabriel, K.
R. (1996). Generalised bilinear regression. *Biometrika*, **85**, 689–700.** **

McCullagh,
P. & Nelder, J. A. (1989). *Generalized Linear Models. *London: Chapman and
Hall.** **

Estimating population genetics ‘F-statistics’ using correspondence analysis with respect to instrumental variables

Bruno Guinand^{1}, Bertrand Parisseaux^{1},
Jean-Dominique Lebreton^{2} & François Bonhomme^{1}

^{1}Université de Montpellier II, France & ^{2}CNRS,
Montpellier, France

Foundations of population genetics are built on a strong and very formal statistical background. Wright’s fixation indices, also known as “F-statistics” (Wright, 1951) are part of this foundation. “F-statistics” aim at analysing how genetic variance is partitioned within and among populations analysed for a set of genetic markers by considering data at several hierarchical levels: individuals, subpopulations, and total sample (Weir & Hill, 2002). Such a hierarchical partitioning of genetic variation should be implemented in a multivariate framework, providing alternative estimations of “F-statistics”.

However, multivariate methods scarcely attracted population geneticists and, basically, only principal component analysis was used in the analysis of human data sets (Cavalli-Sforza et al., 1994). Moreover, most analyses focused on differentiation between subpopulations and did not consider “F-statistics” as a whole. Therefore, population geneticists repeatedly argued that multivariate methods do not provide a clear alternative to the use of “F-statistics”, just being exploratory methods missing the links with the fundamentals of their discipline.

Here, we simply show that the relationship of “F-statistics” with chi-square statistics induces a relationship with scalar products and norms in a Euclidean space that, to our knowledge, has never been exploited to develop links between “F-statistics” and bilinear multivariate methods. “F-statistics” can be estimated and decomposed in subspaces, using correspondence analysis with respect to instrumental variables (or canonical correspondence analysis; Lebreton et al., 1991) in an appropriate way.

We illustrate this approach using previously
published data on hybridizing mouse subspecies (*Mus musculus*) (Orth et
al., 1998). We also briefly discuss main interests using and developing
coherent multivariate approach according to our increasing knowledge of natural
genetic variation of numerous organisms.

Cavalli-Sforza,
L. L., Menozzi, P. & Piazza, A. (1994). *The History
and Geography of Human Genes*. Princeton: Princeton
University Press.

Lebreton, J.-D., Sabatier, R., Banco, G. & Bacou, A.
M. (1991). Principal component and correspondence analyses with
respect to instrumental variables: an overview of their role in studies of
structure-activity and species-environment relationships. In *Applied
Multivariate Analysis in SAR and Environmental Studies* (eds Devillers, J.
& Karcher, J.), 85-114. Dordrecht: Kluwer.

Orth, A., Adama, T., Din, W. & Bonhomme, F. (1998)**.** Hybridation
naturelle entre deux sous espèces de souris domestique *Mus musculus
domesticus* et *Mus musculus castaneus* près de Lake Casitas
(Californie). *Genome*, **41,** 104-110.

Weir,
B. S. & Hill, W. G. (2002). Estimating F-statistics. *Annual Review of
Genetics*, **36,** 721-750.

Wright,
S. (1951). The genetical structure of populations. *Annals of Eugenics*, **15,** 323-354.

Interset distances in the barycentric representation of

profiles

Willem J. Heiser

Leiden University, The Netherlands

heiser@rulfsw.fsw.leidenuniv.nl

There has been some debate about the correct interpretation of distances between row elements and column elements in a joint display of a correspondence table. The conventional view is that we can scale this joint display in such a way that either the distances between rows can be interpreted, or the distances between columns, but never directly the distances between rows and columns (Heiser & Meulman, 1983; Greenacre & Hastie, 1987). Carroll et al. (1986) proposed an alternative scaling of the coordinates for which they claimed that both between-set and within-set squared distances could be interpreted, but Greenacre (1989) has shown that this claim is not warranted.

Before any dimension reduction, the representation of the data in correspondence analysis is a barycentric configuration of profile points with respect to the unit profiles, which are hypothetical profiles for which all mass is concentrated in one cell. It is shown that a between-set distance interpretation is possible in any barycentric configuration or plot, in comparison with the distance to some specific supplementary points. The distance involved is not of the chi-squared type, but simply Euclidean. The result is equally valid in the full-dimensional space as in a reduced space obtained by projection, or by any other method producing a suitable configuration of the unit profiles.

Carroll, J. D., Green, P. E., &
Schaffer, C. M. (1986). Interpoint distance comparisons in correspondence analysis.
*Journal of Marketing Research*, **23**, 271-280.

Greenacre, M. J. (1989). The
Carroll-Green-Schaffer scaling in correspondence analysis: A theoretical and
empirical appraisal. *Journal of Marketing Research*, **26**,
358-365.

Greenacre, M. J. & Hastie, T.
(1987). The geometric interpretation of correspondence analysis. *Journal of
the American* *Statistical Association*, **82**, 437-447.

Heiser, W.
J. & Meulman, J. (1983). Analyzing rectangular tables
with joint and constrained multidimensional scaling. *Journal of Econometrics*,
**22,** 139-167.

"Le patronat norvégien": capital structures and political position-taking in the Norwegian field of power

Johs. Hjellbrekke & Olav Korsnes

University
of Bergen, Norway.

johs.hjellbrekke@sos.uib.no & olav.korsnes@sos.uib.no

When it comes to sociological applications
of correspondence analysis, Bourdieu's (1989) and Bourdieu & de
Saint-Martin's (1978) analyses of the French field of power count among the
classic works. Drawing inspiration from these two studies, and following the
approach first outlined in Le Roux & Rouanet (1998), and Chiche et al.
(2000), this presentation will describe the relations between field positions
and agents' political position taking (i.e. their political orientations) in
the Norwegian field of power.

Based on data from a survey by Norwegian Power and Democracy Project, distributed to 1711 people in top positions within the Norwegian society - positions within politics, academia, larger private and public companies, the central administration, the church, the judicial system, the military, cultural institutions and also larger occupational and managerial organisations - two main questions will be adressed:

(1) What are the dominant oppositions
within the Norwegian field of power in the year 2000? What areas of this field
are the most open with respect to social mobility, and where is the
intergenerational reproduction at its strongest? These structures will be
revealed uncovered in an analysis where 17 variables are defined as active
capital indicators.

(2) What are the relations between the capital structures in the Norwegian field of power and the structures in the habituses of the agents that are located in these positions? To what degree are structural oppositions between field positions also present in the agents' political position taking? These relations will be analysed using data on the agents' position taking towards 20 statements, mainly on the relations between the state and the market, but also on more general political issues.

Defining the variables on political
position taking as the active set, the capital indicators and the field
position variable will be defined as supplementary variables. Finally, both in
order to get a better view of the distributions of the individuals within the
field, and also of the degree of intrapositional opposition and interpositional
separation, ellipses will be drawn around the positions' mean points (see Chiche et al., op.cit.).

Bourdieu, Pierre (1989). *La Noblesse d'État.* Paris: Editions de
minuit.

Bourdieu,
Pierre & de Saint-Martin, Monique (1978). Le patronat. In *Actes de la recherche en sciences sociales*,
#20/21, mars-avril 1978.

Chiche, J., Le Roux,
B., Perrineau, P. & Rouanet, H. (2000). L'éspace politique des électeurs
français à la fin des années 1990. In *Revue
française de science politique*, **50,**
juin 2000,
463-488.

Le Roux, Brigitte & Rouanet, Henry
(1998). Interpreting axes in MCA: method of the contributions of points and
deviations. In *Visualization of Categorical Data *(eds J. Blasius & M. J.
Greenacre), pp.197-220. * *London: Academic Press.

Using nonlinear principal component analysis to assess the relationship between industrial sectors, import countries and export barriers: the case of Norway

Tor Korneliussen

Bodø** **Graduate School of Business, Norway

tor.korneliussen@hibo.no

In this paper we investigate external
export barriers, how they are related to each other and how they are related to
industrial sectors and to import countries. In 1995, 459 chief executive officers
from Norwegian companies have been asked to depict their perceptions of ten
export barriers they meet in their most important import country. We used
Likert-type items with five categories each, ranging from “very low importance”
to “very high importance”.

Applying nonlinear principal component analysis (NLPCA), we chose a two-dimensional space from which the first dimension is described as “level of export barriers”, ranging from “very low export barriers” to “very high export barriers”. The second dimension mirrors the “composition of export barriers” and differentiates between those barriers which are very important for the fishery sector, for example “veterinarian certification”, and those barriers which are important to all sectors. The application of NLPCA allows the production of a two-dimensional map. This map was used in the production of the profile of quantifications for the ten indicators that turned out to be an effective visual framework to investigate scale properties. The visualizations of the quantifications showed that the assumption about measures of export barriers being of a continuous nature, at least in this instance, does not hold. In many cases the items had ordinal properties, but several items only had dichotomous properties. We show that the largest differences in the quantifications in most cases are between the first and the second category, they reflect the greatest differences in the (latent) perception of export barriers. If we like to create a dichotomous scale, we should cut the items at this point and not in the middle of the manifest scale. To visualize the structure of responses in the two-dimensional space, we use biplots of the indicators.

The reported findings show that industrial sectors have different levels of export barriers and that the patterns are heterogeneous across import countries. A free trade agreement with the European Union does not result in any lower level of perceived export barriers than in countries without a free trade agreement. Furthermore, firms within a highly competitive industry meet higher levels of perceived export barriers than firms within other industrial sectors.

Cultural and social backgrounds and students choice of craft within the field of Vocational Education and Training in Denmark

Peter Koudahl

Roskilde
University, Denmark

koudahl@ruc.dk

Educational reforms in youth education in Denmark are carried out based on the presumption that young people are individualised, released from tradition and culture and constantly preoccupied with creating their own identity (Giddens 1991; Ziehe 1982). Therefore the political agenda is to reform the educational system according to this “presumed youth”. This also goes for the field Vocational Education and Training (VET). Traditionally the VET area, due to its strong focus on practical work in companies and firms, has absorbed a large part of the youth, which primarily is searching a practical education with little or no resemblance with traditional school-based teaching. The latest reform of the VET system, called Reform 2000, is characterized by a complete individualisation of the education of and an adaptation to each individual student. The students are supposed – in cooperation with a teacher – to create their own individual education choosing between different modules in order to consider their own “learning style”. Furthermore, the students are supposed to take examinations in traditional school-based subjects, preparing for further education.

The purpose of my research project is to establish an empirically based knowledge of the existing social and cultural differences between students in VET in Denmark, their choice of craft and their appreciation of the results of Reform 2000. One of the main theses guiding the project is that young people in general are much more differentiated than the “presumed youth” suggests.

Data were collected using questionnaires distributed among approximately 1200 students from the following crafts: carpenter, metalworker, plumber, graphic designer and computer mechanic. The questionnaires are composed of about 200 questions prepared for electronic processing. The aim of the questionnaire is to make it possible to recreate the social, cultural and economical background of the students, in order to construct their economical, cultural and social capital (Bourdieu, 1984; Bennet et al., 1999) and the interrelation hereof with their choice of craft.

Using a multiple correspondence analysis our strategy is to construct indicators that are able to show differences in the composition of capital among the students and possible correspondences with choice of craft and appreciation of the results of Reform 2000. I shall present preliminary results of the project and discuss the problems of using questionnaires in the construction of the students composition of capital. Furthermore, I shall present my considerations of the applicability of correspondence analysis in this particular project.

Bennett, Tony, Emmison, Michael & Frow, John (1999). *Accounting
for Tastes: Australian Everyday Cultures*. Cambridge University Press.

Bourdieu, Pierre (1984). *Distinction: A Social Critique of the
Judgement of Taste*. Routledge & Kegan Paul.

Giddens, Anthony
(1991). *Modernity and Self-Identity: Self and Society in the Late Modern Age*. Cambridge: Polity.

Ziehe, Thomas;
Stubenrauch, Herbert (1982). *Plädoyer für ungewöhnliches Lernen : Ideen zur
Jugendsituation*. Reinbek Rowohlt.

Additive and multiplicative models for three-way tables

Pieter M. Kroonenberg* *

Leiden
University, The Netherlands

kroonenb@fsw.leidenuniv.nl

In a little referenced paper Darroch (1974) discussed the relative merits of additive and multiplicative modelling for contingency tables. In particular, he compared the following aspects: partition properties, closeness to independence, conditional independence as a special case, distributional equivalence, subtable invariance, and constraints on the marginal probabilities. On the basis of this investigation, he believed that multiplicative modelling is preferable over additive modelling, "but not by so wide a margin as the difference in the attention these two definitions have received in the literature" (p. 213).

It is surprising that one important aspect of modelling contingency tables did not figure in this comparison, i.e. interpretability. The major aim in most empirical sciences is to apply models to data and to get a deeper insight into the subject matter by interpreting the outcome of the statistical models. In this presentation Darroch's investigations are extended by investigating the interpretational possibilities and impossibilities of multiplicative and additive modelling of contingency tables. The investigation will primarily take place at the empirical level, and is limited to three-way contingency tables with medium to large numbers of categories. The focus lies with the interpretation of the dependence present in the table and how one can gain insight into complex patterns of different types of dependence.

In particular, empirical comparisons are made between Goodman's RCM association models (Anderson, 1996; Wong, 2001), and three-mode correspondence analysis for moderate to large three-way contingency tables (Carlier & Kroonenberg, 1996). By limiting ourselves to three-way tables, some of the generality of Darroch's argument is lacking, but some higher-order tables can be fruitfully reduced to three-way tables by multiplicative (or interactive, as it is sometimes called) coding of the categories. Such coding will inevitably eliminate a certain number of interactions and the relative merits of doing so will also be a subject of discussion.

Several data sets which have been previously analysed in the literature will be scrutinised with both multiplicative and additive models and evaluated to what extent they succeed in bringing the patterns contained in the dependence to the fore. An attempt will be made to formulate specific recommendations about when each technique is most likely to be informative, but such recommendations will only be very preliminary.

Anderson, C. J. (1996). The analysis of
three-way contingency tables by three-mode association models. *Psychometrika,
***61**, 465‑483.

Carlier,
A., & Kroonenberg, P. M. (1996). Decompositions and
biplots in three-way correspondence analysis, *Psychometrika*, **61**,
355‑373.

Darroch, J.
N. (1974). Multiplicative and additive interaction in
contingency tables. *Biometrika, ***61,** 207‑214.

Wong, R. S.-K. (2001). Multidimensional
association models. *Sociological Methods
& Research, ***30**, 197‑240.

Robustness
in nonmetric multidimensional scaling

Damian Läge

University of Zurich, Switzerland

dlaege@allgpsy.unizh.ch

Similarity
based records in social sciences (measuring relations between a large number of
subjects or representing the knowledge about an object field in form of a
cognitive structure) are often blotted out by a mixture of scattering and
outliers – especially when working with questionnaire data, when dealing with
measurements on relatively small samples or when modelling on the level of
single individuals.

When
such data are proceeded with metric or nonmetric multidimensional scaling methods,
a large portion of scattering and outliers can severely affect the resulting
geometric structure. The reason is that classical (N)MDS algorithms are only
partly suitable for such records, because of their squared error model: to
minimize the stress value, large errors in the fit of single distances are to
be avoided because they affect the stress to a major degree when squared.
Outliers (which, by definition must produce such “errors” in a geometric
solution) therefore result in an inadequate shift of the respecting points to
scatter the error over as many distances as possible for minimizing the
squares. By this shifting of points, outliers can distort the “true” solution
to a significant degree.

The
subsequent problems of data interpretation which arise from such non-robust
results can be illustrated by various examples from the field of social
sciences, from intuitive data as well as from prominent published results.

If the
concern of outlier affected data is justified, a method would be appropriate
which can separate the signal (i.e. “true” structure) from the noise
(scattering and outliers). As a suggested solution to this problem of
robustness, we present the RobuScal NMDS algorithm, which is based on a robust
starting configuration (a further development of the metric TUFSCAL algorithm
by Spence & Lewandowsky, 1989) and a weighted error model as proposed by
Heiser (1988) for the nonmetric part.

The robustness of the RobuScal algorithm can be proved by a systematic test. This Monte Carlo study provides an adequate set of simulation data which can also be used for a more general evaluation of all existing NMDS algorithms with regard to their robustness. We conclude with the general recommendation that scaling algorithms should pass such a test before they are used for proceeding empirical data.

Heiser, W.
J. (1988). Multidimensional scaling with least absolute residuals. In *Classification and Related Methods* (ed H. H. Bock), 145-155. Berlin:
Springer.

Spence, I.
& Lewandowsky, S. (1989). Robust multidimensional scaling. *Psychometrika*, **54**, 501-513.

Evolutionary
analysis for correspondence data

N. Carlo Lauro
& Simona Balbi

Università “Federico II”, Naples, Italy

clauro@unina.it & sb@unina.it

** **

In two-way
correspondence analysis (CA) applications, one of the dimensions concerned is often
represented by occasions, namely by the time. Treating time as a categorical
variable does not allow consideration of the asymmetrical role that it plays
with respect to the other variables. On the usual plots of CA, we are just able
to draw trajectories joining points referred to the different occasions. While
nonsymmetrical correspondence analysis (NSCA) (Lauro & D'Ambra, 1984; Lauro
& Balbi, 1999) solves the problem of asymmetry, it does not offer a
suitable visualisation and interpretation of data evolution.

In the
literature the case of three-way tables has been considered by different
authors (e. g. Foucart,1979; Glaçon,1981), developing some proposals based on
the analysis of a contingency table series, obtained by juxtaposing/superposing
the strata of the multiple table. Whereas the first author focuses his
attention on the distances between marginal distributions and a reference
distribution (given by averaging the independence hypothesis in different
times, or by building profiles), the second one proposes a STATIS approach. The
graphics proposed in both approaches rely on the classical joint plots, on
which it is possible to represent trajectories obtained by joining points
representing the same (row or column) category, referred to different times. In
addition, STATIS allows us to represent synthetically each matrix by one point
and visualises global trajectories on the interstructure factorial planes.

It must
be noticed that the data evolution is not the actual aim of all these methods
and it is just considered descriptively in a final step of the analysis.

The
aim of this paper is to introduce a time-series analysis based approach in the
frame of the analysis of correspondence data, in order to take into account in
an explicit way the temporal dimension (Lauro, 1973). The core of the analysis
consists in understanding the mechanism of transition from one state to the
following one by estimating the matrix generating the evolutionary process of
data, according to a first order autoregressive model, and its eigen–structure.
Thus, similarly to NSCA, the proposal allows us to decompose the transition
matrix in terms of latent components now depending on time. One of the main
outcomes consists in the possibility of an additional time series-like
graphical representation, visualising the evolution of the structure with
respect to time, which explicitly appears on the graphs.

Methodological
implications, computational and interpretative problems are discussed. The
method is illustrated using evolutionary data referred to the Italian economic
structure, and comparing results with the other previously mentioned methods.

Foucart,
T. (1979). *Structures de tableaux de probabilité. Description et prevision. *Thèse
du III cycle, Université de Sciences et Techniques du Languedoc.

Glaçon, F.
(1981). *Analyse conjointe de plusieurs matrices*, Thèse du III cycle,
Université de Grenoble

Lavit,
Ch., Escoufier, Y., Sabatier, R*.* & Treissac, P.* *(1994). The ACT(STATIS
method). *Computational Statistics & Data Analysis,* **18**, 97-119.

Lauro, N. C. (1973). Tendenze evolutive del sistema produttivo italiano
alla luce di un'analisi strutturale. *Quaderni del C.S.E.I.*, **12**.

Lauro, N. C. & D'Ambra, L. (1984). L'Analyse non symétrique des correspondance. In: *Data
Analysis and Informatics III*. Amsterdam: North-Holland.

Lauro, N. C. & Balbi, S. (1999). The analysis of
structured qualitative data. *Applied Stochastic Models and Data Analysis*,
**15**, 1-27.

The space of central bankers in the world

Frédéric Lebaron

Université de
Picardie Jules Verne, CSE et CEFRESS, Paris, France

lebaron@msh-paris.fr

The space of the educational, academic and
professional trajectories of *central bank
governors* testifies to the coexistence and rivalry between different types
of “symbolic capital” (forms of prestige and specific authority). There are
"insiders", who come from the central bank institution on the one
hand, and "outsiders", whose legitimacy may be academic, economic or
political, on the other hand. There are leaders, who come from the financial
world and the private sector on the one hand, and then leaders who come from
the political or university arenas. Central banks are the places where these
different forms of symbolic capital affront each other and combine (Lebaron,
2000). These “neutral” places are characterized by a particular distribution of
resources, which define their underlying *structure*.

Multiple correspondence analysis
(MCA) helps to reveal the *structure*
of this very particular social space inside the economic world, and to answer
specific comparative questions related to the social characteristics of
governors from different regions. In this sense, geometric data analysis (GDA)
appears to be a specific kind of “structural” analysis, in which “social
agents” (“*individuals*”) are central
to the analysis and the interpretation (see Rouanet et al., 2000).

This paper will present the utility
of MCA to treat these sociological problems in the particular case of a central
economic institution (devoted to monetary policy and banking supervision).
Possible relations between GDA and economic sociology will be discussed. The paper
will then focus on the construction of a relevant social space, using a
particular set of *active questions*, related
to educational, professional and academic trajectories. It will discuss the
relations between statistical and sociological interpretations of the relevant
dimensions of this space. It will then try to characterize regions of the world
according to the types of capital which are dominant inside the central banks,
using the technique of *supplementary
elements*. It will, in the end, assess the general relation between
positions in this social space and “opinions”, “position takings” and
“strategic choices” (see Bourdieu, 1984; Lebaron, 2001), in the same
methodological frame.

Bourdieu, Pierre (1984). *Homo Academicus*. Paris: Minuit.

Lebaron, Frédéric (2000). The space of
economic neutrality. Trajectories and types of legitimacy of central bank
managers. *International Journal of
Contemporary Sociology*, **37**, 208-229.

Lebaron, Frédéric (2001). Economists and
the economic order. The field of economists and the field of power in France. *European Societies*, **3**, 91-110.

Rouanet, H., Ackermann, W. & Le
Roux, B. (2000). The geometric analysis of questionnaires: The lesson of
Bourdieu's *La Distinction*. *Bulletin de Méthodologie Sociologique,* **65**.

Validation
procedures for principal axes methods

Ludovic Lebart

Centre National de la Recherche Scientifique, ENST., France

lebart@enst.fr

Bootstrap resampling techniques are frequently used to produce confidence areas on two-dimensional displays derived from principal axes techniques such as correspondence analysis (CA) and principal component analysis (PCA). In the case of PCA, numerous papers have contributed to select the relevant number of axes, and have proposed confidence intervals for points in the subspace spanned by the principal axes. These parameters are computed after the realization of several replicated samples, and involve constraints that depend on these samples. Several procedures have been proposed to overcome these difficulties: partial replications using supplementary elements (partial bootstrap), use of a three-way analysis to process simultaneously the whole set of replications (Holmes, 1989), filtering techniques involving reordering of axes and Procrustean rotations (Markus, 1994; Milan & Whittaker, 1995). Gifi (1981), Meulman (1982), Greenacre (1984) have addressed the problem in the context of two-way and multiple correspondence analyses.

In the PCA case, variants of
bootstrap (partial and total bootstrap) are presented for active variables,
supplementary variables, and supplementary nominal variables as well (Chateau
& Lebart, 1996). In the case of numerous homogeneous variables, a *bootstrap on variables* is also proposed,
with examples of application to the case of *semiometric*
data (Lebart et al., 2003).

In the context of CA (two-way or multi-way), the bootstrap allows one to draw confidence ellipses or convex hulls for both supplementary categories and for supplementary continuous variables. It appears easier to assess eigenvectors than eigenvalues (see Alvarez et al., 2002). In the domain of textual data, these techniques can efficiently tackle the difficult problem of the plurality of statistical units (words, lemmas, segments, sentences, respondents) .

Alvarez, R., Bécue, M., Lanero, J. J. & Valencia, O. (2002). Results
stability in textual analysis: its application to the study of the Spanish
investiture speeches (1979-2000). In: *JADT-2002,
6th International Conference on Textual Data Analysis, *(eds*. *A. Morin & P.
Sébillot), * *INRIA-IRISA, Rennes, 1-12.

Château, F. & Lebart, L. (1996). Assessing sample variability and stability
in the visualization techniques related to principal component analysis; bootstrap and alternative simulation methods. *COMPSTAT
1996* (ed A. Prat),
205-210. Heidelberg: Physica-Verlag.

Gifi,
A. (1981, 1990). *Nonlinear Multivariate
Analysis*. Chichester: Wiley.

Greenacre, M. (1984). *Theory and Applications of Correspondence Analysis*. London: Academic
Press.

Holmes, S. (1989). Using the bootstrap and
the RV coefficient in the multivariate context. In: *Data Analysis, Learning Symbolic and Numeric Knowledge* (ed. E.
Diday), 119-132. New York: Nova Science.

Lebart, L.,
Piron, M. & Steiner, J.-F. (2003*). **La Sémiométrie*. Paris: Dunod.

Markus, M. Th. (1994). Bootstrap confidence
regions for homogeneity analysis: the influence of rotation on coverage percentages.
*COMPSTAT 1994* (eds R.
Dutter & W. Grossmann), 337-342. Heidelberg: Physica-Verlag.

Meulman, J. (1982)

Milan, L. & Whittaker, J. (1995).
Application of the parametric bootstrap to models that incorporate a singular
value decomposition. *Applied Statistics*, **44**, 31-49.

Detection
of density modes based on distributional distance: an application to
fine-grained clustering of bibliographical data

Alain Lelu & Claire François

INRA, Jouy-en-Josas, France et Université de Franche-Comté/LASELDI

&
INIST, Vandoeuvre-lès-Nancy, France

alain.lelu@jouy.inra.fr
& francois@inist.fr

The design of
our local components analysis method has been motivated by two main considerations:

(1) In order to keep the same desirable property of distributional
equivalence as correspondence analysis, at the same time as to extract *local*
eigen elements for each document cluster issued from large collections, we use
the distributional distance defined by Matusita (1955) and Escofier (1978).
This distance is also used in spherical factor analysis by Domengès & Volle
(1979).
In doing so, documents are characterized by a centrality index in their own
cluster, and clusters are characterized in a dual sense by deduced values for
terms.

(2) Most clustering methods use iterative optimizing algorithms leading
to a local maximum of a global quality index (see Diday er al.,1979) for
centroid shift methods, and Buntine (2002) for more recent EM methods). This results
in unstable representations, difficult to use if one wants to evaluate the
influence of deleting / inserting / merging terms or documents, or – a major
challenge – to evaluate the evolution of a data-flow. We have adapted a
density-mode seeking algorithm (Trémolières, 1994) to an adaptive density
definition, based on reciprocal *K*-nearest-neighbours. An application is
presented, and stability is evaluated for a data set of 1397 bibliographical
records, described by 935 keywords.

Buntine,
W. (2002). Variational extensions to EM and multinomial PCA. In *Proceedings
of ECML 2002*.

Diday, E.
et al. (1979). *Optimisation en* *Classification Automatique*.
Rocquencourt: INRIA.

Domengès,
D. & Volle, M. (1979). Analyse factorielle sphérique: une exploration. *Annales
de l'INSEE*, **35**.

Escofier,
B. (1978). Analyses factorielles et distances répondant au principe d'équivalence
distributionnelle. *Revue de Statistique Appliquée*, **26**, 29-37.

Matusita,
K. (1955). Decision rules, based on the distance for problems of fit, two
examples, and estimation. *Annals* *of Statistical Mathematics*,
631-640.

Trémolières, R. C. (1994). Percolation
and multimodal data structuring. In *New Approaches in Classification and
Data Analysis*, (eds Diday E. et al.), pp. 263-268. Berlin: Springer-Verlag.

Graphical displays of Markov chains by means of nonsymmetrical correspondence analysis

Danilo Leone & Marilena Fucili

University of Naples, Italy

danilo.leone@dsaonline.it &
fucili@unina.it

The Markov chains framework (Bertsekas, 1987) is widely used to model dynamic systems: examples come from engineering and social applications and from the class of problems solved by Markovian decision processes. Recently these tools have been greatly used in machine learning and reinforcement learning techniques. The proposed tool provides an original way to visualize the inner structure of Markov chains, the dynamics of the systems and the advancement of the learning process. The advantages of this methodology are manifold: it uses nonsymmetrical correspondence analysis (NSCA) (Lauro & D’Ambra; 1984) for dealing with the Markovian dependence assumption, it can manage the presence and combinations of several categorical variables (attributes) defining different states, it allows a direct visualization of relations between starting states and destination states by using graphical displays in the style of principal component analysis.

We assume that the starting states
(and the destination states) of the system are defined either by *I* different levels of a unique
attribute, or by combinations of *n* attributes with *L _{a}* (

We also propose to represent the attributes coordinates as supplementary points on the principal axes. The graphical displays must be analyzed applying Markov chain principles (e.g. recurrence) and the geometries of the factorial method. We also describe an application on real data to illustrate the use of our methodology and the advantages of the suggested graphical representation.

Bertsekas,
D. P. (1987). *Dynamic Programming. Deterministic and
Stochastic Models*. Prentice Hall.

Lauro, C. & D’Ambra, L. (1984). Non
symmetrical correspondence analysis. In *Data Analysis* *and Informatics
III* (eds E. Diday et al.), pp. 433-446. North Holland.

Specific multiple correspondence analysis

Brigitte Le Roux & Jean Chiche

Université René Descartes, Paris & Centre d’Etudes de la Vie Politique Française, Paris, France

Lerb@math-info.univ-paris5.fr & Chiche@msh-paris.fr

The method of specific multiple correspondence analysis (SMCA) is motivated by the need to analyze questionnaires with nonresponses, the overall aim being to free oneself from the constraints of complete disjunctive coding while preserving the structural properties of multiple correspondence analysis (MCA). In the paper, we will present the method within the general framework of geometric data analysis, then we deal with the special case of a questionnaire, following the line of Le Roux (1999).

We first recall the properties of principal directions and variables of a Euclidean cloud. Then we apply these properties to a protocol for which individual profiles are measures, and we derive the properties of biweighted principal component analysis. Then we recall the formulas of MCA viewed as a particular biweighted principal component analysis.

We then compare SMCA with standard MCA, writing inequalities between eigenvalues and studying the rotation of principal subspaces when one goes from the global analysis to the specific one.

Examples of SMCA are found in Bourdieu (1999) and Chiche et al. (2000).

Bourdieu, P. (1999). Une révolution
conservatrice dans l’édition, *Actes de la
Recherche en Sciences Sociales, ***127**.

Le Roux, B. (1999). Analyse spécifique
d’un nuage euclidien*. Mathématiques,
Informatique et Sciences Humaines*, **37**, 65-83.

Chiche, J., Le Roux, B., Perrineau, P.
& Rouanet, H. (2000). L’espace politique des électeurs français à la fin
des années 1990. *Revue Française de
Sciences Politiques*, **50**, 463-487.

Spectral mapping. Was it worth the effort?

Paul J. Lewi

Center for Molecular Design, Janssen Pharmaceutica, Vosselaar, Belgium

plewi@prdbe.jnj.com

I started my career 40 years ago in the
research laboratory of Dr. Paul Janssen in Beerse, Belgium, a laboratory which
has produced 75 original medicines during the course of 45 years of
pharmacological research. Pharmacology is both multidimensional and dual. The
multidimensionality follows from the fact that chemical compounds, medicinal
drugs in particular, may exhibit a vast spectrum of activities when tested on
proteins (and other biopolymers), cells, isolated organs, micro-organisms,
plants, animals and man. (Activity in a test is specified on a ratio scale as
the estimated dose or concentration of a drug that produces a certain effect in
50% of replicated cases.) The duality derives from the following consideration.
Any two drugs can be contrasted in terms of the log ratio of their activities
obtained in a given test. Vice-versa, the contrast between any two tests is
defined by the log ratio of the activities produced in them by a given drug.

The Janssen laboratory frequently
produced exhaustive results for a relatively large number of drugs and tests,
with resulting data tables that were intrinsically multivariate and dual, as
just described. The problem, then, was to classify the drugs and tests in order to reveal the biological structure
underlying the data. Initially, the spectra were drawn on cardboard cards and
displayed on a table in the laboratory. But different researchers arranged the
cards in distinct ways, some being more biased by the average activity (or
size) in the spectrum of a drug or test, others paying more attention to
certain features of the profile (or shape) of the corresponding activity
spectra. In 1975, in order to resolve the controversies, Dr. Janssen asked me
whether it would be possible to find or design an ‘objective’ method to solve
this dually multivariate problem. By coincidence, some French statisticians had
just handed me a rather cryptic computer code for correspondence (CA) and
principal components (PCA) analysis. After some experimenting and
re-engineering, I realized that CA revealed contrasts between drugs and,
reciprocally, displayed contrasts between tests. It also showed dual
specificities between drugs and tests, independently of the potencies of the
compounds and of the sensitivities of the tests. It did not provide, however,
an interpretation of contrasts in terms of log ratios. The basic design of CA
involves double-closure of a contingency table, singular vector decomposition
(SVD) and biplot, all weighted by marginal sums of rows and columns. It seemed
natural to me to replace the double-closure operation in CA by double-centring
of the log transformed table of activity spectra, all other things remaining
equal. The weighting by marginal sums makes the resulting spectral map less
influenced by drugs with low potency and less biased by tests with weak
sensitivity. This ‘spectral mapping’ approach is formally identical to the
‘weighted logratio’ method described in the poster at this conference by
Greenacre and Aitchison.

Over the years, spectral mapping has been applied to a large variety of data in various fields of research, marketing and finance. Both inside and outside the laboratory the method has had its strong believers, and also its fierce opponents. At the onset, Dr. Janssen said he would only be convinced of its usefulness if a spectral map revealed something that could not have been readily observed beforehand from the data with the unaided eye. One or two cases were produced that made the point; much less, however, than was hoped for. Looking back now, after more than 25 years, the time has come to ask: was it really worth the effort?

Lewi, P. J. (1998). Analysis of
contingency tables. In *Handbook of Chemometrics and Qualimetrics: Part B*
(eds. B. G. M. Vandeginste, D. L. Massart, L. M. C. Buydens, S. de Jong, P .J.
Lewi & J. Smeyers-Verbeke), pp. 161-206. Amsterdam: Elsevier.

Three-way multidimensional scaling analysis of corporate failure

Cecilio Mar Molinero & Evridiki Neophytou

Universitat Politècnica de Catalunya, Barcelona, Spain

&
University of Southampton, United Kingdom

Cecilio.mar@upc.es & En498@soton.ac.uk

Multivariate statistical analysis has long been used to study corporate failure, using discriminant analysis or logistic regression. It has long been suggested that company size and area of activity are important factors when predicting failure. For this reason, the companies in the sample of failed and continuing firms are often matched by size and area of activity. However, proceeding in this way has serious disadvantages. First, using samples of equal size does not reflect real life, where continuing companies are much more common than failed companies; this can be addressed by means of Bayesian techniques, but it is rare to find a study that makes such a correction. Second, matching by size and area of activity makes it impossible to assess the importance of such factors. Third, it is unrealistic to assume that the conclusions of an analysis based on a sample that does not take into account time evolution will hold for the future. Fourth, the practitioner who will make use of the results is unlikely to understand the complexities of the analysis.

This paper reports on a large study that attempts to overcome all the above limitations. 370 failed companies are included in the sample, a far larger number of companies than has been previously reported in the US and UK literature. All the UK public quoted companies included in the active file of the FAME database satisfying certain criteria have been included: a total of 818 companies and over 6400 company accounts. The data covers the period 1993 to 2001. For each failed company, data from three to five reporting periods prior to failure were obtained. In the case of continuing companies, financial data covers up to eight reporting periods. As is usual in this type of study, the analysis is based on financial ratios obtained from the balance sheet and the profit and loss account. For each company 19 such ratios were calculated.

The analysis relies on a three-way scaling technique: individual differences scaling (INDSCAL). INDSCAL works from data on proximities. Given the size of the data set, the proximities were calculated between ratios, using companies as observations. A proximity matrix between ratio structures was calculated for each financial year. INDSCAL generates a “common map” that shows the average relationship between financial ratios during the period, and a set of weights that show how the financial ratio structure of companies evolves over time. It was found that the economic cycle influences the structure of company accounts.

Companies were projected on the common map. Location differences between failed and continuing companies in the common map were studied by a series of methods that include visual inspection, cluster analysis, and logit methods. Previously unobserved non-linearities; as suggested by theory, were discovered. It was found that failed companies tend to concentrate in certain areas of the maps, and that these areas are associated with low profitability, bad cash flow, and unsatisfactory debt structure. The impact of size and area of activity also became clear. These are well-known results, but the scaling approach has the advantage of visualising the results and, in this way, helps in the process of decision-making as it makes it possible to combine the qualitative and the quantitative aspects of any decision involving an assessment of the future of a company.

The evaluation of “don´t know” responses by partial

homogeneity analysis

Herbert Matschinger & Matthias C. Angermeyer

University
of Leipzig, Germany

math@medizin.uni-leipzig.de

Attitudes and other latent dimensions are measured quite frequently by means of Likert-type items where the respondent is asked to evaluate the item with respect to a closed form of mostly ordinal categories. It is implicitly assumed that the respondents are familiar with the problem addressed in the questionnaire. If this is not the case, quite frequently an extra category is employed in order to prevent an inflation of missing values and an uncontrollable bias of the sample. Unfortunately, these "don´t know" categories do not fit into the ordinal scheme of the rest of the categories and therefore are very often treated “per fiat” as neutral categories or as missing value, neither of which is a satisfactory solution. In two surveys on attitudes towards the mentally ill, conducted in 1990 and in 2001 in both parts of Germany, among other questions, ten 5-point items were employed to measure attitudes towards positive or negative effects of psychotropic drugs. Half of the items are worded in favour of psychotropic drugs, the other half deny the potential effects of these drugs. Listwise deletion of respondents with respect to "don´t know" responses would lead to a reduction of the sample from 5613 to 2921 (52%) which makes the evaluation of these responses vitally important. The goal of this investigation therefore is

(1) to estimate the relationship between the latent dimension and the "don´t know" category

(2) to evaluate the meaning of the "don´t know" categories in relation to the other - ordinal - categories for each item and conditional on the wording of the item

(3) to control for the impact of the amount of "don´t know" responses for each respondent on the dimensional structure of the construct to be portrayed. This amount might serve as an indicator for a more general willingness to respond to the questionnaire.

The structural relationship of the categories is evaluated by means of a partial homogeneity analysis (Bekker & De Leeuw, 1988; Heiser & Meulmann, 1994). Here, each set of items not only contains one of the variables of interest but also a copy of the variable: number of "don´t know" responses. Treating the items of the scale as multiple nominal and the sum of "don´t know" responses as numerical we obtain an extra dimension with a perfect fit (eigenvalue of 1) (Verdegaal, 1986). All the other axes then portray the dimension of interest.

It is shown, that the "don´t know" response may serve as an indicator for a more critical appraisal of the effect of psychotropic drugs and that this effect is not the same for the two independent samples in 1990 and 2001.

Bekker, P., & De Leeuw, J. (1988). Relations between variants of
non-linear principal component analysis. In *Component and Correspondence
Analysis* (eds J. L. A. van Rijckevorsel & J. De Leeuw), 1-31. Chichester:
Wiley.

Heiser, W. J., & Meulmann, J. J. (1994). Homogeneity analysis:
exploring the distribution of variables and their nonlinear relationships. In *Correspondence
Analysis in the Social Sciences: Recent Developments and Applications *(eds
M. Greenacre & J. Blasius), 179-209. London: Academic Press.

Verdegaal, R. (1986). *OVERALS, Users Manual*. (UG-86-1 ed.).
Leiden: Department of Data Theory.

Multiple correspondence analysis for
industrial specialised local areas in southern Italy

Fernanda Mazzotta^{1}, Gianluigi Coppola^{2}
& Maria Rosaria Garofalo^{1}

^{1}University of Salerno, Italy & ^{2}Celpe,
Italy

mazzotta@unisa.it,
glc@xcom.it & garofalo@unisa.it

One of the most important social and economic Italian problem is the underdevelopment and the lower levels of industrialisation of the southern part of the country. Since the first years of post world war II, many Government interventions have taken place in order to encourage the localisation of firm in this big area. Particularly at the end of the 1950’s, along with the phase of strong industrial expansion in northern Italy, the Government choice was to encourage the localization of big firm in the area in order to employ as many workers as possible.

In the following decades and particularly from the middle of the 1970’s, the end of the fordism model, and with it the constant decline of large industries, has redrawn the Italian economic geography. In those years the cluster of small and medium firm emerged and caused the economic growth of many Italian geographical areas that covered a secondary role as, for example, the regions of Veneto and Marche. But of the 199 industrial districts covered by ISTAT (National Institute of Statistics) in 1991, only 15 were located in southern Italy (ISTAT, 1996).

This research is a detailed study of Salerno’s productive reality, implemented also through a direct survey on the local firms. Using the Intermediate Census of the Industry and of the Services data (CIIS, 1996), the specialization indexes of the Local Labour Market System of the province of Salerno have been calculated in order to individualize high manufacturing specialization areas. Subsequently, applying a cluster analysis, the existence of micro areas of specialization has been also tested.

The industrial districts areas have
also been characterized by the existence of historical, cultural, social and
political factors, besides high manufacturing specialization indices or other
quantitative indices. Therefore the attention could not be exclusively focused
on the productive structures, but had to take into account the institutions,
the social network existing in the area, and the mechanisms of interaction
between productive structures and the social framework (Brusco & Paba,
1997). Most factors are of a qualitative nature, and hence difficult to measure
and to quantify. To such difficulty one obvious, frequently, through the study
of cases of specific areas, with interviews directed to entrepreneurs and
privileged actors (Viesti, 2000)*.*

In order to obtain those qualitative
variable, an in-depth questionnaire has been administered during the years
1998/1999 by an *ad hoc* survey
(Permanent Observatory of the Enterprises of the Province of Salerno OPIS) of a
sample (no. 462) of firms, of all sizes, located in the province of Salerno.
The questionnaire was structured into nine sections (about 200 questions) covering
all aspects of the firm. The objective in this work is to analyse the relation
between the belonging to each clusters
areas, individualised with the previous cluster analysis, and several qualitative
variables. For this aim we first divided the variables in five groups: variables
which indicates the initial situation (as financial facilitation, who
established the firm, activity done before by the entrepreneur) institutional
characteristics (opinion about local institution, professional level of the
workers in the area, supplementary wages, workers involvement in the firm)
market (sales, buying, relations with other firms) vitality of the firms
(innovation, hiring and dismissal, training) and information channels. After we
run five MCA one for each group of variables and we insert the variable
“cluster” as a supplementary variable.

Constructing fields with multiple correspondence analysis: an applied researcher’s view

Alexander Mejstrik

Ludwig
Boltzmann-Institut für Historische Sozialwissenschaft, Vienna, Austria

alexander.mejstrik@univie.ac.at

The importance of correspondence analysis
(CA) for the work of Pierre Bourdieu has been noted repeatedly (e.g. Blasius,
2001; Rouanet & Le Roux, 1993). In fact, the possibility of an
experimentally controlled construction of social spaces and relatively
autonomous fields has had a profound impact even on the theoretical conceptions
themselves. This can be seen for example in the shift from mostly typological
models of fields and spaces illustrated with network-sketches (e.g. Bourdieu,
1971) to much more systematic or structural research objects. These objects are
based on the formal logic of (*n*-)dimensional mathematical spaces and constructed
with the help of multiple correspondence analysis (MCA; e.g. Bourdieu, 1999).
Field in particular manifests a dialectical relation between concepts and
techniques and can therefore be regarded rather as a research-programme than as
a theory or a method. Many articles by different authors in the “actes de la
recherche des sciences sociales” show this dialectization and pluralization of
the field programme which results of its experimental orientation.

In such a dialectization, method changes theory just as theory changes method. So, MCA becomes only one tool (but a crucial one) for the construction of a relatively autonomous field, and needs appropriate adaptation for this specific use. Neither an exploratory-typological use (e.g. Cibois, 1984) nor its use to test a given closed theory are appropriate for constructing historical (social/cultural) fields. This requires an integration of the exploratory formulation and the testing of systematic hypotheses concerning the field-structure. Difficulties increase when dealing with historical data from fragmented and heterogeneous sources.

An application to construct the field of the national-socialist youth education 1941-1944 (Mejstrik, 2000) will help to discuss the following issues:

· the fragmentary status of historical data and various ways of structural samplings,

· the use of homogeneity and heterogeneity of the original data for the experimental construction of historical fields,

· a dynamic determination of active and passive elements of the MCA in view of the exploratory as well as the explanatory use,

· a systematic interpretation of the results of an MCA as definition of a latent multidimensional structure and its principle of differentiation and hierarchy with the help of one-dimensional auxiliary graphical representations of categories and individuals, combined with numerical interpretation, and

· the use of the constructed structural principle to describe and explain face-to-face interactions and/or events as well as their dynamics underlying historical/social/cultural change.

Blasius, Jörg
(2001). *Korrespondenzanalyse*. München
& Wien.

Bourdieu,
Pierre (1971). Le
marché des biens symboliques. *Année Sociologique*, **22**,
49-126.

Bourdieu, Pierre (1999). Une révolution conservatrice dans l’édition.
*Actes de la* *recherche en sciences sociales,* **126-127**, 3-26.

Cibois, Philippe (1984). *L'Analyse des Données en Sociologie*.
Paris.

Mejstrik,
Alexander (2000). Die
Erfindung der deutschen Jugend. Erziehung in Wien 1938-1945. In *NS-Herrschaft in Österreich. Ein Handbuch *(eds Emmerich
Tálos, Ernst Hanisch, Wolfgang Neugebauer & Reinhard Sieder), 494-522.
Wien.

Rouanet,
Henri & Le Roux, Brigitte (1993). *Analyse
des Données Multidimensionnelles*. Paris.

Relations of inertias

George Menexes & Iannis Papadimitriou

University of Macedonia, Thessaloniki, Greece

mariston@hol.gr & iannis@uom.gr

The application of correspondence analysis (CA) in order to investigate the association of two categorical variables can be used in at least three data tables: a) simple contingency table, b) indicator matrix 0-1 and c) generalized contingency table (Burt table). The “picture” on factorial planes of the phenomenon under investigation is the same in the three cases (Greenacre 1984, Lebart et al. 2000, Andersen 1991). However, the total inertia and the inertia that is explained by each factorial axis are different. They depend on which of the three tables the analysis will be applied. This has as a result that the percentage of total inertia which is explained, for example, by the first two factorial axes, sometimes gives “poor” and other times “good” fit indices of the data and of the information that is analysed.

Initially, in this study the
mathematical relations that connect the total inertias of the three tables, in
the case of the two variables, are examined. More specifically, if **F** is a *k*´*l
*contingency table of two categorical variables *X* and *Y*, then it is known that the total inertia of **F** is given by *I*_{k}_{´l}=c^{2}/*n* where c^{2 }is the chi-square statistic calculated on the table
and* n * is the total sample size. If **Z**
is the corresponding indicator matrix 0-1 then it is also known that the total
inertia *I*_{0-1} of **Z** is equal to [(*k*+*l*) / 2] - 1. We prove that the total inertia of the
corresponding Burt matrix *I*_{B }=
(*I*_{0-1}+*I _{k}*

frequency)

Next,
the corresponding generalizations of the relations in
the case of multiple variables are examined.
We prove that the inertia of
the Burt table in the
multivariate case is equal to

(*I*_{0-1}/*m*)+(2/*m*)(Σ *I*_{all}/*m*)
where *I*_{0-1 }is the inertia
of the corresponding indicator matrix, *m*
is the number of variables and Σ *I*_{all} is the sum of inertias of all *m*(*m*-1)/2
pairwise contingency tables. The proof is based on the logic that the total
inertia of the Burt table is equal to the chi-square
calculated on the Burt table divided by *m*^{2}*n*.

These relations can reveal the quality of the information that is produced by CA. For example, the different physical meanings of inertia according to the data table that is analysed each time, the need for some kind of corrections or modifications of the basic results that are produced by CA and the feeling that an average bivariate effect size (inertia) is analysed every time. The relations mentioned above imply the need of a pairwise testing of the categorical variables before application of CA, in order to develop practical criteria for variable selection. If a decision has to be made about the inclusion of some variables in the CA model, we can proceed by selecting those variables that maximise the inertia of the Burt table.

Andersen, E. (1991). *The Statistical
Analysis of Categorical Data.* Springer-Verlag.

Gifi, A.
(1996). *Non LinearMultivariate Analysis*.
John Wiley & Sons.

Greenacre,
M. (1984). *Theory and Applications of
Correspondence Analysis*. London: Academic Press.

Lebart, L., Morineau, A. & Piron, M.
(2000). *Statistique Exploratoire
Multidimensionnelle*. Paris: Dunod.

New proposals in the exploratory analysis of joint tables of categorical variables

Juan Ignacio
Modroño Herrán, Karmele Fernández-Aguirre & M. Isabel Landaluce Calvo

Universidad del
País Vasco, Bilbao, Spain & Universidad de Burgos, Spain

etpmohej@bs.ehu.es, etpfeagk@bs.ehu.es & iland@ubu.es

It is particularly common in survey analysis to have available many more than two categorical variables, for which multiple correspondence analysis (MCA) is suitable for dimensionality reduction. It is also common that the categorical data consist of variables coming from a survey carried out over somewhat different populations or moments of time in a way that some sort of association makes sense to be applied to the variables forming coherent groups, whose influence as a group should also be considered. In such a case a multiple tables analysis such as multiple factor analysis (MFA, see Escofier & Pagès, 1992) can be used; others such as STATIS (see, for example, Lavit, 1994) are not considered because of their inapplicability to categorical data. As a further problem, such data sets, when arranged in groups, typically do not have the same number of rows, what makes standard use of MFA impossible.

In such a situation, the authors,
(in Abascal* *et al*.*, 2001), have recently proposed a method in two
steps which consists in substituting the original variables by their coordinates
on the main factors extracted from a previous MCA of each of the tables defined
by the groups and then performing MFA on
it. The number of coordinates is the same across groups whenever the
original variables are the same and the categorisation is also equal. This
transformation allows for a MFA to be carried out on these matrices of
coordinates and, furthermore, as the coordinates are now continuous variables,
also permits the use of STATIS.

This method is applied to responses to a survey carried out at the University of the Basque Country which, as part of a large project concerning the process of scientific knowledge research, development and transfer, measures opinions, given by scientific staff from five different broad areas of knowledge, on characteristics of the university research culture. The results show some strong common and positive features of the research carried out at the university but also reveal some particular different opposed opinions in particular areas.

Finally, a simulation exercise is carried out to check the stability of the results.

Abascal, E., Fernández, K., Landaluce,
M. I. & Modroño, J. I. (2001). Diferentes aplicaciones de las técnicas
factoriales de análisis de tablas múltiples en las investigaciones mediante encuestas.
*Metodología de Encuestas*, **3**, 251-279.

Escofier, B. & Pagès, J. (1992). *Análisis
Factoriales Simples y Múltiples*. Bilbao: Servicio Editorial de la
Universidad del País Vasco.

Lavit, C. (1994). The act (Statis method). *Computational Statistics and Data Analysis*, **18**,
97-119.

relationship between a clinical classification of diabetes and a typification after a multiple correspondence analysis in a murine model

Nora Moscoloni, Silvana Montenegro, Stella Maris Martínez, Juan Carlos Picena, Hugo Navone & María Cristina Tarrés

Universidad Nacional de Rosario, Argentina.

piad@sede.unr.edu.ar

A major requirement
for investigation and management of diabetes is to derive an appropriate^{ }criterion
to identify its different forms and stages. **The Expert Committee on the Diagnosis and Classification of Diabetes
Mellitus** (American Diabetes Association, 2002) proposed, based
on the values of oral glucose tolerance test, a classification of diabetes and
other categories of glucose regulation.

Our aim was to characterize individuals of the eSS line of rats, genetically diabetic (Martínez et al., 1993) using, by multiple correspondence analysis, the values obtained during the performance of oral glucose tolerance test and the assessment of glucosuria, together with other physiological and environmental characteristics totalling 12 variables, either continuous quantitative or nominal. Previously, an assignation of missing values of glucosuria was carried out through an artificial neural network classifier based on two criteria: 1) total independence in relation with the analysis to determine the typology of individuals and 2) high flexibility of the technique in order to obtain a predictive model with adequate capacity of generalization (Duda et al., 2001). To characterize individuals, multiple correspondence analysis was applied. Continuous glycemic variables were recoded and considered active, whilst the rest were illustrative (supplementary). When the simultaneous description of data structure in a graphical representation of factorial coordinates was performed, the levels of fasting glycemia and glucose intolerance were ordinated. The study was completed with cluster analysis on the factorial coordinates of the individuals obtaining a typology based on four classes. When these results were correlated with the clinical classification, it was possible to classify eSS males starting with the youngest rats with low body weight, not glucosuric, with normal fasting glycemia but impaired glucose tolerance and ending with diabetic individuals, older, with higher body weight, and glucosuric.

We conclude that the typology obtained agrees with the clinical criterion proposed by the American Diabetes Association, allowing the identification of stages in the progression of diabetic syndrome. This confirms the usefulness of multivariate classificatory algorithms in this biological context.

American
Diabetes Association. (2002). Report of the Expert
Committee on the Diagnosis and Classification of Diabetes Mellitus. *Diabetes
Care,* **25**, S5-S20.

Duda,
R. O., Hart, P. E. & Stork, D. G. (2001). *Pattern Classification*.
New York: John Wiley & Sons.

Martínez, S. M, Tarrés, M. C., Picena,
J. C. et al. (1993). eSS rat, an animal model for the study of spontaneous
non-insulin-dependent diabetes. In *Lessons from Animal Diabetes IV* (ed E. Shafrir), 75-90. London: Smith-Gordon.

Visualizing three-dimensional maps in correspondence analysis

Oleg Nenadić,
Daniel Adler & Walter Zucchini

University
of Göttingen, Germany

onenadi@uni-goettingen.de, dadler@gwdg.de & wzucchi@uni-goettingen.de

** **

Maps in correspondence analysis are usually
displayed in two dimensions. The lack of convenient software mitigates against
the use of a full three-dimensional display in cases where a third dimension
would substantially improve the quality. We illustrate how the package RGL can
be used for creating three-dimensional displays that can be examined
interactively.

Although modern computer hardware
provides adequate processing power for real-time visualization in three
dimensions, most statistical software packages do not support sophisticated
graphics in three dimensions. RGL (see Nenadić et al., 2003; Adler &
Nenadić, 2003, for a technical overview) is a library for the statistical
computing environment **R** (Ihaka & Gentleman, 1996) that offers real-time
three-dimensional visualization capabilities using OpenGL as the rendering
backend. It has been ported to the major platforms Win32 and X11, and is
released under the GPL (General Public License, “Copyleft”). The current release
can be downloaded from http://134.76.173.220/~dadler/rgl/index.html

RGL has been designed as a general
framework for three-dimensional visualization and as such does not offer
special purpose functions for particular statistical analyses. It provides basic
building blocks (such as points, lines, triangles, planes, surfaces and spheres
in three dimensional space) and a number of appearance features (such as
lighting properties, transparency effects and texture mapping). A convenient
navigational interface for exploring the three-dimensional space using a mouse
is supplied. The 21 functions offered by RGL are structured into six
categories, with the shape and appearance functions comprising the core. RGL
functions are semantically similar to the standard **R**-commands such
as "plot" and "persp" that are familiar to **R**
users. These functions can be used in a very flexible manner to create complex
three-dimensional graphics.

In most applications of correspondence analysis the first two dimensions explain a sufficiently high percentage of the total inertia, but in some cases the inclusion of the third dimension improves quality of the display substantially. In such cases it is usual to examine each two-dimensional projection of the three-dimensional map individually, i.e. 1&2, 1&3 and 2&3. In this presentation we will illustrate the visualization capabilities of RGL in the context of correspondence analysis using some examples of application. We show how RGL can be used for interactive exploration of the three-dimensional maps; e.g. to zoom into particular regions in order to examine details of interest. The familiar projections onto two-dimensional space can be viewed by simply moving the viewpoint using the mouse. We illustrate how appearance features offered by RGL (apart from colour) can be used to enhance correspondence analysis displays by incorporating attributes, such as mass and quality, in the display. This capability is especially useful for visualizing maps from stacked tables.

Adler, D. & Nenadić, O. (2003).
A framework for an R to
OpenGL interface for interactive 3D graphics, *Proceedings of the 3rd
International Workshop on Statistical Computing*, Vienna (forthcoming).

Ihaka,
R. & Gentleman, R. (1996). R: a language for data analysis and graphics, *Journal
of Computational and Graphical **Statistics*, **5**, 299-314.

Nenadić, O., Adler, D. &
Zucchini, W. (2003). RGL:
a R-library for 3D visualization with OpenGL, *Proceedings of the 35th
Symposium of the Interface: Computing Science and Statistics*, Salt Lake
City (forthcoming).

Multidimensional
structure and information

Shizuhiko
Nishisato

University
of Toronto, Canada

snishisato@oise.utoronto.ca

Following
the tradition of multivariate analysis, the total information is typically
given by the sum of eigenvalues of the variance-covariance matrix. Thus, for a
data set with five standardized variables, the total information is five,
irrespective of the covariances among variables. No-one seems to question this
definition.

Nishisato
(2002a, 2002b), however, discarded this time-honoured definition. When five standardized
variables are perfectly correlated, the first eigenvalue is five, and the
remaining eigenvalues are all zero; when five variables are totally uncorrelated,
the five eigenvalues are all equal to 1. In both cases, the sum of eigenvalues
is five. The key objection to this traditional definition comes from the fact
(1) that if all five variables are perfectly correlated, only one variable is
needed to explain the data since the other four variables are totally
redundant, and (2) that if all the variables are uncorrelated one needs all of
them to explain the data. It is not difficult to visualize how the volume of
the clouds of data points may be influenced by the correlations between
variables. Therefore the conclusion is that the data set of perfectly
correlated variables contains much less information than that of totally
uncorrelated variables.

This
view was tied to research on dual scaling of discretized continuous variables
(Nishisato, 2000, 2002a, 2002b; Eouanzoui, 2003) for a unified treatment of
multivariate data. Far-reaching implications for data analysis are noted here
since multivariate analysis typically employs eigenvalues as key statistics of
information, while the current study offers a different view that the sum of
eigenvalues is not an appropriate measure of total information.

The
study proposes new measures of information for individual variables in each
dimension and total space, and measures of their joint dimensional contribution
and contribution to the total space. Consequences of new measures for
multivariate analysis of both continuous and categorical data are discussed.

Eouanzoui,
K. (2003). *On desensitizing data from interval to nominal measurement with
minimum loss of information.* Doctoral thesis, University of Toronto.

Nishisato,
S. (2000). Data analysis and information: beyond the current practice of data
analysis. In Classification and Information Processing at the
Turn of the Millennium (eds R. Decker & W. Gaul), 40-51, Heidelberg:
Springer-Verlag.

Nishisato,
S. (2002a). Differences in data structures between continuous and categorical
variables from dual scaling perspectives, and a suggestion for a unified mode
of analysis. *Japanese Journal of Sensory Evaluation,*
**6**,
89-94 (in Japanese).

Nishisato,
S. (2002b). Total information in multivariate data from a dual scaling
perspective. Paper presented at the Conference in Honour of Prof.
Ross E. Traub's Retirement, December, Toronto.

Nishisato,
S. (2003). Geometric perspectives of dual scaling for assessment of information
in data. In *Recent Developments in Psychometrics *(eds
H.Yanai, A. Okada, K. Shigemasu, Y. Kano & J. Meulman), 453-462. Tokyo:
Springer-Verlag.

Multiple factor analysis for contingency tables

Jérôme Pagès & Mónica Bécue-Bertaut

ENSA /
INFSA Rennes, France & Universitat Politècnica de Catalunya, Barcelona,
Spain

pages@agrorennes.educagri.fr & monica@eio.upc.es

We study, in the correspondence analysis (CA) framework, a set of contingency tables having the same rows. This kind of data is frequently found in surveys, when one qualitative variable is crossed with several others or when surveys from different countries are compared.

CA is usually applied to such multiple contingency tables using one of two methodologies: i) separate CA of each table; ii) CA of row-wise juxtaposed tables.

In the first case a separate CA of each table is performed and principal axes thus obtained are compared. This basic methodology presents two drawbacks: firstly, structures common to the different tables are only pointed out if they correspond to principal axes; secondly, when the row weights differ between the tables, comparisons are not easy. Furthermore, it is difficult to manage the comparison of several maps.

In the second case a CA is performed of all the tables juxtaposed row-wise (Benzécri, 1982; Cazes, 1980). In this analysis, the inertia of the global columns’ set (union of the columns’ sub-sets of each table) can be decomposed, according to Huygens principle, as the sum of the inertia within the columns of each table (within-tables inertia) and of the inertia between the columns of the different tables (between-tables inertia). This between inertia must not intervene in the study of the profiles: for example, in the case of tables coming from different surveys, the between inertia only expresses a difference between the quotas imposed on the samples. Benzécri (1983), Escofier & Drouet (1983) and Cazes & Moreau (2000) proposed the intra-tables correspondence analysis (ITCA) which eliminates the between-tables inertia.

But two drawbacks remain in this second methodology: some tables can play a dominant role, which conflicts with the aim of a simultaneous analysis; it does not include any reference to the row structure induced by each table. Thus the methodology presented here takes into account the three main problems arising in the simultaneous analysis of several contingency tables having the same rows: the differences between the row margins, the need for balancing the influence of the different tables in a global analysis and the need for a tool to compare the row structures induced by the different tables.

The properties of this method are described and illustrated using contingency tables from an international survey (Lebart et al., 1998) about dishes liked and often eaten.

Benzécri,
J. P. (1982). Sur la généralisation du tableau de Burt et
son analyse par bandes.* * *Les*
*Cahiers de l’Analyse des Données*, **7**, 33-43.

Benzécri,
J. P. (1983). Analyse de l’inertie intraclasse par
l’analyse d’un tableau de contingence. *Les
Cahiers de l’Analyse des Données*, **8**, 351-358.

Cazes, P. (1980). Analyse de certains
tableaux rectangulaires décomposés en blocs. *Les* *Cahiers de l’Analyse
des Données*, **5**, 145-161; 387-403.

Cazes, P. & J. Moreau (2000).
Analyse des correspondances d’un tableau de contingence dont les lignes et les
colonnes sont munies d’une structure de graphes bistochastique. In *L’Analyse
des Correspondances et les Techniques Connexes. Approches Nouvelles pour
l’Analyse Statistique des Données *(eds J. Moreau, P.A. Doudin & P.
Cazes), 87-103. Berlin-Heidelberg:
Springer.

Escofier, B. & D. Drouet (1983).
Analyse des différences entre plusieurs tableaux de fréquence. *Les* *Cahiers
de l’Analyse des Données*, **8**, 491-499.

Lebart, L.,
Salem, A. & Berry, E. (1998). *Exploring Textual
Data*. Dordrecht: Kluwer. 181-199.

Using correspondence analysis for exploring regional differences in the educational system. Decentralization, marketization and the social structure of the field of secondary education in four Swedish regions

Mikael Palme, Donald Broady, Mikael Börjesson, Monica Langerth Zetterman, Ida Lidegran, Sverker Lundin & Ingrid Nordqvist

Stockholm
Institute of Education; Dept. of Teacher Education, Uppsala University; Dept.
of Education, Uppsala University; Chalmers University of Technology,
Gothenburg; University College of Gävle, Sweden

mikael.palme@lhs.se, broady@nada.kth.se,
mikael.borjesson@ilu.uu.se, ida.lidegran@ilu.uu.se, monica.langerth@ped.uu.se,
sverker@cs.chalmers.se, int@hig.se

Analyzing the effects of the 1991 reform of the Swedish secondary school on the social structure of secondary education in various regional settings, we discuss the use of the Bourdieuan notion of “field” and the employment of correspondence analysis for understanding regional differences in the education system.

In the 1991 reform, all Swedish secondary school study programs were made homogenous in terms of length (3 years) and status as regards formal qualification for the entry into post-secondary education. Affirming the formally equal status of all secondary education study programs in all schools throughout the country, the reform abandoned the previous sharp division between “theoretical” programs and vocational training programs. However, the shift towards a unified secondary education system was accompanied by a parallel shift from a bureaucratic, rule-based management of the education system to a goal and result-oriented type of management (“decentralization”). Great freedom was given to secondary schools to create their own local “profiled” versions of the 16 national study programs, creating a previously unknown heterogeneity of secondary education programs. Also, in 1992, Sweden opted for a voucher system, giving families the right to freely “invest” the public funding for the schooling of their children into any private (“independent”) school, regardless of district or commune boundaries. As a result, the 1990’s witnessed, especially in the large cities, a rapid expansion of independent schools and a sharp increase of secondary school study programs with a local “profile”. In the Swedish capital, Stockholm, the local right-wing government abolished, in 1999, the principle that secondary school students had the right to study in the public school neighbouring their home residence, granting pupils and families the right to compete for entry into any secondary school in the city. As a consequence, the tendency towards homogenization inherent in the reform of secondary education in 1991 was counter-balanced by the creation of an educational market in which both schools and students and their families have to compete.

Using individual data on all pupils in secondary education between 1997 and 2001 (information on schools, educational programs, parents occupations, income, education level, national origin, housing, place of residence, etc.), the effects on the social structure of the 1991 reform are analysed, comparing the “fields” of secondary education in Stockholm, Gothenburg, Uppsala and the provincial town of Gävle. It is shown that while the social structure of each geographical area is reflected in the social structure of secondary education, the analysis also has to take into account the effects of the specific local, politically determined, management models regulating secondary education. These models, in turn, depend on the social structure of the concerned area, its specific history and its impact on political traditions. Largely, information relevant to these aspects of the analysis of the various educational fields cannot be found in the statistical data on secondary schools pupils used in the correspondence analysis as such.

Canonical correspondence analysis, a standard in ecology

Sandrine Pavoine,
Anne B. Dufour & Daniel Chessel

Université
Claude Bernard Lyon 1, France

pavoine@biomserv.univ-lyon1.fr, dufour@biomserv.univ-lyon1.fr & chessel@biomserv.univ-lyon1.fr

Canonical correspondence analysis (CCA) was introduced by ter Braak (1986). This method is largely used in ecology (Birks, 1996). It has been developed to study the relationship between species composition and environment within sites. A site is a basic sampling unit separated in space or time from other sites. CCA is an extension of correspondence analysis (CA) where CA is viewed as a mean to find coefficients of sites that maximise the variance in the species average positions (Hill, 1977). CCA looks for coefficients of environmental variables to obtain a site score that maximises the variance of the average positions of species. This viewpoint corresponds to CA under linear constraint where the site scores should represent a synthetic variable (ter Braak, 1987).

In this presentation, we will recall that CCA is an example of a duality diagram. We emphasize that this ordination analysis is a special case of principal correspondence analysis with respect to instrumental variables (PCAIV). This PCAIV is computed after a CA on the array that contains species composition and a principal component analysis (PCA) on the array that contains environmental variables, quantitative, either qualitative or both (Kiers, 1994). CCA as a particular PCAIV implies another kind of data interpretation. Indeed, CCA corresponds to the following process. It looks for scores of species to obtain a site score that maximises the variance explained by the multiple regression of environmental variables rather than the total variance in the site positions. This explained variance is the product of the total variance and the coefficient of determination of the multiple regression. If many environmental variables are involved, then the explained variance becomes equal to the total variance and CCA is then equivalent to CA.

Birks, H.
J. B., Peglar, S. M. & Austin, H. A. (1996). An
annotated bibliography of canonical correspondence analysis and related
constrained ordination methods 1986-1993. *Abstracta Botanica,* **20**,
17-36.

Hill, M. O. (1977). Use of simple
discriminant functions to classify quantitative phytosociological data. In *Proceedings
of the First International Symposium on Data Analysis and Informatics* (ed
E. Diday), 181-199. Rocquencourt: IRIA.

Kiers, H.
A. L. (1994). Simple structure in component analysis
techniques for mixtures of qualitative and quantitative variables. *Psychometrika*, **56**, 197-212.

ter Braak, C. J. F. (1986). Canonical correspondence analysis: a new eigenvector technique for
multivariate direct gradient analysis. *Ecology*,* ***67**,
1167-1179.

ter Braak, C. J. F. (1987). CANOCO - a FORTRAN
program for Canonical community ordination by [partial][detrended][canonical]
correspondence analysis and redundancy analysis, version 2.1, TNO Institute of
Applied Computer Science, Wageningen.

Value orientations of adolescents: applications of cluster and correspondence analyses

Andreas Pöge & Jost Reinecke

University
of Trier, Germany

Poge@uni-muenster.de & Reinecke@uni-trier.de

Social inequality is one of the prominent research areas in the social sciences. Classical theories (Marx, Weber) emphasize vertical differences in the society according to socio-economic differences. With the change from industrial to a more business and service oriented society, a vertical scale of social inequality is not sufficient. In addition, horizontal differences are actual under study with concepts, like values, expressive life styles and leisure behaviour. If variables measuring vertical differences are analyzed together with those concepts, people can be classified into distinct milieus (Bourdieu, 1979).

Our research strategy is based on
actual social inequality research and focused on the relation between value
orientations of adolescents and their deviant behavior. Studies have focused on
this relation, but the theoretical background is often unclear and empirical
analyses are based on simple bivariate analyses (for a discussion, see Hermann,
2003). Expressive life styles, leisure and other peer group behaviour is
incorporated in our empirical analysis, but not part of our presentation.

Our empirical data are part of an ongoing criminological and sociological panel study of adolescents' deviant and criminal behavior. With a self-administered questionnaire data are collected from schools in two German towns (Münster and Duisburg). The sample consists of adolescents from the 7th and 9th grades. Value orientations are classified by cluster analysis and related to behaviour. In a second step the value orientations are analyzed by correspondence analysis. Here, correspondence analysis serves as a confirmation method (for applications see Reinecke & Tarnai, 2000) to validate the results of the first step. Cluster information will be considered in the correspondence analyses as supplementary variables. Comparisons between the cohorts will also be addressed.

Bourdieu, P. (1979). *La Distinction.
Critique Sociale du Judgement*. Paris: Les Editions de Minuit.

Hermann, D. (2003). *Werte und Kriminalität*. Opladen: Westdeutscher Verlag.

Reinecke,
J. & Tarnai, C. (2000) (eds). *Angewandte Klassifikationsanalyse in den
Sozialwissenschaften*. Münster: Waxmann.

Prototype analysis based on similarities

El Mostafa
Qannari, Hicham Noçairi & Evelyne Vigneau^{}

ENITIAA-INRA, Nantes, France

qannari@enitiaa-nantes.fr

We show how the formalization of some data
analysis problems in terms of a similarity measure between individuals leads to
various statistical methods, some of which are already known and some new. We call this general approach *prototype
analysis*.

Consider a data set and a similarity measure between the individuals. The similarity measure may be computed from the data set itself or may be related to external data. For each individual, we associate a prototype defined as a weighted average of all the individuals, using the entries of the similarity matrix as weights. Thereafter, the method of analysis consists in seeking axes for the representation of the individuals in such a way that individuals are as close to their prototypes as possible. This leads to analyses akin to canonical correlation analysis or PLS2 performed on the original data matrix and the matrix of prototypes.

In the particular case where the similarity measure between individuals is related to the partition of the individuals into various known groups (that is, two individuals have a similarity equal to 1 if they belong to the same group and 0, otherwise), prototype analysis leads to Fisher’s canonical discrimination analysis, or PLS-discriminant analysis.

In the case where individual profiles are derived from a contingency table and the similarity measure is the scalar product associated with the chi-square distance, we retrieve correspondence analysis. Within this context, other choices of the similarity measure are also discussed.

We also discuss how prototype analysis can be used in order to predict the components of a mixture from physical/instrumental data.

Correspondence analysis and homogeneity of style in

*Tirant lo Blanc*

Alex Riba & Josep Ginebra

Universitat Politècnica de Catalunya, Barcelona, Spain

Alex.riba@upc.es & josep.ginebra@upc.com

*Tirant lo Blanc* is the main work in Catalan literature and it has been considered
to be the first modern novel in Europe. Its main body was written between 1460
and 1465, but it was not printed until 1490. There is an intense and long
lasting debate around its authorship arising from its first edition, where its
introduction states that the whole book is the work of Martorell (1413?-1468),
while at the end it is stated that the last quarter of the book is by Galba
(?-1490). For an overview of this debate, see Riquer (1990) and for tentative
attempts at this problem using statistical tools, see Ginebra and Cabos (1998)
and Riba & Ginebra (2000).

For the current study we exclude
words in italics, and chapters of less than 200 words, leaving 425 chapters
with a total of 398,242 words and 13,828 different words. Following the lead of
the extensive stylometry literature, we use *word length*, and the use of *function
words* and *vowels* to try to detect heterogeneities in the style of
the book, that might indicate the existence of two authors. In particular, we
classify words according to their number of letters, with a category for all
the words of more than nine letters, and build the corresponding

425 x
10 contingency table of ordered rows. We also count the number of appearances
of each of the 25 most frequent
context-free words in each chapter, forming a 425 x 25 contingency table.
Finally, we consider the 425 x 5 table of counts of each vowel in each chapter.

Neither of the two candidate authors left any text comparable to the one under study, and therefore one cannot use discriminant analysis to classify chapters by author. Instead, we explore the three sequences of 425 multinomial observations in the three tables, and the 40 marginal binomial sequences, and observe a clear change in their distributions. Assuming that change to be a sudden one, we find that for most sequences, the maximum likelihood estimate of the change-point is between chapters 371 and 382. Through correspondence analysis, we identify the features that distinguish the chapters before and after that boundary.

In spite of the fact that the change is rather sharp, correspondence analysis seems to indicate that a few of the chapters appearing after the change-point might be more like the ones before that boundary. That is why we proceed to cluster the rows of the three contingency tables (Greenacre, 1988), using non-hierarchical algorithms based on the fit of generalized linear models for polytomous data (McCullagh & Nelder, 1983). We also discuss simple ways to combine the cluster analysis on each of the three tables to identify the chapters that are most likely being misclassified by the estimated change-point.

Ginebra, J. & Cabos, S. (1998).
Anàlisi estadística de l’estil literari; Aproximació a l’autoria del *Tirant
lo Blanc*. *Afers,
***29**, 185-206.

Greenacre,
M. J. (1988). Clustering the rows and columns of a contingency table, *Journal
of Classification*, **5**, 39-51.

McCullagh,
P. & Nelder, J. A. (1983). *Generalized Linear Models*, 2^{nd} Ed. London: Chapman and Hall.

Riba, A. & Ginebra, J. (2000).
Riquesa de vocabulari i homogeneïtat d’estil en el *Tirant lo Blanc*. *Revista
de Catalunya,* **152**, 99-118.

Riquer, M. de (1990). Excurs VI. Martí
Joan de Galba i la seva intervenció en la novel·la. *Aproximació al Tirant lo
Blanc.* 285-299. Barcelona: Quaderns Crema.

Structural homologies

Lennart Rosenlund

Stavanger
University College, Norway

lennart.rosenlund@oks.his.no

The point of departure of my paper is the
idea of “structural causality of a network of factors” that Pierre Bourdieu advocates
in the *Distinction* (Bourdieu, 1979,
1984) to come to terms with the shortcomings of the standard methods in quantitative
social research. According to him, the most popular and most utilized methods
in quantitative analysis are in fact problematic* *to use.

In *Distinction *this approach is developed by introducing two concepts,
or constructs: the *space of social positions
*(the social space)* *and the *space of lifestyles*. The social space
aims at giving the “universe of material conditions of existence” the best
possible representation. The space of lifestyles has a similar aim; it tries to
disclose the main oppositions and divisions within the “universe of
lifestyles”, a space of “position-takings”, whose content are the products of
the *habitus*, i.e., judgements,
classifications and perceptions, tastes and distastes.

These two representations function as key concepts in this alternative methodology, where the space of social position tends to command the space of lifestyles in periods of equilibrium. Technically, the social space is the resulting map of the first and second principal axes of a multiple correspondence analysis (MCA) - a network of factors - based on carefully chosen background variables (various indices of economic and cultural capital). In a similar way the space of lifestyles is the resulting map of an MCA of relevant indices of lifestyles signs (another network of factors). My intention is to focus on this alternative methodology and furnish a basis for an empirical evaluation of it.

Thus, I will present results from an ongoing collective research project (with Johs. Hjellbrekke & Olav Korsnes, University of Bergen) on the Norwegian social space. I will present an outline and the characteristics of one version of it, which correspond well with Bourdieu’s findings. In a second movement the analyses will be undertaken utilising the “reciprocal approach” (Lebart et. al., 1984). Then lifestyle components are the “raw material” for an MCA and the resulting map reveals the main divisions among these “products” of the habitus.

A careful comparison of these two
independently constructed spaces by focusing on the positions of the *individuals* reveals that they are
structured according to the same set of principles; they are homologous. They
are both swayed by the two mechanisms of social differentiation identified by
Bourdieu: volume and composition of capital. These two mechanisms are operating
both in creating social differences between people in the objective social
structure of which the social space is the representation, as well as in
catching the formative principles of lifestyles. Further, these homological
relations seem to be of a robust nature. They appear invariably, as it seems,
in any sub-universe of lifestyle components.

Bourdieu,
P. (1979, 1984). Distinction - *A Social* *Critique of the Judgement of
Taste*. London: Routledge & Kegan Paul.

Lebart, L.,
Morineau, A. & Warwick, K. (1984). *Multivariate
Descriptive* *Statistical Analysis.* *Correspondence
Analysis and Related Techniques for Large Matrices*. New York: John Wiley
& Sons.

Measure vs variable duality in
geometric data analysis

Henry Rouanet & Brigitte Le Roux

Université René Descartes, Paris, France

Rouanet@math-info.univ-paris5.fr & Lerb@math-info.univ-paris5.fr

The formal approach used by Benzécri to
develop correspondence analysis (CA) was not an accidental matter of notation,
but an integral part of the construction (Benzécri & coll., 1973). The
properties of CA are entirely founded on the underlying mathematical theory,
essentially abstract linear algebra - with a zest of measure theory - as found
in classical textbooks (MacLane, Halmos, etc.): finite-dimensional vector
space, homomorphism, scalar product, etc. The cornerstone of the approach is
the measure vs variable duality, which formalizes the distinction between two
sorts of quantities: those for which grouping units entails summing (adding up)
values, such as weights and frequencies, which we call *measures* (as in mathematical measure theory), versus those for
which grouping units entails averaging values, such as scores, rates, which we
call *variables*. This duality is
reflected in the duality notation (alias transition notation), putting lower
indices for measures and upper indices for variables (Rouanet & Le
Roux,1993)

In the paper, we describe the role of measure vs variable duality in CA at the following two crucial stages of geometric modelling:

i.
*Construction
of clouds and the chi-square metric. *The marginal
frequencies of the table firstly provide reference measures over rows and
columns. Secondly, they define Euclidean isomorphisms from variable vector
spaces to dual measure vector spaces, hence scalar products and Euclidean
norms, therefore they determine without arbitrariness the chi-square metric
over those spaces.

ii.
*Pr**incipal
directions of clouds and
principal coordinates.* The fundamental
mathematical result is that the solution of spectral equations is the singular
decomposition of two adjoint homomorphisms and/or the associated bilinear form. Applying these results to CA
immediately yields the transition equations and the reconstitution formulas.

Two implications will be briefly discussed:

1.
*Formal approach vs matrix approach. *Translating
abstract linear results with the various roles of duality into matrix formulas
is an easy task, and does provide a compact format to transmit the algorithm of
CA - using matrix formulas as a shorthand - but no more than the algorithm. The
converse translation - i.e. from matrix formulas, deciphering the rationale of
the procedure – is more of a headache.

2.
*Methodologically,
*measure vs
variable duality provides firm operational guidelines
to practical data analysis, especially for devising the codings most appropriate
to the situation under study.

Benzécri,
J.-P. & coll. (1973). *Analyse des Données*, Volume
2, *Analyse des Correspondances.* Paris: Dunod.

Rouanet, H. & Le Roux, B. (1993). *Analyse des
Données Multidimensionnelles.* Paris: Dunod.

Changes in UK leisure patterns (1973-1997)

Iris Rubbert

Loughborough University, United Kingdom

i.m.rubbert@lboro.ac.uk

This paper advances the results of Gershuny
and Fisher’s (1999) work on ‘leisure in the UK across the 20^{th}
century’. The authors investigated leisure time-use patterns of UK residents
across a 25-year period by means of multiple classification analysis, which
provides estimates of the average time spent on leisure and sports activities
and the effects of belonging to a particular sub-group in the population. Even
though multiple classification analysis is an excellent tool for the
quantification of time-use patterns, its contribution to building a qualitative
context is limited. Hence, using the same data set that was originally drawn
from the General Household Survey (GHS) a simple correspondence analysis (CA)
was conducted to explore how people’s leisure behaviour and level of sports
participation changed by region, sex, age, and profession between 1973 and
1997. Since the leisure questions of the GHS were not asked in a consistent
manner over the years, the concept of historic profiles as introduced by
Mueller-Schneider (1994) rather than absolute frequencies were used for this
study.

Gershuny and Fisher (1999) revealed that they had difficulties finding the appropriate documentation for the 1973 data file. This problem was reflected in the results of the CA and led to the exclusion of this particular year for many variables. Other results showed that even though there was an apparent north-south division with regard to some leisure activities, patterns of leisure behaviour with regard to the regions varied widely. However, there seemed to be a tendency for ‘new’ leisure styles to emerge in London and the South-East before travelling further north.

The paper will also take ‘leisure
studies’ as an example for an interdisciplinary subject that has neglected the
wealth of basic statistical information ever since it came into existence. It
will be shown how the result of a CA has the potential to integrate
geographical, sociological and management issues in leisure research.

Gershuny,
J. I. & Fisher, K. (1999). *Leisure in
the UK across the 20 ^{th} century*, Working papers of the ESRC Research Centre on Micro-social
Change, paper number 99-3. Colchester: Institute for Social and Economic
Research, University of Essex.

Mueller-Schneider,
T. (1994). The
visualization of structural change by means of correspondence analysis. In *Correspondence Analysis in the Social
Sciences* (eds M. J. Greenacre & J. Blasius). London:
Academic Press.

The environmental impact of Italian farming activities:

testing group membership in surveys through multi-dimensional data analysis

Renato Salvatore & Carlo Russo

University of Cassino, Italy

rsalvatore@unicas.it
& russocar@unicas.it

The study of the impact of human activities on the environment is a current research issue, because of the new directions in European policy in the field of sustainable and environmental-responsible development. However, the unavailability of agri-environmental data is considered a major constraint, preventing analysts from providing reliable assessments (Moxey et al., 1998).

In this paper, a general framework is provided to study the environmental impact of agriculture through farm-level sample surveys, focusing on the system of relationships between environmental and structural data identified by a multiple correspondence analysis. The complex relations between the farm structure and its environmental impact has been empirically identified through exploratory analysis and can be utilized by researchers to infer environmental information from structural data.

The approach utilizes the
readily-available structural data to infer environmental behaviour of Italian
farms. The methodology is based on testing sample units membership to the
census typology of farms at different dimensions of environmental impact,
typology obtained by multiple
correspondence analysis and cluster* *analysis
in order to group census farms in homogeneous classes (Sabbatini & Russo,
2002). Utilizing sample surveys data, in this paper we apply multidimensional
data analysis as a tool to estimate group membership in a pre-established
typology, using the distribution of the supplementary variables in each group.

In order to design a sampling strategy for the population of farms grouped in homogeneous classes, we have adapted the convex programming approach to the multivariate sample allocation problem (Bethel, 1989) to the needs of clustering procedures. This technique (Innocenzi & Salvatore, 2002) is useful when we do not need population estimates at the analytic or territorial domains level, but the aim is to allocate the sample in strata that can represent the research domains. The method proposed tests the sampling distribution of the categories of the environmental impact supplementary variables, under the null hypothesis of farm membership to a pre-established environmental impact class.

The technique is evaluated using data from the Italian 2000 census of the Lazio region.

The approach can increase the efficiency of the agri-environmental statistic systems in terms of cost reduction, more timely estimates of environmental trend and more meaningful and intelligible statistics based on a multidimensional approach rather than on separate indicators.

Bethel, J. (1989). Sample allocation in
multivariate surveys. *Survey Methodology,*
**15**, 47 – 57.

Innocenzi, G. & Salvatore, R.
(2002).* *The implementation of the
DPSIR model in the* *Italian
agri-environmental statistic system: methodology issues rising from the 1998
FSS experience. *Proceedings of the Eurostat International Conference on the Agricultural
Statistics in the new Millennium, Greece*
(http://www.ariadne2002.gr/en/).

Moxey,
A., Whitby, M. & Low, P. (1998). Agri-environmental indicators: issues and
choices. *Land Use Policy*, **15**.

Sabbatini, M. & Russo, C. (2002).
Assessing agricultural environmental impact: a cluster analysis approach. *Proceedings of the Eurostat International
Conference on the Agricultural Statistics in the new Millennium, Greece* (http://www.ariadne2002.gr/en/).

Cluster analysis and HJ-biplot: a joint approach applied to the evaluation of the adolescent personality

P O S T E R

Sonia Salvo, Paula Alarcón & Eugenia Vinet

Universidad
de la Frontera, Temuco, Chile

ssalvo@ufro.cl,
paulandr@ufro.cl & evinet@ufro.cl

The aim of cluster analysis is to organise
objects in relation to their own characteristics. There are a numbers of ways
to construct clusters with respect to *p*
variables on *n* taxonomic units, but
in general it is not possible to know directly the particular configuration of
variables responsible for each of the groups.

Our work establishes a relationship between cluster analysis and the HJ-Biplot technique (Galindo & Cuadras 1986). This technique can be applied to any data matrix and gives a better simultaneous quality representation for rows and columns projected on a subspace of maximum inertia. This methodology makes it possible to identify in the factorial plane each one of the clusters and the variables associated with these clusters.

In our study we illustrate these techniques using a sample of 104 low infractor adolescents. The Multiaxial Adolescent Clinic Inventory (MACI) defined in 1993 was applied with associative design, within selective methodology. Using both analysis techniques, four clusters of personality patterns were obtained. The first two were represented by external behaviour with destructive type. The others were represented by passive and inhibited behaviour. The model explains 28.5% of the index variance of the non-social adaptive. These findings are discussed within the psychology evaluation of adolescents and its applications in forensic psychology.

This work was funded by the project
FONDECYT No 1010514, which is gratefully acknowledged.

Galindo, M.
P. & Cuadras, C. (1986). Una alternativa de representación simultánea:
HJ-biplot. *Qüestiió,
***10**, 13-23.

An application of nonsymmetrical
correspondence analysis, based on TUCKALS3 algorithm, to electoral marketing data

P O S T E R

Sonia Salvo^{1}, Purificación
Galindo^{2}, Luis Cid^{3}, Javier Martín^{2}^{}

^{1}Universidad de la Frontera, Chile, ^{2}Universidad
de Salamanca, Spain &

^{3}Universidad de Concepción, Chile

ssalvo@ufro.cl &
pgalindo@usal.es

Electoral analysis consists in evaluating
information obtained from previous elections in order to compile segmented
voting records. This targeting task offers a campaign the possibility to fine
tune and direct its communication to certain segments of the electorate,
through direct mail, telephone and the internet. Targeting is a particularly
useful tool to identify and profile undecided voters. As an illustratikon we
analysed data coming from a pre-electoral
survey about the elections of 1996 in Spain. The data came from the “*Centro de Informaciones Sociológicas*”
(C.I.S.) The survey is based on a census list prepared for Election 1996 in
Spanish State except Ceuta and Melilla. The size of sample was 2547 individuals
once the missing and wrong data were eliminated. Fifty variables were measured
(more details in Dorado et al., 2002).The aim is describe the preferences of
survey respondents about their vote intention to state parties (IU, PP, PSOE) depending on their age
(18-24; 25-44; and >45 years old) and educational level (no studies; primary
level; secondary level; higher level).

We analysed the three-way contingency table (state parties x age x educational level) by the partial nonsymmetrical correspondence analysis (Lauro & Balbi, 1999) and nonsymmetric correspondence analysis based on TUCKALS3 algorithm (Salvo, 2002) to illustrate the interpretation, advantages and limitations of the former method.

Dorado, A,
Galindo M.P., Vicente-Villardón, J & Vicente-Tavera, S. (2002). El CHAID como herramienta de
marketing politico. *Esic Market. **Vol 111.*

Lauro, N. C. & Balbi, S. (1999). The analysis of
structured qualitative data. *Applied Stochastic Models and Data Analysis*,
**15**, 1-27.

Salvo, S. (2002). *Contribuciones al análisis de modelos para variables cualitativas que
contemplan variable respuesta*. Ph D thesis. Universidad de Salamanca.

Correspondence
analysis and classification

Gilbert Saporta

Conservatoire
National des Arts et Métiers, Paris, France

saporta@cnam.fr

The use of correspondence analysis for
classification purposes goes back to the “prehistory” of data analysis (Fisher,
1940) where one looks for the optimal scaling of categories of a variable *X*
in order to predict a categorical variable *Y*. When there are several
categorical predictors a commonly used technique consists in a two step
analysis: multiple correspondence analysis is first performed on the predictors
set, followed by a discriminant analysis using factor coordinates of units as
numerical predictors (Bouroche et al.,1977).

However, in banking applications (for example, credit scoring) logistic regression seems to be more and more used instead of discriminant analysis when predictors are categorical. One of the reasons advocated in favour of logistic regression, is that it gives a probabilistic model and it is often claimed among econometricians that the theoretical basis is more solid, but this is arguable. This tendency is also due to the flexibility of logistic regression software which has been more developed compared to discriminant analysis. However, it can easily be proved that discarding non-informative eigenvectors gives more robust results than direct logistic regression, since it is a regularisation technique similar to principal component regression (Hastie et al., 2001). Moreover, correspondence analysis provides an insight to the data, which is always useful.

Since factor coordinates are derived without taking into account the response variable, one could think of adapting partial least squares (PLS) regression. We will show that PLS is related, at least for the first PLS component, to barycentric discrimination (Celeux & Nakache, 1994; Verde & Palumbo, 1996).

For two-class discrimination, we will also present a combination of logistic regression and correspondence analysis, as well as ridge regression which are interesting alternatives. A comparison of all these methods will be illustrated on a real case study.

Bouroche, J. M., Saporta, G. &
Tenenhaus, M. (1977). Some methods of qualitative data analysis. In *Recent
Developments in Statistics* (ed J. R. Barra), 749-755. Amsterdam:
North-Holland.

Celeux, G. & Nakache, J. P.(1994). *Discrimination
sur Variables Qualitatives*. Paris: Polytechnica.

Fisher, R. A. (1940). The precision of
discriminant functions. *Annals of Eugenics, ***10**, 422-429*.*

Hastie, T., Tibshirani, F. &
Friedman, J. (2001). *The Elements of Statistical Learning Theory. *New-York: Springer.

Verde, R. & Palumbo, F. (1996).
Analisi fattoriale discriminante non-simmetrica su predittori qualitativi. *Atti
del Convegno della XXXVIII Riunione Scientifica della Società Italiana di
Statistica*, Rimini.

Simple, optimal, factor and unidimensional scale scores:

a comparison

Hans Schadee & Giovanni Battista Flebus

Università di Milano-Bicocca, Milano, Italy

hans.schadee@unimib.it & giovannibattista.flebus@unimib.it

In many analyses individual scores are obtained by forming weighted sums from several observed variables or items. Simple sums, optimal scores, factor scores and scores resulting from unidimensional scaling models - whether cumulative scales (Guttman, Mokken, Rasch) or unfolding and seriation models (Coombs) - have all been used for this purpose. The relations between these techniques are relatively well known, though often ignored in applications. The degree to which the results of one analysis are informative with respect to another model, or whether scores from different models give the same results, is less well known. The empirical investigation of trace functions (item characteristic functions) using local (non parametric) regression of item responses on the total score sheds light on these empirical questions.

As examples we use psychological test data - Eysenck's neuroticism scale, Bem Sex role inventory, an abridged version of the Adorno F-scale - and data from public opinion surveys on electoral behaviour..

Regularization in multiple-set canonical correlation

analysis

Yoshio Takane &
Heungsun Hwang

McGill
University, Montreal & HEC, Montreal, Canada

takane@takane2.psych.mcgill.ca & heungsun.hwang@hec.ca

Generalized (multiple-set) canonical correlation analysis (GCANO; Carroll, 1968; Horst, 1961) has attracted the attention of many data analysts primarily because it subsumes a number of interesting techniques in multivariate analysis as special cases (Yanai, 1998). More recently, however, it is recognized as an important method of integrating information from multiple sources (Takane & Oshima-Takane, 2001). In this paper we discuss a regularization technique for linear GCANO. Regularization is considered important as a way of solving ill-posed problems, of supplementing insufficient data by prior knowledge, or of incorporating certain desirable properties in the estimates of parameters in the model. We discuss some mathematical properties of a matrix operator involved in the ridge type of regularization method in GCANO and discuss their implications for multiple correspondence analysis (MCA).

Let *X*_{i}* *denote a column-wise centered
cases-by-variables matrix for the *i*^{th}* *data set, and let *X* denote
a super-matrix formed from *X*_{i}(*i** *= 1, …, *K*) arranged side by side. Define M (l) = *I*+l(*XX*’)^{-}, where l* *is a
regularization parameter, and (*XX*’)^{-} indicates a g-inverse of *XX*’.
Regularized GCANO obtains the generalized eigenvalue-vector decomposition
(GEVD) of *X*’*M*(l)*X* with respect to *D*(l), which is a block diagonal matrix with *D*_{i}(l) = *X*’_{i}* M*(l)*X*_{i} as the *i*^{th} diagonal block. An
optimal value of l* *is
determined by cross validation. Note that the problem reduces to the same GEVD
problem as solved in the conventional MCA when l* *= 0, and
consequently *M*(l) = *I*. A small positive
value of l, on
the other hand, has the effect of obtaining parameter estimates with bias, but
with a smaller expected value of mean squares (Hoerl & Kenard, 1970). It is
useful when the number of variables (categories in the case of MCA) is large
relative to the sample size.

Some examples are given to illustrate the method.

Carroll, J. D. (1968). A generalization
of canonical correlation analysis to three or more sets of variables. *Proceedings
of the 76th Annual Convention of the American Psychological Association*,
227–228.

Hoerl, A. F. & Kenard, R. W. (1970).
Ridge regression: biased estimation for nonorthogonal problems. *Technometrics*,
**12**, 55–67.

Horst, P. (1961). Generalized canonical
correlations and applications to experimental data. *Journal of Clinical
Psychology*, **17**, 331–347.

Takane, Y. & Oshima-Takane, Y.
(2001). Nonlinear generalized canonical correlation analysis by neural network
models. In *Measurement
and Multivariate Analysis *(eds S. Nishisato et al.), 183–190. Tokyo: Springer Verlag.

Yanai, H. (1998). Generalized canonical
correlation analysis with linear constraints. In *Data Science,
Classification and Related Methods* (eds C. Hayashi et al.), 539–546. Tokyo:
Springer Verlag.

Increasing Cronbach’s alpha for questionnaire reliability

by using optimal scaling

Dicle
Taspinar^{1}, Vedat Coskun^{2} & Nihat Demirhan^{2}

^{1}Istanbul Commerce University, Turkey & ^{2}Turkish
Naval Academy, Turkey

dtaspinar@iticu.edu.tr & vedatcoskun@dho.edu.tr

Using questionnaires is the most commonly used technique for data collecting in marketing and social researches. A deterministic criteria in questionnaire reliability is Cronbach’s alpha coefficient. As the value of Cronbach’s alpha increases, questionnaire reliability becomes more reliable. In this paper, we will prepare a simulation questionnaire in order to determine the choices of the respondents which would affect the consistency and therefore reliability of the questionnaire.

Flebus (1990), Bernardi (1994) and Barnette (1999) also showed that Cronbach's alpha coefficent can be affected by the observations. Flebus (1990) studied observations and the correlations among variables together with the variance of these variables. He then developed a program which can calculate the point where Cronbach's alpha is maximized. Bernardi (1994) tried to show that some relations can be found between the variables and the observations even if Cronbach’s alpha is small. Barnette (1999) performed a simulation and found a way to increase and decrease Cronbach's alpha by looking at respondent refusals.

We will show that it is possible to find out which observations which cause decrease in consistency by using homogeneity analysis which is an optimum scaling method. It is then possible to increase the questionnaire reliability by just taking out those observations.

Barnette,
J**. **(1999). Nonattending
respondent effects on the internal consistency of self-administered surveys: a
Monte Carlo simulation study. *Educational and Psychological*__ __*Measureme*nt,
**59**, 38-46.

Bernardi,
Richard (1994). A validating research results when Cronbach's alpha is below
.70: A methodological procedure. *Educational and Psychological* *Measurement,* **54**, 766-776.

Flebus,
Giovanni Battista** **(1990)**.** A program* *to select the best items that maximize Cronbach's alpha. *Educational and Psychological Measurement*, **50**, 831.

Co-correspondence analysis: a new ordination method to relate two species compositions

Cajo J. F. ter
Braak & André P. Schaffers

University and Research Centre, Wageningen, The Netherlands

Cajo.terbraak@wur.nl & Andre.Schaffers@wur.nl

A new ordination method, called
co-correspondence analysis, is developed to relate two types of communities of
species (e.g. a plant community and an animal community) sampled at a common
set of *n *sites in a direct way. The two data sets contain nonnegative
values (abundances), typically with very many zeroes, and have many more
variables (species) than statistical units (sites). The method improves the
simple, indirect approach of applying correspondence analysis (reciprocal averaging)
to the separate species data sets and correlating the resulting ordination
axes. Co-correspondence analysis maximizes the weighted covariance between
weighted averaged species scores of one community with weighted averaged species
scores of the other community. It thus attempts to identify the patterns that
are common to both communities. Both a symmetric, descriptive and an
asymmetric, predictive form are developed. The symmetric form relates to
co-inertia analysis (Dolédec & Chessel, 1994). Predictive co-correspondence
analysis relates to correspondence analysis as partial least squares (PLS)
regression (Martens & Naes, 1992; ter Braak & de Jong, 1998) relates to
principal component analysis.

Co-correspondence analysis uses weighted averages where PLS use linear combinations (weighted sums), as in ter Braak (1995). The new method performs better than PLS when the data have a unimodal structure, a strong qualitative nature and/or are sum-constrained, i.e. when each data set is better analyzed by correspondence analysis than by principal component analysis (ter Braak & Prentice, 1988).

In two examples the predictive power of co-correspondence analysis is compared with that of canonical correspondence analyses (ter Braak, 1986; ter Braak & Verdonschot, 1995) on syntaxonomic and environmental data. In the first example carabid beetles in roadside verges are shown to be more closely related to plant species composition than to vegetation structure, and in the second example bryophytes in spring meadows are shown to be more closely related to the species composition of the vascular plants than to the measured water chemistry.

Dolédec, S. & Chessel, D. (1994).
Co-inertia analysis: an alternative method for studying species-environment
relationships. *Freshwater Biology,* **31**, 277-294.

Martens,
H. & Naes, T. (1992). *Multivariate Calibration*. Chichester: Wiley.

ter
Braak, C. J. F. (1986). Canonical correspondence analysis: a new eigenvector
technique for multivariate direct gradient analysis. *Ecology,* **67**,
1167-1179.

ter Braak, C. J. F. (1995). Non-linear
methods for multivariate statistical calibration and their use in
palaeoecology: a comparison of inverse (k-nearest neighbours, partial least
squares and weighted averaging partial least squares) and classical approaches.
*Chemometrics and
Intelligent Laboratory Systems, ***28**, 165-180.

ter
Braak, C. J. F. & de Jong, S. (1998). The objective function of partial
least squares regression. *Journal of Chemometrics,* **12**, 41-54.

ter Braak, C. J. F. &. Prentice, I. C. (1988). A theory
of gradient analysis. *Advances in* *Ecological Research,* **18**, 271-317.

ter Braak, C. J. F. & Verdonschot, P. F. M. (1995). Canonical correspondence analysis and
related multivariate methods in aquatic ecology. *Aquatic Sciences,* **57**, 255-289.

Correspondence analysis and categorical conjoint

measurement

Anna Torres-Lacomba

Universidad Carlos III de Madrid, Spain

atorres@emp.uc3m.es

To quantify individuals’ trade-off when they can choose between multidimensional alternatives is a typical study in marketing research, usually handled by conjoint analysis. We want to show that, in a particular case, correspondence analysis (CA) can also be used to analyze conjoint data and further, it offers a map which helps to understand the results obtained.

Conjoint analysis can be understood as a technique which predicts what products or services people will prefer and assesses the weight people give to various factors that underlie their decisions. There exist different conjoint algorithms for analyzing such data, depending on the type of conjoint measurement: in our case we are interested in conjoint measurement on a categorical scale. For this case there exists an algorithm due to Carroll (1969) known as categorical conjoint measurement. In this study we show how correspondence analysis can be applied to tables concatenated in a certain way in order to emulate Carroll’s algorithm. We use canonical correlation analysis applied to dummy variables as a bridge in order to show the equivalence in results between categorical conjoint analysis and correspondence analysis. Previous literature has already demonstrated the equivalence between simple correspondence analysis and canonical correlation analysis for two categorical variables (Greenacre, 1984). Our first innovation came with the demonstration of the equivalence between correspondence analysis and canonical correlation analysis for more than two categorical variables.

A further issue in the conjoint analysis literature is the study of interaction effects (Green, 1973). We incorporated interactions in canonical correlation analysis for the usual way of coding interactions as well as for a new one, establishing the connection with correspondence analysis and categorical conjoint measurement in the presence of interactions.

As an example we give an application in which a potential interaction effect between the type of fragrance for a perfume and its intensity may exist. For example, a particular subject may prefer floral fragrance as well as low intensity fragrance, but for the particular case of citric fragrance, high intensity is preferred.

Carroll,
J. D. (1969). *Categorical* *Conjoint Measurement.* Unpublished Manuscript. Bell laboratories, Murray Hill.

Green, P. E. (1973). On the analysis of
interactions in marketing research data. *Journal of Marketing Research, ***10**,
410-420.

Greenacre, M. J. (1984). *Theory and
Applications of Correspondence Analysis*.
London : Academic Press.

Resampling methods applied to stability analysis in

multiple correspondence analysis: two case studies in

Colombia

P O S T E R

Leonardo Trujillo & Elquin A. Huertas

National
University of Colombia, Bogotá, Colombia & Citizen Coexistence Project

trujillo@matematicas.unal.edu.co & elquin_huertas@fundacion-social.com.co

When a multiple correspondence analysis (MCA) is applied, there are very important questions regarding either stability of the configurations representing the data or the estimators produced from this data. In particular, we are concerned with two types of stability:

(1) Internal (or inner) stability, which refers to the stability of configurations or estimators due to changes in the data inside a particular sample. A question related to inner stability may be: does a particular observation excessively influence the obtained representation or estimators?

(2) External (or outer) stability, which refers to the stability of configurations and estimators to changes in the whole sample. A question related to outer stability is: would it be possible to obtain the same configurations or estimators if we used a different sample with the same characteristics. Thus, it could be said that a factorial plane is externally stable, if its orientation is minimally altered considering several samples of the same population.

Jackknife and bootstrap techniques (Efron, 1982, Shao & Tu, 1995) are used to study inner and outer stability, respectively. The idea is to repeat MCA on each of the simulated samples to study their fluctuations. Additionally, the bootstrap allows us to obtain confidence zones (Lebart et al., 1995) in order to study the stability of some configurations obtained in the factorial planes.

In this work, two particular applications of stability analyses are presented. First, a study of internal and external stability for two methods of longitudinal data analysis, namely qualitative harmonic analysis (QHA) and STATIS, is performed. These methods were applied to study the mobility of individuals in Bogota, Colombia (Trujillo, 2002). Second, the external stability in the construction of indicators of citizen coexistence in adolescents of Bogota was studied (Huertas & Corzo, 2001).

Efron, B. (1982). *The Jackknife, the
Bootstrap and other Resampling Plans*. Society for Industrial and Applied
Mathematics. Philadelphia.

Huertas, E. &
Corzo, J. (2001). Análisis de estabilidad de indicadores: Pluralismo en la convivencia
ciudadana. *Memorias simposio de* *Estadística*. *Estadística en la
investigación* *social*. Santa Marta - Colombia. Agosto de 2001.

Lebart, L.,
Morineau, A., & Piron, M. (1995). *Statistique Exploratoire* *Multidimensionnelle*.
Paris: Dunod.

Shao, J. & Tu,
D. (1995). *The* *Jackknife* *and* *the* *Bootstrap*.
New York: Springer-Verlag.

Trujillo, L. (2002). *Estimación* *de*
*la* *Varianza* *de* *los* *Valores* *Propios* *Estimados*
*para* *dos* *Métodos* *de* *Análisis* *de* *Datos*
*Longitudinales*: STATIS Y ACC. Tesis de maestría. Universidad Nacional de
Colombia. Facultad de Ciencias. Departamento de Estadística.

Interactive software to produce biplots

Frederic Udina

Universitat
Pompeu Fabra, Barcelona

udina@upf.es

We analyse and discuss how a generic software to produce biplot graphs should be designed. We describe a data structure appropriate to include the biplot description and we specify the algorithm(s) to be used for different biplot types.

We specify the options the software should offer to the user in two different environments. In a highly interactive environment the user should be able to specify many graphical options and also to change them using the usual interactive tools (Bond & Michailides, 1997). The resulting graph needs to be available in several formats, including high quality printing format. In a Web-based environment, the user submits a data file together with some options specified either in a file or using a form. Then the graphic is sent back to the user in one of several possible formats according to the specifications.

We review some of the already
available software (for example, Lipkovich & Smith, 2002) and we present an
implementation of the proposed software based in Xlisp-Stat. It can be run
under Unix or Windows, and it is also part of a web service that provides
biplot graphs through the web.

Preliminary information will eventually be available at http://gauss.upf.es/xls-biplot/ and http://gauss.upf.es/bp-form.html.

Bond, J. & Michailides, G. (1997).
Interactive correspondence analysis in a dynamic object-oriented environment. *Journal
of Statistical Software,* **2**, 8.

Lipkovich I. & Smith, E. P. (2002).
Biplot and singular value decomposition macros for Excel. *Journal of
Statistical Software*, **7**, 5.

Multiple
correspondence analysis to explore relationships among genetic polymorphisms

Joan
Valls, Elisabet Guinó & Víctor Moreno

Catalan
Institute of Oncology, Barcelona, Spain

joan.valls@iconcologia.catsalut.net, eguino@ico.scs.es & v.moreno@ico.scs.es

The
polymorphisms detected in multiple genes that are related with processes of
xenobiotic metabolism or inflammation could partially explain the variability
in cancer predisposition. New technology in molecular biology based in DNA
microarrays helps to simultaneously determine polymorphisms in hundreds of
genes. Classical statistical techniques, useful in the analysis of few
variables, lose their utility when the number of variables is near or higher
than the observations. In these cases, techniques of dimension reduction can be
useful. These methods help the identification of variation patterns or groups
of variables, and suggest a hypotheses that
can be tested using other techniques.

In
this paper multiple correspondence and cluster analysis will be used to explore
frequency patterns of categorical variables, such as the different variants
identified in multiple genes of interest in colorectal cancer.

We
have determined 150 polymorphisms in 50 genes related to inflammation or metabolism
in a group of 323 patients with colorectal cancer and 283 hospital controls.
For each polymorphism, the genotype has been identified and classified as a
categorical variable with three levels (normal homozygous, heterozygous,
variant homozygous). Multiple correspondence analysis has been applied to the
whole of the polymorphisms (independently to the group of patients) and the
categories have been represented in factorial biplots.

Initially,
in order to simplify the exploratory analysis, only 5 polymorphisms in different
genes have been selected (IL6, IL8, PPARG, NFKB, TNF) and the categories of
variant homozygous and heterozygous have been combined assuming a dominant
effect. Subsequently, cluster analysis, applied to the new factors created,
helps us to understand the relationships between the polymorphisms.

Escofier,
B. & Pagès, J. (1988). *Análisis Factoriales Simples y Múltiples:
Objetivos, Métodos e Interpretación*. Bilbao: Servicio editorial de la Universidad
del País Vasco.

Greenacre,
M. J. (1984). *Theory and Applications of Correspondence Analysis*.
London: Academic Press.

Lebart, L., Morineau, A. & Warwick, K. M. (1984). *Multivariate Descriptive
Statistical Analysis: Correspondence Analysis and Related Techniques for Large
Matrices*.
New York: John Wiley & Sons.

Inverse correspondence analysis

Michel van de Velden & Patrick Groenen

Universitat** **Pompeu** **Fabra, Barcelona, Spain & Erasmus
Universiteit Rotterdam,

The Netherlands

michel.vandevelden@econ.upf.es & groenen@few.eur.nl

In correspondence analysis, rows and
columns of a data matrix are depicted as points in low-dimensional space. The
row and column profiles are approximated by minimizing the so-called weighted
chi-squared distance between the original profiles and their approximations. In
this paper, we will study the *inverse
correspondence analysis* problem, that is, the possibilities of retrieving
one or more data matrices from a low-dimensional correspondence analysis
solution. We will show that there exists a nonempty closed and bounded
polyhedron of such matrices. We also present an algorithm to find the vertices
of the polyhedron. A proof that the maximum of the Pearson chi-squared
statistic is attained at one of the vertices is given. In addition, it is discussed
how extra equality constraints on some elements of the data matrix can be
imposed on the inverse correspondence analysis problem. As a special case, we
present a method for imposing integer restrictions on the data matrix as well.
The approach to inverse correspondence analysis followed here is similar to the
one employed by De Leeuw and Groenen (1997) in their inverse multidimensional
scaling problem.

De Leeuw, J. & Groenen, P. J. F.
(1997). Inverse multidimensional scaling. *Journal of Classification*, **14**,
3-21.

Measuring values or preferences in a cross-national

context: rating or ranking?

Hester van Herk
& Michel van de Velden

Vrije Universiteit Amsterdam, The Netherlands & Universitat Pompeu Fabra, Barcelona, Spain

hherk@feweb.vu.nl & michel.vandevelden@econ.upf.es

To measure values or preferences both
ratings and rankings are often used. With rankings individuals are explicitly
forced to express an ordering of their preferences or values. Ratings on the
other hand, allow the individuals to express their preferences or values more
freely on a particular, usually predefined scale.

No consensus exists about what method should be preferred for studying preferences or values in a cross-national context. Some argue that ranking is the most appropriate (e.g., Kamakura & Mazzon, 1991), whereas others argue that ratings should be preferred (Klein & Artzheimer, 1999); especially in a cross-national context ratings should be preferred (Ng, 1982). Unfortunately, comparisons of the two measurement methods are typically made at the aggregate level across all individual subjects. Results at this aggregate level indicate that rankings and ratings provide similar results. Moreover, most studies comparing the two measurement types use a between-subject design in which subjects either rated or ranked the items. Consequently a comparison at the individual level is not possible. An exception is the study by Russell and Gray (1994), who let the same subjects rate and rank the same item set. However, this study was done in one country only.

In this paper we consider, at the individual level, the relationships between ratings and rankings across countries. A sample is used from five countries in the European Union: Germany, the UK, France, Italy and Spain, including more than 4000 respondents. Each of the respondents supplied ratings and rankings for the same group of items. Insight is given into the relative merits of rating and ranking measurement by using three-way correspondence analysis (Carlier & Kroonenberg, 1996). This technique enables us to model the expected ordinal character in both scales, as well as differences and similarities between countries. Results show that the way in which people assess rating and ranking measurement procedures is similar across these countries even if substantive content of items differs.

Carlier, A.
& Kroonenberg, P. M. (1996). Decompositions and biplots in three-way correspondence analysis. *Psychometrika*, **61**, 355-373.

Kamakura,
W. A. & Mazzon, J. A. (1991). Value segmentation: a model for the measurement of values and value
systems. *Journal of Consumer Research, ***18, **208-218.

Klein, M.
& Artzheimer, K. (1999). Ranking und Rating Verfahren zur Messung von
Wertorientierungen, untersucht am Beispiel des Inglehart-Index. Empirische
Befunde eines Methodenexperiments. *Kölner Zeitschrift für Soziologie und
Sozialpsychologie*, **51**, 550-564.

Ng, S. H. (1982). Choosing between the
ranking and rating procedures for the comparison of values across cultures. *European
Journal of Social Psychology*, **12**, 169-172.

Russell, P.
A. & Gray, C. D. (1994). Ranking or rating? Some data
and their implications for the measurement of evaluative response. *British*
*Journal of Psychology*, **85**, 79-92.

A comparison of correspondence analysis

and nonmetric item response models

Wijbrandt van Schuur & Jörg Blasius

University of Groningen, The Netherlands & University of Bonn, Germany

h.van.schuur@ppsw.rug.nl & jblasius@uni-bonn.de

In the world of survey research we can distinguish at least three different schools by the way they go about measuring their concepts. First, there is the school that starts with the concept of a Likert scale, continues with reliability analysis, and ends with factor analysis and its offspring. In this school, measurement is always metric, the interval properties of the variables are taken for granted, and there is little emphasis on systematic differences among the items.

The second school dates back from Thurstone, continues with Guttman, Coombs, Lazersfeld and Henry, and ends with item response models such as the parametric Rasch model, the nonparametric Mokken model, and their offspring. Here the interval (or ordinal) properties of the variables are derived from the measurement model, and are subject to falsification. Some adherents claim that only the most parsimonious item response theory (IRT) model leads to 'objective' measurement in which measurements are externally valid, in that they are comparable over different groups of subjects in different times and places.

Finally, the third school, that we shall refer to here as correspondence analysis and multiple correspondence analysis, goes back to Hirschfeld. Its major adherents are Benzécri, Gifi, Greenacre, and Nishisato.

In this paper we will make some comparisons between the second and third schools, and relate nonmetric item-response models (Van Schuur, 2003) to multiple correspondence analysis with respect to:

· types of data (dichotomous, rating scales, pick or rank k/n or any/n data)

· interpretation of variables (dependent, independent, indicator, intervening, active, passive)

· representation in one or more dimensions

· representation of variables or of response categories

·
measurement
of subjects

·
sensitivity
to frequency distributions

· top-down and bottom-up approaches to finding interpretable structure in the data.

Our findings
will be illustrated with a variety of data sets.

Schuur, W. H. van (2003). Mokken scale
analysis: between the Guttman scale and parametric item response theory. *Political Analysis*, **11**, 139-163.

Multiple correspondence analysis for symbolic data

Rosanna Verde

Seconda
Università di Napoli, Italy

rosanna.verde@unina2.it

In recent years, the development of symbolic data analysis has yielded many methods for the synthesis and the representation of complex information, expressed in terms of symbolic objects (Bock & Diday, 2000).

According to the definition given by Diday (1989), a symbolic object (SO) is a suitable concept modelling. It can be described by a set of multi-valued variables (multi-categorical, intervals, distributions); furthermore, logical rules can even be considered in the SO’s description in order to reduce the space of description of such variables.

In the framework of factorial methods for representing symbolic data, the present work aims to provide an extension of multiple correspondence analysis (MCA) to the study of data described by multi-categorical variables. In fact, according to the classical application of the MCA to multiple binary data tables, following the generalised canonical analysis (GCA) approach, an extension of this approach is suggested when the symbolic objects descriptors are multi-categorical ones. Moreover, a possible generalization of GCA (Verde, 1998) has been proposed when all the several kinds of descriptors are present in the SO’s description.

The proposed procedure is based on a quantification phase of the symbolic descriptors: relational operators are used on the transformed data in order to preserve the information about the relationships among such descriptors. A fuzzy coding data is performed as well as an evaluation of SO quality on the factorial plan.

In order to provide a suitable visualisation of symbolic data on factorial plans, we propose different kind of representation forms (convex polygons) and a symbolic interpretation of the factorial axes.

The criterion optimized in the analysis is a kind of squared correlation ratio on the factorial variables.

Finally, an application on will be performed using SODAS software, allowing us to validate the proposed approach.

Bock, H. H. & Diday, E. (2000).
Analysis of symbolic data, exploratory methods for extracting statistical
information from complex data. *Studies in Classification, Data Analysis and*
*Knowledge Organisation*. Springer-Verlag.

Diday, E.
(1989). Knowledge representation and symbolic data
analysis. In *Proceedings of Second International Workshop on Data, Expert
Knowledge, and Decision. *Hamburg.

Verde, R. (1998). Generalised canonical
analysis on symbolic objects. In *Classification and* *Data Analysis*
(eds M. Vichi & O. Opitz), 195-202. Heidelberg: Springer Verlag.

Logistic biplot for binary data

José Luis Vicente Villardón^{1} , M. P. Galindo Villardón^{1}, Miguel Yánez-Alvarado^{2} & Antonio Blázquez Zaballos^{1}

^{1}Universidad de Salamanca, Spain & ^{2}Universidad
de Los Lagos, Osorno, Chile

villardon@usal.es

Classical biplot methods allow for the simultaneous representations of individuals and variables in a data matrix of continuous variables. When variables are binary (presence/absence) a classical linear biplot representation is not suitable and multiple correspondence analysis is commonly used.

In this paper we propose a linear biplot representation based on logistic response models, closely related to latent trait models and item response theory. The geometry of the biplot is such that the coordinates of individuals and variables are calculated to have logistic responses along the latent dimensions.

Gabriel (1998) took into account the
probability distributions of the identically distributed manifest variables and
adjusted the biplot using generalised bilinear regression. However, that
procedure was developed for contingency tables,** **and has some problems
when applied to a matrix of
individuals by variables, owing to the size of the matrices involved. The
biplot method proposed in this paper has been developed for data matrices that
contain individuals by variables.** **

The main characteristic of the proposal is that, although it is based on a non-linear response model, the representation is linear: the directions of the variable vectors on the biplot show the directions of increasing logit values and therefore, the directions in which the probability of having the characteristic increases, with optimum fit.

A modification of the Newton-Raphson
method (Murray, 1972) is considered for the estimation of parameters by joint maximum likelihood, leading to a
procedure similar to that used by Baker (1992)

The method is illustrated using real data.

Baker, F. B. (1992). *Item Response Theory. Parameter Estimation Techniques*. New York: Marcel
Dekker.

Gabriel, K. R. (1998). Generalised bilinear regression. *Biometrika*, **85**, 689-700.

Murray,
W. (1972). *Numerical
Methods for Unconstrained Optimization*. London: Academic Press.

Simultaneous analysis

Amaya Zárraga & Beatriz Goitisolo

Universidad del País Vasc / Euskal Herriko Unibertsitatea, Bilbao, Spain

az@alcib.bs.ehu.es &
bg@alcib.bs.ehu.es

The classical method used to analyze
several contingency tables consists of performing a separate correspondence analysis on each one and/or a correspondence
analysis of the juxtaposition
of the tables (if the rows of the tables, for example, are the same units).
However, the results of these processes can be affected, as Bécue-Bertaut &
Pagès (2000) point out, by differences between the marginal row profiles of the
different tables and by the relative importance of the tables in the analysis, measured through the contributions of the columns. This, in
turn, is due to the differences between the grand totals of the tables and to
the differences of structure intensity
between the tables.

The aim of this work is to present a
new method of factor analysis called simultaneous analysis (Zárraga & Goitisolo, 2002, 2003) which is based on the
already known technologies of correspondence analysis and multiple factor analysis (Escofier &
Pagès, 1988). Simultaneous
analysis allows the treatment and joint study of several tables of information,
solving the problems encountered with classical techniques. Simultaneous
analysis is especially suitable
for the study of several contingency
tables and, by extension, complete disjunctive tables, incomplete disjunctive tables and Burt or
pseudo-Burt tables.

The proposed method of analysis allows us:

· to balance the influence of the tables, transforming the values of each one;

· to balance the influence of the tables according to the differences in structure intensity between them; and

·
to preserve both the weight and the
metric of each table in an overall factor analysis.

Bécue-Bertaut, M. & Pagès, J.
(2000). Analyse factorielle múltiple intra-tableaux. Application à l’analyse simultanée de plusieurs questions ouvertes. In *JADT 2000: 5és Journées Intern**a**tionales d’Analyse Statistique des Données Textue**l**les.*

Escofier, B. & Pagès, J. (1988)
(third edition 1998). *Analyses Factorielles Simples et
Multiples. Objectifs, Méthodes et Interprétation*.
Paris : Dunod.

Zárraga, A.
& Goitisolo, B. (2002). Méthode factorielle pour
l’analyse simultanée de tableaux de contingence. *Revue
de Statistique Appliquée, ***50,** 47-70.

Zárraga, A.
& Goitisolo, B. (2003). Étude de la structure
inter-tableaux à travers l’analyse simultanée.
*Revue de Statistique Appliquée*
(forthcoming).

Constrained
ordination analysis with flexible response curves

Mu Zhu
& Trevor J. Hastie

University of Waterloo, Canada & Stanford University,
U.S.A.

m3zhu@uwaterloo.ca
& hastie@stat.stanford.edu

Canonical correspondence analysis, or CCA (ter Braak, 1986), is a popular multivariate method for constrained ordination analysis. By straightforward manipulations with matrix algebra, it can be shown (e.g., Takane et al., 1991; Zhu, 2001) that CCA is equivalent to Fisher's linear discriminant analysis (LDA), but this equivalence is apparently not widely known among practitioners.

We
provide a more intuitive (and less algebraic) argument to show how this
equivalence can be understood directly in the context of the Gaussian response
model, a model that is widely used in constrained ordination analysis.

The
Gaussian response model, however, only provides a reasonable approximation if
the species have unimodal and
symmetric response functions. Canonical correspondence analysis also implicitly
assumes that the species have the same tolerance level, perhaps the most unreasonable simplification of all. There
is growing empirical evidence (e.g., Johnson & Altman, 1999) that such
assumptions are often violated in practice.

We
show that, by exploiting the equivalence between CCA and LDA, we can model the
response functions much more flexibly in constrained ordination analysis. In
particular, a nonparametric generalization of
Fisher’s LDA (Zhu & Hastie, 2003) can be applied. This allows the species
to have different tolerance levels: for example, they can even have response
functions that are asymmetric and
multimodal.

Johnson,
K. W. & Altman, N. S. (1999). Canonical correspondence analysis as an
approximation to Gaussian ordination. *Environmetrics*, **10**.

Takane,
Y., Yanai, H. & Mayekawa, S. (1991). Relationships among several methods of
linearly constrained correspondence
analysis. *Psychometrika*, **56**, 667-684.

ter
Braak, C. J. F. (1986). Canonical correspondence analysis: a new eigenvector
technique for multivariate direct
gradient analysis. *Ecology*, **67**, 1167-1179.

Zhu, M. (2001). *Feature
Extraction and Dimension Reduction with applications to Classific**a**tion
and the Analysis of Co-occurrence Data*. Ph.D.
dissertation, Stanford University.

Zhu, M. & Hastie, T. J. (2003). Feature extraction for nonparametric
discriminant analysis. *Journal of Computational and Graphical Statistics*, **12**, 101-120.