Mixture and latent variable models for causal inference and analysis of socio-economic data

Research project


The research project concerns methodological and empirical developments about relevant classes of statistical models, which are formulated through mixtures of distributions or through latent variables/factors. As is well known, these approaches are strongly related and have great potential of application in several fields, where the characteristic of interest is not directly observable (e.g., quality of life, ability in a certain subject) or the differences in the behaviours of the statistical units depend on a form of heterogeneity that cannot be explained on the basis of the observed variables (unobservable heterogeneity).

The objective of the research project is, not only to propose advanced tools for the data analysis perspective, but also to develop new tools of causal inference and methods for the evaluation of the efficacy of policies or treatments. In this context, the project will mainly focus on the evaluation of healthcare or similar facilities, and on the performance evaluation of public intervention policies concerning education, health, juvenile crime and the youth labour market. These aims are in agreement with the objectives for a “better society” of Horizon 2020.

The research project will be focused on the following classes of models:

1. Generalized Linear Latent Variable Models (GLLVMs) and Item Response Theory (IRT) models: which are used when the response variables are of different nature and depend on variables of interest that are not directly observable; IRT models may be seen as particular GLLVMs which are suitable to the analysis of data deriving from the administration of test measuring a certain ability.

2. Component Analysis (CA) models:
these are statistical models where the latent variables are estimated through linear combinations of the observed variables chosen on the basis of an optimality criterion.

3. Mixture of a discrete Uniform and a translated Binomial distribution (CUB): the aim is to decompose the psychological process of choice when there is a set of ordered modality of response.

4. Cluster Weighted (CW) models: wide family of mixture models, which formulates assumption on the joint distribution of a response variable and a set of explanatory variables.

5. Latent Markov (LM) models: which are used for the analysis of longitudinal data when the interest is on the evolution of an unobservable individual characteristic.

6. Mixed effects: wide class of models which allow for the analysis of longitudinal or multilevel data.

7. Multilinear models: which are suitable to study a sample of units on the basis of different variables observed across time and are based on a set of latent factors.

As regards the methodological developments, the research project involves the following themes:

1. Inferential developments on GLLVMs and LM models (estimation methods which are efficient from a computational point of view and that allow us to overcome the limits of classical estimation methods).

2. Formulation of extended versions of GLLVMs for ordinal data (extensions to the longitudinal context with emphasis to dynamic factor analysis and to latent growth models).

3. Development of powerful tests for the goodness of fit of GLLVMs in presence of sparseness problems.

4. Improvement of the accuracy of estimates of an IRT model parameters and development of methods of assembling tests with optimal properties

5. Extensions of the CA model for estimating latent score when complex relationships are assumed between exogenous, endogenous, and concomitant variables.

6. Extensions of CUB models (to analyse multilevel data and to account for the “shelter effect”).

7. Development of extended CW models for mixed-type variables and development of related inferential methods.

8. Formulation of LM models for multilevel data allowing for a dynamic cluster effect.

9. Developments in the context of mixed effects models and multilinear methods for longitudinal and multivariate data (analysis of mixed-type data; to deal with cases where the observations are not recorded at the same time occasions).

10. Methodological developments for causal inference (formulation of extended versions of LM models in terms of potential outcomes; identification and estimation of causal effects in the presence of intermediate variables; problems related with multilevel data).

As regards the applications of the aforementioned methodologies, the analysis will be focused on the following themes:

1. Education (evaluation of the relationship between performance and socio-economic status, formative path and family characteristics; study of the effectiveness of schools on learning; effectiveness of University grants). For this aim datasets coming from the OCSE-PISA survey and from the National Institute of Evaluation of the Educational System of Instruction and Training (INVALSI) will be analysed.

2. Labour market and training (evaluation of the effect of the type of degree and of the University attended on the labour market participation and on the job satisfaction; estimation of the human capital and linkage with career path and dynamics of income; evaluation of the external efficacy of the universities based on the longitudinal economic performance of their graduates). The reference datasets are provided by: (i) merging of the data produced by the Job Centres of the Region of Umbria, the administrative data of the University of Perugia, and the AlmaLaurea data; (ii) by merging data of the Job Centres of Lombardy, the “Agenzia dell'Entrate” (revenue office), and some Universities; (iii) by the Eurostat panel database EU_SILC (European Union Statistics on Income and Living Conditions); (iv) by the panel Survey on Household Income and Wealth (SHIW) conducted by the Bank of Italy.

3. Criminal activity (evaluation of the juvenile criminal behaviour, taking into account different typologies of crime and the recidivism behaviours). The employed dataset results from an agreement with the Department of Juvenile Justice of the Justice Ministry.

4. Health (evaluation of effectiveness of healthcare providers by using healthcare databases and discharges datasets; analysis of routine data recorded at national level and regarding health care policy and organization; analysis of data drawn from longitudinal studies aimed at studying the individual and environment-specific determinants of healthy ageing and longevity). These analyses are based on a dataset provided by the Lombardy region, by the Ministry of Health, and by the Leiden University Medical Centre.

The research project involves five Research Units that will deal, in particular, with the following themes:

1. Unit of Perugia: methodological developments on LM models and CUB models and applications on labour market and juvenile crime.

2. Unit of Bologna: methodological developments about GLLVMs for ordinal and longitudinal data and applications in education.

3. Unit of Firenze: methodological developments about causal inference and applications in education.

4. Unit of Milano-Bicocca: methodological developments about LM, CA, and CW models and applications in the fields of education, labor market, and evaluation of effectiveness of healthcare providers.

5. Unit of Roma: methodological developments about mixed effects and multilinear models, with application in health.

Though every Research Unit has its peculiarities, it will strictly collaborate with the other Units, both from the methodological and the applied point of view. We plan to give diffusion to the results of the research activity through articles published in journals and presentations at conferences of international level. In developing this activity, we plan to make available some user-friendly software packages implemented in R and Stata.
Effective start/end date1/1/12 → …




Markov model
Causal inference
Labour market
Data base
Longitudinal data
Statistical model
Quality of life
Performance evaluation
Education policy
Educational system
Health education
Job satisfaction