Spaces
Before modeling or analyzing data, it is essential to fully understand the nature of the variables being manipulated. The type of variable determines:
the mathematical space in which it exists;
the distance measures that can be used to compare it to others;
and the relevant models to use.
In this section, we present the most common types of variables, as well as the associated spaces.
Statistical unit
A statistical unit is the basic element on which an observation is made. Morally, it is the “carrier” of the information that is used to determine the level of aggregation of the analysis. The statistical unit is a choice made by the modeler.
Types of variables
There are generally four types of variables, which are identified at the level of the smallest statistical unit in the dataset.
Although these types of variables are the most common, there are many other types of variables. For example, we may be interested in comparing curves, texts, images, networks, etc. In these situations, the choice of representation depends on the level at which we wish to place ourselves, and therefore on the statistical unit.
Associated spaces
Once our data has been collected, the first step in statistical analysis is to choose a mathematical space in which to work. This space, sometimes called the observation space and denoted by \(\mathcal{X}\), depends on the type of data observed. It constitutes the formal framework in which our variables take their values, and it guides the methodological choices that follow.
When the data is more complex, more suitable spaces must be chosen. For curve or signal analysis, we can work in a function space. For example, we can consider the space of continuous functions on a closed interval \([a, b]\), denoted \(\mathcal{X} = \mathcal{C}([a, b])\). For text analysis (viewed as a sequence of characters), the workspace can be an alphabet. For example, we can consider \(\mathcal{X} = \{ \text{A}, \text{B}, \dots, \text{Z} \}\).
Often, several variables are observed at the same time, e.g., the height, weight, and gender of an individual. In this case, the observation space will be the Cartesian product (also called the product set) of the spaces associated with each variable: \[\mathcal{X} = \mathcal{X}_1 \times \mathcal{X}_2 \times \dots \mathcal{X}_p,\] where \(p\) is the number of variables. In the case where we observe \(p\) numerical variables, we will simply write \(\mathcal{X} = \mathbb{R}^p\).
