In today's digital age, where everything is quantified and databases are constantly growing in size and number, it's easy to understand any phenomenon. But the sheer mass of data raises a number of questions: how can we extract all the information or patterns a dataset contains?
The case study discussed below is based on a dataset of quantitative variables covering 141 countries worldwide. We'll be using PCA (Principal Component Analysis) to analyze this type of data, in order to extract the maximum amount of information while reducing the size of the data.
The aim here is to briefly present the methodology of this approach and to carry out the analysis of the dataset using this approach.
The database analyzed here provides information on indicators for 141 countries around the world. The database is not up to date and dates back a few years, so the groups in the results section must have changed a lot by now. In addition to specifying the name and continent of each country, it contains 18 other quantitative variables. These variables can be grouped into three main categories: demographic variables (infant mortality rate, percentage of population aged 0-14 and 65+, annual population growth); the economic aspect (GDP per capita, growth in GDP per capita, gross fixed capital formation, share of agriculture, exports, industry and services in GDP, share of foreign direct investment in GDP), as well as indicators on the standard of living and living conditions of the populations of these countries (percentage of the population with access to drinking water, number of mobile phone and laptop users, number of people with access to the Internet per 1,000 inhabitants).
Principal Component Analysis (PCA) is one of the factorial analysis methods used when we want to describe a table of several quantitative variables. It allows the information provided by this set of quantitative variables to be summarized by a smaller set of new variables called factors, which are in fact linear combinations of the initial variables. Given the type of table we're dealing with here, PCA proves to be one of the most appropriate way of describing the data we have at our disposal, both effectively and efficiently.
The dataset (a standardized Z table ) can be viewed geometrically in two different ways: the set of individuals photographed in the space directed by the variables (cloud of individuals) or the set of variables photographed in the space directed by the individuals (cloud of variables). For example, in our case, we have 18 variables measured in 141 countries, so each observation (row) will be considered as a vector in an 18-dimensional space (in the cloud of individuals) and each variable as a vector in a 141- dimensional space (in the cloud of variables). In PCA, we seek to answer the questions: which individual is similar to which other? Which variable is related to which other? To answer these questions, we need to find one or more subspaces in which the projections (photos) of the data are sufficiently clear to assess similarities and correlations.
The search for these subspaces is based on two principles: the principle of maximum elongation and the principle of maximum spread. The first consists in finding good photos during projection. The second principle, that of maximum spread, is the search for the axis, plane or subspace (of dimension greater than or equal to 3) on which the elements are best spread. If we're interested in the cloud of variables, the quality of a photo is measured by its inertia.Finding the axis along which the cloud spreads best is to solve the program:
where V is the unitary direction vector of the axis and ZZ' is the inertia matrix of the variable cloud. We show that V is the eigenvector associated with the largest eigenvalue. In general, the p-dimensional subspace along which the cloud spreads the most is that generated by the first p eigenvectors associated with the first p eigenvalues. The reasoning is similar for the individual cloud, where the associated inertia matrix is Z'Z.
Since factor graphs are derived from projections, their observation can be subject to distortion. To avoid this, two main tools are used for PCA interpretation: the cosine squared (CO2) and the contribution (CTR). CO2 measures the quality of an element's representation. It is the cosine squared of the angle formed by the element and the factual axis. The higher it is, the better the element is represented on the factorial axis. The CTR of an item in the formation of a factorial axis is the share of information contributed by this item to the formation of the axis. It can be used to identify the influential points in the formation of the axis.
The following procedure is intended as a guide only, and may be modified to suit the requirements of the subject or the objectives pursued. To carry out a PCA, one should proceed as follows:
1- Check that the table conforms to the PCA application.
2- Remember the objectives of the PCA
3- Decide on the number of axes to be retained, according to one of the following criteria:
-
Inertia rate criterion : For this criterion, we take eigenvalues such that the total inertia is around 60 to 80%.
-
Kaiser criterion : Eigenvalues greater than or equal to 1 are taken.
-
Elbow criterion : The axis with the sharpest drop in inertia compared to the top level is selected.
4- See correlations between variables
5- See correlations between variables and factors
6- Observe individual (countries here) positions
7- Make sense of the axes
8- Conclude.
Demographically, the average proportion of people under 15 years of age is 30.95%; the average proportion of people aged 65 and over is 7.36%. The average annual population growth rate for the countries in the base is 1.319%. (see table 1 below)
Table I : Summary statistics on the data
Variable Label | Mean | Minimum | Maximum |
---|---|---|---|
GDP per capita (constant 2000 international $) | 8,840.900 | 582.960 | 51,874.200 |
Share of agriculture in GDP | 15.437 | 0.110 | 55.150 |
Share of population with access to drinking water | 83.479 | 22.000 | 100.000 |
Number of individuals with a mobile phone (per 1000 inhabitants) | 363.111 | 3.240 | 1,399.810 |
Number of individuals with a laptop computer (per 1000 inhabitants) | 140.482 | 0.670 | 765.630 |
Number of internet users (per 1000 inhabitants) | 158.747 | 0.790 | 771.750 |
Share of population aged 0–14 years | 30.952 | 14.060 | 50.430 |
Share of population aged 65 and over | 7.362 | 1.060 | 19.660 |
Annual population growth rate (%) | 1.319 | -1.030 | 6.290 |
In terms of living conditions and standard of living, on average 140 out of every 1,000 people in a given country use a laptop computer, around 363 have a cell phone and over 158 have access to the Internet. 83.479% of the population of all the countries in the database have access to drinking water. However, the figures are not so uniform from one continent to the next. In Oceania, the entire population has access to drinking water, followed by Europe, North America and South America, with percentages of 96.35%, 92.82% and 90.81% respectively. Sub-Saharan Africa comes out with 66.26%, followed by Asia (83.00%) and North Africa (89.25%).
On the economic front, when we look at GDP per capita, we see that there are major disparities between countries, as evidenced by the value of the standard deviation ($9740,790). The lowest value of GDP per capita is observed in Malawi ($582.96) and the highest in Luxembourg ($51874). On average, agriculture accounts for 15.437% of the economies of the countries in the dataset.
We are now starting to perform a standardized PCA on the data. The software used is SPADv5.5. It should be noted that in this PCA, the Guinea Bissau individual has been included as an illustration following a previous analysis, as it was an atypical point.
TABLE II: Eigenvalues
Number | Eigenvalue | Percentage | Cumulative Percentage |
---|---|---|---|
1 | 8.7704 | 48.72% | 48.72% |
2 | 2.1247 | 11.80% | 60.53% |
3 | 1.5525 | 8.63% | 69.15% |
4 | 1.1716 | 6.51% | 75.66% |
5 | 0.9349 | 5.19% | 80.86% |
6 | 0.7799 | 4.33% | 85.19% |
7 | 0.5582 | 3.10% | 88.29% |
8 | 0.4882 | 2.71% | 91.00% |
9 | 0.4085 | 2.27% | 93.27% |
10 | 0.3139 | 1.74% | 95.02% |
11 | 0.2605 | 1.45% | 96.46% |
12 | 0.2043 | 1.13% | 97.60% |
13 | 0.1708 | 0.95% | 98.55% |
14 | 0.0980 | 0.54% | 99.09% |
In choosing the number of axes for this analysis, we will use the inertia criterion. The first two axes alone account for 60.53% of total inertia. The first axis explains around 48.72% of inertia, while the second provides 11.80% more information. We can therefore say that 60.53% of the statistics observed for the 141 countries (excluding Guinea Bissau) can be explained using the first two axes alone. This quantity of information provided by these two axes is within the range recommended by the inertia rate criterion. We therefore retain the first two axes for our analysis.
Looking at the correlation matrix, we can identify which variables are negatively correlated and which are positively correlated. For the purposes of this work, we consider the correlation whose absolute value is greater than or equal to 0.70. A summary of these correlations is presented in Appendix 2.
Two main groups stand out in terms of correlations. The first group of variables, is made up of variables that are strongly positively correlated with each other (GDP per capita, agricultural value added per worker, number of people with a cell phone per 1,000 inhabitants, number of people with a laptop per 1,000 inhabitants, proportion of the population aged 65 and over, number of Internet users per 1,000 inhabitants). The second group is also made up of highly correlated variables (Share of agriculture in GDP, Share of rural population, Share of population aged 0-14). Each of the variables in a given group is negatively correlated with those in the opposite group.
➡️ Group 1 (Positive correlation within Group 1)
- GDP per capita
- Agricultural value added per worker
- Number of people with a mobile phone (per 1000 inhabitants)
- Number of people with a laptop computer (per 1000 inhabitants)
- Share of population aged 65 and over
- Number of internet users (per 1000 inhabitants)
➡️ Group 2 (Positive correlation within Group 2)
- Share of agriculture in GDP
- Share of rural population
- Share of population aged 0–14 years
We consider here the variables whose correlation with the factors in absolute value is greater than or equal to 0.69. The variables most characteristic of each of axis 1 and 2 are summarized in Tables III and IV. The variables: GDP per capita, agricultural value added per worker, access to drinking water, number of people with a cell phone per 1,000 inhabitants, number of people with a laptop per 1,000 inhabitants, number of internet users per 1,000 inhabitants, share of population aged 65 and over, share of services in GDP are positively correlated with each other and are well represented on axis 1, mainly on the negative side of the axis.
Similarly, the variables: infant mortality rate, share of agriculture in GDP, share of population aged 0-14, percentage of rural population are positively correlated with each other and well represented on axis 1 mainly on the positive side of the axis. The latter are negatively correlated with those on the negative side of the axis.
Axis 1, therefore, contrasts countries with high values for the variables of infant mortality rate, share of agriculture in GDP, share of population aged 0-14, percentage of rural population (i.e. underdeveloped countries) to those with high values for the variables GDP per capita, agricultural value added per worker, access to drinking water, number of people with a cell phone per 1,000 inhabitants, number of people with a laptop per 1,000 inhabitants, number of internet users per 1,000 inhabitants, share of population aged 65 and over, share of ser-vices in GDP (i.e. developed countries).
Axis 1 would therefore seem to characterize the level of development.
TABLE III: Correlation Between Variables and Axis 1
Correlation | Variables |
---|---|
Negatively Correlated | GDP per capita, agricultural value added per worker, access to drinking water, number of people with a mobile phone per 1000 inhabitants, number of people with a laptop per 1000 inhabitants, number of internet users per 1000 inhabitants, population aged 65 and over, share of services in GDP |
Positively Correlated | Infant mortality rate, share of agriculture in GDP, share of population aged 0–14 years, share of rural population |
The variables annual per capita GDP growth and industry's share of GDP illustrate Axis 2, and are strongly positively correlated with it. Axis 2 therefore contrasts countries with high values for the variables: annual per capita GDP growth and industry's share of GDP (i.e. emerging countries) with those with low values for these variables. This suggests that Axis 2 reflects the importance of industry in the GDP of the economies of the countries in the base.
TABLE IV: Correlation between variables and axis 2
Correlation | Variables |
---|---|
Negatively Correlated | (None) |
Positively Correlated | Annual GDP per capita growth rate, share of industry in GDP |
The first factorial plane illustrates the development cycle of the various economies in the dataset. Axes 1 and 2 divide the plane into three main parts (see figure 1). Part 1 of the diagram below characterizes underdeveloped countries with a high share of the primary sector (mainly agriculture) in GDP, and a low rate of industrialization with low growth in GDP per capita.
The second part (part 2), groups together emerging countries with high GDP growth, a strong preponderance of industry within their economy (high share of industry in GDP) and high values in terms of gross fixed capital formation.
The last part (part 3) is marked by developed countries with a high tertiary sector share of GDP, which greatly reduces the share of industry, even though they remain highly industrialized. They are also characterized by high GDP values and relatively improved living condi- tions for their populations.
Figure 2: Countries graph
The direct graph allows us to identify certain countries that clearly characterize the different parts of the factorial plan. Part 1 is dominated by sub-Saharan African and Latin American countries. The second part includes countries from North America, North Africa and Asia. Part 3 is typical of European and North American countries.
Part 1 features Benin, an underdeveloped country with little or no industrial contribution to GDP (an economy mainly dependent on the primary sector), and a high proportion of rural population. It is characterized by a broad age pyramid (0-14 years), the result of strong population growth. There is also a high infant mortality rate. In the same group, we find Bangladesh, an underdeveloped country with a relatively high share of industry in its GDP.
In the second group, we find China, a highly industrialized emerging country where industry accounts for the major part of GDP.
In the final group, we find Belgium, a developed country with a highly tertiarized economy and one of the highest per capita GDPs in the world. The entire population has access to drinking water, and the majority have access to ICT and well-being. These living conditions are conducive to longevity, which explains the high proportion of the population aged 65 and over.
The standardized PCA performed on our dataset of 18 quantitative variables described on 141 countries worldwide enabled us to identify the macro-characteristics of the countries in the database. Our analysis reveals three main groups of countries: underdeveloped countries, emerging countries and developed countries. All our analysis describes the economic cycle: the process by which underdeveloped countries can become emerging and then developed, drastically improving the living conditions of their populations.
Variable Label | Mean | Std. Dev. | Minimum | Maximum |
---|---|---|---|---|
GDP per capita (constant 2000 international $) | 8840.900 | 9740.790 | 582.960 | 51874.200 |
Annual GDP per capita growth rate (%) | 3.567 | 3.188 | -7.440 | 14.860 |
GFCF (% of GDP) | 21.655 | 6.172 | 9.660 | 47.350 |
Agricultural value added per worker (constant 2000 US$) | 7721.660 | 13444.100 | 59.390 | 63451.900 |
Share of agriculture in GDP | 15.437 | 12.981 | 0.110 | 55.150 |
Share of exports in GDP | 42.157 | 27.285 | 8.610 | 228.970 |
Share of FDI in GDP | 5.841 | 23.823 | -0.490 | 282.400 |
Share of population with access to drinking water | 83.479 | 17.396 | 22.000 | 100.000 |
Share of industry in GDP | 30.526 | 11.170 | 11.970 | 69.180 |
Number of people with a mobile phone (per 1000 inhabitants) | 363.111 | 333.011 | 3.240 | 1399.810 |
Infant mortality rate | 40.335 | 37.071 | 2.000 | 154.000 |
Number of people with a laptop (per 1000 inhabitants) | 140.482 | 199.780 | 0.670 | 765.630 |
Share of population aged 0–14 years | 30.952 | 10.782 | 14.060 | 50.430 |
Share of population aged 65 and over | 7.362 | 5.105 | 1.060 | 19.660 |
Share of rural population | 45.745 | 22.245 | 0.000 | 90.280 |
Share of services in GDP | 54.048 | 13.133 | 22.760 | 82.660 |
Annual population growth rate | 1.319 | 1.140 | -1.030 | 6.290 |
Number of internet users (per 1000 inhabitants) | 158.747 | 188.472 | 0.790 | 771.750 |
Only correlations with absolute value ≥ 0.7
Variable | Positively Correlated | Negatively Correlated |
---|---|---|
GDP per capita | - Agricultural value added per worker (0.86) - Mobile phones per 1000 inhabitants (0.87) - Laptops per 1000 inhabitants (0.90) - Internet users per 1000 inhabitants (0.88) - Population aged 65+ (0.72) |
- Population aged 0–14 (-0.70) |
Agricultural value added per worker | - Laptops per 1000 inhabitants (0.83) - Internet users per 1000 inhabitants (0.83) |
- Mobile phones per 1000 inhabitants (-0.70) - Infant mortality (-0.78) - Population aged 0–14 (-0.80) |
Share of agriculture in GDP | - Rural population (0.70) | |
Access to drinking water | - Mobile phones per 1000 inhabitants (0.72) | - Infant mortality (-0.71) - Population aged 0–14 (-0.80) |
Mobile phones per 1000 inhabitants | - Laptops per 1000 inhabitants (0.76) - Population aged 65+ (0.77) - Internet users per 1000 inhabitants (0.84) |
- Infant mortality (-0.71) - Population aged 0–14 (-0.80) |
Population aged 0–14 | - Population aged 65+ (-0.88) - Internet users per 1000 inhabitants (-0.71) |
|
Population aged 65+ | - Internet users per 1000 inhabitants (0.72) | - Population growth rate (-0.70) |
Variable Label | Axis 1 | Axis 2 |
---|---|---|
GDP per capita (constant 2000 international $) | -0.90 | -0.22 |
Annual GDP per capita growth rate (%) | -0.04 | 0.78 |
GFCF (% of GDP) | -0.06 | 0.55 |
Agricultural value added per worker (constant 2000 US$) | -0.77 | -0.31 |
Share of agriculture in GDP | 0.79 | -0.25 |
Share of exports in GDP | -0.36 | 0.29 |
Share of FDI in GDP | -0.23 | -0.04 |
Share of population with access to drinking water | -0.77 | 0.04 |
Share of industry in GDP | -0.07 | 0.69 |
Mobile phones per 1000 inhabitants | -0.92 | -0.05 |
Infant mortality rate | 0.82 | -0.14 |
Laptops per 1000 inhabitants | -0.84 | -0.24 |
Share of population aged 0–14 | 0.89 | -0.24 |
Share of population aged 65 and over | -0.83 | 0.07 |
Share of rural population | 0.75 | 0.06 |
Share of services in GDP | -0.73 | -0.34 |
Annual population growth rate | 0.60 | -0.37 |
Internet users per 1000 inhabitants | -0.89 | -0.18 |