Skip to content

This repository presents an exploratory analysis using Principal Component Analysis (PCA) on a dataset of 18 quantitative indicators for 141 countries. The project aims to reduce dimensionality while extracting meaningful patterns to characterize countries based on development level, economic structure, and living conditions.

License

Notifications You must be signed in to change notification settings

ameudes/Uncovering-Development-Patterns-in-141-Countries-Using-PCA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

alt text

Introduction

In today's digital age, where everything is quantified and databases are constantly growing in size and number, it's easy to understand any phenomenon. But the sheer mass of data raises a number of questions: how can we extract all the information or patterns a dataset contains?

The case study discussed below is based on a dataset of quantitative variables covering 141 countries worldwide. We'll be using PCA (Principal Component Analysis) to analyze this type of data, in order to extract the maximum amount of information while reducing the size of the data.

The aim here is to briefly present the methodology of this approach and to carry out the analysis of the dataset using this approach.

The data

The database analyzed here provides information on indicators for 141 countries around the world. The database is not up to date and dates back a few years, so the groups in the results section must have changed a lot by now. In addition to specifying the name and continent of each country, it contains 18 other quantitative variables. These variables can be grouped into three main categories: demographic variables (infant mortality rate, percentage of population aged 0-14 and 65+, annual population growth); the economic aspect (GDP per capita, growth in GDP per capita, gross fixed capital formation, share of agriculture, exports, industry and services in GDP, share of foreign direct investment in GDP), as well as indicators on the standard of living and living conditions of the populations of these countries (percentage of the population with access to drinking water, number of mobile phone and laptop users, number of people with access to the Internet per 1,000 inhabitants).

Quick intro to PCA

Justification of PCA

Principal Component Analysis (PCA) is one of the factorial analysis methods used when we want to describe a table of several quantitative variables. It allows the information provided by this set of quantitative variables to be summarized by a smaller set of new variables called factors, which are in fact linear combinations of the initial variables. Given the type of table we're dealing with here, PCA proves to be one of the most appropriate way of describing the data we have at our disposal, both effectively and efficiently.

What are we looking for in PCA ?

The dataset (a standardized Z table ) can be viewed geometrically in two different ways: the set of individuals photographed in the space directed by the variables (cloud of individuals) or the set of variables photographed in the space directed by the individuals (cloud of variables). For example, in our case, we have 18 variables measured in 141 countries, so each observation (row) will be considered as a vector in an 18-dimensional space (in the cloud of individuals) and each variable as a vector in a 141- dimensional space (in the cloud of variables). In PCA, we seek to answer the questions: which individual is similar to which other? Which variable is related to which other? To answer these questions, we need to find one or more subspaces in which the projections (photos) of the data are sufficiently clear to assess similarities and correlations.

The search for these subspaces is based on two principles: the principle of maximum elongation and the principle of maximum spread. The first consists in finding good photos during projection. The second principle, that of maximum spread, is the search for the axis, plane or subspace (of dimension greater than or equal to 3) on which the elements are best spread. If we're interested in the cloud of variables, the quality of a photo is measured by its inertia.Finding the axis along which the cloud spreads best is to solve the program:

$$ max ( V'ZZ'V )\\ with \ the \ condition \ ||V^2||= 1 $$

where V is the unitary direction vector of the axis and ZZ' is the inertia matrix of the variable cloud. We show that V is the eigenvector associated with the largest eigenvalue. In general, the p-dimensional subspace along which the cloud spreads the most is that generated by the first p eigenvectors associated with the first p eigenvalues. The reasoning is similar for the individual cloud, where the associated inertia matrix is Z'Z.

Tools for interpretation

Since factor graphs are derived from projections, their observation can be subject to distortion. To avoid this, two main tools are used for PCA interpretation: the cosine squared (CO2) and the contribution (CTR). CO2 measures the quality of an element's representation. It is the cosine squared of the angle formed by the element and the factual axis. The higher it is, the better the element is represented on the factorial axis. The CTR of an item in the formation of a factorial axis is the share of information contributed by this item to the formation of the axis. It can be used to identify the influential points in the formation of the axis.

GENERAL APPROACH TO PCA

The following procedure is intended as a guide only, and may be modified to suit the requirements of the subject or the objectives pursued. To carry out a PCA, one should proceed as follows:

1- Check that the table conforms to the PCA application.

2- Remember the objectives of the PCA

3- Decide on the number of axes to be retained, according to one of the following criteria:

  • Inertia rate criterion : For this criterion, we take eigenvalues such that the total inertia is around 60 to 80%.

  • Kaiser criterion : Eigenvalues greater than or equal to 1 are taken.

  • Elbow criterion : The axis with the sharpest drop in inertia compared to the top level is selected.

4- See correlations between variables

5- See correlations between variables and factors

6- Observe individual (countries here) positions

7- Make sense of the axes

8- Conclude.

Analysis and results

Descriptive statistics

Demographically, the average proportion of people under 15 years of age is 30.95%; the average proportion of people aged 65 and over is 7.36%. The average annual population growth rate for the countries in the base is 1.319%. (see table 1 below)

Table I : Summary statistics on the data

Variable Label Mean Minimum Maximum
GDP per capita (constant 2000 international $) 8,840.900 582.960 51,874.200
Share of agriculture in GDP 15.437 0.110 55.150
Share of population with access to drinking water 83.479 22.000 100.000
Number of individuals with a mobile phone (per 1000 inhabitants) 363.111 3.240 1,399.810
Number of individuals with a laptop computer (per 1000 inhabitants) 140.482 0.670 765.630
Number of internet users (per 1000 inhabitants) 158.747 0.790 771.750
Share of population aged 0–14 years 30.952 14.060 50.430
Share of population aged 65 and over 7.362 1.060 19.660
Annual population growth rate (%) 1.319 -1.030 6.290

In terms of living conditions and standard of living, on average 140 out of every 1,000 people in a given country use a laptop computer, around 363 have a cell phone and over 158 have access to the Internet. 83.479% of the population of all the countries in the database have access to drinking water. However, the figures are not so uniform from one continent to the next. In Oceania, the entire population has access to drinking water, followed by Europe, North America and South America, with percentages of 96.35%, 92.82% and 90.81% respectively. Sub-Saharan Africa comes out with 66.26%, followed by Asia (83.00%) and North Africa (89.25%).

On the economic front, when we look at GDP per capita, we see that there are major disparities between countries, as evidenced by the value of the standard deviation ($9740,790). The lowest value of GDP per capita is observed in Malawi ($582.96) and the highest in Luxembourg ($51874). On average, agriculture accounts for 15.437% of the economies of the countries in the dataset.

CHOICE OF NUMBER OF AXES

We are now starting to perform a standardized PCA on the data. The software used is SPADv5.5. It should be noted that in this PCA, the Guinea Bissau individual has been included as an illustration following a previous analysis, as it was an atypical point.

TABLE II: Eigenvalues

Number Eigenvalue Percentage Cumulative Percentage
1 8.7704 48.72% 48.72%
2 2.1247 11.80% 60.53%
3 1.5525 8.63% 69.15%
4 1.1716 6.51% 75.66%
5 0.9349 5.19% 80.86%
6 0.7799 4.33% 85.19%
7 0.5582 3.10% 88.29%
8 0.4882 2.71% 91.00%
9 0.4085 2.27% 93.27%
10 0.3139 1.74% 95.02%
11 0.2605 1.45% 96.46%
12 0.2043 1.13% 97.60%
13 0.1708 0.95% 98.55%
14 0.0980 0.54% 99.09%

In choosing the number of axes for this analysis, we will use the inertia criterion. The first two axes alone account for 60.53% of total inertia. The first axis explains around 48.72% of inertia, while the second provides 11.80% more information. We can therefore say that 60.53% of the statistics observed for the 141 countries (excluding Guinea Bissau) can be explained using the first two axes alone. This quantity of information provided by these two axes is within the range recommended by the inertia rate criterion. We therefore retain the first two axes for our analysis.

CORRELATION BETWEEN VARIABLES

Looking at the correlation matrix, we can identify which variables are negatively correlated and which are positively correlated. For the purposes of this work, we consider the correlation whose absolute value is greater than or equal to 0.70. A summary of these correlations is presented in Appendix 2.

Two main groups stand out in terms of correlations. The first group of variables, is made up of variables that are strongly positively correlated with each other (GDP per capita, agricultural value added per worker, number of people with a cell phone per 1,000 inhabitants, number of people with a laptop per 1,000 inhabitants, proportion of the population aged 65 and over, number of Internet users per 1,000 inhabitants). The second group is also made up of highly correlated variables (Share of agriculture in GDP, Share of rural population, Share of population aged 0-14). Each of the variables in a given group is negatively correlated with those in the opposite group.

➡️ Group 1 (Positive correlation within Group 1)

  • GDP per capita
  • Agricultural value added per worker
  • Number of people with a mobile phone (per 1000 inhabitants)
  • Number of people with a laptop computer (per 1000 inhabitants)
  • Share of population aged 65 and over
  • Number of internet users (per 1000 inhabitants)

➡️ Group 2 (Positive correlation within Group 2)

  • Share of agriculture in GDP
  • Share of rural population
  • Share of population aged 0–14 years

CORRELATION BETWEEN VARIABLES AND FACTORS

We consider here the variables whose correlation with the factors in absolute value is greater than or equal to 0.69. The variables most characteristic of each of axis 1 and 2 are summarized in Tables III and IV. The variables: GDP per capita, agricultural value added per worker, access to drinking water, number of people with a cell phone per 1,000 inhabitants, number of people with a laptop per 1,000 inhabitants, number of internet users per 1,000 inhabitants, share of population aged 65 and over, share of services in GDP are positively correlated with each other and are well represented on axis 1, mainly on the negative side of the axis.

Similarly, the variables: infant mortality rate, share of agriculture in GDP, share of population aged 0-14, percentage of rural population are positively correlated with each other and well represented on axis 1 mainly on the positive side of the axis. The latter are negatively correlated with those on the negative side of the axis.

Axis 1, therefore, contrasts countries with high values for the variables of infant mortality rate, share of agriculture in GDP, share of population aged 0-14, percentage of rural population (i.e. underdeveloped countries) to those with high values for the variables GDP per capita, agricultural value added per worker, access to drinking water, number of people with a cell phone per 1,000 inhabitants, number of people with a laptop per 1,000 inhabitants, number of internet users per 1,000 inhabitants, share of population aged 65 and over, share of ser-vices in GDP (i.e. developed countries).

Axis 1 would therefore seem to characterize the level of development.

TABLE III: Correlation Between Variables and Axis 1

Correlation Variables
Negatively Correlated GDP per capita, agricultural value added per worker, access to drinking water, number of people with a mobile phone per 1000 inhabitants, number of people with a laptop per 1000 inhabitants, number of internet users per 1000 inhabitants, population aged 65 and over, share of services in GDP
Positively Correlated Infant mortality rate, share of agriculture in GDP, share of population aged 0–14 years, share of rural population

The variables annual per capita GDP growth and industry's share of GDP illustrate Axis 2, and are strongly positively correlated with it. Axis 2 therefore contrasts countries with high values for the variables: annual per capita GDP growth and industry's share of GDP (i.e. emerging countries) with those with low values for these variables. This suggests that Axis 2 reflects the importance of industry in the GDP of the economies of the countries in the base.

TABLE IV: Correlation between variables and axis 2

Correlation Variables
Negatively Correlated (None)
Positively Correlated Annual GDP per capita growth rate, share of industry in GDP

Interpretation of the factorial plane

The first factorial plane illustrates the development cycle of the various economies in the dataset. Axes 1 and 2 divide the plane into three main parts (see figure 1). Part 1 of the diagram below characterizes underdeveloped countries with a high share of the primary sector (mainly agriculture) in GDP, and a low rate of industrialization with low growth in GDP per capita.

Figure 1: Variable graph alt text

The second part (part 2), groups together emerging countries with high GDP growth, a strong preponderance of industry within their economy (high share of industry in GDP) and high values in terms of gross fixed capital formation.

The last part (part 3) is marked by developed countries with a high tertiary sector share of GDP, which greatly reduces the share of industry, even though they remain highly industrialized. They are also characterized by high GDP values and relatively improved living condi- tions for their populations.

Countries projections

Figure 2: Countries graph

alt text

The direct graph allows us to identify certain countries that clearly characterize the different parts of the factorial plan. Part 1 is dominated by sub-Saharan African and Latin American countries. The second part includes countries from North America, North Africa and Asia. Part 3 is typical of European and North American countries.

Part 1 features Benin, an underdeveloped country with little or no industrial contribution to GDP (an economy mainly dependent on the primary sector), and a high proportion of rural population. It is characterized by a broad age pyramid (0-14 years), the result of strong population growth. There is also a high infant mortality rate. In the same group, we find Bangladesh, an underdeveloped country with a relatively high share of industry in its GDP.

In the second group, we find China, a highly industrialized emerging country where industry accounts for the major part of GDP.

In the final group, we find Belgium, a developed country with a highly tertiarized economy and one of the highest per capita GDPs in the world. The entire population has access to drinking water, and the majority have access to ICT and well-being. These living conditions are conducive to longevity, which explains the high proportion of the population aged 65 and over.

Conclusion

The standardized PCA performed on our dataset of 18 quantitative variables described on 141 countries worldwide enabled us to identify the macro-characteristics of the countries in the database. Our analysis reveals three main groups of countries: underdeveloped countries, emerging countries and developed countries. All our analysis describes the economic cycle: the process by which underdeveloped countries can become emerging and then developed, drastically improving the living conditions of their populations.

Appendix

Appendix 1: Dataset variables

Variable Label Mean Std. Dev. Minimum Maximum
GDP per capita (constant 2000 international $) 8840.900 9740.790 582.960 51874.200
Annual GDP per capita growth rate (%) 3.567 3.188 -7.440 14.860
GFCF (% of GDP) 21.655 6.172 9.660 47.350
Agricultural value added per worker (constant 2000 US$) 7721.660 13444.100 59.390 63451.900
Share of agriculture in GDP 15.437 12.981 0.110 55.150
Share of exports in GDP 42.157 27.285 8.610 228.970
Share of FDI in GDP 5.841 23.823 -0.490 282.400
Share of population with access to drinking water 83.479 17.396 22.000 100.000
Share of industry in GDP 30.526 11.170 11.970 69.180
Number of people with a mobile phone (per 1000 inhabitants) 363.111 333.011 3.240 1399.810
Infant mortality rate 40.335 37.071 2.000 154.000
Number of people with a laptop (per 1000 inhabitants) 140.482 199.780 0.670 765.630
Share of population aged 0–14 years 30.952 10.782 14.060 50.430
Share of population aged 65 and over 7.362 5.105 1.060 19.660
Share of rural population 45.745 22.245 0.000 90.280
Share of services in GDP 54.048 13.133 22.760 82.660
Annual population growth rate 1.319 1.140 -1.030 6.290
Number of internet users (per 1000 inhabitants) 158.747 188.472 0.790 771.750

Appendix 2: Summary of the Correlation Matrix

Only correlations with absolute value ≥ 0.7

Variable Positively Correlated Negatively Correlated
GDP per capita - Agricultural value added per worker (0.86)
- Mobile phones per 1000 inhabitants (0.87)
- Laptops per 1000 inhabitants (0.90)
- Internet users per 1000 inhabitants (0.88)
- Population aged 65+ (0.72)
- Population aged 0–14 (-0.70)
Agricultural value added per worker - Laptops per 1000 inhabitants (0.83)
- Internet users per 1000 inhabitants (0.83)
- Mobile phones per 1000 inhabitants (-0.70)
- Infant mortality (-0.78)
- Population aged 0–14 (-0.80)
Share of agriculture in GDP - Rural population (0.70)
Access to drinking water - Mobile phones per 1000 inhabitants (0.72) - Infant mortality (-0.71)
- Population aged 0–14 (-0.80)
Mobile phones per 1000 inhabitants - Laptops per 1000 inhabitants (0.76)
- Population aged 65+ (0.77)
- Internet users per 1000 inhabitants (0.84)
- Infant mortality (-0.71)
- Population aged 0–14 (-0.80)
Population aged 0–14 - Population aged 65+ (-0.88)
- Internet users per 1000 inhabitants (-0.71)
Population aged 65+ - Internet users per 1000 inhabitants (0.72) - Population growth rate (-0.70)

Appendix 3: Coordinates of variables on the axes

Variable Label Axis 1 Axis 2
GDP per capita (constant 2000 international $) -0.90 -0.22
Annual GDP per capita growth rate (%) -0.04 0.78
GFCF (% of GDP) -0.06 0.55
Agricultural value added per worker (constant 2000 US$) -0.77 -0.31
Share of agriculture in GDP 0.79 -0.25
Share of exports in GDP -0.36 0.29
Share of FDI in GDP -0.23 -0.04
Share of population with access to drinking water -0.77 0.04
Share of industry in GDP -0.07 0.69
Mobile phones per 1000 inhabitants -0.92 -0.05
Infant mortality rate 0.82 -0.14
Laptops per 1000 inhabitants -0.84 -0.24
Share of population aged 0–14 0.89 -0.24
Share of population aged 65 and over -0.83 0.07
Share of rural population 0.75 0.06
Share of services in GDP -0.73 -0.34
Annual population growth rate 0.60 -0.37
Internet users per 1000 inhabitants -0.89 -0.18

About

This repository presents an exploratory analysis using Principal Component Analysis (PCA) on a dataset of 18 quantitative indicators for 141 countries. The project aims to reduce dimensionality while extracting meaningful patterns to characterize countries based on development level, economic structure, and living conditions.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published