Skip to content

01_Exploratory analyses

Zaira Rosario Pérez-Vázquez edited this page May 2, 2025 · 1 revision

title: "Exploratory Analysis for Random Forest Modeling" author: "Dra. Zaira Rosario Pérez-Vázquez" date: "r Sys.Date()" output: html_document: toc: true number_sections: true toc_depth: 2 fig_caption: true fig_path: "docs/" theme: flatly

knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)

1. Introduction

This report documents exploratory data analysis (EDA) for a Random Forest model on simulated forest floor carbon data. Data includes measurements from two layers (L and FH) across three years: 2013, 2018, 2023.

2. Load Libraries

library(tidyverse)
library(psych)
library(here)
library(DescTools)
library(ggpubr)
library(car)
library(Hmisc)
library(moments)
library(cowplot)
library(readr)

3. Load Data

data_path <- here("data", "RF_data_Carbon_ForestFloor.csv")
datasetFF <- read_csv(data_path)
str(datasetFF)

4. Data Preparation

datasetFF <- datasetFF %>%
  mutate(
    YEAR = as.factor(YEAR),
    LAYER = as.factor(LAYER),
    CONDITION = as.factor(CONDITION),
    LAYER_NUM = as.factor(LAYER_NUM),
    CON_NUM = as.factor(CON_NUM)
  )

5. Summary Statistics

describe(datasetFF$C_STOCKS)
describeBy(datasetFF$C_STOCKS, list(datasetFF$YEAR, datasetFF$LAYER))

6. Histogram by Layer and Year

datasetFF %>%
  ggplot(aes(x = C_STOCKS)) +
  geom_histogram(bins = 30, fill = "steelblue", alpha = 0.7) +
  facet_grid(YEAR ~ LAYER) +
  theme_minimal() +
  labs(title = "Distribution of C Stocks", x = "C Stocks (Mg ha^-1)", y = "Count")

7. Normality Assessment (L layer)

plots <- list()
for (yr in c("2013", "2018", "2023")) {
  data_tmp <- datasetFF %>% filter(YEAR == yr & LAYER == "L")
  p <- ggplot(data_tmp, aes(x = C_STOCKS)) +
    geom_histogram(aes(y = ..density..), binwidth = 0.5, fill = "gray", color = "black") +
    geom_density(color = "blue") +
    stat_function(fun = dnorm, args = list(mean = mean(data_tmp$C_STOCKS), sd = sd(data_tmp$C_STOCKS)), color = "red") +
    labs(title = paste("Year", yr), x = "C Stocks (Mg ha^-1)", y = "Density") +
    theme_minimal()
  plots[[yr]] <- p
}
ggarrange(plotlist = plots, ncol = 3, common.legend = TRUE)

8. Spearman Correlations

data2013 <- datasetFF %>% filter(YEAR == "2013")
corr_vars <- data2013 %>% select(C_STOCKS, UTMX, UTMY, STAND_AGE, BASAL_AREA, DOM_HEIGHT, SPECIES_RICHNESS, SHANNON_INDEX, GAP_FRACTION, CANOPY_COVER, ELEVATION, SLOPE, ASPECT)
rcorr(as.matrix(corr_vars), type = "spearman")

9. Variance Inflation Factor (VIF)

modelC_2013VIF <- lm(C_STOCKS ~ UTMX + UTMY + STAND_AGE + BASAL_AREA + DOM_HEIGHT + SPECIES_RICHNESS + SHANNON_INDEX + GAP_FRACTION + ELEVATION + SLOPE + ASPECT, data = data2013)
vif(modelC_2013VIF)

10. Train/Test Split

set.seed(123)
train_idx <- sample(seq_len(nrow(data2013)), size = 0.85 * nrow(data2013))
training_2013 <- data2013[train_idx, ]
testing_2013 <- data2013[-train_idx, ]

write_csv(training_2013, here("data", "training_2013_85.csv"))
write_csv(testing_2013, here("data", "testing_2013_15.csv"))

11. Conclusion

This exploratory analysis identified key variables, distributional patterns, and correlation structures to inform future Random Forest modeling stages. All outputs are saved in the /outputs folder, and this report is rendered to /docs for easy viewing.