1. Official poster: 2. Background story: 3. Data source: 4. Research links: https://education.nationalgeographic.org/resource/paradox-undernourishment Food Safety status in African countries https://agrilinks.org/post/advancing-food-safety-africa-opportunities-and-action-areas 5. The development of Food Safety in Sub-Saharan Africa: Based on the data found and researches, we concluded that Food Safety issue in Sub-Saharan Africa was determined by 2 main factors: Food Quantity and Food […]

1. Official poster:

2. Background story:

3. Data source:

4. Research links:

https://education.nationalgeographic.org/resource/paradox-undernourishment

Food Safety status in African countries

https://agrilinks.org/post/advancing-food-safety-africa-opportunities-and-action-areas

5. The development of Food Safety in Sub-Saharan Africa:

Based on the data found and researches, we concluded that Food Safety issue in Sub-Saharan Africa was determined by 2 main factors: Food Quantity and Food Quality. As we developed deeper analysis on each factor, we found that there was no sufficient data to support Food Quality. Therefore, we decided to primarily focus on Food Quantity.

6. Indicator for Food Quantity & Other variables:

  • We chose The Prevalence Number of Undernourished People in 20 years (2001 – 2020) as an indicator for Food Quantity (Y)
  • We chose randomly a certain number of factors as variable X that we believed to have an impact on Food Quantity (Y)

7. Data cleaning:

We used R Studio to merge data tables from Excel files and convert rows to columns. Then we removed any variables that had more than 10% of missing values. That helped us to narrow down the number of variable X that was finally used for the analysis.

8. My major tasks: Gather data, Clean data, Develop K-means clustering analysis on the regional level.

  • R script for Data Cleaning

library(readxl)
library(tidyr)
library(janitor)
library(tidyverse)

countMissingValue <- function(tbl_col) { return(length(which(tbl_col == “” | tbl_col == “NULL” | tbl_col == “NA” | is.na(tbl_col) | is.null(tbl_col) ))) }

import_and_transform_data <- function(file_path, table_all = NULL) {
  # file_path = “FAOSTAT_number of undernourished people_new.xlsx”
  # Step 1: Import dataset from Excel
  tbl <- read_excel(file_path)
  # glimpse(tbl)
 # Step 2: Convert data table from long to wide format
  tbl <- clean_names(tbl)
  tbl_var <- pivot_wider(
    data = tbl,
    id_cols = c(“year”, “area”),
    names_from = “item”,
    values_from = “value”
  tbl_var <- clean_names(tbl_var)
  # names(tbl_var)
  # tbl_var
  # Step 3: Transform the year column to numeric format
  tbl_var <- transform(tbl_var, year = as.numeric(as.character(year)))
  # glimpse(tbl_var)
  # Step 4: Repeat the first two steps for another dataset and join them
  if (!is.null(table_all)) {
    table_all <- inner_join(tbl_var, table_all, by = c(“year”, “area”))
  } else {
    table_all <- tbl_var
  }
glimpse(tbl_var)
  # Step 7: Count the percentage of missing values for each column
  total_rows <- nrow(table_all)
# cols_to_keep <- c(“Number of people undernourished (million) (3-year average)”, “Population – Est. & Proj.”)
# Initialize with columns to keep
  cols_white_list = c(‘number_of_people_undernourished_million_3_year_average’)
  cols_to_keep <- c()
  for (col in names(table_all)) {
   # count_nan <- sum(is.na(table_all
))
    #count_null <- nrow(table_all
== “” | is.numeric())
    # count_nan <- nrow(table_all[is.na(table_all
),])
    # count_null <- nrow(table_all[table_all
== “NULL”,])
    count_missing_value <- countMissingValue(table_all
)
    percent_missing <- count_missing_value * 100 / total_rows
    if (!is.na(percent_missing) && (percent_missing <= 10 | col %in% cols_white_list)  ) {
      cols_to_keep <- c(cols_to_keep, col)
      # print(paste0(“hello”, 10))
      print(paste0(“Count missing values = “, count_missing))
      print(paste0(“Add column “, col, ” with “, percent_missing, “% missing values”))
    } else {
      print(paste0(“Removing column “, col, ” with “, count_missing, “/”, total_rows, “=” , percent_missing, “% missing values”))
    }
  }

table_subset <- table_all[, cols_to_keep] # Step 9: View the final table View(table_subset) return(table_subset) # cols_to_keep }

setwd(“~/Documents/Niagara University/MCBAC”)
table_all <- NULL
table_all <- import_and_transform_data(“FAOSTAT_total population_new.xlsx”)
table_all <- import_and_transform_data(“FAOSTAT_number of undernourished people_new.xlsx” , table_all = table_all)
table_all <- import_and_transform_data(“FAOSTAT_imports_4-5-2023.xls” , table_all = table_all)
table_all <- import_and_transform_data(“FAOSTAT_NO2 Emissions_4-5-2023 (3).xls” , table_all = table_all)
table_all <- import_and_transform_data(“FAOSTAT_production_4-6-2023.xlsx” , table_all = table_all)
table_all <- import_and_transform_data(“FAOSTAT_water&sanitation_new.xlsx” , table_all = table_all)
View(table_all)
df <- table_all[,order(names(table_all))] df <- data.frame(df)
df <-apply(df, 2 , as.character)
write.csv(df, file = “table_all_output3.csv”)
  • R script for K-means clustering analysis:

# https://www.statology.org/k-means-clustering-in-r/
library(factoextra)
library(cluster)
library(readxl)

# install.packages(‘factoextra’)
setwd(“~/Documents/Niagara University/MCBAC/”)

#load data
final_data_cluster_36 <- read_excel(“final_data_cluster.xlsx”, sheet = “36 countries”, col_names =TRUE)
df1 <- data.frame(final_data_cluster_36, row.names =  TRUE )
df <- df1[1:nrow(df1), 2:ncol(df1)]
#remove rows with missing values
df <- na.omit(df)
#scale each variable to have a mean of 0 and sd of 1
df <- scale(df)
df <- na.omit(df)
fviz_nbclust(df, kmeans, method = “wss”)
gap_stat <- clusGap(df,
                    FUN = kmeans,
                    nstart = 25,
                    K.max = 10 ,
                    B = 50)
#plot number of clusters vs. gap statistic
fviz_gap_stat(gap_stat)
km1 <- kmeans(df, centers = 3, nstart = 25)
fviz_cluster(km1, data = df )
final_data <- cbind(df, cluster=km1$cluster)
fviz_cluster(km1, data = final_data)
View(final_data)
write.csv(final_data, ‘final_data.csv’)
system(“open final_data.csv”)

 

9. Judges’ comments:
  • The in-person competition took place on Manhattan College’s campus, which I, as the team advisor, did not attend since only Undergraduate students were eligible to participate. The poster was presented in the 1st round, after which all participating teams were selected for the 2nd round. At this point, a new challenge was given as each team was required to run another analysis on the Food Safety issue for the Central America region and compare the results with Sub-Saharan Africa’s. Our team was not selected for the final round after the 2nd presentation. However, the overall evaluation for both phases was positive with every rating being better than 3.0 score.

.

Credit to Dr. Caruso, my entire team - Chance, Nolan, Mai Anh, and Kevin. 
Special thanks and deep gratitude to anh Võ Minh Tiến (a.k.a Coding Instructor, Researching Buddy, Healing Partner, and Mental Supporter)