Skip to content
Snippets Groups Projects

JustClust: JustNature Clustering

Introduction

This repository contains data and code used to define ecological and socio-economic status and disparities profiles at urban level within the activities carried out by JustNature H2020 project.

Repository structure

The main structure of the repository is the follow:

  • data
    • raw
      • bolzano: directory with the data of Bolzano, copied directlz from Selected indicators
      • ... data of the other CiPeLs
  • justclust: directory containing python source code used for the analyses
    • alg:
    • analysis
      • bolzano.py: file defining the whole analysis for the municipality of Bolzano.
      • ... add a file for each CiPeLs
    • data:
      • bolzano.py:
      • ...: add variable and function to read Bolzano's data
    • explore:
      • algclusters.py:
      • preprocs.py:
    • paths:
      • bolzano.py: file with define all the bolzano related paths.
      • ...: add file for the other CiPeLs
    • hopkins.py: file containing the function to compute the Hopkins statistic used to measure the cluster tendency of a data set.
    • plots.py: collection of function used to plot data
    • utils.py: a collection of function to work with the data.
  • report:
    • bolzano
  • LICENSE
  • poetry.lock
  • poetry.toml
  • pyproject.toml
  • README.md

Install

The python packages are managed by poetry, so ensure that poetry is properly installed in your machine.

Then:

  1. clone the repository: git clone git@gitlab.inf.unibz.it:URS/justnature/clustering.git;
  2. move inside the new created folder cd clustering;
  3. install the required python packages with: poetry install;
  4. enable the virtualenv with: poetry shell or execute the python command instead of python justclust/analysis/bolzano.py use: poetry run python justclust/analysis/bolzano.py

Extend to other cities

to extend the analysis to other cities follow the following steps, replace {city} with the name of your place. Consider to change the content of the python list and dictionary but try to avoid to change the variables' name, otherwise you need to consistently rename the variable in all the subsequent

  1. create a directory with the city's data with mkdir -p data/raw/{city};
  2. Copy the file from Bolzano cp justclust/paths/bolzano.py justclust/paths/{city}.py and adapt to the new city inputs and variables;
  3. Copy the file from Bolzano cp justclust/data/bolzano.py justclust/data/{city}.py and:
    • define the main columns for each justice dimension;
    • define the weights that you want to use in the wcols variable;
    • define the conv dictionary values to convert local variable name into a more standardize and international names;
    • define a python function that read the file from the data/raw/{city} directory and return a geopandas GeoDataFrame.
  4. Copy the analysis file for Bolzano with: cp justclust/analysis/bolzano.py justclust/analysis/{city}.py and:
    • fix the imports substituting bolzano with {city}
  5. Do the analysis, the three main steps are explained in detail in the next section but they can be summarized as follow:
    • run python justclust/analysis/{city}.py to explore the pre-processing options, define your own if you like, then select the pre_key to be used;
    • run python justclust/analysis/{city}.py to explore the clustering algorithms, define your own if you like, and then select the sel_clsts to be used;
    • run python justclust/analysis/{city}.py to generate the main reports and analysis.

Usage

Under the folder justclust/analysis/ you find a python file for each CiPeLs (at the moment only bolzano.py), once installed the python packages (see previous section), you can launch the analysis with: python justclust/analysis/bolzano.py

The analysis process is divided in three main steps:

  • explore the main pre-processing alternatives;
  • explore the possible clustering algorithms;
  • extract and report the main features for the selected cluster.

The process required to modify with a textual editor the file: python justclust/analysis/{city}.py.

The file contains the procedural steps required by the analysis. To avoid to repeat everytime all the steps and to save processing time, the execution of the code produce for each step a set of output files, and before executing the code check if the file exists already, if yes read the main variables from the file otherwise execute the code.

This approach it might be convenient, but you need to be aware that if you need to redo something from the begining of from some previous steps you need to delete or move the previous file, before re-executing the code.

Cache file for each step:

  • pre-processing: report/{city}/scores_preprocess.xlsx;
  • clustering: report/{city}/cluster_all.gpkg;
  • reporting: has no cache, and it is generating the outputs every time.

Explore pre-processing

The analysis performed a first exploration of different data pre-processing preparing the data for the clustering task.

All different pre-processing pipelines are defined in get_preprocs function available in justclust/explore/preprocs.py, more than 300 pre-processing pipelines are tested.

The result of the pre-processing tests are written to an excel file in report/{city}/scores_preprocess.xlsx (e.g. report/bolzano/scores_preprocess.xlsx).

The file is used as cache, therefore the pre-processing is not computed again but load the result from the excel file, if you need to re-execute the pre-processing analysis move/rename/delete the excel file and launch the program again.

Once the pre-processing task is executed choose a pre-processing pipe line that you want to apply on your data before exploring the result of different clustering algorithms, therefore open the excel file, compare the different hopkins metrics and select a pre-processing pipeline of your interest. Copy the index key from the excel file into the pre_key variable in justclust/analysis/bolzano.py.

Please note, that not necesarly you need to select the pre-processing pipeline with the highest score, therefore evaluate the transformations that make sense for your data.

Explore clustering algorithms and options

After you wrote the pre-processing key that you want to use and you launch the program again, the pre-processing will be read by the excel file and it will start exploring different clustering options and algorithms. As in the previous step, all the clustering algorithms that are tested are defined by the function get_algclusters defined in justclust/explore/algclusters.py.

For each clustering algorithm several metrics are computed by the compute_metrics function (i.e. Silhouette score, Davies Bouldin score, Calinski Harabasz score).

To limit the amount of columns in the resulting geopackage file (report/{city}/cluster_all.gpkg) it is possible to define some criteria, like for instance:

  • the range of number of clusters to be consider of interest;
  • the min or max thresholds' values for some specific metric (e.g. percentage of area with an assigned cluster, a minimum Silhouette score of 0.5, etc). See the justclust/analysis/bolzano.py for an example.

Exploring the clustering algorithms will generate several files:

  • report/{city}/cluster_all.gpkg used as cache containing all the clusters that sadisfied the criteria defined by the user with the filter conditions. If you need to explore again different cluster algorithms this is the file to be moved/deleted. Moreover, several smaller (in terms of number of columns) files will be generated to have lighter file to be load in the GIS environment, the files follow this nomenclature: report/{city}/cluster_{cluster model}_k{number of clusters}.gpkg.
  • report/{city}/scores_clusters.xlsx an excel file with all the scores computed for all the algorithms tested. This file it might be useful to select the criteria to be used for the identification of the clusters;
  • report/{city}/feature_importances.xlsx an excel file with feature importance score for each cluster options. To make the GPKG easier to handle and not too heavy the columns are split by algorithm and by number of clusters.

Select and analyze the cluster of interest

As last step you need to define the cluster that you want to analyse and compare that seems more interesting.

Define in justclust/analysis/bolzano.py the variable sel_clsts that is a list of cluster id that you want to further explore. The complete list of cluster id is available in the generated excel file: report/{city}/scores_clusters.xlsx, from the file you can copy any label value that has the selected column set to TRUE and copy it into the sel_clsts list.

Once you have edit the file, if you launch the program again it will start itereting over the selected clusters, creating a dedicated folder in report/{city}/{cluster label} with all the graphs and tables.

All the graphs, by default are generated with a default DPI of 600, if you need to reduze the size in linux you can use the ImageMagick command line tool convert -units PixelsPerInch -density {new DPI} report/{city}/{cluster label}/*.png and to further reduce the size with OptiPNG with optipng *.png.

Map of the clusters

The figure map_{cluster label}.png shows the clusters on a GIS map.

Map with the clusters

Cluster cardinality

The figure cardinality.png shows the number of elements per cluster.

Cluster cardinality

Cluster values

The figure lines_{number of clusters}_{metric}.png and lines_scaled_{number of clusters}_{metric}.png show the cluster value for each features, here is reported the scaled (from 0 to 1) version.

Scaled values per cluster

A boxplot showing the main statistics per feature per cluster and reporting the Feature Importance using a RandomForest classifiers is saved in profiles.png.

Cluster boxplot profile

Cluster comparison

To make easier to compare the main differences among the identified clusters the last step produced an heatmap with the difference of the value respect to the median value. Moreover, we provide the values as they are and in a Normalized form:

N = \frac{\mu^{c}_{f} - \overline{\mu}_{f}}{\overline{\mu}_{f}}

With selected \mu^{c}_{f} that the median value per features per cluster, while the \overline{\mu}_{f} is the median value for the selected feature of the whole population. Therefore, the normalize difference is 0 if the value of the cluster is exactly equal to the median value of the whole distribution, negative if is bellow the value of the distribution and positive if is above.

The values are reported in a heatmap table with lower values in blue and higher values in red.

Normalize median difference per cluster

The same information, it is also available as a graph

Normalize difference between clusters

License

Distributed under the Apache 2 license. See LICENSE for more information.

Contact

Pietro Zambelli - pietro.zambelli@eurac.edu

Acknowledgements

EU emblem
This repository has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under Grant Agreement No. 101003757