JustClust: JustNature Clustering
Introduction
This repository contains data and code used to define ecological and socio-economic status and disparities profiles at urban level within the activities carried out by JustNature H2020 project.
Repository structure
The main structure of the repository is the follow:
- data
- raw
- bolzano: directory with the data of Bolzano, copied directlz from Selected indicators
- ... data of the other CiPeLs
- raw
- justclust: directory containing python source code used for the analyses
- alg:
- analysis
- bolzano.py: file defining the whole analysis for the municipality of Bolzano.
- ... add a file for each CiPeLs
- data:
- bolzano.py:
- ...: add variable and function to read Bolzano's data
- explore:
- algclusters.py:
- preprocs.py:
- paths:
- bolzano.py: file with define all the bolzano related paths.
- ...: add file for the other CiPeLs
- hopkins.py: file containing the function to compute the Hopkins statistic used to measure the cluster tendency of a data set.
- plots.py: collection of function used to plot data
- utils.py: a collection of function to work with the data.
- report:
- bolzano
- LICENSE
- poetry.lock
- poetry.toml
- pyproject.toml
- README.md
Install
The python packages are managed by poetry
, so ensure that poetry
is properly installed in your machine.
Then:
- clone the repository:
git clone git@gitlab.inf.unibz.it:URS/justnature/clustering.git
; - move inside the new created folder
cd clustering
; - install the required python packages with:
poetry install
; - enable the virtualenv with:
poetry shell
or execute the python command instead ofpython justclust/analysis/bolzano.py
use:poetry run python justclust/analysis/bolzano.py
Extend to other cities
to extend the analysis to other cities follow the following steps, replace {city}
with the name of your place. Consider to change the content of the python list and dictionary but try to avoid to change the variables' name, otherwise you need to consistently rename the variable in all the subsequent
- create a directory with the city's data with
mkdir -p data/raw/{city}
; - Copy the file from Bolzano
cp justclust/paths/bolzano.py justclust/paths/{city}.py
and adapt to the new city inputs and variables; - Copy the file from Bolzano
cp justclust/data/bolzano.py justclust/data/{city}.py
and:- define the main columns for each justice dimension;
- define the weights that you want to use in the
wcols
variable; - define the
conv
dictionary values to convert local variable name into a more standardize and international names; - define a python function that read the file from the
data/raw/{city}
directory and return a geopandasGeoDataFrame
.
- Copy the analysis file for Bolzano with:
cp justclust/analysis/bolzano.py justclust/analysis/{city}.py
and:- fix the imports substituting
bolzano
with{city}
- fix the imports substituting
- Do the analysis, the three main steps are explained in detail in the next section but they can be summarized as follow:
- run
python justclust/analysis/{city}.py
to explore the pre-processing options, define your own if you like, then select thepre_key
to be used; - run
python justclust/analysis/{city}.py
to explore the clustering algorithms, define your own if you like, and then select thesel_clsts
to be used; - run
python justclust/analysis/{city}.py
to generate the main reports and analysis.
- run
Usage
Under the folder justclust/analysis/
you find a python file for each CiPeLs (at the moment only bolzano.py
), once installed the python packages (see previous section), you can launch the analysis with:
python justclust/analysis/bolzano.py
The analysis process is divided in three main steps:
- explore the main pre-processing alternatives;
- explore the possible clustering algorithms;
- extract and report the main features for the selected cluster.
The process required to modify with a textual editor the file: python justclust/analysis/{city}.py
.
The file contains the procedural steps required by the analysis. To avoid to repeat everytime all the steps and to save processing time, the execution of the code produce for each step a set of output files, and before executing the code check if the file exists already, if yes read the main variables from the file otherwise execute the code.
This approach it might be convenient, but you need to be aware that if you need to redo something from the begining of from some previous steps you need to delete or move the previous file, before re-executing the code.
Cache file for each step:
- pre-processing:
report/{city}/scores_preprocess.xlsx
; - clustering:
report/{city}/cluster_all.gpkg
; - reporting: has no cache, and it is generating the outputs every time.
Explore pre-processing
The analysis performed a first exploration of different data pre-processing preparing the data for the clustering task.
All different pre-processing pipelines are defined in get_preprocs
function available in justclust/explore/preprocs.py
, more than 300 pre-processing pipelines are tested.
The result of the pre-processing tests are written to an excel file in report/{city}/scores_preprocess.xlsx
(e.g. report/bolzano/scores_preprocess.xlsx
).
The file is used as cache, therefore the pre-processing is not computed again but load the result from the excel file, if you need to re-execute the pre-processing analysis move/rename/delete the excel file and launch the program again.
Once the pre-processing task is executed choose a pre-processing pipe line that you want to apply on your data before exploring the result of different clustering algorithms, therefore open the excel file, compare the different hopkins metrics and select a pre-processing pipeline of your interest. Copy the index key from the excel file into the pre_key
variable in justclust/analysis/bolzano.py
.
Please note, that not necesarly you need to select the pre-processing pipeline with the highest score, therefore evaluate the transformations that make sense for your data.
Explore clustering algorithms and options
After you wrote the pre-processing key that you want to use and you launch the program again, the pre-processing will be read by the excel file and it will start exploring different clustering options and algorithms.
As in the previous step, all the clustering algorithms that are tested are defined by the function get_algclusters
defined in justclust/explore/algclusters.py
.
For each clustering algorithm several metrics are computed by the compute_metrics
function (i.e. Silhouette score, Davies Bouldin score, Calinski Harabasz score).
To limit the amount of columns in the resulting geopackage file (report/{city}/cluster_all.gpkg
) it is possible to define some criteria, like for instance:
- the range of number of clusters to be consider of interest;
- the min or max thresholds' values for some specific metric (e.g. percentage of area with an assigned cluster, a minimum Silhouette score of 0.5, etc). See the
justclust/analysis/bolzano.py
for an example.
Exploring the clustering algorithms will generate several files:
-
report/{city}/cluster_all.gpkg
used as cache containing all the clusters that sadisfied the criteria defined by the user with the filter conditions. If you need to explore again different cluster algorithms this is the file to be moved/deleted. Moreover, several smaller (in terms of number of columns) files will be generated to have lighter file to be load in the GIS environment, the files follow this nomenclature:report/{city}/cluster_{cluster model}_k{number of clusters}.gpkg
. -
report/{city}/scores_clusters.xlsx
an excel file with all the scores computed for all the algorithms tested. This file it might be useful to select the criteria to be used for the identification of the clusters; -
report/{city}/feature_importances.xlsx
an excel file with feature importance score for each cluster options. To make the GPKG easier to handle and not too heavy the columns are split by algorithm and by number of clusters.
Select and analyze the cluster of interest
As last step you need to define the cluster that you want to analyse and compare that seems more interesting.
Define in justclust/analysis/bolzano.py
the variable sel_clsts
that is a list of cluster id that you want to further explore. The complete list of cluster id is available in the generated excel file: report/{city}/scores_clusters.xlsx
, from the file you can copy any label
value that has the selected
column set to TRUE
and copy it into the sel_clsts
list.
Once you have edit the file, if you launch the program again it will start itereting over the selected clusters, creating a dedicated folder in report/{city}/{cluster label}
with all the graphs and tables.
All the graphs, by default are generated with a default DPI of 600, if you need to reduze the size in linux you can use the ImageMagick command line tool convert -units PixelsPerInch -density {new DPI} report/{city}/{cluster label}/*.png
and to further reduce the size with OptiPNG with optipng *.png
.
Map of the clusters
The figure map_{cluster label}.png
shows the clusters on a GIS map.
Cluster cardinality
The figure cardinality.png
shows the number of elements per cluster.
Cluster values
The figure lines_{number of clusters}_{metric}.png
and lines_scaled_{number of clusters}_{metric}.png
show the cluster value for each features, here is reported the scaled (from 0 to 1) version.
A boxplot showing the main statistics per feature per cluster and reporting the Feature Importance using a RandomForest classifiers is saved in profiles.png
.
Cluster comparison
To make easier to compare the main differences among the identified clusters the last step produced an heatmap with the difference of the value respect to the median value. Moreover, we provide the values as they are and in a Normalized form:
N = \frac{\mu^{c}_{f} - \overline{\mu}_{f}}{\overline{\mu}_{f}}
With selected \mu^{c}_{f} that the median value per features per cluster, while the \overline{\mu}_{f} is the median value for the selected feature of the whole population. Therefore, the normalize difference is 0 if the value of the cluster is exactly equal to the median value of the whole distribution, negative if is bellow the value of the distribution and positive if is above.
The values are reported in a heatmap table with lower values in blue and higher values in red.
The same information, it is also available as a graph
License
Distributed under the Apache 2 license. See LICENSE
for more information.
Contact
Pietro Zambelli - pietro.zambelli@eurac.edu