JustClust: JustNature Clustering
Introduction
This repository contains data and code used to define ecological and socio-economic status and disparities profiles at urban level within the activities carried out by JustNature H2020 project.
Repository structure
The main structure of the repository is the follow:
- data
- raw
- bolzano: directory with the data of Bolzano, copied directlz from Selected indicators
- ... data of the other CiPeLs
- raw
- justclust: directory containing python source code used for the analyses
- alg:
- analysis
- bolzano.py: file defining the whole analysis for the municipality of Bolzano.
- ... add a file for each CiPeLs
- data:
- bolzano.py:
- ...: add variable and function to read Bolzano's data
- explore:
- algclusters.py:
- preprocs.py:
- paths:
- bolzano.py: file with define all the bolzano related paths.
- ...: add file for the other CiPeLs
- hopkins.py: file containing the function to compute the Hopkins statistic used to measure the cluster tendency of a data set.
- plots.py: collection of function used to plot data
- utils.py: a collection of function to work with the data.
- report:
- bolzano
- LICENSE
- poetry.lock
- poetry.toml
- pyproject.toml
- README.md
How to install and run the code
The python packages are managed by poetry
, so ensure that poetry
is properly installed in your machine.
Then:
- clone the repository:
git clone git@gitlab.inf.unibz.it:URS/justnature/clustering.git
; - move inside the new created folder
cd clustering
; - install the required python packages with:
poetry install
; - enable the virtualenv with:
poetry shell
or execute the python command instead ofpython justclust/analysis/bolzano.py
use:poetry run python justclust/analysis/bolzano.py
How to use
Under the folder justclust/analysis/
you find a python file for each CiPeLs (at the moment only bolzano.py
), once installed the python packages (see previous section), you can launch the analysis with:
python justclust/analysis/bolzano.py
The analysis performed a first exploration of different data pre-processing preparing the data for the clustering task. All different pre-processing pipelines are defined in get_preprocs
function available in justclust/explore/preprocs.py
, more than 300 pre-processing pipelines are tested.
The result of the pre-processing tests are written to an excel file in report/{city}/scores_preprocess.xlsx
(e.g. report/bolzano/scores_preprocess.xlsx
).
The file is used as cache, therefore the pre-processing is not computed again but load the result from the excel file, if you need to re-execute the pre-processing analysis move/rename/delete the excel file and launch the program again.
Once the pre-processing task is executed choose a pre-processing pipe line that you want to apply on your data before exploring the result of different clustering algorithms, therefore open the excel file, compare the different hopkins metrics and select a pre-processing pipeline of your interest. Copy the index key from the excel file into the pre_key
variable in justclust/analysis/bolzano.py
.
If you launch the program again, the pre-processing will be read by the excel file and it will start exploring different clustering options and algorithms, as in the previous step all the clustering algorithms that are tested are defined by the function get_algclusters
defined in justclust/explore/algclusters.py
, for each clustering algorithm several metrics are computed by the compute_metrics
function (i.e. Silhouette score, Davies Bouldin score, Calinski Harabasz score).
To limit the amount of columns in the resulting geopackage file it is possible to define some criteria, like for instance the range of valid cluster to be identified, the thresholds for the min or max values. To make the GPKG easier to handle and not too heavy the columns are split by algorithm and by number of clusters.