Skip to content
Snippets Groups Projects
Commit 38d224ce authored by Pietro Zambelli's avatar Pietro Zambelli
Browse files

Add content and images to the README.md

parent df3b2168
No related branches found
No related tags found
No related merge requests found
# JustClust: JustNature Clustering
## Introduction
This repository contains data and code used to define ecological and socio-economic status and disparities profiles at urban level within the activities carried out by [JustNature H2020 project](https://justnatureproject.eu/).
......@@ -40,7 +38,7 @@ The main structure of the repository is the follow:
- README.md
## How to install and run the code
## Install
The python packages are managed by [`poetry`](https://python-poetry.org/), so ensure that `poetry` is properly installed in your machine.
......@@ -50,12 +48,55 @@ Then:
3. install the required python packages with: `poetry install`;
4. enable the virtualenv with: `poetry shell` or execute the python command instead of `python justclust/analysis/bolzano.py` use: `poetry run python justclust/analysis/bolzano.py`
## How to use
### Extend to other cities
to extend the analysis to other cities follow the following steps, replace `{city}` with the name of your place. Consider to change the content of the python list and dictionary but try to avoid to change the variables' name, otherwise you need to consistently rename the variable in all the subsequent
1. create a directory with the city's data with `mkdir -p data/raw/{city}`;
2. Copy the file from Bolzano `cp justclust/paths/bolzano.py justclust/paths/{city}.py` and adapt to the new city inputs and variables;
3. Copy the file from Bolzano `cp justclust/data/bolzano.py justclust/data/{city}.py` and:
* define the main columns for each justice dimension;
* define the weights that you want to use in the `wcols` variable;
* define the `conv` dictionary values to convert local variable name into a more standardize and international names;
* define a python function that read the file from the `data/raw/{city}` directory and return a geopandas `GeoDataFrame`.
4. Copy the analysis file for Bolzano with: `cp justclust/analysis/bolzano.py justclust/analysis/{city}.py` and:
* fix the imports substituting `bolzano` with `{city}`
5. Do the analysis, the three main steps are explained in detail in the next section but they can be summarized as follow:
* run `python justclust/analysis/{city}.py` to explore the pre-processing options, define your own if you like, then select the `pre_key` to be used;
* run `python justclust/analysis/{city}.py` to explore the clustering algorithms, define your own if you like, and then select the `sel_clsts` to be used;
* run `python justclust/analysis/{city}.py` to generate the main reports and analysis.
## Usage
Under the folder [`justclust/analysis/`](justclust/analysis/) you find a python file for each CiPeLs (at the moment only [`bolzano.py`](justclust/analysis/bolzano.py)), once installed the python packages (see previous section), you can launch the analysis with:
`python justclust/analysis/bolzano.py`
The analysis performed a first exploration of different data pre-processing preparing the data for the clustering task. All different pre-processing pipelines are defined in `get_preprocs` function available in [`justclust/explore/preprocs.py`](justclust/explore/preprocs.py), more than 300 pre-processing pipelines are tested.
The analysis process is divided in three main steps:
* explore the main pre-processing alternatives;
* explore the possible clustering algorithms;
* extract and report the main features for the selected cluster.
The process required to modify with a textual editor the file: `python justclust/analysis/{city}.py`.
The file contains the procedural steps required by the analysis. To avoid to repeat everytime all the steps and to save processing time, the execution of the code produce for each step a set of output files, and before executing the code check if the file exists already, if yes read the main variables from the file otherwise execute the code.
This approach it might be convenient, but you need to be aware that if you need to redo something from the begining of from some previous steps you need to delete or move the previous file, before re-executing the code.
Cache file for each step:
* pre-processing: `report/{city}/scores_preprocess.xlsx`;
* clustering: `report/{city}/cluster_all.gpkg`;
* reporting: has no cache, and it is generating the outputs every time.
### Explore pre-processing
The analysis performed a first exploration of different data pre-processing preparing the data for the clustering task.
All different pre-processing pipelines are defined in `get_preprocs` function available in [`justclust/explore/preprocs.py`](justclust/explore/preprocs.py), more than 300 pre-processing pipelines are tested.
The result of the pre-processing tests are written to an excel file in `report/{city}/scores_preprocess.xlsx` (e.g. `report/bolzano/scores_preprocess.xlsx`).
......@@ -63,9 +104,92 @@ The file is used as cache, therefore the pre-processing is not computed again bu
Once the pre-processing task is executed choose a pre-processing pipe line that you want to apply on your data before exploring the result of different clustering algorithms, therefore open the excel file, compare the different hopkins metrics and select a pre-processing pipeline of your interest. Copy the index key from the excel file into the `pre_key` variable in [`justclust/analysis/bolzano.py`](justclust/analysis/bolzano.py).
If you launch the program again, the pre-processing will be read by the excel file and it will start exploring different clustering options and algorithms, as in the previous step all the clustering algorithms that are tested are defined by the function `get_algclusters` defined in [`justclust/explore/algclusters.py`](justclust/explore/algclusters.py), for each clustering algorithm several metrics are computed by the [`compute_metrics`](justclust/explore/algclusters.py) function (i.e. Silhouette score, Davies Bouldin score, Calinski Harabasz score).
Please note, that not necesarly you need to select the pre-processing pipeline with the highest score, therefore evaluate the transformations that make sense for your data.
### Explore clustering algorithms and options
After you wrote the pre-processing key that you want to use and you launch the program again, the pre-processing will be read by the excel file and it will start exploring different clustering options and algorithms.
As in the previous step, all the clustering algorithms that are tested are defined by the function `get_algclusters` defined in [`justclust/explore/algclusters.py`](justclust/explore/algclusters.py).
For each clustering algorithm several metrics are computed by the [`compute_metrics`](justclust/explore/algclusters.py) function (i.e. Silhouette score, Davies Bouldin score, Calinski Harabasz score).
To limit the amount of columns in the resulting geopackage file it is possible to define some criteria, like for instance the range of valid cluster to be identified, the thresholds for the min or max values.
To limit the amount of columns in the resulting geopackage file (`report/{city}/cluster_all.gpkg`) it is possible to define some criteria, like for instance:
* the range of number of clusters to be consider of interest;
* the min or max thresholds' values for some specific metric (e.g. percentage of area with an assigned cluster, a minimum Silhouette score of 0.5, etc). See the `justclust/analysis/bolzano.py` for an example.
Exploring the clustering algorithms will generate several files:
* `report/{city}/cluster_all.gpkg` used as cache containing all the clusters that sadisfied the criteria defined by the user with the filter conditions. If you need to explore again different cluster algorithms this is the file to be moved/deleted. Moreover, several smaller (in terms of number of columns) files will be generated to have lighter file to be load in the GIS environment, the files follow this nomenclature: `report/{city}/cluster_{cluster model}_k{number of clusters}.gpkg`.
* `report/{city}/scores_clusters.xlsx` an excel file with all the scores computed for all the algorithms tested. This file it might be useful to select the criteria to be used for the identification of the clusters;
* `report/{city}/feature_importances.xlsx` an excel file with feature importance score for each cluster options.
To make the GPKG easier to handle and not too heavy the columns are split by algorithm and by number of clusters.
### Select and analyze the cluster of interest
As last step you need to define the cluster that you want to analyse and compare that seems more interesting.
Define in `justclust/analysis/bolzano.py` the variable `sel_clsts` that is a list of cluster id that you want to further explore. The complete list of cluster id is available in the generated excel file: `report/{city}/scores_clusters.xlsx`, from the file you can copy any `label` value that has the `selected` column set to `TRUE` and copy it into the `sel_clsts` list.
Once you have edit the file, if you launch the program again it will start itereting over the selected clusters, creating a dedicated folder in `report/{city}/{cluster label}` with all the graphs and tables.
All the graphs, by default are generated with a default DPI of 600, if you need to reduze the size in linux you can use the [ImageMagick](https://imagemagick.org/) command line tool `convert -units PixelsPerInch -density {new DPI} report/{city}/{cluster label}/*.png` and to further reduce the size with [OptiPNG](https://optipng.sourceforge.net/) with `optipng *.png`.
#### Map of the clusters
The figure `map_{cluster label}.png` shows the clusters on a GIS map.
![Map with the clusters](report/example/hdbscan__mcs-10_ms-05_m-sqeuclidean_csm-eom/map_hdbscan__mcs-10_ms-05_m-sqeuclidean_csm-eom.png)
#### Cluster cardinality
The figure `cardinality.png` shows the number of elements per cluster.
![Cluster cardinality](report/example/hdbscan__mcs-10_ms-05_m-sqeuclidean_csm-eom/cardinality.png)
#### Cluster values
The figure `lines_{number of clusters}_{metric}.png` and `lines_scaled_{number of clusters}_{metric}.png` show the cluster value for each features, here is reported the scaled (from 0 to 1) version.
![Scaled values per cluster](report/example/hdbscan__mcs-10_ms-05_m-sqeuclidean_csm-eom/lines_scaled_9_m-sqeuclidean.png)
A boxplot showing the main statistics per feature per cluster and reporting the Feature Importance using a RandomForest classifiers is saved in `profiles.png`.
![Cluster boxplot profile](report/example/hdbscan__mcs-10_ms-05_m-sqeuclidean_csm-eom/profiles.png)
#### Cluster comparison
To make easier to compare the main differences among the identified clusters the last step produced an heatmap with the difference of the value respect to the median value. Moreover, we provide the values as they are and in a Normalized form:
$N = \frac{\mu^{c}_{f} - \overline{\mu}_{f}}{\overline{\mu}_{f}}$
With selected $\mu^{c}_{f}$ that the median value per features per cluster, while the $\overline{\mu}_{f}$ is the median value for the selected feature of the whole population. Therefore, the normalize difference is 0 if the value of the cluster is exactly equal to the median value of the whole distribution, negative if is bellow the value of the distribution and positive if is above.
The values are reported in a heatmap table with lower values in blue and higher values in red.
![Normalize median difference per cluster](report/example/hdbscan__mcs-10_ms-05_m-sqeuclidean_csm-eom/comparison_normal_median.png)
The same information, it is also available as a graph
![Normalize difference between clusters](report/example/hdbscan__mcs-10_ms-05_m-sqeuclidean_csm-eom/norm_comparison_median.png)
## License
Distributed under the Apache 2 license. See [`LICENSE`](LICENSE) for more information.
## Contact
Pietro Zambelli - pietro.zambelli@eurac.edu
## Acknowledgements
<div class="row">
<div class="col-md-4" markdown="1">
<img src="https://ec.europa.eu/research/participants/docs/h2020-funding-guide/imgs/normal-reproduction-low-resolution.jpg" alt="EU emblem" width="50"/>
</div>
<div class="col-md-8" markdown="1">
This repository has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under Grant Agreement No. <a href="https://cordis.europa.eu/project/id/101003757">101003757</a>
</div>
</div>
report/example/hdbscan__mcs-10_ms-05_m-sqeuclidean_csm-eom/cardinality.png

76.4 KiB

report/example/hdbscan__mcs-10_ms-05_m-sqeuclidean_csm-eom/comparison_normal_median.png

454 KiB

report/example/hdbscan__mcs-10_ms-05_m-sqeuclidean_csm-eom/norm_comparison_median.png

216 KiB

report/example/hdbscan__mcs-10_ms-05_m-sqeuclidean_csm-eom/profiles.png

427 KiB

0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment