Commit cda56a11 authored by npedot's avatar npedot
Browse files

refacts appendici

parent 190dff4d
......@@ -14,6 +14,7 @@ ltnw.constant('b',[1, 17])
ltnw.constant('c',[1, 18])
ltnw.constant('d',[1, 22])
ltnw.constant('e',[1, 99])
ltnw.constant('f',[1, ])
ltnw.constant('maggiorenne',[1, 18])
ltnw.constant('roma',[12.,41.])
......@@ -32,6 +33,9 @@ def _dist(x,y):
ltnw.predicate("cleq",2,_crisp_leq)
ltnw.predicate("close",2,_dist)
print(ltnw.ask("eta(f)"))
print(ltnw.ask("cleq(eta(maggiorenne),eta(a))"))
print(ltnw.ask("cleq(eta(maggiorenne),eta(b))"))
print(ltnw.ask("cleq(eta(maggiorenne),eta(c))"))
......
# SEMINT APPENDICI
Vi sono diverse tipologie di sporcizia e attività di pulizia applicabili, come ad esempio: [http://wp.sigmod.org/?p=2288] :
* valori mancanti per ignoranza
* valori fuori scala per errato input
* violazione di vincoli di integrità per unione di più schemi
* entità duplicate per merge di due tabelle a pari schema
* entità frammentate, identificazione dell'unità
Questi errori possono anche interferire congiuntamente in modalità non triviali. Come indicato in [http://www.vldb.org/pvldb/vol9/p993-abedjan.pdf]
non esiste un ordine migliore per risolvere i problemi.
Principali problemi da risolvere sono:
1. fornire una soluzione olistica ai problemi di pulizia
2. scalare ad insieme di dati di grandi dimensioni
3. rendere l'utilizzo interattivo parte essenziale
Ad aggiungersi il problema della complessità quadratica delle soluzioni oppure richiedono più passate.
Secondo, la parametrizzazione degli algoritmi di correzione sono difficili o fornite da inforormazioni non disponibili.
Terzo, gli errori che emergono chiaramente in ultimi stadi di elaborazione richiedono una ricerca a ritroso delle sorgenti errate.
Quarto, non vi sono soluzioni completamente automatiche l'uomo che validi le alternative è componente indispensabile.
L'esperienza pratica mostra la distanza dalle assunzioni delle situazioni ideali teoriche rende obbligatorio la costruzione di soluzioni su misura ad alto costo di manutenzione da parte di personale tecnico specializzato ed esperti di dominio.
https://github.com/HoloClean/holoclean/blob/master/repair/learn/learn.py
Soluzioni proposte [https://sites.google.com/site/anhaidgroup/projects/magellan] forniscono guide passo passo
La chiave sta nel raccogliere tutti segnali del modello dei dati in modo da permettere di predire per sostituire o completare dati errati o mancanti.
Segnali espressi dal relazioni tra dati disponibili.
## terminologia
https://elitedatascience.com/birds-eye-view
Model - a set of patterns learned from data.
Algorithm - a specific ML process used to train a model.
Training data - the dataset from which the algorithm learns the model.
Test data - a new dataset for reliably evaluating model performance.
Features - Variables (columns) in the dataset used to train the model.
Target variable - A specific variable you're trying to predict.
Observations - Data points (rows) in the dataset.
You have 150 observations...
1 target variable (Height)...
3 features (Age, Gender, Weight)...
You might then separate your dataset into two subsets:
Set of 120 used to train several models (training set)
Set of 30 used to pick the best model (test set)
## passi per la pulizia
https://elitedatascience.com/data-cleaning
The steps and techniques for data cleaning will vary from dataset to dataset. As a result, it's impossible for a single guide to cover everything you might run into.
However, this guide provides a reliable starting framework that can be used every time. We cover common steps such as fixing structural errors, handling missing data, and filtering observations.
Better data beats fancier algorithms.
garbage in gets you garbage out.
In fact, if you have a properly cleaned dataset, even simple algorithms can learn impressive insights from the data!
1. Remove Unwanted observations
The first step to data cleaning is removing unwanted observations from your dataset.
This includes duplicate or irrelevant observations.
Duplicate observations most frequently arise during data collection, such as when you:
Combine datasets from multiple places
Scrape data
Receive data from clients/other departments
Irrelevant observations are those that don’t actually fit the specific problem that you’re trying to solve.
2. Fix Structural Errors (What is the SAME)
Structural errors are those that arise during measurement, data transfer, or other types of "poor housekeeping."
you can check for typos or inconsistent capitalization. This is mostly a concern for categorical features,
'composition' is the same as 'Composition'
check for mislabeled classes, i.e. separate classes that should really be the same.
’IT’ and ’information_technology’ should be a single class.
3. Filter Unwanted Outliers ( WHat to filter OUT)
outliers are innocent until proven guilty. You should never remove an outlier just because it’s a "big number." That big number could be very informative for your model.
We can’t stress this enough: you must have a good reason for removing an outlier, such as suspicious measurements that are unlikely to be real data.
4. Handle Missing Data (What to take IN)
you cannot simply ignore missing values in your dataset. You must handle them in some way for the very practical reason that most algorithms do not accept missing values.
Dropping missing values is sub-optimal because when you drop observations, you drop information.
you should always tell your algorithm that a value was missing because missingness is informative.
"missingness" is almost always informative in itself, and you should tell your algorithm if a value was missing.
Even if you build a model to impute your values, you’re not adding any real information. You’re just reinforcing the patterns already provided by other features.
Missing categorical data
The best way to handle missing data for categorical features is to simply label them as ’Missing’!
You’re essentially adding a new class for the feature.
Missing numeric data
flag and fill
Flag the observation with an indicator variable of missingness.
Then, fill the original missing value with 0 just to meet the technical requirement of no missing values.
## What is Feature Engineering?
Feature engineering is about creating new input features from your existing ones.
we will introduce several heuristics to help spark new ideas.
Before moving on, we just want to note that this is not an exhaustive compendium of all feature engineering because there are limitless possibilities for this step.
The good news is that this skill will naturally improve as you gain more experience.
1. Infuse Domain Knowledge
2. Create Derived Features
3. Combine Sparse Classes
Sparse classes (in categorical features) are those that have very few total observations. They can be problematic for certain machine learning algorithms, causing models to be overfit. we can group similar classes
4. Add Dummy Variables
Most machine learning algorithms cannot directly handle categorical features. Specifically, they cannot handle text values.
5. Remove Unused Features
Unused features are those that don’t make sense to pass into our machine learning algorithms. Examples include:
ID columns
Not all of the features you engineer need to be winners. In fact, you’ll often find that many of them don’t improve your model. That’s fine because one highly predictive feature makes up for 10 duds.
The key is choosing machine learning algorithms that can automatically select the best features among many options (built-in feature selection).
This will allow you to avoid overfitting your model despite providing many input features.
## Problema della selezione del dato (Simpson)
[https://it.wikipedia.org/wiki/Paradosso_di_Simpson]
Sbagliare nella selezione del dato da leggere implica interpretazioni scorrette.
In statistica, il paradosso di Simpson indica una situazione in cui una relazione tra due fenomeni appare modificata, o perfino invertita, dai dati in possesso a causa di altri fenomeni non presi in considerazione nell'analisi (variabili nascoste).
I dati prodotti dal paradosso di Simpson chiaramente non sono sbagliati in sé, ma semplicemente devono essere letti in modo diverso da come farebbe un lettore o analista superficiale.
Men Women
History 1/5 < 2/8
Geography 6/8 < 4/5
University 7/13 > 6/13
The key to this puzzling example lies in the fact that more women are applying for jobs that are harder to get. It is harder to make your way into History than into Geography.
History hired only 3 out of 13 applicants, whereas Geography hired 10 out of 13 applicants. Hence the success rate was much higher in Geography, where there were more male applicants.
Si deve porre l'attenzione sui sottostanti. Per questo escludere informazioni può portare finoa d invertire le interpretazioni dei dati.
they do have methodological significance insofar as substantive empirical assumptions are required to identify salient partitions for making inferences from statistical and probability relationships.
[https://plato.stanford.edu/entries/paradox-simpson/]
I dati outlow, importanti o rumore?
Come selezionare un filtro corretto?
Solo l'esperto di dominio può rispondere a queste domande.
## Problema dei tipi di dato e mappatura
Quando dobbiamo passare da un insieme di tipi di dato ad un altro dobbiamo operare delle scelte responsabili di nomenclatura e di intervallo dei valori.
Mappatura dello schema dati tra tabelle.
Se una tabella non è in prima forma normale, si può usare un campo per codificare più informazioni, ad esempio un indirizzo o numero telefonico,
può presentare il campo nr civico separato o meno ed il prefisso numero di CAP separato o meno.
Starà al programmatore decodificare il contenuto.
A volte si usa il tipo di dato testo per lasciare libertà di inserimento, senza garantire un controllo applicativo, l'inserimento manuale rende quasi certo sui grandi numeri errori di battitura.
Vi sono campi che assomigliano perchè hanno lo stesso nome di colonna ma fanno riferimento a significati diversi.
Campo ANNO di nascita in anagrafica persona e ANNO di assunzione in campo dipedneti.
Vi sono traduzioni di tipo per occupazione floating point approssimati.
## Problema dei dati null
Ambiguità del vuoto. Assenza di valore o ignoranza?
supponiamo di avere una colonna PREZZO con valori NULL all’interno di una ipotetica tabella PRODOTTO, questo non significa che esistano prodotti senza un prezzo, ma solamente che il prezzo di alcuni prodotti è sconosciuto o non è stato ancora valorizzato.
Null values will effect the results from an aggregate function
Because Null is not a data value, but a marker for an absent value, using mathematical operators on Null gives an unknown result, which is represented by Null.
Including null values within your data can have an adverse effect when using this data within any mathematical operations. Any operation that includes a null value will result in a null; this being logical as if a value is unknown then the result of the operation will also be unknown.
NULL is the default for new columns!
[http://www.databasedev.co.uk/null_values_defined.html]
in practice Nulls also end up being used as a quick way to patch an existing schema when it needs to evolve beyond its original intent, coding not for missing but rather for inapplicable information; for example, a database that quickly needs to support electric cars while having a miles-per-gallon column
[https://en.wikipedia.org/wiki/Null_(SQL)]
Le due forme più comuni di generazione di NULL sono legate a:
FROM HAS JOINS
SQL outer joins, including left outer joins, right outer joins, and full outer joins, automatically produce Nulls as placeholders for missing values in related tables.
Query che possono essere materializzate in tabelle denormalizzate.
Tecniche usate in denormalizzazione [https://it.wikipedia.org/wiki/Denormalizzazione] per ottenere ricerche più rapide elimina
FROM ISA HIEARACHIES
[https://weblogs.asp.net/manavi/inheritance-mapping-strategies-with-entity-framework-code-first-ctp5-part-1-table-per-hierarchy-tph]
A simple strategy for mapping classes to database tables might be “one table for every entity persistent class.” This approach sounds simple enough and, indeed, works well until we encounter inheritance. Inheritance is such a visible structural mismatch between the object-oriented and relational worlds because object-oriented systems model both “is a” and “has a” relationships. SQL-based models provide only "has a" relationships between entities; SQL database management systems don’t support type inheritance—and even when it’s available, it’s usually proprietary or incomplete.
There are three different approaches to representing an inheritance hierarchy:
Table per Hierarchy (TPH): Enable polymorphism by denormalizing the SQL schema, and utilize a type discriminator column that holds type information.
Table per Type (TPT): Represent "is a" (inheritance) relationships as "has a" (foreign key) relationships.
Table per Concrete class (TPC): Discard polymorphism and inheritance relationships completely from the SQL schema.
Table per Hierarchy (TPH)
An entire class hierarchy can be mapped to a single table. This table includes columns for all properties of all classes in the hierarchy. The concrete subclass represented by a particular row is identified by the value of a type discriminator column. You don’t have to do anything special in Code First to enable TPH. It's the default inheritance mapping strategy!
FROM REFACTORING
campi colonna vengono abbandonati in favore di nuovi campi colonna, ma prima di essere rimossi si attende il passaggio a nuova versione applicativo del'intera utenza.
## Soluzioni proposte
Ad esempio Google OpenRefine [] oppure più recentemente basato sulla probabilità e algoritmi basati su reti neurali come HoloClean [].
## HoloClean come framework e benchmark
Data una tabella
Steps:
1. individua righe da correggere, scegliendo i singoli rilevatori di difetto.
2. applica algoritmi di correzione alle righe da correggere, scegliendo i singoli algoritmi da applicare.
......@@ -257,260 +257,3 @@ proposta
[Logic Tensor Networks: Deep Learning and Logical Reasoning from Data and Knowledge](https://www.researchgate.net/publication/303969790_Logic_Tensor_Networks_Deep_Learning_and_Logical_Reasoning_from_Data_and_Knowledge)
----------------------------------
## APPENDICI
Vi sono diverse tipologie di sporcizia e attività di pulizia applicabili, come ad esempio: [http://wp.sigmod.org/?p=2288] :
* valori mancanti per ignoranza
* valori fuori scala per errato input
* violazione di vincoli di integrità per unione di più schemi
* entità duplicate per merge di due tabelle a pari schema
* entità frammentate, identificazione dell'unità
Questi errori possono anche interferire congiuntamente in modalità non triviali. Come indicato in [http://www.vldb.org/pvldb/vol9/p993-abedjan.pdf]
non esiste un ordine migliore per risolvere i problemi.
Principali problemi da risolvere sono:
1. fornire una soluzione olistica ai problemi di pulizia
2. scalare ad insieme di dati di grandi dimensioni
3. rendere l'utilizzo interattivo parte essenziale
Ad aggiungersi il problema della complessità quadratica delle soluzioni oppure richiedono più passate.
Secondo, la parametrizzazione degli algoritmi di correzione sono difficili o fornite da inforormazioni non disponibili.
Terzo, gli errori che emergono chiaramente in ultimi stadi di elaborazione richiedono una ricerca a ritroso delle sorgenti errate.
Quarto, non vi sono soluzioni completamente automatiche l'uomo che validi le alternative è componente indispensabile.
L'esperienza pratica mostra la distanza dalle assunzioni delle situazioni ideali teoriche rende obbligatorio la costruzione di soluzioni su misura ad alto costo di manutenzione da parte di personale tecnico specializzato ed esperti di dominio.
https://github.com/HoloClean/holoclean/blob/master/repair/learn/learn.py
Soluzioni proposte [https://sites.google.com/site/anhaidgroup/projects/magellan] forniscono guide passo passo
La chiave sta nel raccogliere tutti segnali del modello dei dati in modo da permettere di predire per sostituire o completare dati errati o mancanti.
Segnali espressi dal relazioni tra dati disponibili.
## terminologia
https://elitedatascience.com/birds-eye-view
Model - a set of patterns learned from data.
Algorithm - a specific ML process used to train a model.
Training data - the dataset from which the algorithm learns the model.
Test data - a new dataset for reliably evaluating model performance.
Features - Variables (columns) in the dataset used to train the model.
Target variable - A specific variable you're trying to predict.
Observations - Data points (rows) in the dataset.
You have 150 observations...
1 target variable (Height)...
3 features (Age, Gender, Weight)...
You might then separate your dataset into two subsets:
Set of 120 used to train several models (training set)
Set of 30 used to pick the best model (test set)
## passi per la pulizia
https://elitedatascience.com/data-cleaning
The steps and techniques for data cleaning will vary from dataset to dataset. As a result, it's impossible for a single guide to cover everything you might run into.
However, this guide provides a reliable starting framework that can be used every time. We cover common steps such as fixing structural errors, handling missing data, and filtering observations.
Better data beats fancier algorithms.
garbage in gets you garbage out.
In fact, if you have a properly cleaned dataset, even simple algorithms can learn impressive insights from the data!
1. Remove Unwanted observations
The first step to data cleaning is removing unwanted observations from your dataset.
This includes duplicate or irrelevant observations.
Duplicate observations most frequently arise during data collection, such as when you:
Combine datasets from multiple places
Scrape data
Receive data from clients/other departments
Irrelevant observations are those that don’t actually fit the specific problem that you’re trying to solve.
2. Fix Structural Errors (What is the SAME)
Structural errors are those that arise during measurement, data transfer, or other types of "poor housekeeping."
you can check for typos or inconsistent capitalization. This is mostly a concern for categorical features,
'composition' is the same as 'Composition'
check for mislabeled classes, i.e. separate classes that should really be the same.
’IT’ and ’information_technology’ should be a single class.
3. Filter Unwanted Outliers ( WHat to filter OUT)
outliers are innocent until proven guilty. You should never remove an outlier just because it’s a "big number." That big number could be very informative for your model.
We can’t stress this enough: you must have a good reason for removing an outlier, such as suspicious measurements that are unlikely to be real data.
4. Handle Missing Data (What to take IN)
you cannot simply ignore missing values in your dataset. You must handle them in some way for the very practical reason that most algorithms do not accept missing values.
Dropping missing values is sub-optimal because when you drop observations, you drop information.
you should always tell your algorithm that a value was missing because missingness is informative.
"missingness" is almost always informative in itself, and you should tell your algorithm if a value was missing.
Even if you build a model to impute your values, you’re not adding any real information. You’re just reinforcing the patterns already provided by other features.
Missing categorical data
The best way to handle missing data for categorical features is to simply label them as ’Missing’!
You’re essentially adding a new class for the feature.
Missing numeric data
flag and fill
Flag the observation with an indicator variable of missingness.
Then, fill the original missing value with 0 just to meet the technical requirement of no missing values.
## What is Feature Engineering?
Feature engineering is about creating new input features from your existing ones.
we will introduce several heuristics to help spark new ideas.
Before moving on, we just want to note that this is not an exhaustive compendium of all feature engineering because there are limitless possibilities for this step.
The good news is that this skill will naturally improve as you gain more experience.
1. Infuse Domain Knowledge
2. Create Derived Features
3. Combine Sparse Classes
Sparse classes (in categorical features) are those that have very few total observations. They can be problematic for certain machine learning algorithms, causing models to be overfit. we can group similar classes
4. Add Dummy Variables
Most machine learning algorithms cannot directly handle categorical features. Specifically, they cannot handle text values.
5. Remove Unused Features
Unused features are those that don’t make sense to pass into our machine learning algorithms. Examples include:
ID columns
Not all of the features you engineer need to be winners. In fact, you’ll often find that many of them don’t improve your model. That’s fine because one highly predictive feature makes up for 10 duds.
The key is choosing machine learning algorithms that can automatically select the best features among many options (built-in feature selection).
This will allow you to avoid overfitting your model despite providing many input features.
## Problema della selezione del dato (Simpson)
[https://it.wikipedia.org/wiki/Paradosso_di_Simpson]
Sbagliare nella selezione del dato da leggere implica interpretazioni scorrette.
In statistica, il paradosso di Simpson indica una situazione in cui una relazione tra due fenomeni appare modificata, o perfino invertita, dai dati in possesso a causa di altri fenomeni non presi in considerazione nell'analisi (variabili nascoste).
I dati prodotti dal paradosso di Simpson chiaramente non sono sbagliati in sé, ma semplicemente devono essere letti in modo diverso da come farebbe un lettore o analista superficiale.
Men Women
History 1/5 < 2/8
Geography 6/8 < 4/5
University 7/13 > 6/13
The key to this puzzling example lies in the fact that more women are applying for jobs that are harder to get. It is harder to make your way into History than into Geography.
History hired only 3 out of 13 applicants, whereas Geography hired 10 out of 13 applicants. Hence the success rate was much higher in Geography, where there were more male applicants.
Si deve porre l'attenzione sui sottostanti. Per questo escludere informazioni può portare finoa d invertire le interpretazioni dei dati.
they do have methodological significance insofar as substantive empirical assumptions are required to identify salient partitions for making inferences from statistical and probability relationships.
[https://plato.stanford.edu/entries/paradox-simpson/]
I dati outlow, importanti o rumore?
Come selezionare un filtro corretto?
Solo l'esperto di dominio può rispondere a queste domande.
## Problema dei tipi di dato e mappatura
Quando dobbiamo passare da un insieme di tipi di dato ad un altro dobbiamo operare delle scelte responsabili di nomenclatura e di intervallo dei valori.
Mappatura dello schema dati tra tabelle.
Se una tabella non è in prima forma normale, si può usare un campo per codificare più informazioni, ad esempio un indirizzo o numero telefonico,
può presentare il campo nr civico separato o meno ed il prefisso numero di CAP separato o meno.
Starà al programmatore decodificare il contenuto.
A volte si usa il tipo di dato testo per lasciare libertà di inserimento, senza garantire un controllo applicativo, l'inserimento manuale rende quasi certo sui grandi numeri errori di battitura.
Vi sono campi che assomigliano perchè hanno lo stesso nome di colonna ma fanno riferimento a significati diversi.
Campo ANNO di nascita in anagrafica persona e ANNO di assunzione in campo dipedneti.
Vi sono traduzioni di tipo per occupazione floating point approssimati.
## Problema dei dati null
Ambiguità del vuoto. Assenza di valore o ignoranza?
supponiamo di avere una colonna PREZZO con valori NULL all’interno di una ipotetica tabella PRODOTTO, questo non significa che esistano prodotti senza un prezzo, ma solamente che il prezzo di alcuni prodotti è sconosciuto o non è stato ancora valorizzato.
Null values will effect the results from an aggregate function
Because Null is not a data value, but a marker for an absent value, using mathematical operators on Null gives an unknown result, which is represented by Null.
Including null values within your data can have an adverse effect when using this data within any mathematical operations. Any operation that includes a null value will result in a null; this being logical as if a value is unknown then the result of the operation will also be unknown.
NULL is the default for new columns!
[http://www.databasedev.co.uk/null_values_defined.html]
in practice Nulls also end up being used as a quick way to patch an existing schema when it needs to evolve beyond its original intent, coding not for missing but rather for inapplicable information; for example, a database that quickly needs to support electric cars while having a miles-per-gallon column
[https://en.wikipedia.org/wiki/Null_(SQL)]
Le due forme più comuni di generazione di NULL sono legate a:
FROM HAS JOINS
SQL outer joins, including left outer joins, right outer joins, and full outer joins, automatically produce Nulls as placeholders for missing values in related tables.
Query che possono essere materializzate in tabelle denormalizzate.
Tecniche usate in denormalizzazione [https://it.wikipedia.org/wiki/Denormalizzazione] per ottenere ricerche più rapide elimina
FROM ISA HIEARACHIES
[https://weblogs.asp.net/manavi/inheritance-mapping-strategies-with-entity-framework-code-first-ctp5-part-1-table-per-hierarchy-tph]
A simple strategy for mapping classes to database tables might be “one table for every entity persistent class.” This approach sounds simple enough and, indeed, works well until we encounter inheritance. Inheritance is such a visible structural mismatch between the object-oriented and relational worlds because object-oriented systems model both “is a” and “has a” relationships. SQL-based models provide only "has a" relationships between entities; SQL database management systems don’t support type inheritance—and even when it’s available, it’s usually proprietary or incomplete.
There are three different approaches to representing an inheritance hierarchy:
Table per Hierarchy (TPH): Enable polymorphism by denormalizing the SQL schema, and utilize a type discriminator column that holds type information.
Table per Type (TPT): Represent "is a" (inheritance) relationships as "has a" (foreign key) relationships.
Table per Concrete class (TPC): Discard polymorphism and inheritance relationships completely from the SQL schema.
Table per Hierarchy (TPH)
An entire class hierarchy can be mapped to a single table. This table includes columns for all properties of all classes in the hierarchy. The concrete subclass represented by a particular row is identified by the value of a type discriminator column. You don’t have to do anything special in Code First to enable TPH. It's the default inheritance mapping strategy!
FROM REFACTORING
campi colonna vengono abbandonati in favore di nuovi campi colonna, ma prima di essere rimossi si attende il passaggio a nuova versione applicativo del'intera utenza.
## Soluzioni proposte
Ad esempio Google OpenRefine [] oppure più recentemente basato sulla probabilità e algoritmi basati su reti neurali come HoloClean [].
## HoloClean come framework e benchmark
Data una tabella
Steps:
1. individua righe da correggere, scegliendo i singoli rilevatori di difetto.
2. applica algoritmi di correzione alle righe da correggere, scegliendo i singoli algoritmi da applicare.
# SEMINT OVERALL APPENDIVI
https://en.wikipedia.org/wiki/Data_wrangling
Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. A data wrangler is a person who performs these transformation operations.
The data transformations are typically applied to distinct entities (e.g. fields, rows, columns, data values etc.) within a data set,
and could include such actions as
extractions,
parsing,
joining,
standardizing,
augmenting,
cleansing,
consolidating and
filtering
to create desired wrangling outputs that can be leveraged downstream.
## WINTER
https://github.com/olehmberg/winter
Data Loading: WInte.r provides readers for standard data formats such as CSV, XML and JSON. In addition, WInte.r offers a specialized JSON format for representing tabular data from the Web together with meta-information about the origin and context of the data,
Pre-processing: During pre-processing you prepare your data for the methods that you are going to apply later on in the integration process. WInte.r WebTables provides you with specialized pre-processing methods for tabular data, such as:
Data type detection
Unit of measurement normalization
Header detection
Subject column detection (also known as entity name column detection)
Schema Matching: Schema matching methods find attributes in two schemata that have the same meaning. WInte.r provides three pre-implemented schema matching algorithms which either rely on attribute labels or data values, or exploit an existing mapping of records (duplicate-based schema matching) in order to find attribute correspondences.
Label-based schema matching
Instance-based schema matching
Duplicate-based schema matching
Identity Resolution: Identity resolution methods (also known as data matching or record linkage methods) identify records that describe the same real-world entity. The pre-implemented identity resolution methods can be applied to a single dataset for duplicate detection or to multiple datasets in order to find record-level correspondences. Beside of manually defining identity resolution methods, WInte.r also allows you to learn matching rules from known correspondences. Identity resolution methods rely on blocking (also called indexing) in order to reduce the number of record comparisons. WInte.r provides following pre-implemented blocking and identity resolution methods:
Blocking by single/multiple blocking key(s)
Sorted-Neighbourhood Method
Token-based identity resolution
Rule-based identity resolution
Data Fusion: Data fusion methods combine data from multiple sources into a single, consolidated dataset. For this, they rely on the schema- and record-level correspondences that were discovered in the previous steps of the integration process. However, different sources may provide conflicting data values. WInte.r allows you to resolve such data conflicts (decide which value to include in the final dataset) by applying different conflict resolution functions.
11 pre-defined conflict resolution functions for strings, numbers and lists of values as well as data type independent functions.
https://github.com/olehmberg/winter/wiki/WInte.r-Tutorial
## BIG GORILLA
This example covers the “data wrangling” aspect of the data science pipeline. After data from different sources are integrated into a single database, a data scientist would like to perform analysis on the data through techniques such as classification, clustering, anomaly detection, correlation discovery, and OLAP style exploration.
steps may be carried out in a different order and some of the steps may even be repeated:
### Data Acquisition, Extraction, and Cleaning
when you wish to acquire data from other sources or extract structured data from text. Most tools in this component include data cleaning components to, for example, detect and/or correct inconsistent data.
DATA ACQUISITION AND DATA EXTRACTION:
Suppose we have acquired a table X(id, name, loc) from a relational database management system and we have extracted a table Y(id, cname, address, rev) from a set of news articles.
The table X contains information about the names and locations of companies while the table Y contains, for each company, its address and its quarterly revenue of the company in billions of dollars.
DATA CLEANING:
We detect that revenue 351 of GE (the last row of Table Y) is an outlier. Upon closer inspection, we realize that it should have been 35.1 instead of 351, due to an extraction error. So we manually change this value to 35.1.
* Note that there are many other types of cleaning operations in general. We are currently missing a data cleaning component in BigGorilla.
### Schema Matching and Mapping
Use this component when you wish to match attributes across two schemas or when you wish to generate scripts (from schema matchings) that can be executed to transform data from one format into another.
SCHEMA MATCHING:
Next, we match the schemas of tables X and Y. We obtain the matches name⬌cname and loc⬌address. Intuitively, this means that the attribute name of table X is the same as cname of table Y, and the attribute loc of table X is the same as address of table Y.
SCHEMA MERGING:
Based on the matchings name⬌cname and loc⬌address between tables X and Y, the data scientist may choose to merge the two schemas X(id, name, loc) and Y(id, cname, address, rev) into a single schema Z(name, loc, rev). Note that the id attribute is omitted in the merge process and this is a conscious decision of the data scientist.
SCHEMA MAPPING:
The program that is used to transform data that resides in tables X and Y into table Z is called a schema mapping. Here, the schema mapping is developed based on understanding how tuples from X and Y should be migrated into Z. The program can be an SQL query that populates table Z based on tuples from tables X and Y. It uses table M to determine the matches and it uses the function merge_name(.) to apply the heuristic of selecting the longer string described earlier.
### Entity Matching
Use this component when you wish to identify when two entities are the same entity or when they are related in some ways.
DATA MATCHING:
Next, we match the tuples of tables X and Y. The matching process makes the follow associations: x1 ≃ y2 and x2 ≃ y1. Intuitively, the first match x1 ≃ y2 indicates that tuple x1 and tuple y2 refer to the same real-world entity (in this case, they are the same company Apple Inc.). Similarly, tuple x2 and y1 refer to the same company IBM Corp. These matches are stored in a table M.
DATA MERGING: