Commit e4764a57 authored by npedot's avatar npedot
Browse files

updates overview with details

parent 4f1dada1
......@@ -28,14 +28,16 @@ Iterative because it offers a pay-as-you-go [] approach that allows you to fragm
Interactive because the proposal sees in the decision-making centrality of the designer the solution to the various non-automatable problems of choice.
Steps:
Steps:
1. for each datasource reverse engineer from database to conceptual level with semantic enrichment using standard vocabularies (eg. schema.org).
2. map and integrate from many conceptual diagrams to a single overall conceptual model gaining semantic services
3. map from the conceptual model to physical structured datasources
4. SQL query on virtual or materialized new datasources
5. repeat from step 1 for each new datasource to integrate
1. choose your next business query to answer
2. acquire existing datasets
3. model with ORM your database or reverse engineer from database to conceptual level with semantic enrichment using standard vocabularies (eg. schema.org).
4. map and integrate from many conceptual diagrams to a single overall conceptual model gaining semantic services
5. map from the conceptual model to physical structured datasources
6. SQL query on virtual or materialized new datasources
7. repeat from step 1 for each new datasource to integrate
The goal of this process is to gain:
......@@ -48,6 +50,37 @@ For the description of the conceptual model we will use Object Role Modeling []
Each step has been studied to minimize the friction of information loss due to the semantic poverty of the physical levels with respect to the richness of the conceptual ones, highlighting the necessary practical compromises.
## Identify team
Roles:
* Business Expert giving business value to data
* Knowledge Scientist, design ontology competence
* IT Dev, to data access
## Identify targets
Stakeholders, domain expert, knowledge scientist have to share
goals, valuable business questions to be answered and priority.
## Acquisition
Catalog available tables (data sources)
* data ingestion
* data extraction guidelines (access, tranfer, terms of use)
output: selection of valuable tables for extraction with guidelines
## data cleaning
If required detect outlier, missing values and choose how to fix.
output: valuale tables projection without noise
## Model top-down: ORM
Database design, is a forward engineering process which systematically transforms a high-level conceptual schema into a relational database schema residing on a physical machine, via a series of tasks — requirement collection and analysis, logical
......@@ -98,12 +131,17 @@ Business questions are anwsered via ORM facts.
Very difficult to automate may have over hundread or thousand entity attributes there is no direct mapping.
- schema matching
attribute name of table X is the same as cname of table Y
output: attribute mapping
Roles:
- schema merging
choose to merge the two schemas X(id, name, loc) and Y(id, cname, address, rev) into a single schema Z(name, loc, rev)
adds required attributes, derived values
output: a new table
- schema enrich
* Business Expert giving business value to data
* Knowledge Scientist, design ontology competence
* IT Dev, to data access
## Map to ER from conceptual
......@@ -111,10 +149,72 @@ Once given the overall conceptual schema described in ORM notation ther is a sta
[Halpine]
## Data fusion
We can identify five steps:
- data matching
tuple x1 and tuple y2 refer to the same real-world entity
output: tuple mapping
- data merging
merge the tuples that refer to the same entity,
value noralization
output: tuple without duplication, with the right values
- dataset license & privacy
input: dataset clean & rich
output: dataset with license and visibility level filter
- dataset versioning
input: final dataset
output: final dataset versioned
- dataset provisionig
when delivered to final storage.
## Query
Once exported to a new database data could be queried using standard SQL query and analytical tools.
It is a requirement to have clean data source to get valuable results. A lot of tools are dedicated to this time consuming and critical preprocessing phase of data preparation.
We propose a novel data cleaning solution based on [Logic Tensor Network technology](semint-clean.md).
\ No newline at end of file
We propose a novel data cleaning solution based on [Logic Tensor Network technology](semint-clean.md).
## Semantic Integration Checklist
List of questions: semantic integration
* Do you know which questions you need to answer sorted by priority?
* Do you have examples of answers you'd like to get?
* Do you have a catalog of accessible sources with diagram and without scheme from which to draw?
- Where does the data come from? are they reliable?
- How old is the data?
- Is there a data versioning organization?
- User license?
- Which access stability?
* Are the data for each source clean or do they contain dirty or missing values?
- To normalize
- To aggregate
- To be standardized
- Fix Outliers
- Fix Nulls
* Sources with schematic metadata are:
- Names
- Data schema and their syntactic format
- Foreign keys and functional dependency constraints
* Do sources have semantic metadata?
- Do you know what the entities, the attributes and the units of measurement in each source really mean?
- Semantic metadata indicating that a number is a postal code
* Do sources have conversational metadata?
- Information on the usefulness of the data and the choices applied during construction or use.
* Can you complete the schematic source with schematics?
* Can you reproduce the entities from the tables?
- Removal of people
- Deduplicate rows
* Do you know how to connect entity data from different sources?
- What are the integrity constraints of the conceptual scheme?
- What are the keys needed to relate the entities?
* Are there any derivable columns useful for answering questions?
\ No newline at end of file
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment