Commit 0dc81d12 authored by npedot's avatar npedot
Browse files

reorders docs

parent f46b7f5e
# DB DATATYPES
https://it.wikipedia.org/wiki/Tipo_di_dato_(basi_di_dati)
## articolo
In questo articolo si propone una metodologia di accesso al dato basato sull'ontologia come modello di alto contenuto semantico, formalmente validabile, trattabile in automazione. Si presentera' inoltre una composizione di strumenti per la manutenzione di uno schema dati strutturato di qualita' nel tempo a basso costo.
Gli apporti innovativi riguardano l'uso di DomainDrivenDesign (DDD) come primo approccio di elicitazione del vocabolario, scomposizione del constesto e disegno del processo aziendale e dell'utilizzo della procedura ObjectRoleModeling (ORM) per la formalizzazione del dato strutturato multisorgente.
### situazione
In contesti aziendali attuali molte aziende richiedono strumenti di analisi dei propri dati per rispondere alle sfide del mercato.
Il mercato Data Science (DS) sta conoscendo una crescita verticale []
DECADENZA DEL RELAZIONALE
Il decadimento dei database tradizionali [1] impone costi che possono portare a valutare una riscrittura degli archivi, in favore di motori specializzati in grado di fornire prestazioni una o due ordini di grandezza superirore [2].
SENTIMENT
Comprendere il pensiero dei clienti e' importante per pianificare le azioni di vendita [].
Sempre piu' si comprende che il comportamento del cliente e' volatile[]. In base alle disponibilita' dell'azienda il cliente va o anticipato influenzandolo con campagne di marketing [] e/o seguito nei suoi cambiamenti di gusto quotidiani.
PROBLEMI ARCHIETTTURALI
Spesso il sistema informatico di un'azienda oggi si trova in uno di questi due stati:
1) il monolite
Sistema centralizzato, le informazioni si sono stratificate nel tempo senza evolvere qualita'.
2) il frammento
Sistema sbriciolato, sono replicate e/o distribuite male rispetto alle nuove richieste, in molti sottosistemi che non si parlano.
Entrambe presentano problemi di scalabilita' e gestione
1) scalabilita'
Nel primo caso devono reggere un carico diventando collo di bottiglia, nel secondo caso si ha un costo di recupero del dato finale eccessivo.
2) gestione
Gestione, nel primo caso ottenendo difficile mappare i ruoli sulle necessita' di uso delle risorse, nel secondo caso con un costo eccessivo di replicazione e allineamento delle credenziali di accesso.
MICROSERVIZI
Obiettivo dei microservizi e' decentralizzare e consentire di replicare quando necessario e gestire i malfunzionamenti parziali.
DDD
Obiettivo riorganizzare con una scomposizione di equilibrio che massimizza la coesione e minimizza l'interazione che incaspulano i dati gestendo le politiche di aggiornamento e di visione per sicurezza e riservatezza aziendale.
Queste architetture costose nella loro implementazione, manutenzione e progettazione hanno il beneficio di scalare secondo carico di lavoro e di seguire piu' velocemente le esigenze con motori di calcolo specializzati.
GLOBAL
Il mercato anche per le aziende piccole diventa globale in concorrenza, platea fornitori e clienti. Strategico il time to market ed il test market. Essenziale diventa rimuovere le frizioni verso un accesso alle proprie sorgenti di informazioni per sintetizzare cruscotti in grado di fornire un timone reattivo e sensibile.
Naturale quindi che vi sia un forte sviluppo del settore Data Science.
Necessario saper rispondere ai carichi di lavoro di picco e non solo medi.
CLOUD
Le moderne infrastrutture si appoggiano sempre piu' a servizi esterni in cloud,
I vantaggi sono scalabilita', velocita'di accesso all'innovazione.
Gli svantaggi sono delegare componenti che possono diventare essenziali all'azienda
e la difficolta' nel garantire il rispetto dei vincoli normativi di privacy.
SFIDE
Le sfide tradizionali di DS sono la collaborazione di diverse figure di alto profilo, la fruibilita' di tutte le sorgenti di dato disponibili.
MOLTE FIGURE
Figure come Gardner [] indica in:
analista di mercato, in grado di
ingegnere dei dati,
progettista di metadati,
MOLTI STADI e STRUMENTI
Raggiungere un adeguato accesso alle informazioni a partire da dati grezzi, richiede diversi passaggi: catalogo, pulizia, integrazione, analisi . [] Ciascuno di questi passaggi diversi strumenti specializzati, in costante evoluzione, e tarati alle dimensioni aziendali.
Sono premianti oggi le soluzioni che costruiscano comunita' aziendale [] e suite di strumenti [] con un processo che renda fluido la corretta interpretazione dei dati. Stanno emergendo aziende che forniscono (databricks, data.world, poolparty) mentre grandi aziende gia' forniscono strumenti cloud (amazon, google, microsoft, ibm, oracle).
# OBDA
Il vocabolario diventa chiave. Avere un modello dei termini aziendali non ambiguo e mantenibile e' strategico.
Il paradigma Ontology Based Data Access [] consente :
1. chiara visione concettuale di sorgenti dati eterogenee
2. possibilita' di interrogare in termini aziedali le sorgenti
## Elicitazione DDD
consente un dialogo fluido ed un riconoscimento dei contesti aziendali, costruzione di un vocabolario
## Formalizzazione ORM
consente validazione, esportazione, manutenzione dell'ontologia ne termini del vocabolario
## Ciclo vita e portafoglio strumenti
consente di mantenere il debito tecnico basso ed alta la qualita' del dato
suite aperte:gorilla, suite proprietarie: poolparty
## Sfide aperte
volumi mapping
ridurre il costo delle figure professionali
## Riferimenti
Stonebraker, Data Decay Challenge [1]
Stonebraker, HStore [2]
Sequeda, Pay as you go
Data science
-Gorilla
DDD
-EventStorming
ORM
# Full-Stack Data Integration Methodology
1. reverse
third form clean schema plus logic constrains
input: a database
tools: nony
output: a db schema
2. clean
detect and fix missing or faulted values
input: database
tools: holoclean (modular) + ltwn
http://holoclean.io/
output: database cleaned
[semint_clean](semint-clean.md)
3. merge mappings
define reversal entity mappings for single external source:
input: multi schemas
tools: JedAI (modular) + manual binding
output: mappings (eqivalence clusters)
4. integration
build a unified conceptual schema
input: multi-schema + mappings
tools: ??
output: integrated ORM-schema
5. fact based modeling
reasoning and query answer
input: integrated schema
tools: visual studio NORMA and ORMIE
NORMA - https://github.com/ormfoundation/NORMA-plus
ORMiE - overall concistency
ouput: new SQL and query
## Competitors
https://www.devart.com/entitydeveloper/
## Pay as you go
https://files.zotero.net/12337925152/Sequeda%20et%20al.%20-%202018%20-%20The%20Pay-as-you-go%20Methodology%20as%20an%20enabler%20of%20Sel.pdf
## Pay-as-you-go Methodology - Sequeda
ODBA
Ontology Based Data Access (OBDA) paradigm enables
(1) a clean separation be-tween a conceptual business view from heterogeneous data sources and
(2) ability to askquestions in terms of the business view, independent of how and where the data is phys-ically stored.
FEDERATED MODEL
Ontologies serve as a uniform conceptual federatedmodel describing the domain of interest.
...
The ontology serves as a business view, using business terminology, which is then connected to data sources.
VOLUMES
Common enterprise application’s database schema commonly consist of thousands of tables and tens ofthousands of attributes.
PUTATIVE (reverse engineering)
A common approach is to bootstrap ontologies derived fromthe source database schemas, known also as putative ontologies[6, 7]. The putative ontologies can gradually be transformed into target ontologies, using existing ontology engineering methodologies
W3C DIRECT MAPPING
The direct mapping takes as input a relational database (data and schema), and generates an RDF graph that is called the direct graph.
DERIVED FIELDS
In addition to 1-1 correspon-dences between classes and properties, mappings can be complex involving calculationsand rules that are part of business logic.
BUSINESS QUESTION DRIVEN
driven by a prioritized list of business questions.
BUSINESS QUESTION SOURCE
BQ are currently being answered by a small set of expert users by running multiple SQL queries to manually generate BI reports.
HYPOTHESIS
we should be able to extract info to generate an Ontology and Mapping from SQL query.
ITERATIVE
After a minimal set of business questions havebeen successfully modeled, mapped, answered and made into dashboards, then the setof business questions can be extended.
THREE ACTORS
Business User, IT Developer, Knowledge Engineer
PHASE 1 Discovery-Vocabulary-Ontology
PHASE 2 Mapping-Query-Validation
CHALLENGES
Automation:
Given a SQL query, how can we automatically generate an OWL ontology and R2RML mappings?
Iteration:
Manage new business questions that extend the ontology and mappings. What happens if a new query contradicts the current ontology and/or mappings, hence it is non-monotonic?
Tools:
There is a need for tools that can manage large database schemas at scale
SCENARIO 1 : sinonimi, omonimi
Executives of alarge e-commerce company need to know how many orders were placed in a givenmonth and the corresponding net sales. Depending on whom they ask they get differentanswers. The IT department managing the website records an order when a customerhas checked out. The fulfillment department records an order when it has shipped. Yetthe accounting department records an order when the funds charged against the creditcard are actually transferred to the company’s bank account, regardless of the shippingstatus. Unaware of the source of the problem, the executives have inconsistencies acrosstheir business reports.
SCENARIO 2 : documentazione
Business users asks IT developers to answer a business question. SQL queries are initially created by IT developers who are knowl-edgeable of the large database schema. Developers come and go within an organization.Queries get shared, altered, extended and combined. After time, business users are executing SQL queries without any understanding of what the queries actually do. Users rely on a description of what the SQL query is supposed to be returning.
----
### The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd Edition
### The Kimball Group Reader
## How to Plan and Launch Your Modern Data Catalog_datadotworld
Step 1:CHOOSE A PILOT PROJECT
WHY THIS WORKS
Reduce risk:
Learn by doing
Build buzz:
WHO TO INVOLVE
a cross-functional selection team
Step 2: ENGAGE THE RIGHT PEOPLE
Data Manager: access to the data
Domain Expert: access to the domain rules
Data Practitioner: data query
Data Consumer: data interpretaion
Step 3: SELECT AND CONNECT DATA SOURCES
Data is brought into the light:
Data is documented and trustworthy:
Data becomes instantly more useful:
Step 4: EDUCATE AND DRIVE USAGE
Step 5:MEASURE SUCCESS
Understand unique viewpoints: how the catalog helped them
Refine your process before onboarding others: working out the kinks
Provide proof: Quantifiable results showing ROI
Tracking the catalog’s impact on team productivity, organizational culture, and overall business results will be an ongoing feedback loop with continuous tinkering.
## DREMIO and PSQL
One approach frequently applied to optimize this kind of query, is to execute it against a different database repository specifically designed to optimize analytical queries, such as a data warehouse. For this approach we would need to provision a server, export the data, apply the necessary transformations and import the resulting data. All these tasks require a very complex, time consuming and expensive project.
Reflections in Dremio are powered by a data acceleration technology based on Apache Arrow, which is a columnar in-memory analytics layer designed to accelerate big data. Dremio’s Reflections are optimized physical data structures that accelerate data and queries automatically.
Dremio provides two types of Reflections: Raw Reflections and Aggregation Reflections.
Note that both types of reflections consume less than 1MB combined. The size of the original table in PostgreSQL is over 1GB.
This is a subset of the columns in the table.
This is a subset of the rows based on the query predicates and group by statements.
Dremio columnarizes and compresses the data, which can reduce storage by 70% or more (depending on many factors)
## passi essenziali per machine learning
https://elitedatascience.com/birds-eye-view
Exploratory Analysis
First, "get to know" the data. This step should be quick, efficient, and decisive.
Data Cleaning
Then, clean your data to avoid many common pitfalls. Better data beats fancier algorithms.
Feature Engineering
Next, help your algorithms "focus" on what's important by creating new features.
Algorithm Selection
Choose the best, most appropriate algorithms without wasting your time.
Model Training
Finally, train your models. This step is pretty formulaic once you've done the first 4.
## passi aggiuntivi
Project Scoping
Sometimes, you'll need to roadmap the project and anticipate data needs.
Data Wrangling
You may also need to restructure your dataset into a format that algorithms can handle.
Preprocessing
Often, transforming your features first can further improve performance.
Ensembling
You can squeeze out even more performance by combining multiple models.
This diff is collapsed.
# SemInt Overview
In the expansion processes of companies that incorporate other companies, in the birth of new startups that grow trying to intercept and respond to user needs and in the evolution of the data archives of their schemes, processed by new data sources, it becomes an essential part of a strategy successful business and often a prerequisite for its own survival.
Even the normal evolutionary history of a single company that wants to respond to market challenges often involves a complicated and costly access to the sources and their management by:
* growth business requirements
* technological alignment to the market
* big data challenges: volume, velocity, and variety
* safety and audit regulations
* technical debt
The evolutionary activity is a set of compromise and balance choices made in order to keep the data schema more efficient and readable, ie trying to:
1. keep technical debt to a minimum by allowing the applications that access it to evolve rapidly;
2. offer maximum efficiency in accessing and manipulating data to those who should and can access it
3. offer maximum rapid and correct interpretation of the data to all the roles that can benefit from it;
Requirements that often struggle with each other.
Here we will focus on keeping low the technical debs and we present a curated selection of steps to help this evolution as Ontology-based Data Integration (OBDI) of structured datasources like relational databases.
The methodology proposed here has inter-active and iterative characteristics.
Iterative because it offers a pay-as-you-go [] approach that allows you to fragment the cost and benefit more quickly than the work done, as opposed to single-step evolution.
Interactive because the proposal sees in the decision-making centrality of the designer the solution to the various non-automatable problems of choice.
Steps:
1. choose your next business query to answer
2. acquire existing datasets
3. model with ORM your database or reverse engineer from database to conceptual level with semantic enrichment using standard vocabularies (eg. schema.org).
4. map and integrate from many conceptual diagrams to a single overall conceptual model gaining semantic services
5. map from the conceptual model to physical structured datasources
6. SQL query on virtual or materialized new datasources
7. repeat from step 1 for each new datasource to integrate
The goal of this process is to gain:
* a progressive integration
* a live sharable documentation in sync
* no intermediary for low level access
For the description of the conceptual model we will use Object Role Modeling [] as a friendly notation to both the designer and the domain expert by virtue of his verbalization [], while offering a well-established formal semanic from which he will be It is possible to extract an OWL2 [] description to use services to check the consistency and explication of rules that would otherwise remain implicit, to improve data cleaning services by exporting the domain's conceptual constraints [HoloClean].
Each step has been studied to minimize the friction of information loss due to the semantic poverty of the physical levels with respect to the richness of the conceptual ones, highlighting the necessary practical compromises.
## Identify team
Roles:
* Business Expert giving business value to data
* Knowledge Scientist, design ontology competence
* IT Dev, to data access
## Identify targets
Stakeholders, domain expert, knowledge scientist have to share
goals, valuable business questions to be answered and priority.
## Acquisition
Catalog available tables (data sources)
* data ingestion
* data extraction guidelines (access, tranfer, terms of use)
output: selection of valuable tables for extraction with guidelines
## data cleaning
If required detect outlier, missing values and choose how to fix.
output: valuale tables projection without noise
## Model top-down: ORM
Database design, is a forward engineering process which systematically transforms a high-level conceptual schema into a relational database schema residing on a physical machine, via a series of tasks — requirement collection and analysis, logical
design, normalisation and the final physical implementation.
In ORM, the knowledge is structured into:
* Facts, A fact is a statement, or assertion, about some piece of information within the application domain. (Professor works as Employee for Departement)
* Predicates, is a verb, or verb phrase, that connects the object types in a fact, with one role each. (e.g. ... works for ...)
* Roles, Each role in a predicate is expressed by a role label and is played by one object type (e.g. Employee).
* Object Types, categorizes data into different kinds of meaningful sets (e.g Professor).
* Constraints, restricting the set of value for a role.
The knowledge about the domain is the stated by means of set of facts.
These facts may be verbalized using sample data, named as fact instances.
eg.
Instructor works for Department (fact type)
Instructor 100 works for Department “CS” (fact instance)
For a detailed guide [Guide to FORML]
## Model bottom-up: Reverse
Database Reverse Engineering (DBRE) is a process typically used for creating an
equivalent conceptual schema from an existing relational database. It is a trans-
lation between schemas of different data models and structures.
The application of a reverse engineering process over a relational database is
beneficial for recovering hidden knowledge and representing it in a conceptual
schema that provides a more expressive formulation of the domain.
[Nony]
## Merge
As with pay as you go[] we may have At the center of the
methodology are a set of prioritized business questions that need to be answered. The
business questions serve as competency questions and as a success metric.
Business questions are anwsered via ORM facts.
* Deduplicate Instances, distinct entities by artificial key same attribute values
* Namespaces, same type name distinct domains
* Domain Ranges, subtypes, same value constraints
* Rendundancy, distinct types names same constraints
* Derivable, empty types
Very difficult to automate may have over hundread or thousand entity attributes there is no direct mapping.
- schema matching
attribute name of table X is the same as cname of table Y
output: attribute mapping
- schema merging
choose to merge the two schemas X(id, name, loc) and Y(id, cname, address, rev) into a single schema Z(name, loc, rev)
adds required attributes, derived values
output: a new table
- schema enrich
## Map to ER from conceptual
Once given the overall conceptual schema described in ORM notation ther is a standard automatizable procedure to generate a normalized chema database.
[Halpine]
## Data fusion
We can identify five steps:
- data matching
tuple x1 and tuple y2 refer to the same real-world entity
output: tuple mapping
- data merging
merge the tuples that refer to the same entity,
value noralization
output: tuple without duplication, with the right values
- dataset license & privacy
input: dataset clean & rich
output: dataset with license and visibility level filter
- dataset versioning
input: final dataset
output: final dataset versioned
- dataset provisionig
when delivered to final storage.
## Query
Once exported to a new database data could be queried using standard SQL query and analytical tools.
It is a requirement to have clean data source to get valuable results. A lot of tools are dedicated to this time consuming and critical preprocessing phase of data preparation.
We propose a novel data cleaning solution based on [Logic Tensor Network technology](semint-clean.md).
## Semantic Integration Checklist
List of questions: semantic integration
* Do you know which questions you need to answer sorted by priority?
* Do you have examples of answers you'd like to get?
* Do you have a catalog of accessible sources with diagram and without scheme from which to draw?
- Where does the data come from? are they reliable?
- How old is the data?
- Is there a data versioning organization?
- User license?
- Which access stability?
* Are the data for each source clean or do they contain dirty or missing values?
- To normalize
- To aggregate
- To be standardized
- Fix Outliers
- Fix Nulls
* Sources with schematic metadata are:
- Names
- Data schema and their syntactic format
- Foreign keys and functional dependency constraints
* Do sources have semantic metadata?
- Do you know what the entities, the attributes and the units of measurement in each source really mean?
- Semantic metadata indicating that a number is a postal code
* Do sources have conversational metadata?
- Information on the usefulness of the data and the choices applied during construction or use.
* Can you complete the schematic source with schematics?
* Can you reproduce the entities from the tables?
- Removal of people
- Deduplicate rows
* Do you know how to connect entity data from different sources?
- What are the integrity constraints of the conceptual scheme?
- What are the keys needed to relate the entities?
* Are there any derivable columns useful for answering questions?
\ No newline at end of file
# TOOLS
* dremio
* spark
* holoclean
* druid
* superset
* ignite
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment