Commit 1561c665 authored by npedot's avatar npedot
Browse files

moves overall appendix into separate file

parent 9658ac9e
# SEMINT OVERALL APPENDIVI
# SEMINT OVERALL APPENDICI
https://en.wikipedia.org/wiki/Data_wrangling
......@@ -134,3 +134,296 @@ Section 3.2. Aggregations
Section 3.3. Join and Concatenation
Section 3.4. Transformation: Conversion, Replacement, Standardization, and New Feature Generation
-------------------------------------
## APPENDICI
## Present ETL Challege: Spaghetti Infrastructure Problem
"The average corporation has bought a portal system, has bought an enterprise application integration system, has bought an ETL (Extraction, Transformation, and Loading) system, has bought an application server, maybe has bought a federated data system. All of these are big pieces of system infrastructure that run in the middle tier; they have high overlap in functionality, and are complicated, and require system administrators. The average enterprise has more than one of all of these things, and so they have this spaghetti environment of middleware, big pieces of moving parts that are expensive to maintain and expensive to use. "
- Dr. Michael Stonebraker
## Present Analytics Challenge: Prediction Bias, Privacy
The trouble with predictive models is that they are built by humans, and humans by nature are prone to bias.
We have to protect and give access only to authorized people. The right amonut of data, no more, no less.
## Present BigData challenge: Volume, Velocity, Variety
"From my point of view, there are three potential problems with Big Data. These can be broken into the three “V’s.” It can be a volume problem, meaning you have too much data; the data is coming at you too fast and it’s a velocity problem; or there is data coming at you from too many sources and it’s a variety problem."
- Dr. Michael Stonebraker [2]
## Present DB Distibution Challenge: Distributed DB & Microservice
Spezzare il monolite in frammenti aumenta i costi di gestione e richiede disciplina nella sua evoluzione, un forte supporto di automatismo del monitoraggio e manutenzione.
## Present DB Quality Challenge2: Database Decay [4]
"DBAs appear to attempt to minimize
application maintenance (and hence schema changes) instead
of maximizing schema quality. This leads to schemas which
quickly diverge from E-R or UML models and actual database
semantics tend to drift farther and farther from 3rd normal
form. We term this divergence of reality from 3rd normal form
principles database decay."
## Risposte alle esigenze di mercato
## First Response: Vertical Scaling
Crescita verticale, potenziando il motore di calcolo espandendo in storage e potenza di calcolo.
The great concentration in all-inclusive app servers, Non-modular monoliths, does not scale enough and is not flexible enough to meet market needs.
## Second Response: Horizontal Scaling, Federated BD
### NewSQL: VoltDB
The database is partitioned into disjoint subsets each assigned to a single-threaded execution engine assigned to one core on one node. Each engine has exclusive access to all of the data in its partition.
Throughput is increased by increasing the number of nodes in the system and reducing partition sizes.
Provide the high-throughput and high-availability of NoSQL systems, but without giving up the transactional consistency of a traditional DBMS known as ACID (atomicity, consistency, isolation and durability).Such systems operate across multiple machines, as opposed to a single, more powerful, more expensive machine.
### Spanner: Google's Globally-Distributed Database
Spanner is Google's scalable, multi-version, globally-distributed, and synchronously-replicated database. It is the first system to distribute data at global scale and support externally-consistent distributed transactions. This paper describes how Spanner is structured, its feature set, the rationale underlying various design decisions, and a novel time API that exposes clock uncertainty. This API and its implementation are critical to supporting external consistency and a variety of powerful features: non-blocking reads in the past, lock-free read-only transactions, and atomic schema changes, across all of Spanner.
### CosmoDB
This globally-distributed multi-model database is built for low latency, elastic scalability, high availability.
## Third Response: Hybrid Cloud Architectures
Usare la capacita' di calcolo per la quantita' ed per il tempo che serve sfruttando l'elasticita' dei servizi di rete Internet.
Serve un accordo con fonitore di accesso ai servizi cloud. Serve poter esportare dall'azienda i dati nel rispetto della legilazione esistente.
## Proposta
Reverse
from single DB extract
Map
Le sorgenti dati sono molte e cambiano nel tempo. Serve un catalogo di veloce gestione.
Il dialogo tra progettisti, esperti di dominio e clienti deve essere fluido. Serve un vocabolario piu' possibile condiviso.
I sistemi da mettere in dialogo sono molteplici. Serve una mappatura delle interfacce dati tra sistemi.
## Limiti
* Valocita' della iterazione deve essere realizzata nei tempi consoni per rispondere ai volumi e richieste di mercato
* Il progettista e' un essere umano se paragonato ad un'automazione risulta: lento , inaffidabile, inconsistente
----------------------------------------------------------
## Present DB Challenge2: Database Decay [4]
"DBAs appear to attempt to minimize
application maintenance (and hence schema changes) instead
of maximizing schema quality. This leads to schemas which
quickly diverge from E-R or UML models and actual database
semantics tend to drift farther and farther from 3rd normal
form. We term this divergence of reality from 3rd normal form
principles database decay."
Problems with views over tables
Problem 1: The views, so defined, are unlikely to be updatable.
Problem 2: The semantics of the application may change.
Consequence #1. There is substantial risk. ALL of the
applications must be found and corrected
Consequence #2: There is often no budget for global
maintenance.
As a result, a very popular tactic is to leave the schema
in Figure 4 intact. Instead, the changed business logic is
supported without changing the schema.
The result, which we call the kluge, is not in
3rd normal form, so professionals would consider this a bad
design. However, it has a huge advantage:
The applications A1 − A3 will continue to run.
There is no E-R diagram that will produce the kluge schema.
We have talked to nearly twenty DBAs, and all report that they do
not use E-R tools, because they do not reflect reality in
the database.
Problem 3: Applications have to avoid running certain
problematic SQL commands.
We explore how to lower the amount of required application
maintenance by using either defensive schemas or defen-
sive application development tactics.
No solution 1: code at ER level, view over table update problem maintenance remains.
No solution 2: db schema versioning, sql commands may have to be rewritten.
Proposed 1: defensive
Using defensive programming, defensive schemas and
appropriate kluge schemas will allow many popular changes
in business logic to NOT require application maintenance.
This will clearly result in lower maintenance costs than
using the traditional wisdom. Of course, there is no free
lunch and the ultimate cost is database decay.
Proposed 2: a new app dev paradigm (DB gateway)
One problem that contributes significantly to database
decay is the decentralization of application development.
It is virtually impossible to figure out the implications of
schema changes on applications, since they are in mul-
tiple departments in the enterprise.
...
However, instead of coding in ODBC/JDBC,
an application must send a message to a
middleware system in an agreed-upon format.
...
If the schema changes,
then the application logic that interacts with the DBMS is
centralized in the middleware message processing system.
## Present Challenges: Widly Competitive Fields
"We learned on the job .
These days wildly more is known and
I don't think we could have succeeded now what we did in the 70's,
just because the env is so much more competitive, 10x more competitive.
...
I think database design is completly broken
I think we build software that is ridicolously hard to use
Working on human factors I think no one wants because it's too hard
so I would look there.
I think it remains to be seen how or even if
complex analytics and machine learning are going to get integrated into
data and storage systems.
Cope with the cloud is ridicolously complicated env
and how to navigate in the cloud is going to be a real challenge."
[3]
## Today Outside The Company: Function As A Service, Cloud Serverless
Velocità di connessione e trasmission
Infrastruttura e servizi a richiesta.
## Today Inside The Company: Microservices, DevOps
L'importana della distribuzione delle competenze per ottimizzare i processi di elaborazione e della rapidità di feedback.
L'importanza del tempo. Sistemi reattivi.
## NewSQL 5G revolution - VoltDB
applications that require real-time intelligent decisions on stream processing for a connected world, without compromising on ACID requirements.
## Domain Driven Design
L'importanza delle terminologie.
L'importanza del team.
L'importanza delle comunità.
L'importanza della velocità di evoluzione per sopravvivere alla competizione.
## Clean
L'importanza della pulizia come asset competitivo per l'evoluzione.
Dato pulito, chiara semantica
Architettura pulita, chiara modularità
Codice pulito, chiara lettura
## Steps per avere UNA visione dominio dei dati
[see metodology](semint-metodology.md)
## Mapping Tool
https://dzone.com/articles/data-mapping-why-it-is-important-for-integration
A data mapping tool makes sure that there are no gaps in mapping and the destination data is getting populated in the right format/schema. It ensures output is free of errors, inconsistencies, and any kind of duplication, thus preserving the integrity of data integration.
It not only enables the mapping of two distinct elements but also governs the rule as to how the data would be mapped with each other. In a way, data mapping requires an understanding of the semantics of data schemas to ascertain relationships between source and destination fields.
## References
1. [Michael Stonebraker Speaks Out](https://sigmod.org/publications/interviews/pdf/D1-DBP-stonebraker-final.pdf)
2. [Interview: Michael Stonebraker, Adjunct Professor, MIT Computer Science and AI Laboratory (CSAIL)](https://insidebigdata.com/2017/07/07/interview-michael-stonebraker-adjunct-professor-mit-computer-science-ai-laboratory-csail/)
3. [Michael Stonebraker at the 6th Heidelberg Laureate Forum, September 2018](https://tomgeller.com/accomplishment/stonebraker-interview/)
4. [Database Decay and How to Avoid It - Conference: 2016 IEEE International Conference on Big Data](https://www.researchgate.net/publication/311584152_Database_Decay_and_How_to_Avoid_It)
-------------------
## Accademy and Market
The academic world has a role in using public funds to tackle research topics and develop prototypes.
Development of a prototype that focuses on demonstrating the feasibility of a product and omits the complementary aspects.
The company uses private funds to make the prototype a product.
"I would encourage academics to pay attention to the real world, at least in those fields where the ultimate arbiter is real-world applications."
- Dr. Michael Stonebraker
### Lession 1: INGRESS Success
"And, in retrospect, we made a bunch of very lucky accidental decisions.
I think another factor in our success was that we stumbled onto the dictum: If you get it wrong, just throw it away and rewrite it.”
- Dr. Michael Stonebraker
### Lession 2: OODB Unsuccess
"OODBs are a deep tangent in the sense that it was interesting stuff that nobody wanted; and the fact that nobody wanted it was, I thought, fairly obvious up front. "
- Dr. Michael Stonebraker
Root problem: an absence of standards, not enough vendor push to cross the chasm of early adopters.
https://en.wikipedia.org/wiki/Crossing_the_Chasm
### Federated DB
"I mean, sooner or later, again it seems inevitable that federated database technology will have to be commercially important. "
"But I think in the commercial market, timing is everything. "
- Dr. Michael Stonebraker [1]
......@@ -98,296 +98,3 @@ Via standard SQL query.
-------------------------------------
## APPENDICI
## Present ETL Challege: Spaghetti Infrastructure Problem
"The average corporation has bought a portal system, has bought an enterprise application integration system, has bought an ETL (Extraction, Transformation, and Loading) system, has bought an application server, maybe has bought a federated data system. All of these are big pieces of system infrastructure that run in the middle tier; they have high overlap in functionality, and are complicated, and require system administrators. The average enterprise has more than one of all of these things, and so they have this spaghetti environment of middleware, big pieces of moving parts that are expensive to maintain and expensive to use. "
- Dr. Michael Stonebraker
## Present Analytics Challenge: Prediction Bias, Privacy
The trouble with predictive models is that they are built by humans, and humans by nature are prone to bias.
We have to protect and give access only to authorized people. The right amonut of data, no more, no less.
## Present BigData challenge: Volume, Velocity, Variety
"From my point of view, there are three potential problems with Big Data. These can be broken into the three “V’s.” It can be a volume problem, meaning you have too much data; the data is coming at you too fast and it’s a velocity problem; or there is data coming at you from too many sources and it’s a variety problem."
- Dr. Michael Stonebraker [2]
## Present DB Distibution Challenge: Distributed DB & Microservice
Spezzare il monolite in frammenti aumenta i costi di gestione e richiede disciplina nella sua evoluzione, un forte supporto di automatismo del monitoraggio e manutenzione.
## Present DB Quality Challenge2: Database Decay [4]
"DBAs appear to attempt to minimize
application maintenance (and hence schema changes) instead
of maximizing schema quality. This leads to schemas which
quickly diverge from E-R or UML models and actual database
semantics tend to drift farther and farther from 3rd normal
form. We term this divergence of reality from 3rd normal form
principles database decay."
## Risposte alle esigenze di mercato
## First Response: Vertical Scaling
Crescita verticale, potenziando il motore di calcolo espandendo in storage e potenza di calcolo.
The great concentration in all-inclusive app servers, Non-modular monoliths, does not scale enough and is not flexible enough to meet market needs.
## Second Response: Horizontal Scaling, Federated BD
### NewSQL: VoltDB
The database is partitioned into disjoint subsets each assigned to a single-threaded execution engine assigned to one core on one node. Each engine has exclusive access to all of the data in its partition.
Throughput is increased by increasing the number of nodes in the system and reducing partition sizes.
Provide the high-throughput and high-availability of NoSQL systems, but without giving up the transactional consistency of a traditional DBMS known as ACID (atomicity, consistency, isolation and durability).Such systems operate across multiple machines, as opposed to a single, more powerful, more expensive machine.
### Spanner: Google's Globally-Distributed Database
Spanner is Google's scalable, multi-version, globally-distributed, and synchronously-replicated database. It is the first system to distribute data at global scale and support externally-consistent distributed transactions. This paper describes how Spanner is structured, its feature set, the rationale underlying various design decisions, and a novel time API that exposes clock uncertainty. This API and its implementation are critical to supporting external consistency and a variety of powerful features: non-blocking reads in the past, lock-free read-only transactions, and atomic schema changes, across all of Spanner.
### CosmoDB
This globally-distributed multi-model database is built for low latency, elastic scalability, high availability.
## Third Response: Hybrid Cloud Architectures
Usare la capacita' di calcolo per la quantita' ed per il tempo che serve sfruttando l'elasticita' dei servizi di rete Internet.
Serve un accordo con fonitore di accesso ai servizi cloud. Serve poter esportare dall'azienda i dati nel rispetto della legilazione esistente.
## Proposta
Reverse
from single DB extract
Map
Le sorgenti dati sono molte e cambiano nel tempo. Serve un catalogo di veloce gestione.
Il dialogo tra progettisti, esperti di dominio e clienti deve essere fluido. Serve un vocabolario piu' possibile condiviso.
I sistemi da mettere in dialogo sono molteplici. Serve una mappatura delle interfacce dati tra sistemi.
## Limiti
* Valocita' della iterazione deve essere realizzata nei tempi consoni per rispondere ai volumi e richieste di mercato
* Il progettista e' un essere umano se paragonato ad un'automazione risulta: lento , inaffidabile, inconsistente
----------------------------------------------------------
## Present DB Challenge2: Database Decay [4]
"DBAs appear to attempt to minimize
application maintenance (and hence schema changes) instead
of maximizing schema quality. This leads to schemas which
quickly diverge from E-R or UML models and actual database
semantics tend to drift farther and farther from 3rd normal
form. We term this divergence of reality from 3rd normal form
principles database decay."
Problems with views over tables
Problem 1: The views, so defined, are unlikely to be updatable.
Problem 2: The semantics of the application may change.
Consequence #1. There is substantial risk. ALL of the
applications must be found and corrected
Consequence #2: There is often no budget for global
maintenance.
As a result, a very popular tactic is to leave the schema
in Figure 4 intact. Instead, the changed business logic is
supported without changing the schema.
The result, which we call the kluge, is not in
3rd normal form, so professionals would consider this a bad
design. However, it has a huge advantage:
The applications A1 − A3 will continue to run.
There is no E-R diagram that will produce the kluge schema.
We have talked to nearly twenty DBAs, and all report that they do
not use E-R tools, because they do not reflect reality in
the database.
Problem 3: Applications have to avoid running certain
problematic SQL commands.
We explore how to lower the amount of required application
maintenance by using either defensive schemas or defen-
sive application development tactics.
No solution 1: code at ER level, view over table update problem maintenance remains.
No solution 2: db schema versioning, sql commands may have to be rewritten.
Proposed 1: defensive
Using defensive programming, defensive schemas and
appropriate kluge schemas will allow many popular changes
in business logic to NOT require application maintenance.
This will clearly result in lower maintenance costs than
using the traditional wisdom. Of course, there is no free
lunch and the ultimate cost is database decay.
Proposed 2: a new app dev paradigm (DB gateway)
One problem that contributes significantly to database
decay is the decentralization of application development.
It is virtually impossible to figure out the implications of
schema changes on applications, since they are in mul-
tiple departments in the enterprise.
...
However, instead of coding in ODBC/JDBC,
an application must send a message to a
middleware system in an agreed-upon format.
...
If the schema changes,
then the application logic that interacts with the DBMS is
centralized in the middleware message processing system.
## Present Challenges: Widly Competitive Fields
"We learned on the job .
These days wildly more is known and
I don't think we could have succeeded now what we did in the 70's,
just because the env is so much more competitive, 10x more competitive.
...
I think database design is completly broken
I think we build software that is ridicolously hard to use
Working on human factors I think no one wants because it's too hard
so I would look there.
I think it remains to be seen how or even if
complex analytics and machine learning are going to get integrated into
data and storage systems.
Cope with the cloud is ridicolously complicated env
and how to navigate in the cloud is going to be a real challenge."
[3]
## Today Outside The Company: Function As A Service, Cloud Serverless
Velocità di connessione e trasmission
Infrastruttura e servizi a richiesta.
## Today Inside The Company: Microservices, DevOps
L'importana della distribuzione delle competenze per ottimizzare i processi di elaborazione e della rapidità di feedback.
L'importanza del tempo. Sistemi reattivi.
## NewSQL 5G revolution - VoltDB
applications that require real-time intelligent decisions on stream processing for a connected world, without compromising on ACID requirements.
## Domain Driven Design
L'importanza delle terminologie.
L'importanza del team.
L'importanza delle comunità.
L'importanza della velocità di evoluzione per sopravvivere alla competizione.
## Clean
L'importanza della pulizia come asset competitivo per l'evoluzione.
Dato pulito, chiara semantica
Architettura pulita, chiara modularità
Codice pulito, chiara lettura
## Steps per avere UNA visione dominio dei dati
[see metodology](semint-metodology.md)
## Mapping Tool
https://dzone.com/articles/data-mapping-why-it-is-important-for-integration
A data mapping tool makes sure that there are no gaps in mapping and the destination data is getting populated in the right format/schema. It ensures output is free of errors, inconsistencies, and any kind of duplication, thus preserving the integrity of data integration.
It not only enables the mapping of two distinct elements but also governs the rule as to how the data would be mapped with each other. In a way, data mapping requires an understanding of the semantics of data schemas to ascertain relationships between source and destination fields.
## References
1. [Michael Stonebraker Speaks Out](https://sigmod.org/publications/interviews/pdf/D1-DBP-stonebraker-final.pdf)
2. [Interview: Michael Stonebraker, Adjunct Professor, MIT Computer Science and AI Laboratory (CSAIL)](https://insidebigdata.com/2017/07/07/interview-michael-stonebraker-adjunct-professor-mit-computer-science-ai-laboratory-csail/)
3. [Michael Stonebraker at the 6th Heidelberg Laureate Forum, September 2018](https://tomgeller.com/accomplishment/stonebraker-interview/)
4. [Database Decay and How to Avoid It - Conference: 2016 IEEE International Conference on Big Data](https://www.researchgate.net/publication/311584152_Database_Decay_and_How_to_Avoid_It)
-------------------
## Accademy and Market
The academic world has a role in using public funds to tackle research topics and develop prototypes.
Development of a prototype that focuses on demonstrating the feasibility of a product and omits the complementary aspects.
The company uses private funds to make the prototype a product.
"I would encourage academics to pay attention to the real world, at least in those fields where the ultimate arbiter is real-world applications."
- Dr. Michael Stonebraker
### Lession 1: INGRESS Success
"And, in retrospect, we made a bunch of very lucky accidental decisions.
I think another factor in our success was that we stumbled onto the dictum: If you get it wrong, just throw it away and rewrite it.”
- Dr. Michael Stonebraker
### Lession 2: OODB Unsuccess
"OODBs are a deep tangent in the sense that it was interesting stuff that nobody wanted; and the fact that nobody wanted it was, I thought, fairly obvious up front. "
- Dr. Michael Stonebraker
Root problem: an absence of standards, not enough vendor push to cross the chasm of early adopters.
https://en.wikipedia.org/wiki/Crossing_the_Chasm
### Federated DB
"I mean, sooner or later, again it seems inevitable that federated database technology will have to be commercially important. "
"But I think in the commercial market, timing is everything. "
- Dr. Michael Stonebraker [1]
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment