[nexa] OSI: We heard you: let’s focus on substantive discussion

Sept. 26, 2024

      Salve Nexa,

vi segnalo un nuovo thread [0] avviato dal community manager di OSI
in cui si invita la comunità ad elencare i problemi ancora presenti
nella bozza 0.0.9 della Open Source AI definition [1].

Il thread è titolato "We heard you: let’s focus on substantive
discussion". [2]

Si è rivelato piuttosto interessante, a tratti sorprendente, 
soprattutto per l'analisi dei meccanismi decisionali del gruppo
di lavoro, tutt'altro che convenzionali. [3]

Al momento comunque, i problemi emersi sono:

- Data transparency: The data used to train an AI system should be
  openly available, as it’s essential for understanding and improving
  the model.

- Pretraining dataset distribution: The dataset used for pre-training
  should also be accessible to ensure transparency and allow for
  further development.

- Dataset documentation: The documentation for training datasets should
  be thorough and accurate to address potential issues.

- Versioning: To maintain consistency and reproducibility, versioned
  data is crucial for training AI systems.

- Open licensing: Data used to train Open Source AI systems should be
  licensed under an open license.

- Reproducibility: an Open Source AI must be reproducible using the
  original training data, scripts, logs and everything else used by the
  original developer.

- Inherent user (in)security: without access to the whole training
  data, it’s possible to plant undetectable backdoors in machine
  learning Models.

- Implicit or Unspecified formal requirements: if ambiguities in the
  OSAID will be solved for each candidate AI system though a formal
  certificate issued by OSI, such formal requirement should be
  explicitly stated in the OSAID.

- OSI as a single point of failure: since each new version of each
  candidate Open Source AI system world wide should undergo to the
  certification process again, this would turn OSI to a vulnerable
  bottleneck in AI development, that would be the target of
  unprecedented lobbying from the industry.

- Open Washing AI: any definition that a black box could pass would
  both damage the credibility the whole open source ecosystem, and open
  a huge loophole in the european normative (the AI Act).

Tutti i problemi in questione sono ampiamente documentati nel thread o
negli altri thread collegati, tuttavia se avete osservato altri problemi
o se voleste commentare su di essi, vi suggerisco di proporli al più
presto.

Giacomo

PS: Guarda caso, tutti i problemi emersi sono risolvibili richiedendo
    la disponibilità dei dati di training, come proposto nel thread
    chiuso dallo stesso comunity manager dopo avermi silenziato [4]

[0]
https://discuss.opensource.org/t/we-heard-you-lets-focus-on-substantive-disc...

[1] https://opensource.org/deepdive/drafts

[2] dice proprio "ascoltare", ma alcuni utenti sono ancora silenziati

[3]
https://discuss.opensource.org/t/we-heard-you-lets-focus-on-substantive-disc...

[4]
https://discuss.opensource.org/t/rfc-separating-concerns-between-source-data...

[nexa] OSI: We heard you: let’s focus on substantive discussion

Giacomo Tesio