Il task di Question Answering su factoid questions è stato l’oggetto della challenge TREC QA dal 2002 al 2007.
Noi partecipammo nel 2003 col team PiQASso, ottenendo un score di 0.45.
Nel corso degli anni i sistemi sono migliorati, fino a considerarlo un problema non più interessante.

TREC 2003: The main task was a composite of factoid, list, and definition questions. The best-performing system on the factoid component achieved an accuracy of 0.698. This year also introduced a "passages" task where systems had to return a text snippet containing the answer; the top accuracy here was 0.532.

TREC 2004: The format was further refined into question series. The top system for the factoid component demonstrated an accuracy of 0.770. This represents a significant jump and a high point in the TREC QA track for this specific task.

TREC 2005 - 2007: The main QA task continued to evolve, often with a focus on more complex, context-dependent questions within a series. While direct year-over-year comparisons of factoid accuracy become more nuanced due to changes in the task design and the introduction of different document collections (like blogs in TREC 2007), the top systems continued to perform at a high level. For instance, in TREC 2006, the best factoid accuracy was 0.759. In TREC 2007, the top score for the factoid component was an accuracy of 0.748.

Alcuni anni fa, nell’era pre-transformer, anche Google lanciò una Natural Question dataset and challenge.
Nel 2019 tre miei studenti vinsero la Fujitsu NLP Challenge
https://www.systemscue.it/fujitsu-ai-nlp-unipi-zinrai/14490/

Erano tutti sistemi ad hoc, altamente ingegnerizzati per quello specifico compito.
  
--

On 7 Jul 2025, at 09:50, nexa-request@server-nexa.polito.it wrote:

From: Daniela Tafani <daniela.tafani@unipi.it>

Suppongo si tratti degli stessi modelli utilizzati a scopi militari (<https://defensescoop.com/2025/01/16/openais-gpt-4o-gets-green-light-for-top-secret-use-in-microsofts-azure-cloud/>).

Open AI study shows factual error rate for the 4 new ChatGPT models, with hallucinations getting much worse: 48- 90%.

GPT-4o-mini: 8.6% correct answers, 0.9% unanswered, and 90.5% incorrect.

01-mini: 8.1% correct answers, 28.5% unanswered, and 
63.4% incorrect.

GPT-40: 38.2% correct answers, 1.0% unanswered, and 60.8% incorrect.

01-preview: The top performer, with 42.7% correct answers, 9.2% unanswered, and 48% incorrect.

link here:
<https://openai.com/index/introducing-simpleqa/>

La segnalazione è di Ewan Morrison:
<https://xcancel.com/MrEwanMorrison/status/1941627600096366662>