Open Datasets and Evaluation Protocols for AIED Systems
Lina Gomez, Takashi Mori, Wing Lam
Journal of Learning Analytics Infrastructure
Resumen de 500 palabras
This mock summary describes a paper focused on datasets and evaluation protocols for AIED systems. The authors argue that the field needs stronger infrastructure for comparison and replication. Many educational AI systems are evaluated on private datasets, local classroom pilots, or narrow benchmarks. That makes it difficult to know whether a model is robust across learner populations, subject areas, languages, and school contexts. The paper surveys available datasets and proposes a protocol for reporting and evaluating AIED systems more transparently.
The survey organizes datasets by task: knowledge tracing, hint generation, essay scoring, dialogue tutoring, affect detection, learning analytics, and content recommendation. For each category, the authors examine data modality, learner age, subject domain, language, privacy treatment, labels, and limitations. They find that some areas, such as knowledge tracing, have relatively mature benchmarks, while others, such as teacher workflow support and multimodal classroom analytics, have fewer reusable datasets.
The proposed evaluation protocol has four layers. The first layer is technical performance, including accuracy, calibration, robustness, and error analysis. The second layer is pedagogical validity, asking whether the measured outcome is meaningful for learning. The third layer is deployment fit, covering latency, interpretability, teacher workflow, and integration with school systems. The fourth layer is governance, including privacy, consent, bias analysis, and documentation. The authors recommend that papers report not only aggregate scores, but also subgroup results, failure cases, and dataset documentation.
A useful contribution is the “evaluation card” template. It asks researchers and product teams to document target users, learning objectives, data sources, model assumptions, known limitations, human oversight, and appropriate use boundaries. This template can be attached to a system or dataset, making it easier for schools to understand what has and has not been validated.
For AIEDHK, the paper is relevant because a knowledge hub should not only summarize exciting capabilities. It should also help readers understand evidence quality. When Dr. Peter Hu or future contributors summarize papers, a structured evaluation lens can make each summary more useful: What data was used? Who were the learners? What was measured? What risks remain? This paper also points to a future AIEDHK feature: an evidence and readiness score for research-to-product translation.