Release v1.1 (beta) - Continuous tests, Resuming, and Chapterbreak · GoodAI/goodai-ltm-benchmark

What's Changed

Single conversation testing

All testing is now performed over the span of a single conversation. This stresses the LTM more, as it will perform multiple versions of the same test in sequence without the memory being wiped clean manually by us. We tell it to forget something, and it will have to do so, or risk confusing old information with new information.

Resuming testing

When the testing process fails, the process can now pick up right from where it left off. All testing events are logged to a master log which is used as the authoritative resource as to what has happened so far in the test suite. When the tests are resumed, this log is used to reset tests back to where they were in their scripts and the process continues. See the runner readme for more details.

Agents are now broadly expected to be persistent. See the models readme for more details.

Datasets:

The addition of the ChapterBreak dataset, a set of long texts (8k tokens) where your agent has to choose which continuation of the text is the correct one.
Prospective memory generation produces correct tests more reliably.
Instruction Recall tests now do not generate questions or instructions that can be reasonably guessed by an LLM. (e.g. no more questions like “What should you do to prepare a drone for its first flight?”)

Full Changelog: v1-benchmark...v1.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v1.1 (beta) - Continuous tests, Resuming, and Chapterbreak

What's Changed

Single conversation testing

Resuming testing

Datasets:

Uh oh!