v1.1 (beta) - Continuous tests, Resuming, and Chapterbreak
Pre-releaseWhat's Changed
Single conversation testing
All testing is now performed over the span of a single conversation. This stresses the LTM more, as it will perform multiple versions of the same test in sequence without the memory being wiped clean manually by us. We tell it to forget something, and it will have to do so, or risk confusing old information with new information.
Resuming testing
When the testing process fails, the process can now pick up right from where it left off. All testing events are logged to a master log which is used as the authoritative resource as to what has happened so far in the test suite. When the tests are resumed, this log is used to reset tests back to where they were in their scripts and the process continues. See the runner readme for more details.
Agents are now broadly expected to be persistent. See the models readme for more details.
Datasets:
- The addition of the ChapterBreak dataset, a set of long texts (8k tokens) where your agent has to choose which continuation of the text is the correct one.
- Prospective memory generation produces correct tests more reliably.
- Instruction Recall tests now do not generate questions or instructions that can be reasonably guessed by an LLM. (e.g. no more questions like “What should you do to prepare a drone for its first flight?”)
Full Changelog: v1-benchmark...v1.1