Skip to content

Commit c24776a

Browse files
Merge pull request #34 from GoodAI/dev
Benchmark 2
2 parents 9064184 + f232c45 commit c24776a

File tree

545 files changed

+139204
-1024
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

545 files changed

+139204
-1024
lines changed

LICENSE

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,8 @@ furnished to do so, subject to the following conditions:
1212
The above copyright notice and this permission notice shall be included in all
1313
copies or substantial portions of the Software.
1414

15+
Appropriate credit is given to the original author and source of the Software.
16+
1517
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
1618
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
1719
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE

README.md

Lines changed: 41 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,15 @@
1-
# GoodAI LTM Benchmark
1+
# GoodAI LTM Benchmark (v2)
22

3-
![GoodAI Logo. A cybernetic owl, which is half robot, half organic, and next to it the company name: GoodAI](logo.png "GoodAI Research s.r.o.")
3+
![GoodAI Logo. A cybernetic owl, which is half robot, half organic, and next to it the company name: GoodAI](reporting/templates/GoodAI_logo.png "GoodAI Research s.r.o.")
44

5-
This repository contains the code and data to supplement [our blogpost](https://www.goodai.com/introducing-goodai-ltm-benchmark/).
5+
This repository contains the code and data to replicate our experiments regarding the Long-Term Memory (LTM) abilities of conversational agents. This is the 2<sup>nd</sup> version of our LTM Benchmark. Check out [our blogpost](https://www.goodai.com/introducing-goodai-ltm-benchmark/) for more information about the benchmark and the related research goals.
66

77
As part of our research efforts in the area of continual learning, we are open-sourcing this benchmark for testing agents’ ability to perform tasks involving the advanced use of the memory over very long conversations. Among others, we evaluate the agent’s performance on tasks that require dynamic upkeep of memories or integration of information over long periods of time.
88

99
We are open-sourcing:
1010
* The living GoodAI LTM Benchmark (this repository).
1111
* Our [LTM agents](model_interfaces/).
12-
* Our experiment data and results
13-
14-
This benchmark has demonstrated that our **LTM agents with 8k context are comparable to long context GPT-4-1106 with 128k**
15-
tokens when recalling and correctly using information in short form conversational contexts. In a longer benchmark, our agents
16-
outperform long context GPT by **13%** for **16% of the running costs.** See the [Benchmark section](#benchmark-1---022024) for the scores.
12+
* Our experiment [data](data/tests/Benchmark%202%20-%2010k%20Filler/definitions) and [results](data/tests/Benchmark%202%20-%2010k%20Filler/results).
1713

1814
## Running the Benchmarks
1915

@@ -45,12 +41,14 @@ The agents currently implemented in this repository are the ones shown below. Fo
4541
# OpenAI models
4642
gpt/gpt-4 # GPT4
4743
gpt-3.5-turbo # GPT3.5
48-
gpt-4-1106 # GPT4-turbo preview
44+
gpt-4-0125 # GPT4-turbo preview
4945
ts-gpt-3.5-turbo # GPT3.5 with timestamped messages
50-
ts-gpt-4-1106 # GPT4-turbo preview with timestamped messages
46+
ts-gpt-4-0125 # GPT4-turbo preview with timestamped messages
5147
52-
# Anthopic Models
53-
claude # Claude-2.1 200k context model
48+
# Anthopic Models (200k context)
49+
claude-2.1 # Claude 2.1
50+
claude-3-sonnet # Claude 3 Sonnet
51+
claude-3-opus # Claude 3 Opus
5452
5553
# Langchain Models
5654
langchain_sb_a # Using 3.5-turbo-instruct and a summary buffer memory
@@ -77,7 +75,7 @@ human # A CLI interface for a human to use the tests.
7775

7876
## Configurations
7977

80-
The configuration used in the blogpost benchmark can be found in `./configurations/blogpost_tests/benchmark-1k.yml`, in which `1k` refers to the information gap between relevant statements. For the `10k` benchmark, we used the very same test definitions as for the `1k` benchmark, but we increased the amount of filler tokens directly in the test definition files. This way we ensured that the length of the information gap is the only thing that changes between both benchmarks.
78+
The configuration files used in the different versions of the benchmark can be found in `configurations`, in which `1k` or `10k` refers to the minimum distance in tokens between relevant statements. For the `10k` benchmarks, we used the very same test definitions as for the `1k` benchmarks, but we increased the amount of filler tokens directly in the test definition files. This way we ensured that the length of the distraction segment is the only thing that changes between both benchmark configurations.
8179

8280

8381
## Datasets
@@ -98,8 +96,10 @@ locations_directions
9896
names
9997
name_list
10098
prospective_memory
99+
restaurant
101100
sallyanne
102101
shopping
102+
spy_meeting
103103
trigger_response
104104
```
105105

@@ -117,34 +117,40 @@ The repository consists of four parts:
117117
More details for each of these parts can be found here: [datasets](datasets/README.md), [models](model_interfaces/README.md), [runner](runner/README.md), [reports](reporting/README.md).
118118

119119

120-
## Benchmark 1 - 02/2024
121-
### Benchmark 1 - 1k Distraction tokens
120+
## Benchmark 2 - 03/2024
121+
122+
### Benchmark 2 - 1k Distraction Tokens
123+
124+
| Model | Context Tokens | Score / 101 | Time (m) | Cost ($) | LTM Score (tokens) |
125+
|------------------------|---------------:|------------:|---------:|---------:|-------------------:|
126+
| GPT-3.5-turbo | 16384 | 58 | 16 | 2.71 | 105349 |
127+
| GPT-4-0125 | 128000 | 61 | 45 | 150.36 | 115625 |
128+
| Claude 3 Opus | 200000 | 83 | 346 | 374.58 | 231807 |
129+
| LTMAgent1 (GPT-4-0125) | 8192 | 80 | 255 | 39.25 | 166056 |
130+
| LTMAgent2 (GPT-4-0125) | 8192 | 75.5 | 40 | 27.94 | 132454 |
131+
| MemGPT | 8192 | 43 | 78 | 100.04 | 78045 |
132+
122133

123-
| Model | Context Tokens | Score / 92 | Time (m) | Cost ($) | Mean Memory Span |
124-
|--------|----------------|------------|----------|----------| ---------------- |
125-
| LTMAgent1 | 4096 | 85 | 153 | 14.82 | 6579 |
126-
| LTMAgent1 | 8192 | 89 | 148.5 | 19.14 | 7253 |
127-
| LTMAgent2 | 8192 | 86 | 31 | 14.31 | 7347 |
128-
| MemGPT | 4096 | 7 | 150 | 81.24 | 5990 |
129-
| MemGPT | 8192 | 44 | 103.3 | 91.69 | 6767 |
130-
| Claude-2.1 | 200000 | 74 | 57.3 | 11.78 | 7291 |
131-
| GPT-4-1106 | 4096 | 49 | 44 | 8.80 | 7344 |
132-
| GPT-4-1106 | 8192 | 77 | 34.7 | 13.85 | 7344 |
133-
| GPT-4-1106 | 128000 | 82 | 42.56 | 15.99 | 7283 |
134+
### Benchmark 2 - 10k Distraction Tokens
134135

136+
| Model | Context Tokens | Score / 101 | Time (m) | Cost ($) | LTM Score (tokens) |
137+
|------------------------|---------------:|------------:|---------:|---------:|-------------------:|
138+
| GPT-3.5-turbo | 16384 | 32 | 36 | 4.43 | 381452 |
139+
| GPT-4-0125 | 128000 | 63 | 244 | 515.74 | 1042531 |
140+
| Claude 3 Opus | 200000 | 79 | 836 | 1104.00 | 1331521 |
141+
| LTMAgent2 (GPT-4-1106) | 8192 | 64.5 | 100 | 46.75 | 978836 |
142+
| LTMAgent2 (GPT-4-0125) | 8192 | 61 | 99 | 45.85 | 1006972 |
135143

136-
### Benchmark 1 - 10k Distraction Tokens
144+
## Previous versions
137145

138-
| Model | Context Tokens | Score / 92 | Time (m) | Cost ($) | Mean Memory Span |
139-
|------------|----------------|------------|----------|----------|------------------|
140-
| LTMAgent1 | 8192 | 86 | 529 | 61.38 | 57095 |
141-
| LTMAgent2 | 8192 | 85 | 117 | 38.98 | 57231 |
142-
| Claude-2.1 | 200000 | 42 | 346 | 227 | 60488 |
143-
| GPT-4-1106 | 8192 | 11 | 90.2 | 37.38 | 58060 |
144-
| GPT-4-1106 | 128000 | 76 | 154.58 | 255.30 | 57343 |
146+
- [Benchmark 1](https://github.com/GoodAI/goodai-ltm-benchmark/tree/v1-benchmark) (02/2024)
145147

146148
## Licence and usage
147-
This code is licenced under MIT. Some datasets use data generated by GPT, so those specific tests are unsuitable for commercial purposes.
149+
This project is licensed under the MIT License - see the LICENSE file for details. Use of this software requires attribution to the original author and project, as detailed in the license.
150+
151+
Some datasets use data generated by GPT, so those specific tests are unsuitable for commercial purposes.
152+
153+
148154

149155
## Acknowledgements
150156
* The filler is drawn from the [TriviaQA dataset](https://github.com/mandarjoshi90/triviaqa) which is licenced under Apache 2.0.

configurations/benchmark-v2-10k.yml

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
config:
2+
debug: True
3+
run_name: "Benchmark 2 - 10k Filler"
4+
incompatibilities:
5+
- - "names"
6+
- "name_list"
7+
- - "locations"
8+
- "locations_directions"
9+
10+
datasets:
11+
args:
12+
filler_tokens_low: 10000
13+
filler_tokens_high: 10000
14+
pre_question_filler: 10000
15+
dataset_examples: 3
16+
17+
datasets:
18+
- name: "colours"
19+
args:
20+
colour_changes: 3
21+
22+
- name: "shopping"
23+
args:
24+
item_changes: 6
25+
26+
- name: "locations_directions"
27+
args:
28+
known_locations: 6
29+
30+
- name: "name_list"
31+
args:
32+
name_changes: 5
33+
34+
- name: "jokes"
35+
args:
36+
jokes_told: 3
37+
38+
- name: "sallyanne"
39+
40+
- name: "delayed_recall"
41+
42+
- name: "prospective_memory"
43+
44+
- name: "instruction_recall"
45+
46+
- name: "trigger_response"
47+
48+
- name: "spy_meeting"
49+
args:
50+
dataset_examples: 1
51+
52+
- name: "chapterbreak"
53+
54+
- name: "restaurant"
55+
args:
56+
dataset_examples: 1

configurations/benchmark-v2-1k.yml

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
config:
2+
debug: True
3+
run_name: "Benchmark 2 - 1k Filler"
4+
incompatibilities:
5+
- - "names"
6+
- "name_list"
7+
- - "locations"
8+
- "locations_directions"
9+
10+
datasets:
11+
args:
12+
filler_tokens_low: 1000
13+
filler_tokens_high: 1000
14+
pre_question_filler: 1000
15+
dataset_examples: 3
16+
17+
datasets:
18+
- name: "colours"
19+
args:
20+
colour_changes: 3
21+
22+
- name: "shopping"
23+
args:
24+
item_changes: 6
25+
26+
- name: "locations_directions"
27+
args:
28+
known_locations: 6
29+
30+
- name: "name_list"
31+
args:
32+
name_changes: 5
33+
34+
- name: "jokes"
35+
args:
36+
jokes_told: 3
37+
38+
- name: "sallyanne"
39+
40+
- name: "delayed_recall"
41+
42+
- name: "prospective_memory"
43+
44+
- name: "instruction_recall"
45+
46+
- name: "trigger_response"
47+
48+
- name: "spy_meeting"
49+
args:
50+
dataset_examples: 1
51+
52+
- name: "chapterbreak"
53+
54+
- name: "restaurant"
55+
args:
56+
dataset_examples: 1

data/Restaurant/menu.json

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
{
2+
"Appetizers": [
3+
"Classic Caesar Salad",
4+
"Crispy Calamari with Marinara Sauce",
5+
"Bruschetta with Fresh Tomato, Basil, and Balsamic Glaze",
6+
"Spinach and Artichoke Dip served with Tortilla Chips"
7+
],
8+
"Soups and Salads": [
9+
"Garden Salad with Mixed Greens, Cucumbers, Tomatoes, and Balsamic Vinaigrette",
10+
"French Onion Soup with Gruyere Cheese Crouton",
11+
"Caprese Salad with Fresh Mozzarella, Tomatoes, Basil, and Balsamic Reduction"
12+
],
13+
"Entrees": [
14+
"Grilled Salmon with Lemon Herb Butter, served with Roasted Vegetables and Rice Pilaf",
15+
"Chicken Parmesan with Marinara Sauce and Melted Mozzarella, served with Spaghetti",
16+
"Filet Mignon with Red Wine Demi-Glace, Garlic Mashed Potatoes, and Steamed Asparagus",
17+
"Vegetarian Stir-Fry with Tofu, Mixed Vegetables, and Teriyaki Sauce over Steamed Rice"
18+
],
19+
"Pasta": [
20+
"Spaghetti Carbonara with Pancetta, Egg, and Parmesan Cheese",
21+
"Penne alla Vodka with Creamy Tomato Vodka Sauce",
22+
"Linguine with Clams in White Wine Garlic Sauce",
23+
"Vegetable Primavera with Seasonal Vegetables in a Light Tomato Sauce"
24+
],
25+
"Sandwiches": [
26+
"Classic BLT with Crispy Bacon, Lettuce, Tomato, and Mayo on Toasted Sourdough",
27+
"Grilled Chicken Club with Avocado, Bacon, Lettuce, Tomato, and Herb Mayo on a Brioche Bun",
28+
"Turkey and Swiss Panini with Cranberry Aioli on Ciabatta Bread",
29+
"Portobello Mushroom Burger with Roasted Red Peppers, Arugula, and Pesto Mayo on a Whole Wheat Bun"
30+
],
31+
"Vegan Options": [
32+
"Vegan Lentil Soup with Seasonal Vegetables",
33+
"Vegan Buddha Bowl with Quinoa, Roasted Chickpeas, Avocado, and Mixed Greens, drizzled with Tahini Dressing",
34+
"Vegan Mushroom and Spinach Risotto with Arborio Rice and Truffle Oil",
35+
"Vegan Beyond Burger with Lettuce, Tomato, Pickles, and Vegan Mayo on a Whole Wheat Bun, served with Sweet Potato Fries"
36+
],
37+
"Halal Options": [
38+
"Halal Chicken Shawarma Plate with Grilled Chicken, Rice, Hummus, Salad, and Pita Bread",
39+
"Halal Lamb Kebabs with Grilled Vegetables, Basmati Rice, and Tzatziki Sauce",
40+
"Halal Beef Biryani with Fragrant Basmati Rice, Tender Beef, and Traditional Spices",
41+
"Halal Falafel Wrap with Hummus, Lettuce, Tomato, Pickled Turnips, and Tahini Sauce in a Warm Pita"
42+
],
43+
"Desserts": [
44+
"New York Style Cheesecake with Strawberry Compote",
45+
"Warm Chocolate Lava Cake with Vanilla Ice Cream",
46+
"Tiramisu with Espresso Soaked Ladyfingers and Mascarpone Cream",
47+
"Fruit Tart with Seasonal Fresh Fruit and Pastry Cream"
48+
],
49+
"Beverages": [
50+
"Soft Drinks (Coke, Diet Coke, Sprite, etc.)",
51+
"Iced Tea (Sweetened or Unsweetened)",
52+
"Lemonade",
53+
"Selection of Teas",
54+
"Beer (Domestic and Imported)",
55+
"Wine (Red, White, and Rosé by the Glass or Bottle)"
56+
]
57+
}
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
{
22
"content":"Formulate an entirely novel fictitious quote from a fictitious person in single quotes ''. Say that it is a quote from that person",
3-
"question":"Instruct the subject to recite the above quote by appending it to the response after the next `n` statements by the user. Where `n` is an integer between 1 and 10.",
3+
"question":"Instruct the subject to recite the above quote by appending it to the response of the `n`th new message after this one. Where `n` is an integer between 1 and 10.",
44
"answer":"State the condition for completing the task, provide the value of `n`."
55
}

0 commit comments

Comments
 (0)