GoodAI
diff --git a/‎LICENSE
Lines changed: 2 additions & 0 deletions b/‎LICENSE
Lines changed: 2 additions & 0 deletions
diff --git a/‎README.md
Lines changed: 41 additions & 35 deletions b/‎README.md
Lines changed: 41 additions & 35 deletions
diff --git a/‎configurations/benchmark-v2-10k.yml
Lines changed: 56 additions & 0 deletions b/‎configurations/benchmark-v2-10k.yml
Lines changed: 56 additions & 0 deletions
diff --git a/‎configurations/benchmark-v2-1k.yml
Lines changed: 56 additions & 0 deletions b/‎configurations/benchmark-v2-1k.yml
Lines changed: 56 additions & 0 deletions
diff --git a/‎data/Restaurant/menu.json
Lines changed: 57 additions & 0 deletions b/‎data/Restaurant/menu.json
Lines changed: 57 additions & 0 deletions
diff --git a/‎data/gpt_generation_prompts/2-1_prospective_memory_test.json
Lines changed: 1 addition & 1 deletion b/‎data/gpt_generation_prompts/2-1_prospective_memory_test.json
Lines changed: 1 addition & 1 deletion
@@ -12,6 +12,8 @@ furnished to do so, subject to the following conditions:
 The above copyright notice and this permission notice shall be included in all
 copies or substantial portions of the Software.
 
+Appropriate credit is given to the original author and source of the Software.
+
 THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 
@@ -1,19 +1,15 @@
-# GoodAI LTM Benchmark
+# GoodAI LTM Benchmark (v2)
 
-![GoodAI Logo. A cybernetic owl, which is half robot, half organic, and next to it the company name: GoodAI](logo.png "GoodAI Research s.r.o.")
+![GoodAI Logo. A cybernetic owl, which is half robot, half organic, and next to it the company name: GoodAI](reporting/templates/GoodAI_logo.png "GoodAI Research s.r.o.")
 
-This repository contains the code and data to supplement [our blogpost](https://www.goodai.com/introducing-goodai-ltm-benchmark/).
+This repository contains the code and data to replicate our experiments regarding the Long-Term Memory (LTM) abilities of conversational agents. This is the 2<sup>nd</sup> version of our LTM Benchmark. Check out [our blogpost](https://www.goodai.com/introducing-goodai-ltm-benchmark/) for more information about the benchmark and the related research goals.
 
 As part of our research efforts in the area of continual learning, we are open-sourcing this benchmark for testing agents’ ability to perform tasks involving the advanced use of the memory over very long conversations. Among others, we evaluate the agent’s performance on tasks that require dynamic upkeep of memories or integration of information over long periods of time.
 
 We are open-sourcing:
  * The living GoodAI LTM Benchmark (this repository).
  * Our [LTM agents](model_interfaces/).
- * Our experiment data and results
-
-This benchmark has demonstrated that our **LTM agents with 8k context are comparable to long context GPT-4-1106 with 128k**
-tokens when recalling and correctly using information in short form conversational contexts. In a longer benchmark, our agents
-outperform long context GPT by **13%** for **16% of the running costs.** See the [Benchmark section](#benchmark-1---022024) for the scores.
+ * Our experiment [data](data/tests/Benchmark%202%20-%2010k%20Filler/definitions) and [results](data/tests/Benchmark%202%20-%2010k%20Filler/results).
 
 ## Running the Benchmarks
 
@@ -45,12 +41,14 @@ The agents currently implemented in this repository are the ones shown below. Fo
 # OpenAI models
 gpt/gpt-4           # GPT4
 gpt-3.5-turbo       # GPT3.5
-gpt-4-1106          # GPT4-turbo preview
+gpt-4-0125          # GPT4-turbo preview
 ts-gpt-3.5-turbo    # GPT3.5 with timestamped messages
-ts-gpt-4-1106       # GPT4-turbo preview with timestamped messages
+ts-gpt-4-0125       # GPT4-turbo preview with timestamped messages
 
-# Anthopic Models
-claude              #  Claude-2.1 200k context model
+# Anthopic Models (200k context)
+claude-2.1          # Claude 2.1
+claude-3-sonnet     # Claude 3 Sonnet
+claude-3-opus       # Claude 3 Opus
 
 # Langchain Models
 langchain_sb_a    # Using 3.5-turbo-instruct and a summary buffer memory
@@ -77,7 +75,7 @@ human             # A CLI interface for a human to use the tests.
 
 ## Configurations
 
-The configuration used in the blogpost benchmark can be found in `./configurations/blogpost_tests/benchmark-1k.yml`, in which `1k` refers to the information gap between relevant statements. For the `10k` benchmark, we used the very same test definitions as for the `1k` benchmark, but we increased the amount of filler tokens directly in the test definition files. This way we ensured that the length of the information gap is the only thing that changes between both benchmarks.
+The configuration files used in the different versions of the benchmark can be found in `configurations`, in which `1k` or `10k` refers to the minimum distance in tokens between relevant statements. For the `10k` benchmarks, we used the very same test definitions as for the `1k` benchmarks, but we increased the amount of filler tokens directly in the test definition files. This way we ensured that the length of the distraction segment is the only thing that changes between both benchmark configurations.
 
 
 ## Datasets
@@ -98,8 +96,10 @@ locations_directions
 names
 name_list
 prospective_memory
+restaurant
 sallyanne
 shopping
+spy_meeting
 trigger_response
 ```
 
@@ -117,34 +117,40 @@ The repository consists of four parts:
 More details for each of these parts can be found here: [datasets](datasets/README.md), [models](model_interfaces/README.md), [runner](runner/README.md), [reports](reporting/README.md).
 
 
-## Benchmark 1 - 02/2024
-### Benchmark 1 - 1k Distraction tokens
+## Benchmark 2 - 03/2024
+
+### Benchmark 2 - 1k Distraction Tokens
+
+| Model                  | Context Tokens | Score / 101 | Time (m) | Cost ($) | LTM Score (tokens) |
+|------------------------|---------------:|------------:|---------:|---------:|-------------------:|
+| GPT-3.5-turbo          |          16384 |          58 |       16 |     2.71 |             105349 |
+| GPT-4-0125             |         128000 |          61 |       45 |   150.36 |             115625 |
+| Claude 3 Opus          |         200000 |          83 |      346 |   374.58 |             231807 |
+| LTMAgent1 (GPT-4-0125) |           8192 |          80 |      255 |    39.25 |             166056 |
+| LTMAgent2 (GPT-4-0125) |           8192 |        75.5 |       40 |    27.94 |             132454 |
+| MemGPT                 |           8192 |          43 |       78 |   100.04 |              78045 |
+
 
-| Model  | Context Tokens | Score / 92 | Time (m) | Cost ($) | Mean Memory Span |
-|--------|----------------|------------|----------|----------| ---------------- |
-| LTMAgent1  | 4096           | 85         | 153      | 14.82    | 6579 |
-| LTMAgent1  | 8192           | 89         | 148.5    | 19.14    | 7253 |
-| LTMAgent2  | 8192           | 86         | 31       | 14.31    | 7347 |
-| MemGPT | 4096           | 7          | 150      | 81.24    | 5990 |
-| MemGPT | 8192           | 44         | 103.3    | 91.69    | 6767 |
-| Claude-2.1 | 200000         | 74         | 57.3     | 11.78 | 7291 |
-| GPT-4-1106 | 4096           | 49         | 44       | 8.80     | 7344 | 
-| GPT-4-1106 | 8192           | 77         | 34.7     | 13.85    | 7344 |
-| GPT-4-1106 | 128000         | 82         | 42.56    | 15.99    | 7283 |
+### Benchmark 2 - 10k Distraction Tokens
 
+| Model                  | Context Tokens | Score / 101 | Time (m) | Cost ($) | LTM Score (tokens) |
+|------------------------|---------------:|------------:|---------:|---------:|-------------------:|
+| GPT-3.5-turbo          |          16384 |          32 |       36 |     4.43 |             381452 |
+| GPT-4-0125             |         128000 |          63 |      244 |   515.74 |            1042531 |
+| Claude 3 Opus          |         200000 |          79 |      836 |  1104.00 |            1331521 |
+| LTMAgent2 (GPT-4-1106) |           8192 |        64.5 |      100 |    46.75 |             978836 |
+| LTMAgent2 (GPT-4-0125) |           8192 |          61 |       99 |    45.85 |            1006972 |
 
-### Benchmark 1 - 10k Distraction Tokens
+## Previous versions
 
-| Model      | Context Tokens | Score / 92 | Time (m) | Cost ($) | Mean Memory Span |
-|------------|----------------|------------|----------|----------|------------------|
-| LTMAgent1  | 8192           | 86         | 529      | 61.38    | 57095            |
-| LTMAgent2  | 8192           | 85         | 117      | 38.98    | 57231            |
- | Claude-2.1 | 200000 | 42         | 346      | 227      | 60488            | 
-| GPT-4-1106 | 8192           | 11         | 90.2     | 37.38    | 58060            | 
-| GPT-4-1106 | 128000         | 76         | 154.58   | 255.30   | 57343            |
+- [Benchmark 1](https://github.com/GoodAI/goodai-ltm-benchmark/tree/v1-benchmark) (02/2024)
 
 ## Licence and usage
-This code is licenced under MIT. Some datasets use data generated by GPT, so those specific tests are unsuitable for commercial purposes.
+This project is licensed under the MIT License - see the LICENSE file for details. Use of this software requires attribution to the original author and project, as detailed in the license.
+
+Some datasets use data generated by GPT, so those specific tests are unsuitable for commercial purposes.
+
+
 
 ## Acknowledgements
 * The filler is drawn from the [TriviaQA dataset](https://github.com/mandarjoshi90/triviaqa) which is licenced under Apache 2.0.
 
@@ -0,0 +1,56 @@
+config:
+  debug: True
+  run_name: "Benchmark 2 - 10k Filler"
+  incompatibilities:
+    - - "names"
+      - "name_list"
+    - - "locations"
+      - "locations_directions"
+
+datasets:
+    args:
+      filler_tokens_low: 10000
+      filler_tokens_high: 10000
+      pre_question_filler: 10000
+      dataset_examples: 3
+
+    datasets:
+      - name: "colours"
+        args:
+          colour_changes: 3
+
+      - name: "shopping"
+        args:
+          item_changes: 6
+
+      - name: "locations_directions"
+        args:
+          known_locations: 6
+
+      - name: "name_list"
+        args:
+          name_changes: 5
+
+      - name: "jokes"
+        args:
+          jokes_told: 3
+
+      - name: "sallyanne"
+
+      - name: "delayed_recall"
+
+      - name: "prospective_memory"
+
+      - name: "instruction_recall"
+
+      - name: "trigger_response"
+
+      - name: "spy_meeting"
+        args:
+          dataset_examples: 1
+
+      - name: "chapterbreak"
+
+      - name: "restaurant"
+        args:
+          dataset_examples: 1
@@ -0,0 +1,56 @@
+config:
+  debug: True
+  run_name: "Benchmark 2 - 1k Filler"
+  incompatibilities:
+    - - "names"
+      - "name_list"
+    - - "locations"
+      - "locations_directions"
+
+datasets:
+    args:
+      filler_tokens_low: 1000
+      filler_tokens_high: 1000
+      pre_question_filler: 1000
+      dataset_examples: 3
+
+    datasets:
+      - name: "colours"
+        args:
+          colour_changes: 3
+
+      - name: "shopping"
+        args:
+          item_changes: 6
+
+      - name: "locations_directions"
+        args:
+          known_locations: 6
+
+      - name: "name_list"
+        args:
+          name_changes: 5
+
+      - name: "jokes"
+        args:
+          jokes_told: 3
+
+      - name: "sallyanne"
+
+      - name: "delayed_recall"
+
+      - name: "prospective_memory"
+
+      - name: "instruction_recall"
+
+      - name: "trigger_response"
+
+      - name: "spy_meeting"
+        args:
+          dataset_examples: 1
+
+      - name: "chapterbreak"
+
+      - name: "restaurant"
+        args:
+          dataset_examples: 1
@@ -0,0 +1,57 @@
+{
+    "Appetizers": [
+        "Classic Caesar Salad",
+        "Crispy Calamari with Marinara Sauce",
+        "Bruschetta with Fresh Tomato, Basil, and Balsamic Glaze",
+        "Spinach and Artichoke Dip served with Tortilla Chips"
+    ],
+    "Soups and Salads": [
+        "Garden Salad with Mixed Greens, Cucumbers, Tomatoes, and Balsamic Vinaigrette",
+        "French Onion Soup with Gruyere Cheese Crouton",
+        "Caprese Salad with Fresh Mozzarella, Tomatoes, Basil, and Balsamic Reduction"
+    ],
+    "Entrees": [
+        "Grilled Salmon with Lemon Herb Butter, served with Roasted Vegetables and Rice Pilaf",
+        "Chicken Parmesan with Marinara Sauce and Melted Mozzarella, served with Spaghetti",
+        "Filet Mignon with Red Wine Demi-Glace, Garlic Mashed Potatoes, and Steamed Asparagus",
+        "Vegetarian Stir-Fry with Tofu, Mixed Vegetables, and Teriyaki Sauce over Steamed Rice"
+    ],
+    "Pasta": [
+        "Spaghetti Carbonara with Pancetta, Egg, and Parmesan Cheese",
+        "Penne alla Vodka with Creamy Tomato Vodka Sauce",
+        "Linguine with Clams in White Wine Garlic Sauce",
+        "Vegetable Primavera with Seasonal Vegetables in a Light Tomato Sauce"
+    ],
+    "Sandwiches": [
+        "Classic BLT with Crispy Bacon, Lettuce, Tomato, and Mayo on Toasted Sourdough",
+        "Grilled Chicken Club with Avocado, Bacon, Lettuce, Tomato, and Herb Mayo on a Brioche Bun",
+        "Turkey and Swiss Panini with Cranberry Aioli on Ciabatta Bread",
+        "Portobello Mushroom Burger with Roasted Red Peppers, Arugula, and Pesto Mayo on a Whole Wheat Bun"
+    ],
+    "Vegan Options": [
+        "Vegan Lentil Soup with Seasonal Vegetables",
+        "Vegan Buddha Bowl with Quinoa, Roasted Chickpeas, Avocado, and Mixed Greens, drizzled with Tahini Dressing",
+        "Vegan Mushroom and Spinach Risotto with Arborio Rice and Truffle Oil",
+        "Vegan Beyond Burger with Lettuce, Tomato, Pickles, and Vegan Mayo on a Whole Wheat Bun, served with Sweet Potato Fries"
+    ],
+    "Halal Options": [
+        "Halal Chicken Shawarma Plate with Grilled Chicken, Rice, Hummus, Salad, and Pita Bread",
+        "Halal Lamb Kebabs with Grilled Vegetables, Basmati Rice, and Tzatziki Sauce",
+        "Halal Beef Biryani with Fragrant Basmati Rice, Tender Beef, and Traditional Spices",
+        "Halal Falafel Wrap with Hummus, Lettuce, Tomato, Pickled Turnips, and Tahini Sauce in a Warm Pita"
+    ],
+    "Desserts": [
+        "New York Style Cheesecake with Strawberry Compote",
+        "Warm Chocolate Lava Cake with Vanilla Ice Cream",
+        "Tiramisu with Espresso Soaked Ladyfingers and Mascarpone Cream",
+        "Fruit Tart with Seasonal Fresh Fruit and Pastry Cream"
+    ],
+    "Beverages": [
+        "Soft Drinks (Coke, Diet Coke, Sprite, etc.)",
+        "Iced Tea (Sweetened or Unsweetened)",
+        "Lemonade",
+        "Selection of Teas",
+        "Beer (Domestic and Imported)",
+        "Wine (Red, White, and Rosé by the Glass or Bottle)"
+    ]
+}
@@ -1,5 +1,5 @@
 {
   "content":"Formulate an entirely novel fictitious quote from a fictitious person in single quotes ''. Say that it is a quote from that person",
-  "question":"Instruct the subject to recite the above quote by appending it to the response after the next `n` statements by the user. Where `n` is an integer between 1 and 10.",
+  "question":"Instruct the subject to recite the above quote by appending it to the response of the `n`th new message after this one. Where `n` is an integer between 1 and 10.",
   "answer":"State the condition for completing the task, provide the value of `n`."
 }
Original file line number	Diff line number	Diff line change
`@@ -1,5 +1,5 @@`
`1`	`1`	`{`
`2`	`2`	`"content":"Formulate an entirely novel fictitious quote from a fictitious person in single quotes ''. Say that it is a quote from that person",`
`3`		- "question":"Instruct the subject to recite the above quote by appending it to the response after the next `n` statements by the user. Where `n` is an integer between 1 and 10.",
	`3`	+ "question":"Instruct the subject to recite the above quote by appending it to the response of the `n`th new message after this one. Where `n` is an integer between 1 and 10.",
`4`	`4`	"answer":"State the condition for completing the task, provide the value of `n`."
`5`	`5`	`}`