Skip to content

Commit bffbe4f

Browse files
committed
Support returning multi-modal content from tools
1 parent 1c009f3 commit bffbe4f

12 files changed

+974
-5
lines changed

docs/tools.md

+56-3
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@ There are a number of ways to register tools with an agent:
1515
* via the [`@agent.tool_plain`][pydantic_ai.Agent.tool_plain] decorator — for tools that do not need access to the agent [context][pydantic_ai.tools.RunContext]
1616
* via the [`tools`][pydantic_ai.Agent.__init__] keyword argument to `Agent` which can take either plain functions, or instances of [`Tool`][pydantic_ai.tools.Tool]
1717

18+
## Registering Function Tools via Decorator
19+
1820
`@agent.tool` is considered the default decorator since in the majority of cases tools will need access to the agent context.
1921

2022
Here's an example using both:
@@ -188,7 +190,7 @@ sequenceDiagram
188190
Note over Agent: Game session complete
189191
```
190192

191-
## Registering Function Tools via kwarg
193+
## Registering Function Tools via Agent Argument
192194

193195
As well as using the decorators, we can register tools via the `tools` argument to the [`Agent` constructor][pydantic_ai.Agent.__init__]. This is useful when you want to reuse tools, and can also give more fine-grained control over the tools.
194196

@@ -244,6 +246,59 @@ print(dice_result['b'].output)
244246

245247
_(This example is complete, it can be run "as is")_
246248

249+
## Function Tool Output
250+
251+
Tools can return anything that Pydantic can serialize to JSON, as well as audio, video, image or document content depending on the types of [multi-modal input](input.md) the model supports:
252+
253+
```python {title="function_tool_output.py"}
254+
from pydantic import BaseModel
255+
from pydantic_ai import Agent, ImageUrl, DocumentUrl
256+
from pydantic_ai.models.openai import OpenAIResponsesModel
257+
from datetime import datetime
258+
259+
class User(BaseModel):
260+
name: str
261+
age: int
262+
263+
agent = Agent(model=OpenAIResponsesModel('gpt-4o'))
264+
265+
@agent.tool_plain
266+
def get_current_time() -> datetime:
267+
return datetime.now()
268+
269+
@agent.tool_plain
270+
def get_user() -> User:
271+
return User(name='John', age=30)
272+
273+
@agent.tool_plain
274+
def get_company_logo() -> ImageUrl:
275+
return ImageUrl(url='https://iili.io/3Hs4FMg.png')
276+
277+
@agent.tool_plain
278+
def get_document() -> DocumentUrl:
279+
return DocumentUrl(url='https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf')
280+
281+
result = agent.run_sync('What time is it?')
282+
print(result.output)
283+
# > The current time is 10:45 PM on April 17, 2025.
284+
285+
result = agent.run_sync('What is the user name?')
286+
print(result.output)
287+
# > The user's name is John.
288+
289+
result = agent.run_sync('What is the company name in the logo?')
290+
print(result.output)
291+
# > The company name in the logo is "Pydantic."
292+
293+
result = agent.run_sync('What is the main content of the document?')
294+
print(result.output)
295+
# > The document contains just the text "Dummy PDF file."
296+
```
297+
298+
Some models (e.g. Gemini) natively support semi-structured return values, while some expect text (OpenAI) but seem to be just as good at extracting meaning from the data. If a Python object is returned and the model expects a string, the value will be serialized to JSON.
299+
300+
_(This example is complete, it can be run "as is")_
301+
247302
## Function Tools vs. Structured Outputs
248303

249304
As the name suggests, function tools use the model's "tools" or "functions" API to let the model know what is available to call. Tools or functions are also used to define the schema(s) for structured responses, thus a model might have access to many tools, some of which call function tools while others end the run and produce a final output.
@@ -307,8 +362,6 @@ agent.run_sync('hello', model=FunctionModel(print_schema))
307362

308363
_(This example is complete, it can be run "as is")_
309364

310-
The return type of tool can be anything which Pydantic can serialize to JSON as some models (e.g. Gemini) support semi-structured return values, some expect text (OpenAI) but seem to be just as good at extracting meaning from the data. If a Python object is returned and the model expects a string, the value will be serialized to JSON.
311-
312365
If a tool has a single parameter that can be represented as an object in JSON schema (e.g. dataclass, TypedDict, pydantic model), the schema for the tool is simplified to be just that object.
313366

314367
Here's an example where we use [`TestModel.last_model_request_parameters`][pydantic_ai.models.test.TestModel.last_model_request_parameters] to inspect the tool schema that would be passed to the model.

pydantic_ai_slim/pydantic_ai/_agent_graph.py

+25-2
Original file line numberDiff line numberDiff line change
@@ -576,7 +576,7 @@ def build_run_context(ctx: GraphRunContext[GraphAgentState, GraphAgentDeps[DepsT
576576
)
577577

578578

579-
async def process_function_tools(
579+
async def process_function_tools( # noqa C901
580580
tool_calls: list[_messages.ToolCallPart],
581581
output_tool_name: str | None,
582582
output_tool_call_id: str | None,
@@ -662,6 +662,8 @@ async def process_function_tools(
662662
if not calls_to_run:
663663
return
664664

665+
user_parts: list[_messages.UserPromptPart] = []
666+
665667
# Run all tool tasks in parallel
666668
results_by_index: dict[int, _messages.ModelRequestPart] = {}
667669
with ctx.deps.tracer.start_as_current_span(
@@ -675,14 +677,33 @@ async def process_function_tools(
675677
asyncio.create_task(tool.run(call, run_context, ctx.deps.tracer), name=call.tool_name)
676678
for tool, call in calls_to_run
677679
]
680+
681+
file_index = 1
682+
678683
pending = tasks
679684
while pending:
680685
done, pending = await asyncio.wait(pending, return_when=asyncio.FIRST_COMPLETED)
681686
for task in done:
682687
index = tasks.index(task)
683688
result = task.result()
684689
yield _messages.FunctionToolResultEvent(result, tool_call_id=call_index_to_event_id[index])
685-
if isinstance(result, (_messages.ToolReturnPart, _messages.RetryPromptPart)):
690+
691+
if isinstance(result, _messages.RetryPromptPart):
692+
results_by_index[index] = result
693+
elif isinstance(result, _messages.ToolReturnPart):
694+
if result.is_multi_modal:
695+
user_parts.append(
696+
_messages.UserPromptPart(
697+
content=[f'This is file {file_index}:', result.content],
698+
timestamp=result.timestamp,
699+
part_kind='user-prompt',
700+
)
701+
)
702+
703+
result.content = f'See file {file_index}.'
704+
705+
file_index += 1
706+
686707
results_by_index[index] = result
687708
else:
688709
assert_never(result)
@@ -692,6 +713,8 @@ async def process_function_tools(
692713
for k in sorted(results_by_index):
693714
output_parts.append(results_by_index[k])
694715

716+
output_parts.extend(user_parts)
717+
695718

696719
async def _tool_from_mcp_server(
697720
tool_name: str,

pydantic_ai_slim/pydantic_ai/messages.py

+8
Original file line numberDiff line numberDiff line change
@@ -253,6 +253,9 @@ def format(self) -> str:
253253

254254
UserContent: TypeAlias = 'str | ImageUrl | AudioUrl | DocumentUrl | VideoUrl | BinaryContent'
255255

256+
# Ideally this would be a Union of types, but Python 3.9 requires it to be a string, and strings don't work with `isinstance``.
257+
MultiModalContentTypes = (ImageUrl, AudioUrl, DocumentUrl, VideoUrl, BinaryContent)
258+
256259

257260
def _document_format(media_type: str) -> DocumentFormat:
258261
if media_type == 'application/pdf':
@@ -357,6 +360,11 @@ class ToolReturnPart:
357360
part_kind: Literal['tool-return'] = 'tool-return'
358361
"""Part type identifier, this is available on all parts as a discriminator."""
359362

363+
@property
364+
def is_multi_modal(self) -> bool:
365+
"""Return `True` if the content is a multi-modal content."""
366+
return isinstance(self.content, MultiModalContentTypes)
367+
360368
def model_response_str(self) -> str:
361369
"""Return a string representation of the content for the model."""
362370
if isinstance(self.content, str):

tests/models/cassettes/test_anthropic/test_image_as_binary_content_input.yaml

+62
Large diffs are not rendered by default.

tests/models/cassettes/test_anthropic/test_image_as_binary_content_tool_response.yaml

+153
Large diffs are not rendered by default.

tests/models/cassettes/test_gemini/test_image_as_binary_content_tool_response.yaml

+150
Large diffs are not rendered by default.

tests/models/cassettes/test_openai/test_image_as_binary_content_tool_response.yaml

+204
Large diffs are not rendered by default.

tests/models/cassettes/test_openai_responses/test_image_as_binary_content_tool_response.yaml

+237
Large diffs are not rendered by default.

tests/models/test_anthropic.py

+30
Original file line numberDiff line numberDiff line change
@@ -589,6 +589,36 @@ async def test_image_url_input_invalid_mime_type(allow_model_requests: None, ant
589589
)
590590

591591

592+
@pytest.mark.vcr()
593+
async def test_image_as_binary_content_tool_response(
594+
allow_model_requests: None, anthropic_api_key: str, image_content: BinaryContent
595+
):
596+
m = AnthropicModel('claude-3-5-sonnet-latest', provider=AnthropicProvider(api_key=anthropic_api_key))
597+
agent = Agent(m)
598+
599+
@agent.tool_plain
600+
async def get_image() -> BinaryContent:
601+
return image_content
602+
603+
result = await agent.run(['What fruit is in the image you have access to via the get_image tool?'])
604+
assert result.output == snapshot(
605+
"The image shows a kiwi fruit that has been cut in half, displaying its characteristic bright green flesh with small black seeds arranged in a circular pattern around a white center core. The kiwi's fuzzy brown skin is visible around the edges of the slice."
606+
)
607+
608+
609+
@pytest.mark.vcr()
610+
async def test_image_as_binary_content_input(
611+
allow_model_requests: None, anthropic_api_key: str, image_content: BinaryContent
612+
):
613+
m = AnthropicModel('claude-3-5-sonnet-latest', provider=AnthropicProvider(api_key=anthropic_api_key))
614+
agent = Agent(m)
615+
616+
result = await agent.run(['What is the name of this fruit?', image_content])
617+
assert result.output == snapshot(
618+
"This is a kiwi fruit (or simply kiwi). It's a slice showing the characteristic bright green flesh with tiny black seeds arranged in a circular pattern around a white center core. The fruit has a distinctive appearance with its fuzzy brown exterior (though only the inner flesh is shown in this cross-section image)."
619+
)
620+
621+
592622
@pytest.mark.parametrize('media_type', ('audio/wav', 'audio/mpeg'))
593623
async def test_audio_as_binary_content_input(allow_model_requests: None, media_type: str):
594624
c = completion_message([TextBlock(text='world', type='text')], AnthropicUsage(input_tokens=5, output_tokens=10))

tests/models/test_gemini.py

+19
Original file line numberDiff line numberDiff line change
@@ -959,6 +959,25 @@ def handler(request: httpx.Request) -> httpx.Response:
959959
assert result.output == 'world'
960960

961961

962+
@pytest.mark.vcr()
963+
async def test_image_as_binary_content_tool_response(
964+
allow_model_requests: None, gemini_api_key: str, image_content: BinaryContent
965+
) -> None:
966+
m = GeminiModel('gemini-2.5-pro-preview-03-25', provider=GoogleGLAProvider(api_key=gemini_api_key))
967+
agent = Agent(m)
968+
969+
@agent.tool_plain
970+
async def get_image() -> BinaryContent:
971+
return image_content
972+
973+
result = await agent.run(['What fruit is in the image you have access to via the get_image tool?'])
974+
assert result.output == snapshot("""\
975+
Okay, I have retrieved the image.
976+
977+
The fruit in the image is a kiwi, sliced in half.\
978+
""")
979+
980+
962981
@pytest.mark.vcr()
963982
async def test_image_as_binary_content_input(
964983
allow_model_requests: None, gemini_api_key: str, image_content: BinaryContent

tests/models/test_openai.py

+15
Original file line numberDiff line numberDiff line change
@@ -640,6 +640,21 @@ async def test_image_url_input(allow_model_requests: None):
640640
)
641641

642642

643+
@pytest.mark.vcr()
644+
async def test_image_as_binary_content_tool_response(
645+
allow_model_requests: None, image_content: BinaryContent, openai_api_key: str
646+
):
647+
m = OpenAIModel('gpt-4o', provider=OpenAIProvider(api_key=openai_api_key))
648+
agent = Agent(m)
649+
650+
@agent.tool_plain
651+
async def get_image() -> BinaryContent:
652+
return image_content
653+
654+
result = await agent.run(['What fruit is in the image you have access to via the get_image tool?'])
655+
assert result.output == snapshot('The fruit in the image is a kiwi.')
656+
657+
643658
@pytest.mark.vcr()
644659
async def test_image_as_binary_content_input(
645660
allow_model_requests: None, image_content: BinaryContent, openai_api_key: str

tests/models/test_openai_responses.py

+15
Original file line numberDiff line numberDiff line change
@@ -221,6 +221,21 @@ async def get_location(loc_name: str) -> str:
221221
)
222222

223223

224+
@pytest.mark.vcr()
225+
async def test_image_as_binary_content_tool_response(
226+
allow_model_requests: None, image_content: BinaryContent, openai_api_key: str
227+
):
228+
m = OpenAIResponsesModel('gpt-4o', provider=OpenAIProvider(api_key=openai_api_key))
229+
agent = Agent(m)
230+
231+
@agent.tool_plain
232+
async def get_image() -> BinaryContent:
233+
return image_content
234+
235+
result = await agent.run(['What fruit is in the image you have access to via the get_image tool?'])
236+
assert result.output == snapshot('The fruit in the image is a kiwi.')
237+
238+
224239
async def test_image_as_binary_content_input(
225240
allow_model_requests: None, image_content: BinaryContent, openai_api_key: str
226241
):

0 commit comments

Comments
 (0)