-
Notifications
You must be signed in to change notification settings - Fork 205
fix: Add getting started tutorial to git #870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
Please make sure all the checkboxes are checked:
|
WalkthroughThis update introduces major enhancements and refactoring across the Cognee codebase. Notable changes include the addition of new graph retrievers with chain-of-thought and context extension capabilities, expanded support for Memgraph as a graph database, improved pipeline and task execution with context propagation, and a new OpenAI-compatible responses API. Numerous bug fixes, error handling improvements, and new example scripts and tests are also included. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant API
participant Dispatcher
participant ToolFunction
Client->>API: POST /api/v1/responses (with model/input/tools/tool_choice)
API->>API: Call OpenAI-compatible endpoint
API->>Dispatcher: For each function call in response, dispatch_function(tool_call)
Dispatcher->>ToolFunction: handle_search / handle_cognify / handle_prune (async)
ToolFunction-->>Dispatcher: Return result or error
Dispatcher-->>API: Aggregate tool call outputs
API-->>Client: Return structured ResponseBody with tool call results and usage
sequenceDiagram
participant Pipeline
participant Task
participant User
participant Context
Pipeline->>Task: run_tasks(tasks, user, pipeline_name, context)
Task->>Task: handle_task(..., context)
Task->>Task: If executable accepts context, pass context
Task->>Task: Possibly recurse for sub-tasks with context
Task-->>Pipeline: Return result
Possibly related PRs
Poem
Note ⚡️ AI Code Reviews for VS Code, Cursor, WindsurfCodeRabbit now has a plugin for VS Code, Cursor and Windsurf. This brings AI code reviews directly in the code editor. Each commit is reviewed immediately, finding bugs before the PR is raised. Seamless context handoff to your AI code agent ensures that you can easily incorporate review feedback. ✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
️✅ There are no secrets present in this pull request anymore.If these secrets were true positive and are still valid, we highly recommend you to revoke them. 🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 48
🔭 Outside diff range comments (3)
cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py (1)
151-156
: 💡 Verification agent🧩 Analysis chain
with_vector
andnormalized
parameters are ignoredThe
search
signature exposeswith_vector
andnormalized
, yet the body never references them.
This makes the public API misleading and forces callers to pay for functionality that is never delivered.- limit: int = 15, - with_vector: bool = False, - normalized: bool = True, + limit: int = 15, + *, + with_vector: bool = False, # ← either honour it … + normalized: bool = True, # ← … or remove from the signature.Follow-up actions:
- Either implement the feature (include vectors in the DataFrame / optionally skip
normalize_distances
) or- Delete the unused parameters and update call-sites.
Let’s verify whether
with_vector
andnormalized
are actually referenced inLanceDBAdapter.py
:
🏁 Script executed:
#!/bin/bash # Search for any usage of the parameters within the file grep -R "with_vector" -n cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py || echo "No with_vector usages" grep -R "normalized" -n cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py || echo "No normalized usages"Length of output: 469
🏁 Script executed:
#!/bin/bash # Display the portion of the LanceDBAdapter.py around the search function to inspect parameter usage sed -n '130,240p' cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.pyLength of output: 3855
Remove or Implement Unused Parameters in
LanceDBAdapter.search
The
search
signature exposeswith_vector
andnormalized
, but neither is ever referenced in the method body:
with_vector
isn’t used to include vectors in the returned results.normalized
isn’t used to skip or control thenormalize_distances
call.Locations to fix:
- File:
cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py
- Method:
async def search(...)
, lines ~151–156Suggested diff (make keyword-only or remove):
- async def search( - self, - collection_name: str, - query_text: str = None, - query_vector: List[float] = None, - limit: int = 15, - with_vector: bool = False, - normalized: bool = True, - ): + async def search( + self, + collection_name: str, + query_text: str = None, + query_vector: List[float] = None, + limit: int = 15, + *, + with_vector: bool = False, # ← honor or remove + normalized: bool = True, # ← honor or remove + ):Follow-up actions:
- To implement:
- Return vectors in each
ScoredResult
whenwith_vector=True
.- Skip or apply
normalize_distances
based onnormalized=False
.- Otherwise, remove these parameters (and update all call-sites, including
batch_search
).cognee/infrastructure/databases/vector/milvus/MilvusAdapter.py (1)
151-160
:⚠️ Potential issueRetrieve leaks client handle and risks type mismatch
client
is never closed – leaking sockets/threads.UUID
objects are implicitly coerced inside an f-string; Milvus expects plain strings.client = self.get_milvus_client() try: - filter_expression = f"""id in [{", ".join(f'"{id}"' for id in data_point_ids)}]""" + filter_expression = ( + "id in [" + ", ".join(f'"{str(dp_id)}"' for dp_id in data_point_ids) + "]" + ) results = client.query( collection_name=collection_name, expr=filter_expression, output_fields=["*"], ) return results finally: client.close()cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py (1)
85-99
: 🛠️ Refactor suggestion
convert_to_weaviate_data_points
is O(n²)`Using
data_points.index(data_point)
within a list-comprehension forces a linear scan for every element.
Replace with anenumerate
for O(n):-data_vectors = await self.embed_data([...]) - -def convert_to_weaviate_data_points(data_point: DataPoint): - vector = data_vectors[data_points.index(data_point)] +data_vectors = await self.embed_data([...]) + +def convert_to_weaviate_data_points(idx_dp: tuple[int, DataPoint]): + idx, data_point = idx_dp + vector = data_vectors[idx] ... - -data_points = [convert_to_weaviate_data_points(data_point) for data_point in data_points] +data_points = [ + convert_to_weaviate_data_points(t) for t in enumerate(data_points) +]This matters for large batch inserts.
🧹 Nitpick comments (62)
cognee/modules/retrieval/exceptions/__init__.py (1)
7-7
: Explicitly define the public API via__all__
To signal thatSearchTypeNotSupported
andCypherSearchError
are intended exports (and silence the F401 lint warnings), add an__all__
declaration.from .exceptions import SearchTypeNotSupported, CypherSearchError + +__all__ = ["SearchTypeNotSupported", "CypherSearchError"]🧰 Tools
🪛 Ruff (0.11.9)
7-7:
.exceptions.SearchTypeNotSupported
imported but unused; consider removing, adding to__all__
, or using a redundant alias(F401)
7-7:
.exceptions.CypherSearchError
imported but unused; consider removing, adding to__all__
, or using a redundant alias(F401)
examples/python/multimedia_example.py (1)
23-29
: Use pathlib for path construction: Whileos.path.join
works, switching topathlib.Path
will make the code more readable and robust across platforms. For example:from pathlib import Path project_root = Path(__file__).parents[2] mp3_file_path = project_root / "examples" / "data" / "multimedia" / "text_to_speech.mp3" png_file_path = project_root / "examples" / "data" / "multimedia" / "example.png"notebooks/cognee_multimedia_demo.ipynb (2)
25-27
: Automate notebook cleanup: To ensure consistency, consider addingnbstripout
or a pre-commit hook that automatically strips outputs and execution counts on commit.
34-38
: Improve path resolution in notebook: Relying onos.path.abspath("")
and"../"
can break depending on the working directory. A more reliable approach is to usepathlib.Path
and derive the project root explicitly. For instance:from pathlib import Path notebook_dir = Path().resolve().parent project_root = notebook_dir.parent mp3_file_path = project_root / "examples" / "data" / "multimedia" / "text_to_speech.mp3" png_file_path = project_root / "examples" / "data" / "multimedia" / "example.png"Also applies to: 40-43
cognee/modules/retrieval/utils/brute_force_triplet_search.py (1)
66-73
: Consider usingcontextlib.suppress
for cleaner exception handling.The error handling logic is correct - gracefully continuing with an empty memory fragment when entities are not found is appropriate. However, the static analysis suggestion to use
contextlib.suppress
would be more pythonic.+from contextlib import suppress + - try: - await memory_fragment.project_graph_from_db( - graph_engine, - node_properties_to_project=properties_to_project, - edge_properties_to_project=["relationship_name"], - ) - except EntityNotFoundError: - pass + with suppress(EntityNotFoundError): + await memory_fragment.project_graph_from_db( + graph_engine, + node_properties_to_project=properties_to_project, + edge_properties_to_project=["relationship_name"], + )🧰 Tools
🪛 Ruff (0.11.9)
66-73: Use
contextlib.suppress(EntityNotFoundError)
instead oftry
-except
-pass
Replace with
contextlib.suppress(EntityNotFoundError)
(SIM105)
cognee-starter-kit/.gitignore (1)
1-197
: Comprehensive .gitignore with minor duplication issue.This is a well-structured and comprehensive
.gitignore
file that covers Python, Node.js, various frameworks, and cognee-specific patterns. However, there's a redundant entry for.env
(appears on both line 2 and line 140).Remove the duplicate
.env
entry:.data -.env .local.env .prod.env cognee/.data/
The comprehensive coverage of development tools, frameworks, and cognee-specific directories is excellent for maintaining a clean repository.
cognee/tests/unit/modules/pipelines/run_tasks_test.py (1)
53-55
: Evaluate necessity ofif __name__ == "__main__"
guard in test
The entry‐point guard allows running this test script directly but isn’t needed for standard unittest or pytest discovery. Consider removing it to keep tests consistent and avoid divergence from test framework conventions.cognee-frontend/src/utils/fetch.ts (1)
4-4
: Externalize API base URL for flexibility
Hardcodinghttp://localhost:8000/api
may lead to environment-specific issues. Consider moving this into a configuration or environment variable (e.g.process.env.API_BASE_URL
) to centralize the endpoint and enable easier testing and deployment.cognee/modules/pipelines/operations/__init__.py (1)
1-1
: Consider adding__all__
to clarify the public API.The static analysis tool correctly identifies that
log_pipeline_run_initiated
is imported but not explicitly exported. To maintain consistency and clarity about the module's public interface, consider adding an__all__
declaration:+__all__ = [ + "log_pipeline_run_initiated", + "log_pipeline_run_start", + "log_pipeline_run_complete", + "log_pipeline_run_error", + "cognee_pipeline" +]This makes the intent to expose these functions at the package level explicit.
🧰 Tools
🪛 Ruff (0.11.9)
1-1:
.log_pipeline_run_initiated.log_pipeline_run_initiated
imported but unused; consider removing, adding to__all__
, or using a redundant alias(F401)
cognee/modules/engine/models/__init__.py (1)
6-6
: Consider adding__all__
to clarify the public API.The static analysis tool correctly identifies that
ColumnValue
is imported but not explicitly exported. To maintain consistency and clarity about the module's public interface, consider adding an__all__
declaration:+__all__ = [ + "Entity", + "EntityType", + "TableRow", + "TableType", + "NodeSet", + "ColumnValue" +]This makes the intent to expose these model classes at the package level explicit and follows Python best practices for module organization.
🧰 Tools
🪛 Ruff (0.11.9)
6-6:
.ColumnValue.ColumnValue
imported but unused; consider removing, adding to__all__
, or using a redundant alias(F401)
cognee/tests/test_neo4j.py (1)
77-85
: Consider conditional test skipping instead of commenting out.While the reasoning for disabling the test is valid, commenting out tests reduces coverage. Consider using conditional skipping based on LLM model capability or configuration.
Instead of commenting out, consider using pytest skip:
- # NOTE: Due to the test failing often on weak LLM models we've removed this test for now - # search_results = await cognee.search( - # query_type=SearchType.NATURAL_LANGUAGE, - # query_text=f"Find nodes connected to node with name {random_node_name}", - # ) - # assert len(search_results) != 0, "Query related natural language don't exist." - # print("\nExtracted results are:\n") - # for result in search_results: - # print(f"{result}\n") + # Skip natural language search test for weak LLM models + try: + search_results = await cognee.search( + query_type=SearchType.NATURAL_LANGUAGE, + query_text=f"Find nodes connected to node with name {random_node_name}", + ) + assert len(search_results) != 0, "Query related natural language don't exist." + print("\nExtracted results are:\n") + for result in search_results: + print(f"{result}\n") + expected_history_count = 8 + except Exception as e: + logger.warning(f"Natural language search test skipped due to: {e}") + expected_history_count = 6cognee/infrastructure/llm/prompts/cot_followup_user_prompt.txt (1)
5-5
: Fix template syntax inconsistency.There's a missing space in the template placeholder syntax.
Apply this fix for consistency:
-`{{ query}}` +`{{ query }}`cognee/eval_framework/metrics_dashboard.py (1)
131-170
: Consider improving function documentation.The function now has dual responsibilities: writing to file AND returning HTML content. The docstring should be updated to reflect this behavior change.
Update the docstring to clarify the return value:
- """Create and save the dashboard with all visualizations.""" + """Create and save the dashboard with all visualizations. + + Returns: + str: The generated dashboard HTML content. + """cognee/infrastructure/llm/prompts/cot_validation_user_prompt.txt (1)
2-2
: Fix template syntax inconsistency.There's a missing space in the template placeholder syntax, similar to the issue in the follow-up prompt template.
Apply this fix for consistency:
-`{{ query}}` +`{{ query }}`cognee-starter-kit/.env.template (1)
1-19
: LGTM! Well-structured environment template.This
.env.template
file provides clear guidance for users setting up the cognee starter kit. The comments effectively explain which variables are required for different providers, and the structure logically groups related configuration options.Consider adding a comment explaining the model name format (e.g.,
"openai/gpt-4o-mini"
) to help users understand the expected syntax:# In case you choose to use OpenAI provider, just adjust the model and api_key. LLM_API_KEY="" +# Format: provider/model-name (e.g., "openai/gpt-4o-mini") LLM_MODEL="openai/gpt-4o-mini"
notebooks/cognee_demo.ipynb (1)
532-536
: Consider simplifying redundant user retrieval.The current pattern retrieves the default user and then immediately fetches the same user by ID. This appears redundant unless there's a specific reason for the additional lookup.
Consider simplifying to:
from cognee.modules.users.methods import get_default_user from cognee.modules.data.methods import get_datasets_by_name -from cognee.modules.users.methods import get_user -default_user = await get_default_user() - -user = await get_user(default_user.id) +user = await get_default_user()Unless the additional
get_user
call serves a specific purpose (like ensuring fresh data or different object properties), the direct use ofget_default_user()
should suffice.cognee-starter-kit/src/pipelines/default.py (3)
5-5
: Remove or uncomment the unused import.The commented import
from cognee.shared.utils import render_graph
is not used in the current code. Consider removing it entirely or uncommenting if it will be used in the future.-# from cognee.shared.utils import render_graph
36-38
: Consider adding configuration guidance for Graphistry integration.The commented Graphistry integration code provides a valuable alternative visualization option. Consider adding a comment explaining how users can enable this feature or pointing to documentation.
- # # Get a graphistry url (Register for a free account at https://www.graphistry.com) - # url = await render_graph() - # print(f"Graphistry URL: {url}") + # Alternative: Get a Graphistry URL for advanced graph visualization + # (Register for a free account at https://www.graphistry.com and uncomment the render_graph import) + # url = await render_graph() + # print(f"Graphistry URL: {url}")
69-69
: Remove the extra blank line.There's an unnecessary extra blank line before the main execution block.
- - + if __name__ == "__main__":cognee/version.py (2)
10-18
: Apply the static analysis suggestion to combinewith
statements.The static analysis tool correctly identified that the nested
with
statements can be combined for better readability and style.with suppress(FileNotFoundError, StopIteration): - with open( - os.path.join(Path(__file__).parent.parent, "pyproject.toml"), encoding="utf-8" - ) as pyproject_toml: - version = ( - next(line for line in pyproject_toml if line.startswith("version")) - .split("=")[1] - .strip("'\"\n ") - ) + pyproject_path = os.path.join(Path(__file__).parent.parent, "pyproject.toml") + with open(pyproject_path, encoding="utf-8") as pyproject_toml: + version = ( + next(line for line in pyproject_toml if line.startswith("version")) + .split("=")[1] + .strip("'\"\n ") + ) # Mark the version as a local Cognee library by appending "-dev" return f"{version}-dev"🧰 Tools
🪛 Ruff (0.11.9)
10-13: Use a single
with
statement with multiple contexts instead of nestedwith
statementsCombine
with
statements(SIM117)
14-18
: Improve robustness of version parsing.The current version parsing assumes a specific format and could fail if the
pyproject.toml
format changes. Consider using a proper TOML parser for more robust parsing.- version = ( - next(line for line in pyproject_toml if line.startswith("version")) - .split("=")[1] - .strip("'\"\n ") - ) + for line in pyproject_toml: + if line.strip().startswith("version"): + # Handle both version = "1.0.0" and version = '1.0.0' formats + version = line.split("=", 1)[1].strip() + version = version.strip("'\"\n ") + break + else: + # If no version line found, let StopIteration be raised + raise StopIteration("No version found in pyproject.toml")Alternatively, consider using a proper TOML parser like
tomllib
(Python 3.11+) ortomli
for more robust parsing.cognee/modules/pipelines/operations/log_pipeline_run_initiated.py (1)
16-22
: Consider adding error handling for database operations.The database session management is correct, but consider adding try-catch blocks for better error handling and logging.
async def log_pipeline_run_initiated(pipeline_id: str, pipeline_name: str, dataset_id: UUID): pipeline_run = PipelineRun( pipeline_run_id=uuid4(), pipeline_name=pipeline_name, pipeline_id=pipeline_id, status=PipelineRunStatus.DATASET_PROCESSING_INITIATED, dataset_id=dataset_id, run_info={}, ) db_engine = get_relational_engine() + try: async with db_engine.get_async_session() as session: session.add(pipeline_run) await session.commit() + except Exception as e: + # Log the error and re-raise or handle appropriately + raise return pipeline_runcognee-frontend/src/modules/ingestion/DataView/DataView.tsx (1)
36-36
: Consider renaming the component to avoid shadowing the global DataView.The component name
DataView
shadows the globalDataView
property (Web API). Consider renaming it to avoid confusion.-export default function DataView({ datasetId, data, onClose, onDataAdd, onCognify }: DataViewProps) { +export default function DatasetView({ datasetId, data, onClose, onDataAdd, onCognify }: DataViewProps) {Also update the interface name:
-interface DataViewProps { +interface DatasetViewProps {🧰 Tools
🪛 Biome (1.9.4)
[error] 36-36: Do not shadow the global "DataView" property.
Consider renaming this variable. It's easy to confuse the origin of variables when they're named after a known global.
(lint/suspicious/noShadowRestrictedNames)
cognee/shared/logging_utils.py (1)
7-12
: Remove unused import and verify version retrieval.The
importlib.metadata
import appears unused according to static analysis. Thestructlog.__version__
andcognee.__version__
are used directly without importing metadata.-import importlib.metadata
If version information needs to be retrieved from other packages, use importlib.metadata appropriately or remove the unused import.
🧰 Tools
🪛 Ruff (0.11.9)
10-10:
importlib.metadata
imported but unusedRemove unused import:
importlib.metadata
(F401)
examples/database_examples/chromadb_example.py (1)
1-1
: Remove unused import.The
os
module is imported but never used in this file.-import os
🧰 Tools
🪛 Ruff (0.11.9)
1-1:
os
imported but unusedRemove unused import:
os
(F401)
cognee/exceptions/exceptions.py (1)
23-30
: Simplify the logging conditional structure.The current if/elif chain can be simplified using a more elegant approach.
- if log and (log_level == "ERROR"): - logger.error(f"{self.name}: {self.message} (Status code: {self.status_code})") - elif log and (log_level == "WARNING"): - logger.warning(f"{self.name}: {self.message} (Status code: {self.status_code})") - elif log and (log_level == "INFO"): - logger.info(f"{self.name}: {self.message} (Status code: {self.status_code})") - elif log and (log_level == "DEBUG"): - logger.debug(f"{self.name}: {self.message} (Status code: {self.status_code})") + if log: + log_method = getattr(logger, log_level.lower(), None) + if log_method: + log_method(f"{self.name}: {self.message} (Status code: {self.status_code})")examples/data/car_and_tech_companies.txt (1)
17-17
: Fix grammatical error in sample text.There's a subject-verb disagreement: "Each of these car manufacturer" should use the plural form to agree with "these".
-Each of these car manufacturer contributes to Germany's reputation as a leader in the global automotive industry, showcasing a blend of innovation, performance, and design excellence. +Each of these car manufacturers contributes to Germany's reputation as a leader in the global automotive industry, showcasing a blend of innovation, performance, and design excellence.🧰 Tools
🪛 LanguageTool
[grammar] ~17-~17: The plural determiner ‘these’ does not agree with the singular noun ‘car’.
Context: ...nce practicality with quality. Each of these car manufacturer contributes to Germany's r...(THIS_NNS)
[uncategorized] ~17-~17: The grammatical number of this noun doesn’t look right. Consider replacing it.
Context: ...cality with quality. Each of these car manufacturer contributes to Germany's reputation as ...(AI_EN_LECTOR_REPLACEMENT_NOUN_NUMBER)
cognee/tasks/temporal_awareness/index_graphiti_objects.py (1)
64-66
: Clarify edge data structure documentation.The comment indicates that the relationship name is at index 2, but it would be helpful to document the complete structure of the edge tuple for maintainability.
Consider adding a more comprehensive comment:
- edge_types = Counter( - edge[2] # The edge key (relationship name) is at index 2 - for edge in edges_data - ) + edge_types = Counter( + edge[2] # Edge tuple format: (source_id, target_id, relationship_name, ...) + for edge in edges_data + )cognee/tests/unit/modules/pipelines/run_tasks_with_context_test.py (1)
42-47
: Consider using pytest instead of manual test runner.While the current approach works, consider integrating with pytest for better test discovery and reporting capabilities.
-def test_run_tasks(): - asyncio.run(run_and_check_tasks()) - - -if __name__ == "__main__": - test_run_tasks() +@pytest.mark.asyncio +async def test_run_tasks(): + await run_and_check_tasks()examples/database_examples/qdrant_example.py (1)
51-56
: Consider extracting sample text to external file.For a cleaner example, consider moving the sample text to a separate file or making it more concise.
You could create a
sample_data.txt
file or use a shorter, more focused example text to improve readability.examples/database_examples/falkordb_example.py (1)
1-1
: Remove unused import.The
os
module is imported but not used in this file.-import os import pathlib
🧰 Tools
🪛 Ruff (0.11.9)
1-1:
os
imported but unusedRemove unused import:
os
(F401)
examples/database_examples/kuzu_example.py (1)
1-1
: Remove unused import.The
os
module is imported but not used in this file.-import os import pathlib
🧰 Tools
🪛 Ruff (0.11.9)
1-1:
os
imported but unusedRemove unused import:
os
(F401)
notebooks/cognee_openai_compatable_demo.ipynb (1)
14-16
: Consider adding authentication documentation.The comment suggests using
/api/v1/auth/login
to get a JWT, but the example uses a hardcoded API key. Consider adding a cell demonstrating proper authentication or clarifying when the hardcoded key is acceptable for local development.- "# Use /api/v1/auth/login to get JWT\n", + "# For production: Use /api/v1/auth/login to get JWT\n", + "# For local development: Use a placeholder API key as shown below\n",cognee/modules/retrieval/graph_completion_context_extension_retriever.py (1)
3-3
: Remove unused imports identified by static analysis.The static analysis correctly identifies unused imports that should be removed for cleaner code.
Apply this diff to remove the unused imports:
-from cognee.infrastructure.llm.get_llm_client import get_llm_client -from cognee.infrastructure.llm.prompts import read_query_prompt, render_promptAlso applies to: 6-6
🧰 Tools
🪛 Ruff (0.11.9)
3-3:
cognee.infrastructure.llm.get_llm_client.get_llm_client
imported but unusedRemove unused import:
cognee.infrastructure.llm.get_llm_client.get_llm_client
(F401)
examples/database_examples/pgvector_example.py (1)
1-1
: Remove unused import identified by static analysis.The
os
module is imported but never used in the script.Apply this diff to remove the unused import:
-import os
🧰 Tools
🪛 Ruff (0.11.9)
1-1:
os
imported but unusedRemove unused import:
os
(F401)
examples/database_examples/weaviate_example.py (1)
42-44
: Consider making cleanup optional with user confirmation.The automatic pruning of existing data might be unexpected for users. Consider making this optional or adding a confirmation mechanism.
- # Clean any existing data (optional) - await cognee.prune.prune_data() - await cognee.prune.prune_system(metadata=True) + # Clean any existing data (optional) + # Uncomment the following lines if you want to clean existing data + # await cognee.prune.prune_data() + # await cognee.prune.prune_system(metadata=True)cognee/modules/pipelines/operations/run_tasks_base.py (1)
31-36
: Optimize the parameter inspection logic.The logic for checking if a function accepts a context parameter works correctly, but can be simplified based on the static analysis hint.
- has_context = any( - [key == "context" for key in inspect.signature(running_task.executable).parameters.keys()] - ) + has_context = "context" in inspect.signature(running_task.executable).parameters🧰 Tools
🪛 Ruff (0.11.9)
32-32: Use
key in dict
instead ofkey in dict.keys()
Remove
.keys()
(SIM118)
cognee/tests/test_memgraph.py (2)
16-27
: Consider using more portable path construction.The current path construction using
os.path.join
andpathlib.Path
is functional but could be simplified and made more readable.Consider using pathlib consistently for cleaner path construction:
- data_directory_path = str( - pathlib.Path( - os.path.join(pathlib.Path(__file__).parent, ".data_storage/test_memgraph") - ).resolve() - ) + data_directory_path = str( + (pathlib.Path(__file__).parent / ".data_storage/test_memgraph").resolve() + )Apply the same pattern to the cognee_directory_path for consistency.
91-91
: Verify the expected search history count.The assertion expects exactly 8 search history entries, but this seems tightly coupled to the number of search operations in the test.
Consider making the assertion more descriptive and potentially more flexible:
- assert len(history) == 8, "Search history is not correct." + expected_searches = 4 # Number of search operations performed + assert len(history) == expected_searches, f"Expected {expected_searches} search history entries, got {len(history)}"This makes the relationship between operations and expected history clearer.
cognee/api/v1/responses/routers/get_responses_router.py (1)
115-121
: Enhance error logging for function dispatch failures.The error handling captures exceptions but could provide more detailed information for debugging.
Consider adding more context to the error logging:
try: function_result = await dispatch_function(tool_call) output_status = "success" except Exception as e: - logger.exception(f"Error executing function {function_name}: {e}") + logger.exception( + f"Error executing function {function_name} with arguments {arguments_str}: {e}", + extra={"function_name": function_name, "arguments": arguments_str, "call_id": call_id} + ) function_result = f"Error executing {function_name}: {str(e)}" output_status = "error"cognee-starter-kit/src/pipelines/low_level.py (1)
110-113
: Add error handling for graph visualization.The graph visualization operation could fail if the directory doesn't exist or there are permission issues.
Add error handling and ensure the directory exists:
# Or use our simple graph preview graph_file_path = str( os.path.join(os.path.dirname(__file__), ".artifacts/graph_visualization.html") ) + # Ensure the artifacts directory exists + os.makedirs(os.path.dirname(graph_file_path), exist_ok=True) + + try: await visualize_graph(graph_file_path) + print(f"Graph visualization saved to: {graph_file_path}") + except Exception as e: + print(f"Failed to generate graph visualization: {e}")cognee/eval_framework/modal_eval_dashboard.py (2)
35-41
: Consider security implications of disabled CORS and XSRF protection.Disabling CORS and XSRF protection can create security vulnerabilities, especially if the dashboard is exposed to external networks.
Evaluate whether these security settings are necessary and document the security implications:
cmd = ( "streamlit run /root/serve_dashboard.py " "--server.port 8000 " + # WARNING: These settings disable security protections + # Only use in trusted environments "--server.enableCORS=false " "--server.enableXsrfProtection=false" )Consider if these settings are truly necessary for your deployment environment.
41-41
: Consider using exec() instead of Popen() for better process management.Using
subprocess.Popen()
without waiting for the process can lead to resource leaks or unclear error states.Consider using a more explicit process management approach:
- subprocess.Popen(cmd, shell=True) + try: + subprocess.run(cmd, shell=True, check=True) + except subprocess.CalledProcessError as e: + raise RuntimeError(f"Failed to start Streamlit: {e}")Alternatively, if you need the process to run in the background, consider proper process management to handle cleanup.
cognee/tests/unit/modules/retrieval/graph_completion_retriever_context_extension_test.py (1)
181-183
: Fix incorrect method names in the manual test runner
test_graph_completion_context_simple
/complex
are not defined – the actual names include_extension_
.
Running this file directly will therefore raiseAttributeError
.- await test.test_graph_completion_context_simple() - await test.test_graph_completion_context_complex() + await test.test_graph_completion_extension_context_simple() + await test.test_graph_completion_extension_context_complex()cognee/tests/unit/modules/retrieval/graph_completion_retriever_cot_test.py (1)
178-181
: Align manual runner with real test namesNames invoked here don’t exist (
_context_
vs._cot_context_
) which will crash the ad-hoc runner.- await test.test_graph_completion_context_simple() - await test.test_graph_completion_context_complex() + await test.test_graph_completion_cot_context_simple() + await test.test_graph_completion_cot_context_complex()cognee/modules/retrieval/graph_completion_cot_retriever.py (2)
36-40
: Initialiseanswer
asstr
, notList[str]
, to match later usage
answer
is first declared as a list and then replaced by the string returned fromgenerate_completion
.
Keeping the initial type consistent eliminates static-type noise and avoids accidental nested-list returns in the future.- followup_question = "" - triplets = [] - answer = [""] + followup_question: str = "" + triplets: list[str] = [] + answer: str = ""
84-85
: Return value can be simplified
get_completion
promisesList[str]
but currently returns[answer]
whereanswer
isstr
, causing a single-element list.
If multi-answer support is not planned, returning the string directly simplifies both API and consumers (tests would need a one-line update).cognee/modules/pipelines/operations/pipeline.py (1)
172-174
: Error message does not match validation ruleThe guard checks for
"."
and" "
but the exception says “spaces or underscores”.- raise ValueError("Dataset name cannot contain spaces or underscores") + raise ValueError("Dataset name cannot contain spaces or periods")cognee/api/v1/responses/dispatch_function.py (2)
31-34
: Un-guardedjson.loads
can explode on malformed argumentsA malformed
arguments
string (which is common with LLM tool calls) will raiseJSONDecodeError
and bubble all the way up. Consider a safe parse with graceful degradation:- arguments = json.loads(arguments_str) + try: + arguments = json.loads(arguments_str) + except json.JSONDecodeError as exc: + logger.error("Cannot parse tool-call arguments: %s", exc) + return f"Error: Invalid JSON in tool arguments – {exc}"
71-82
:top_k
should be coerced toint
to avoid silent fall-backIf
top_k
arrives as a string (e.g."5"
), the currentisinstance(top_k, int)
check forces the default10
, surprising the caller.- top_k = arguments.get("top_k") + try: + top_k = int(arguments["top_k"]) + except (KeyError, ValueError, TypeError): + top_k = 10cognee/infrastructure/databases/vector/qdrant/QDrantAdapter.py (2)
108-116
: O(n²) lookup while building points
data_points.index(data_point)
performs a linear scan for every element. Useenumerate
to keep the complexity linear:- points = [convert_to_qdrant_point(point) for point in data_points] + points = [ + models.PointStruct( + id=str(point.id), + payload=point.model_dump(), + vector={"text": data_vectors[i]}, + ) + for i, point in enumerate(data_points) + ]
195-216
: Client closed twice – minor but noisy
await client.close()
is executed inside thetry
block and thefinally
block, which logs a warning in recent qdrant-client versions. Remove the first one and keep thefinally
cleanup.- await client.close()
cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py (2)
163-165
: Unboundedlimit == 0
can explode memory usage
limit == 0
is translated to “return all rows” (await collection.count_rows()
), which on large tables can load millions of vectors into memory and DOS the service.Consider:
- if limit == 0: - limit = await collection.count_rows() + if limit == 0: + total = await collection.count_rows() + limit = min(total, self.MAX_BULK_LIMIT) # e.g. 1_000 + logger.warning( + "Limit=0 detected → clamped to %s to avoid OOM (collection size: %s)", + limit, + total, + )
205-211
: Inefficient row-by-row deletionIterating and issuing one
DELETE
per ID results in N network round-trips:for data_point_id in data_point_ids: await collection.delete(f"id = '{data_point_id}'")Most vector stores (including LanceDB) support bulk predicates. Deleting in one statement is both atomic and ~N-fold faster:
-for data_point_id in data_point_ids: - await collection.delete(f"id = '{data_point_id}'") +ids = "', '".join(map(str, data_point_ids)) +await collection.delete(f"id IN ('{ids}')")If you retain the per-row strategy to avoid commit conflicts, add a short
await asyncio.sleep()
back-off to reduce lock contention.cognee/infrastructure/databases/vector/milvus/MilvusAdapter.py (2)
181-186
: Redundant guard makes laterlimit if limit > 0 else None
unreachableBecause you
return []
whenlimit <= 0
, the conditional on line 204 is dead code.-if limit <= 0: - return [] +if limit is None or limit <= 0: + logger.warning("Non-positive limit supplied (%s); returning [].", limit) + return [] ... -limit=limit if limit > 0 else None, +limit=limit,
219-232
: Exception filtering duplicates earlier existence checkYou already bail out if the collection is missing (lines 187-192).
Catching the sameCollectionNotExistException
again is harmless but adds noise. Consider collapsing the two branches into a single guard clause.cognee/infrastructure/databases/graph/networkx/adapter.py (2)
139-142
: Edge retrieval now limited to incoming/outgoing onlyChanging the identifier type is fine, but note that the helper combines
in_edges
+out_edges
without preserving multi-edge keys.
If multiple edges with differentkey
s exist between two nodes, only the first will be returned, losing information.
Consider:edges = list(self.graph.in_edges(node_id, data=True, keys=True)) edges += list(self.graph.out_edges(node_id, data=True, keys=True))so every parallel edge is captured.
218-226
: Potential performance hit in neighbor aggregation
await asyncio.gather(self.get_predecessors(), self.get_successors())
is neat, but both called methods perform synchronous NetworkX look-ups; there is no actual I/O to await.
You can simplify and avoid the overhead:predecessors = await self.get_predecessors(node_id) successors = await self.get_successors(node_id)or make the helper functions synchronous and call directly.
cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py (2)
206-216
: Batch search executes sequentially
await query_search(...)
inside a comprehension performs the searches serially.
Leverage concurrency withasyncio.gather
:vectors = await self.embed_data(query_texts) tasks = [ self.search(collection_name, query_vector=v, limit=limit, with_vector=with_vectors) for v in vectors ] return await asyncio.gather(*tasks)Reduces total latency considerably for remote calls.
221-227
:delete_many
result handlingWeaviate’s
delete_many
returns an object containing status counts.
Consider returningresult.summary
(or similar) so callers can verify deletions instead of the raw SDK object.cognee-mcp/src/server.py (1)
87-94
: Background tasks may swallow exceptions silently
asyncio.create_task()
detaches the coroutine; if it raises, the error is logged only when the task is garbage-collected. Consider attaching a done-callback that logs unexpected exceptions or gather the tasks in a supervisor:task = asyncio.create_task(cognify_task(abs_path)) task.add_done_callback(lambda t: logger.error(t.exception()) if t.exception() else None) tasks.append(task)This makes operational issues visible without blocking the caller.
cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py (1)
128-133
: Double round-trip when the collection already exists
create_data_points()
callsawait self.create_collection()
, which in turn performs alist_collections
round-trip even when the collection is known to exist. Right after that you again callget_collection()
. For high-throughput ingestion this is two extra network calls per batch.If you optimistically try
get_collection()
first and fall back tocreate_collection()
only onCollectionNotFoundError
, you save one RTT in the common path:try: collection = await self.get_collection(collection_name) except CollectionNotFoundError: await self.create_collection(collection_name) collection = await self.get_collection(collection_name)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (30)
.data/short_stories/soldiers-home.pdf
is excluded by!**/*.pdf
,!**/*.pdf
.dlt/config.toml
is excluded by!**/*.toml
.github/actions/cognee_setup/action.yml
is excluded by!**/*.yml
.github/workflows/db_examples_tests.yml
is excluded by!**/*.yml
.github/workflows/python_version_tests.yml
is excluded by!**/*.yml
.github/workflows/test_memgraph.yml
is excluded by!**/*.yml
.github/workflows/test_suites.yml
is excluded by!**/*.yml
assets/community/cognee_benefits_zh.JPG
is excluded by!**/*.jpg
,!**/*.jpg
assets/community/cognee_diagram_zh.JPG
is excluded by!**/*.jpg
,!**/*.jpg
assets/community/graph_visualization_pt.png
is excluded by!**/*.png
,!**/*.png
assets/community/graph_visualization_ru.png
is excluded by!**/*.png
,!**/*.png
assets/graph_visualization.png
is excluded by!**/*.png
,!**/*.png
cognee-mcp/pyproject.toml
is excluded by!**/*.toml
cognee-mcp/uv.lock
is excluded by!**/*.lock
,!**/*.lock
cognee-starter-kit/pyproject.toml
is excluded by!**/*.toml
cognee-starter-kit/src/data/companies.json
is excluded by!**/*.json
cognee-starter-kit/src/data/people.json
is excluded by!**/*.json
deployment/helm/Chart.yaml
is excluded by!**/*.yaml
deployment/helm/docker-compose-helm.yml
is excluded by!**/*.yml
deployment/helm/templates/cognee_deployment.yaml
is excluded by!**/*.yaml
deployment/helm/templates/cognee_service.yaml
is excluded by!**/*.yaml
deployment/helm/templates/postgres_deployment.yaml
is excluded by!**/*.yaml
deployment/helm/templates/postgres_pvc.yaml
is excluded by!**/*.yaml
deployment/helm/templates/postgres_service.yaml
is excluded by!**/*.yaml
deployment/helm/values.yaml
is excluded by!**/*.yaml
examples/data/multimedia/example.png
is excluded by!**/*.png
,!**/*.png
examples/data/multimedia/text_to_speech.mp3
is excluded by!**/*.mp3
,!**/*.mp3
poetry.lock
is excluded by!**/*.lock
,!**/*.lock
pyproject.toml
is excluded by!**/*.toml
uv.lock
is excluded by!**/*.lock
,!**/*.lock
📒 Files selected for processing (140)
.data/code/example.txt
(0 hunks)CONTRIBUTING.md
(1 hunks)Dockerfile
(1 hunks)Dockerfile_modal
(1 hunks)README.md
(3 hunks)alembic/versions/1d0bb7fede17_add_pipeline_run_status.py
(1 hunks)alembic/versions/482cd6517ce4_add_default_user.py
(1 hunks)assets/graph_visualization.html
(0 hunks)cognee-frontend/src/app/page.tsx
(3 hunks)cognee-frontend/src/app/wizard/CognifyStep/CognifyStep.tsx
(1 hunks)cognee-frontend/src/app/wizard/ExploreStep/ExploreStep.tsx
(1 hunks)cognee-frontend/src/app/wizard/WizardPage.tsx
(1 hunks)cognee-frontend/src/modules/datasets/cognifyDataset.ts
(1 hunks)cognee-frontend/src/modules/exploration/getExplorationGraphUrl.ts
(1 hunks)cognee-frontend/src/modules/ingestion/DataView/DataView.tsx
(4 hunks)cognee-frontend/src/ui/Partials/Explorer/Explorer.tsx
(1 hunks)cognee-frontend/src/ui/Partials/SearchView/SearchView.tsx
(2 hunks)cognee-frontend/src/utils/fetch.ts
(1 hunks)cognee-mcp/src/server.py
(3 hunks)cognee-starter-kit/.env.template
(1 hunks)cognee-starter-kit/.gitignore
(1 hunks)cognee-starter-kit/README.md
(1 hunks)cognee-starter-kit/src/pipelines/custom-model.py
(1 hunks)cognee-starter-kit/src/pipelines/default.py
(1 hunks)cognee-starter-kit/src/pipelines/low_level.py
(1 hunks)cognee/__init__.py
(1 hunks)cognee/api/client.py
(2 hunks)cognee/api/v1/add/add.py
(1 hunks)cognee/api/v1/cognify/code_graph_pipeline.py
(3 hunks)cognee/api/v1/cognify/cognify.py
(1 hunks)cognee/api/v1/config/config.py
(1 hunks)cognee/api/v1/datasets/datasets.py
(1 hunks)cognee/api/v1/delete/routers/get_delete_router.py
(1 hunks)cognee/api/v1/responses/__init__.py
(1 hunks)cognee/api/v1/responses/default_tools.py
(1 hunks)cognee/api/v1/responses/dispatch_function.py
(1 hunks)cognee/api/v1/responses/models.py
(1 hunks)cognee/api/v1/responses/routers/__init__.py
(1 hunks)cognee/api/v1/responses/routers/default_tools.py
(1 hunks)cognee/api/v1/responses/routers/get_responses_router.py
(1 hunks)cognee/base_config.py
(1 hunks)cognee/eval_framework/answer_generation/answer_generation_executor.py
(2 hunks)cognee/eval_framework/eval_config.py
(4 hunks)cognee/eval_framework/metrics_dashboard.py
(1 hunks)cognee/eval_framework/modal_eval_dashboard.py
(1 hunks)cognee/eval_framework/modal_run_eval.py
(3 hunks)cognee/exceptions/exceptions.py
(1 hunks)cognee/infrastructure/databases/graph/get_graph_engine.py
(1 hunks)cognee/infrastructure/databases/graph/graph_db_interface.py
(2 hunks)cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py
(1 hunks)cognee/infrastructure/databases/graph/networkx/adapter.py
(7 hunks)cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py
(2 hunks)cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py
(8 hunks)cognee/infrastructure/databases/vector/embeddings/FastembedEmbeddingEngine.py
(1 hunks)cognee/infrastructure/databases/vector/exceptions/exceptions.py
(1 hunks)cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py
(8 hunks)cognee/infrastructure/databases/vector/milvus/MilvusAdapter.py
(9 hunks)cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py
(5 hunks)cognee/infrastructure/databases/vector/qdrant/QDrantAdapter.py
(6 hunks)cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py
(9 hunks)cognee/infrastructure/llm/anthropic/adapter.py
(1 hunks)cognee/infrastructure/llm/gemini/adapter.py
(1 hunks)cognee/infrastructure/llm/generic_llm_api/adapter.py
(1 hunks)cognee/infrastructure/llm/ollama/adapter.py
(1 hunks)cognee/infrastructure/llm/openai/adapter.py
(4 hunks)cognee/infrastructure/llm/prompts/cot_followup_system_prompt.txt
(1 hunks)cognee/infrastructure/llm/prompts/cot_followup_user_prompt.txt
(1 hunks)cognee/infrastructure/llm/prompts/cot_validation_system_prompt.txt
(1 hunks)cognee/infrastructure/llm/prompts/cot_validation_user_prompt.txt
(1 hunks)cognee/modules/data/methods/__init__.py
(1 hunks)cognee/modules/data/methods/create_dataset.py
(2 hunks)cognee/modules/data/methods/get_unique_dataset_id.py
(1 hunks)cognee/modules/engine/models/ColumnValue.py
(1 hunks)cognee/modules/engine/models/__init__.py
(1 hunks)cognee/modules/graph/cognee_graph/CogneeGraph.py
(1 hunks)cognee/modules/observability/get_observe.py
(1 hunks)cognee/modules/observability/observers.py
(1 hunks)cognee/modules/pipelines/models/PipelineRun.py
(1 hunks)cognee/modules/pipelines/operations/__init__.py
(1 hunks)cognee/modules/pipelines/operations/get_pipeline_status.py
(2 hunks)cognee/modules/pipelines/operations/log_pipeline_run_initiated.py
(1 hunks)cognee/modules/pipelines/operations/pipeline.py
(3 hunks)cognee/modules/pipelines/operations/run_tasks.py
(4 hunks)cognee/modules/pipelines/operations/run_tasks_base.py
(4 hunks)cognee/modules/retrieval/exceptions/__init__.py
(1 hunks)cognee/modules/retrieval/exceptions/exceptions.py
(0 hunks)cognee/modules/retrieval/graph_completion_context_extension_retriever.py
(1 hunks)cognee/modules/retrieval/graph_completion_cot_retriever.py
(1 hunks)cognee/modules/retrieval/graph_completion_retriever.py
(1 hunks)cognee/modules/retrieval/utils/brute_force_triplet_search.py
(4 hunks)cognee/modules/search/methods/search.py
(3 hunks)cognee/modules/search/types/SearchType.py
(1 hunks)cognee/modules/settings/get_settings.py
(2 hunks)cognee/modules/visualization/cognee_network_visualization.py
(1 hunks)cognee/shared/data_models.py
(0 hunks)cognee/shared/logging_utils.py
(4 hunks)cognee/tasks/ingestion/ingest_data.py
(1 hunks)cognee/tasks/ingestion/migrate_relational_database.py
(2 hunks)cognee/tasks/temporal_awareness/index_graphiti_objects.py
(2 hunks)cognee/tests/integration/run_toy_tasks/conftest.py
(0 hunks)cognee/tests/test_memgraph.py
(1 hunks)cognee/tests/test_neo4j.py
(1 hunks)cognee/tests/test_relational_db_migration.py
(3 hunks)cognee/tests/test_starter_pipelines.py
(1 hunks)cognee/tests/test_weaviate.py
(1 hunks)cognee/tests/unit/modules/pipelines/run_tasks_test.py
(1 hunks)cognee/tests/unit/modules/pipelines/run_tasks_with_context_test.py
(1 hunks)cognee/tests/unit/modules/retrieval/chunks_retriever_test.py
(4 hunks)cognee/tests/unit/modules/retrieval/graph_completion_retriever_context_extension_test.py
(1 hunks)cognee/tests/unit/modules/retrieval/graph_completion_retriever_cot_test.py
(1 hunks)cognee/tests/unit/modules/retrieval/graph_completion_retriever_test.py
(1 hunks)cognee/tests/unit/modules/retrieval/summaries_retriever_test.py
(1 hunks)cognee/tests/unit/modules/retrieval/utils/brute_force_triplet_search_test.py
(0 hunks)cognee/version.py
(1 hunks)entrypoint.sh
(3 hunks)examples/data/car_and_tech_companies.txt
(1 hunks)examples/database_examples/chromadb_example.py
(1 hunks)examples/database_examples/falkordb_example.py
(1 hunks)examples/database_examples/kuzu_example.py
(1 hunks)examples/database_examples/milvus_example.py
(1 hunks)examples/database_examples/neo4j_example.py
(1 hunks)examples/database_examples/pgvector_example.py
(1 hunks)examples/database_examples/qdrant_example.py
(1 hunks)examples/database_examples/weaviate_example.py
(1 hunks)examples/python/graphiti_example.py
(2 hunks)examples/python/multimedia_example.py
(1 hunks)notebooks/cognee_demo.ipynb
(3 hunks)notebooks/cognee_graphiti_demo.ipynb
(4 hunks)notebooks/cognee_llama_index.ipynb
(2 hunks)notebooks/cognee_multimedia_demo.ipynb
(2 hunks)notebooks/cognee_openai_compatable_demo.ipynb
(1 hunks)notebooks/cognee_simple_demo.ipynb
(7 hunks)notebooks/github_graph_visualization.html
(0 hunks)notebooks/graphrag_vs_rag.ipynb
(7 hunks)notebooks/hr_demo.ipynb
(0 hunks)notebooks/llama_index_cognee_integration.ipynb
(5 hunks)profiling/graph_pydantic_conversion/benchmark_function.py
(0 hunks)profiling/graph_pydantic_conversion/profile_graph_pydantic_conversion.py
(0 hunks)profiling/util/DummyEmbeddingEngine.py
(0 hunks)profiling/util/DummyLLMAdapter.py
(0 hunks)
💤 Files with no reviewable changes (12)
- cognee/shared/data_models.py
- cognee/modules/retrieval/exceptions/exceptions.py
- profiling/graph_pydantic_conversion/profile_graph_pydantic_conversion.py
- profiling/util/DummyEmbeddingEngine.py
- cognee/tests/integration/run_toy_tasks/conftest.py
- cognee/tests/unit/modules/retrieval/utils/brute_force_triplet_search_test.py
- .data/code/example.txt
- assets/graph_visualization.html
- profiling/graph_pydantic_conversion/benchmark_function.py
- notebooks/github_graph_visualization.html
- profiling/util/DummyLLMAdapter.py
- notebooks/hr_demo.ipynb
🧰 Additional context used
🧬 Code Graph Analysis (32)
cognee/tests/unit/modules/pipelines/run_tasks_test.py (1)
cognee/tests/unit/modules/pipelines/run_tasks_with_context_test.py (1)
test_run_tasks
(42-43)
cognee/tests/unit/modules/retrieval/summaries_retriever_test.py (1)
cognee/modules/retrieval/summaries_retriever.py (1)
SummariesRetriever
(9-33)
cognee/modules/data/methods/__init__.py (1)
cognee/modules/data/methods/get_unique_dataset_id.py (1)
get_unique_dataset_id
(5-6)
cognee/api/v1/responses/routers/__init__.py (1)
cognee/api/v1/responses/routers/get_responses_router.py (1)
get_responses_router
(25-149)
cognee/api/v1/add/add.py (4)
cognee/modules/users/models/User.py (1)
User
(12-39)cognee/modules/pipelines/operations/pipeline.py (1)
cognee_pipeline
(30-103)cognee/tasks/ingestion/ingest_data.py (1)
ingest_data
(20-192)cognee/tasks/ingestion/resolve_data_directories.py (1)
resolve_data_directories
(8-72)
cognee/tasks/ingestion/ingest_data.py (1)
cognee/modules/data/methods/create_dataset.py (1)
create_dataset
(11-33)
cognee/modules/data/methods/get_unique_dataset_id.py (1)
cognee/modules/users/models/User.py (1)
User
(12-39)
cognee-frontend/src/app/wizard/ExploreStep/ExploreStep.tsx (1)
cognee-frontend/src/ui/Partials/Explorer/Explorer.tsx (1)
Explorer
(15-61)
cognee/api/v1/responses/__init__.py (1)
cognee/api/v1/responses/routers/get_responses_router.py (1)
get_responses_router
(25-149)
cognee/__init__.py (1)
cognee/version.py (1)
get_cognee_version
(7-24)
cognee/modules/observability/get_observe.py (2)
cognee/base_config.py (1)
get_base_config
(29-30)cognee/modules/observability/observers.py (1)
Observer
(4-9)
cognee/api/client.py (1)
cognee/api/v1/responses/routers/get_responses_router.py (1)
get_responses_router
(25-149)
cognee/modules/engine/models/__init__.py (1)
cognee/modules/engine/models/ColumnValue.py (1)
ColumnValue
(4-9)
cognee/tests/test_neo4j.py (1)
cognee/modules/users/methods/get_default_user.py (1)
get_default_user
(12-37)
cognee/modules/pipelines/operations/get_pipeline_status.py (1)
cognee/modules/pipelines/models/PipelineRun.py (1)
PipelineRun
(15-27)
cognee/api/v1/datasets/datasets.py (1)
cognee/modules/pipelines/operations/get_pipeline_status.py (1)
get_pipeline_status
(8-35)
cognee-frontend/src/modules/datasets/cognifyDataset.ts (1)
cognee-frontend/src/utils/fetch.ts (1)
fetch
(3-12)
cognee/infrastructure/databases/graph/get_graph_engine.py (1)
cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py (1)
MemgraphAdapter
(20-690)
cognee/base_config.py (1)
cognee/modules/observability/observers.py (1)
Observer
(4-9)
cognee/modules/engine/models/ColumnValue.py (1)
cognee/infrastructure/engine/models/DataPoint.py (1)
DataPoint
(16-96)
cognee-frontend/src/app/page.tsx (1)
cognee/api/v1/datasets/datasets.py (1)
datasets
(7-40)
cognee/infrastructure/llm/openai/adapter.py (1)
cognee/modules/observability/get_observe.py (1)
get_observe
(5-11)
cognee/tests/test_weaviate.py (1)
cognee/infrastructure/databases/vector/get_vector_engine.py (1)
get_vector_engine
(5-6)
cognee/modules/data/methods/create_dataset.py (2)
cognee/modules/data/methods/get_unique_dataset_id.py (1)
get_unique_dataset_id
(5-6)cognee/modules/users/models/User.py (1)
User
(12-39)
examples/database_examples/neo4j_example.py (5)
cognee/modules/search/types/SearchType.py (1)
SearchType
(4-15)examples/database_examples/kuzu_example.py (1)
main
(8-81)examples/database_examples/falkordb_example.py (1)
main
(8-83)examples/database_examples/milvus_example.py (1)
main
(8-85)cognee/api/v1/config/config.py (4)
config
(15-194)set_graph_db_config
(149-158)data_root_directory
(32-34)system_root_directory
(17-29)
alembic/versions/482cd6517ce4_add_default_user.py (1)
cognee/modules/users/methods/create_default_user.py (1)
create_default_user
(5-19)
cognee/modules/retrieval/utils/brute_force_triplet_search.py (6)
cognee/infrastructure/databases/exceptions/exceptions.py (1)
EntityNotFoundError
(15-27)cognee/infrastructure/databases/vector/exceptions/exceptions.py (1)
CollectionNotFoundError
(5-14)cognee/modules/graph/cognee_graph/CogneeGraph.py (1)
project_graph_from_db
(55-113)cognee/infrastructure/databases/vector/qdrant/QDrantAdapter.py (1)
search
(155-216)cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py (1)
search
(166-204)cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py (1)
search
(146-182)
cognee/exceptions/exceptions.py (1)
cognee/shared/logging_utils.py (4)
error
(138-139)warning
(135-136)info
(132-133)debug
(144-145)
cognee/modules/pipelines/operations/run_tasks.py (3)
cognee/modules/pipelines/tasks/task.py (1)
Task
(5-97)cognee/modules/users/models/User.py (1)
User
(12-39)cognee/modules/pipelines/operations/run_tasks_base.py (1)
run_tasks_base
(66-82)
cognee/modules/pipelines/operations/__init__.py (1)
cognee/modules/pipelines/operations/log_pipeline_run_initiated.py (1)
log_pipeline_run_initiated
(6-22)
cognee/infrastructure/databases/vector/milvus/MilvusAdapter.py (10)
cognee/shared/logging_utils.py (1)
get_logger
(148-169)cognee/infrastructure/engine/models/DataPoint.py (1)
DataPoint
(16-96)cognee/infrastructure/engine/utils/parse_id.py (1)
parse_id
(4-10)cognee/infrastructure/databases/vector/exceptions/exceptions.py (1)
CollectionNotFoundError
(5-14)cognee/infrastructure/databases/vector/qdrant/QDrantAdapter.py (3)
retrieve
(149-153)has_collection
(74-78)delete_data_points
(259-262)cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py (3)
retrieve
(151-164)has_collection
(51-53)delete_data_points
(218-226)cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py (3)
retrieve
(129-144)has_collection
(52-55)delete_data_points
(205-210)cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (3)
retrieve
(190-203)has_collection
(61-71)delete_data_points
(285-293)cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py (3)
retrieve
(165-177)has_collection
(111-113)delete_data_points
(300-304)cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py (3)
retrieve
(226-233)has_collection
(116-119)delete_data_points
(354-360)
cognee/infrastructure/databases/graph/networkx/adapter.py (4)
cognee/infrastructure/databases/graph/neo4j_driver/adapter.py (7)
has_node
(66-75)get_edges
(264-275)extract_node
(121-124)extract_nodes
(126-136)get_neighbors
(381-383)get_node
(385-392)get_nodes
(394-402)cognee/infrastructure/databases/graph/kuzu/adapter.py (7)
has_node
(167-171)get_edges
(439-475)extract_node
(284-304)extract_nodes
(306-325)get_neighbors
(479-481)get_node
(483-502)get_nodes
(504-521)cognee/infrastructure/databases/graph/graph_db_interface.py (4)
get_edges
(177-179)get_neighbors
(182-184)get_node
(125-127)get_nodes
(130-132)cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py (2)
extract_node
(235-238)extract_nodes
(240-241)
🪛 Ruff (0.11.9)
cognee/modules/data/methods/__init__.py
10-10: .get_unique_dataset_id.get_unique_dataset_id
imported but unused; consider removing, adding to __all__
, or using a redundant alias
(F401)
cognee/modules/retrieval/exceptions/__init__.py
7-7: .exceptions.SearchTypeNotSupported
imported but unused; consider removing, adding to __all__
, or using a redundant alias
(F401)
7-7: .exceptions.CypherSearchError
imported but unused; consider removing, adding to __all__
, or using a redundant alias
(F401)
cognee/modules/engine/models/__init__.py
6-6: .ColumnValue.ColumnValue
imported but unused; consider removing, adding to __all__
, or using a redundant alias
(F401)
examples/database_examples/chromadb_example.py
1-1: os
imported but unused
Remove unused import: os
(F401)
cognee/version.py
10-13: Use a single with
statement with multiple contexts instead of nested with
statements
Combine with
statements
(SIM117)
examples/database_examples/falkordb_example.py
1-1: os
imported but unused
Remove unused import: os
(F401)
cognee/shared/logging_utils.py
10-10: importlib.metadata
imported but unused
Remove unused import: importlib.metadata
(F401)
cognee/modules/retrieval/graph_completion_context_extension_retriever.py
3-3: cognee.infrastructure.llm.get_llm_client.get_llm_client
imported but unused
Remove unused import: cognee.infrastructure.llm.get_llm_client.get_llm_client
(F401)
6-6: cognee.infrastructure.llm.prompts.read_query_prompt
imported but unused
Remove unused import
(F401)
6-6: cognee.infrastructure.llm.prompts.render_prompt
imported but unused
Remove unused import
(F401)
alembic/versions/482cd6517ce4_add_default_user.py
24-27: Use contextlib.suppress(Exception)
instead of try
-except
-pass
Replace with contextlib.suppress(Exception)
(SIM105)
examples/database_examples/kuzu_example.py
1-1: os
imported but unused
Remove unused import: os
(F401)
examples/database_examples/pgvector_example.py
1-1: os
imported but unused
Remove unused import: os
(F401)
alembic/versions/1d0bb7fede17_add_pipeline_run_status.py
13-13: cognee.modules.pipelines.models.PipelineRun.PipelineRun
imported but unused
Remove unused import
(F401)
13-13: cognee.modules.pipelines.models.PipelineRun.PipelineRunStatus
imported but unused
Remove unused import
(F401)
cognee/modules/retrieval/utils/brute_force_triplet_search.py
66-73: Use contextlib.suppress(EntityNotFoundError)
instead of try
-except
-pass
Replace with contextlib.suppress(EntityNotFoundError)
(SIM105)
cognee/api/v1/responses/routers/get_responses_router.py
74-74: Do not perform function call Depends
in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable
(B008)
cognee-starter-kit/src/pipelines/low_level.py
37-37: Use a context manager for opening files
(SIM115)
40-40: Use a context manager for opening files
(SIM115)
cognee/modules/pipelines/operations/__init__.py
1-1: .log_pipeline_run_initiated.log_pipeline_run_initiated
imported but unused; consider removing, adding to __all__
, or using a redundant alias
(F401)
cognee/modules/pipelines/operations/run_tasks_base.py
32-32: Use key in dict
instead of key in dict.keys()
Remove .keys()
(SIM118)
cognee/eval_framework/modal_run_eval.py
5-5: hashlib
imported but unused
Remove unused import: hashlib
(F401)
cognee-mcp/src/server.py
166-166: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py
430-430: Do not use mutable data structures for argument defaults
Replace with None
; initialize within function
(B006)
🪛 LanguageTool
cognee/infrastructure/llm/prompts/cot_validation_system_prompt.txt
[uncategorized] ~1-~1: This verb does not appear to agree with the subject. Consider using a different form.
Context: You are a helpful agent who are allowed to use only the provided questi...
(AI_EN_LECTOR_REPLACEMENT_VERB_AGREEMENT)
cognee-starter-kit/README.md
[uncategorized] ~85-~85: You might be missing the article “the” here.
Context: ... the following environment variables to .env
file: ``` GRAPHISTRY_USERNAME="" G...
(AI_EN_LECTOR_MISSING_DETERMINER_THE)
[uncategorized] ~90-~90: You might be missing the article “an” here.
Context: ...D="" ``` Note: GRAPHISTRY_PASSWORD
is API key. # What will you build with cogne...
(AI_EN_LECTOR_MISSING_DETERMINER_AN)
examples/data/car_and_tech_companies.txt
[duplication] ~2-~2: Possible typo: you repeated a word.
Context: text_1 = """ 1. Audi Audi is known for its modern designs and adv...
(ENGLISH_WORD_REPEAT_RULE)
[duplication] ~5-~5: Possible typo: you repeated a word.
Context: ...ns to high-performance sports cars. 2. BMW BMW, short for Bayerische Motoren Werke, is...
(ENGLISH_WORD_REPEAT_RULE)
[style] ~6-~6: Consider using a more concise synonym.
Context: ... reflects that commitment. BMW produces a variety of cars that combine luxury with sporty pe...
(A_VARIETY_OF)
[duplication] ~8-~8: Possible typo: you repeated a word.
Context: ...ine luxury with sporty performance. 3. Mercedes-Benz Mercedes-Benz is synonymous with luxury and quality. ...
(ENGLISH_WORD_REPEAT_RULE)
[duplication] ~11-~11: Possible typo: you repeated a word.
Context: ... catering to a wide range of needs. 4. Porsche Porsche is a name that stands for high-performa...
(ENGLISH_WORD_REPEAT_RULE)
[duplication] ~14-~14: Possible typo: you repeated a word.
Context: ...o value both performance and style. 5. Volkswagen Volkswagen, which means "people's car" in German, ...
(ENGLISH_WORD_REPEAT_RULE)
[grammar] ~17-~17: The plural determiner ‘these’ does not agree with the singular noun ‘car’.
Context: ...nce practicality with quality. Each of these car manufacturer contributes to Germany's r...
(THIS_NNS)
[uncategorized] ~17-~17: The grammatical number of this noun doesn’t look right. Consider replacing it.
Context: ...cality with quality. Each of these car manufacturer contributes to Germany's reputation as ...
(AI_EN_LECTOR_REPLACEMENT_NOUN_NUMBER)
[duplication] ~21-~21: Possible typo: you repeated a word.
Context: ...design excellence. """ text_2 = """ 1. Apple Apple is renowned for its innovative consumer...
(ENGLISH_WORD_REPEAT_RULE)
[duplication] ~27-~27: Possible typo: you repeated a word.
Context: ... in shaping the internet landscape. 3. Microsoft Microsoft Corporation has been a dominant force i...
(ENGLISH_WORD_REPEAT_RULE)
[style] ~28-~28: Consider using a synonym to be more concise.
Context: ...n both business and personal computing. In recent years, Microsoft has expanded into cloud comp...
(IN_RECENT_STYLE)
[uncategorized] ~31-~31: You might be missing the article “the” here.
Context: ...or innovation continues to reshape both retail and technology sectors. 5. Meta Meta, ...
(AI_EN_LECTOR_MISSING_DETERMINER_THE)
[duplication] ~33-~33: Possible typo: you repeated a word.
Context: ...both retail and technology sectors. 5. Meta Meta, originally known as Facebook, revoluti...
(ENGLISH_WORD_REPEAT_RULE)
🪛 Biome (1.9.4)
cognee-frontend/src/modules/ingestion/DataView/DataView.tsx
[error] 36-36: Do not shadow the global "DataView" property.
Consider renaming this variable. It's easy to confuse the origin of variables when they're named after a known global.
(lint/suspicious/noShadowRestrictedNames)
🪛 markdownlint-cli2 (0.17.2)
cognee-starter-kit/README.md
17-17: Fenced code blocks should have a language specified
null
(MD040, fenced-code-language)
21-21: Fenced code blocks should have a language specified
null
(MD040, fenced-code-language)
28-28: Fenced code blocks should have a language specified
null
(MD040, fenced-code-language)
43-43: Fenced code blocks should have a language specified
null
(MD040, fenced-code-language)
51-51: Fenced code blocks should have a language specified
null
(MD040, fenced-code-language)
59-59: Fenced code blocks should have a language specified
null
(MD040, fenced-code-language)
67-67: Fenced code blocks should have a language specified
null
(MD040, fenced-code-language)
75-75: Fenced code blocks should have a language specified
null
(MD040, fenced-code-language)
84-84: Bare URL used
null
(MD034, no-bare-urls)
86-86: Fenced code blocks should have a language specified
null
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Pensar Scan
🔇 Additional comments (174)
CONTRIBUTING.md (1)
100-100
: Direct and clear phrasing is a great improvement
Removing “make sure to” tightens the instruction and makes the step more actionable without changing its meaning.notebooks/cognee_llama_index.ipynb (2)
15-17
: LGTM! Standard Jupyter notebook formatting improvement.The conversion of the
source
field from a single string to a list containing that string follows standard Jupyter notebook JSON formatting conventions. This change improves notebook compatibility across different Jupyter environments without affecting code execution.
123-124
: LGTM! Proper metadata and outputs field formatting.The reordering and formatting of the
metadata
andoutputs
fields aligns with standard Jupyter notebook structure conventions. These formatting improvements enhance notebook consistency without impacting functionality.notebooks/llama_index_cognee_integration.ipynb (5)
6-8
: Approve formatting update: Convertedsource
field to list
The change converts single-stringsource
entries to lists (standard Jupyter JSON format). No functional impact.
62-64
: Approve formatting update: Code cellsource
field as list
The pip install line is now correctly wrapped in a list for Jupyter compatibility.
199-201
: Approve formatting update: Markdownsource
field as list
The "### Answer prompt based on RAG approach:" cell now uses a list for itssource
.
219-221
: Approve formatting update: Conclusionsource
field as list
The conclusion paragraph is correctly formatted as a list.
282-283
: Approve metadata update: Python kernel version bump
Updatedlanguage_info.version
to "3.12.9" for reproducibility and clarity.notebooks/cognee_simple_demo.ipynb (8)
13-19
: Update installation command to Cognee v0.1.39
The tutorial now installscognee==0.1.39
, which aligns with the latest features mentioned in the PR summary. Please verify that this version is published on PyPI and free of regressions.
32-41
: Standardize code cell metadata (execution_count & outputs)
Adding"execution_count": null
and an emptyoutputs
array ensures a reproducible, clean notebook state across environments.
53-60
: Standardize code cell metadata (execution_count & outputs)
Consistent metadata scaffolding helps avoid stale outputs when rendering or testing the notebook.
72-80
: Standardize code cell metadata (execution_count & outputs)
This cell’s metadata updates maintain consistency with other notebook examples in the repo.
92-98
: Standardize code cell metadata (execution_count & outputs)
Cleaning execution state metadata here will prevent accidental output leakage in CI or PR previews.
102-108
: Standardize code cell metadata (execution_count & outputs)
Empty outputs and null execution counts make this cell ready for fresh runs.
112-118
: Standardize code cell metadata (execution_count & outputs)
Maintains a clean slate for the personality query example.
130-143
: Standardize code cell metadata (execution_count & outputs)
Ensures the visualization cell has no precomputed outputs and a fresh execution count.notebooks/cognee_multimedia_demo.ipynb (2)
25-27
: Cleaned notebook metadata: Strippingexecution_count
and outputs keeps the notebook lean and version-control friendly.
34-43
: Verify example data directory presence: Make sure theexamples/data/multimedia
directory—and the filestext_to_speech.mp3
andexample.png
—are committed to the repo so this demo runs without errors.cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py (1)
72-72
: LGTM: Comment typography corrected.The straight apostrophe replacement improves text consistency and readability.
cognee/api/v1/add/add.py (1)
2-6
: LGTM! Import reordering improves organization.The import reordering follows a logical structure (core modules → user models → task-specific imports) without affecting functionality. This aligns well with the broader dataset lifecycle management improvements mentioned in the PR context.
README.md (2)
133-139
: Great addition of cognee UI section!The new UI section enhances the documentation by showcasing the visual interface capabilities. The image and documentation link provide users with clear guidance on accessing the UI features.
38-42
:✅ Verification successful
LGTM! Path corrections look good.
The updated paths from "community/" to "assets/community/" appear correct and align with the repository structure improvements.
🏁 Script executed:
#!/bin/bash # Verify the corrected README paths exist fd "README.pt.md|README.zh.md|README.ru.md" assets/community/Length of output: 153
Verified: Localized README links are correct
All three locale files (
README.pt.md
,README.zh.md
,README.ru.md
) exist underassets/community/
as intended. No further changes needed.Dockerfile_modal (2)
21-21
: Good optimization! Streamlined file copying.Including README.md in the initial COPY command is more efficient than separate copy operations and follows Docker best practices for layer optimization.
25-27
: LGTM! Poetry configuration improvements.The changes appropriately:
- Disable virtualenv creation (suitable for containers)
- Install only specific required extras (neo4j, qdrant)
- Follow modern Poetry best practices
This aligns well with the streamlined dependency management approach mentioned in the AI summary.
cognee/modules/retrieval/utils/brute_force_triplet_search.py (5)
4-6
: LGTM! Import additions support robust error handling.The new imports for
EntityNotFoundError
andCollectionNotFoundError
are properly sourced and necessary for the enhanced error handling implemented in this file.
146-153
: Excellent encapsulation of collection search with error handling.The
search_in_collection
helper function properly isolates the error handling for individual collections, returning empty results when collections don't exist rather than propagating errors. This aligns with the vector adapter patterns shown in the codebase.
156-156
: Good use of the helper function while maintaining parallelism.The refactored
asyncio.gather
call properly utilizes the new helper function while preserving parallel execution across collections.
159-160
: Appropriate graceful handling of empty search results.Returning an empty list when no results are found across all collections is more user-friendly than raising exceptions for data absence scenarios.
173-174
: Good defensive programming with outer exception handling.The outer
CollectionNotFoundError
handling provides a safety net and maintains consistency with the graceful error handling approach throughout the function.Dockerfile (3)
1-11
: Well-designed multi-stage build approach.The switch to
uv
with a multi-stage build is excellent for optimizing image size and build performance. TheUV_LINK_MODE=copy
setting is appropriate for mounted volumes, and the disabled bytecode compilation comment suggests thoughtful consideration of build options.
30-44
: Excellent dependency management and caching strategy.The approach of installing dependencies first (lines 33-34) followed by source code copy and final installation (lines 42-44) optimizes Docker layer caching. The explicit extras specification is more maintainable than ARG-based approaches, and using
--frozen
ensures reproducible builds.
46-60
: Clean runtime stage configuration.The runtime stage properly copies the built application and sets up the environment for execution. The PATH configuration correctly includes the uv virtual environment, and the PYTHONPATH setting ensures proper Python module resolution.
entrypoint.sh (1)
16-16
: Appropriate removal of Poetry wrapper.The direct
alembic upgrade head
call aligns with the Dockerfile migration away from Poetry to uv-based dependency management.cognee/modules/visualization/cognee_network_visualization.py (1)
24-24
: Approve addition ofColumnValue
color mapping
The new"ColumnValue": "#13613a"
entry aligns with the recently addedColumnValue
node type. Ensure any visualization tests or documentation are updated to include this type so it renders correctly.cognee/modules/search/types/SearchType.py (1)
14-15
: Validate newSearchType
enum members dispatch
The additionsGRAPH_COMPLETION_COT
andGRAPH_COMPLETION_CONTEXT_EXTENSION
enable the new graph retrievers. Please verify that all dispatch logic inspecific_search
(and related registration points) handles these values, and update any documentation or type hints accordingly.cognee/modules/pipelines/models/PipelineRun.py (1)
9-9
: Ensure database enum is in sync with Alembic migration
The newDATASET_PROCESSING_INITIATED
status must match the PostgreSQL enum via your migration. Please confirm that the Alembic revision adding this value has been applied and that existing pipeline run records handle the new status gracefully.examples/python/graphiti_example.py (3)
17-17
: Good addition of user context support.The import of
get_default_user
aligns with the broader codebase refactoring to explicitly handle user context in pipeline operations.
31-32
: Proper user initialization.Correctly initializing the default user before passing it to pipeline operations. This ensures proper user context throughout the example execution.
41-41
: Appropriate user context propagation.Passing the user object to
run_tasks
is consistent with the codebase refactoring to include user context in pipeline execution.cognee/tasks/ingestion/ingest_data.py (1)
107-107
: Correct update to match new create_dataset signature.The change from passing
user.id
to passing the fulluser
object aligns with the refactoredcreate_dataset
function incognee/modules/data/methods/create_dataset.py
which now accepts aUser
object and internally extracts the owner ID.cognee/modules/data/methods/__init__.py (1)
10-10
: Import is required for module's public API.The static analysis tool flagged this as unused, but this is a false positive. The
get_unique_dataset_id
function is part of the dataset management refactoring and is used bycreate_dataset
as mentioned in the AI summary. Including it in the module's public API is appropriate.🧰 Tools
🪛 Ruff (0.11.9)
10-10:
.get_unique_dataset_id.get_unique_dataset_id
imported but unused; consider removing, adding to__all__
, or using a redundant alias(F401)
cognee/tests/unit/modules/retrieval/summaries_retriever_test.py (1)
130-130
: Correct parameter name update.The change from
limit=20
totop_k=20
properly aligns with theSummariesRetriever
constructor signature shown incognee/modules/retrieval/summaries_retriever.py
(lines 12-14), which accepts atop_k
parameter.cognee/infrastructure/llm/prompts/cot_followup_system_prompt.txt (1)
1-3
: Well-structured prompt for chain-of-thought follow-up functionality.The prompt is clear, concise, and provides specific instructions for generating focused clarifying questions. The formatting guidelines ensure consistent output.
cognee/api/v1/delete/routers/get_delete_router.py (1)
61-62
:✅ Verification successful
Verify cognee_delete function compatibility with file.file input.
The change to pass
file.file
instead of the entireUploadFile
object is correct from a FastAPI perspective, as it provides access to the underlying file object. However, ensure thecognee_delete
function can properly handle this input type.
🏁 Script executed:
#!/bin/bash # Description: Verify cognee_delete function signature and usage # Expected: Function should accept file-like objects and handle them properly echo "Checking cognee_delete function signature..." ast-grep --pattern 'def cognee_delete($$$):' echo "Checking other usages of cognee_delete to ensure consistency..." rg -A 3 -B 3 "cognee_delete\(" cognee/api/v1/delete/ echo "Checking imports and function definition..." rg -A 10 -B 5 "from cognee.api.v1.delete import delete as cognee_delete"Length of output: 3387
🏁 Script executed:
#!/bin/bash # Locate the actual definition of the `delete` function aliased as `cognee_delete` echo "Searching for the delete function signature in cognee/api/v1/delete..." rg -n "def delete" -R cognee/api/v1/deleteLength of output: 600
🏁 Script executed:
#!/bin/bash # Correctly search for the delete function definition under cognee/api/v1/delete echo "Looking for delete function signature in cognee/api/v1/delete directory..." rg -n "def delete" cognee/api/v1/deleteLength of output: 563
🏁 Script executed:
#!/bin/bash # Retrieve the signature and docstring of the delete function to confirm accepted types echo "Fetching delete function signature and docstring..." rg -n "async def delete" -A5 -B5 cognee/api/v1/delete/delete.pyLength of output: 1609
Compatibility Verified: The
cognee_delete
function acceptsBinaryIO
inputs andfile.file
(aSpooledTemporaryFile
) satisfies that interface. No changes needed—this update is correct.cognee/tests/test_weaviate.py (1)
85-85
: LGTM! Correct async/await usage.The addition of
await
properly handles the asynchronouslist_all()
method call, aligning with the broader refactoring of the Weaviate adapter to use fully asynchronous operations.cognee/modules/settings/get_settings.py (2)
2-2
: LGTM! Proper import addition.Correctly imports
Optional
to support the type annotations changes below.
24-25
: LGTM! Improved configuration flexibility.Making
endpoint
andapi_version
optional enhances configuration flexibility, allowing LLM providers that don't require these fields to work seamlessly. This aligns well with the broader system enhancements for adaptable LLM provider configurations.cognee/tests/unit/modules/retrieval/graph_completion_retriever_test.py (1)
157-162
: LGTM! Improved async test execution.Consolidating the individual test method calls into a single
main()
function that sequentially awaits each test is a better practice. This ensures proper execution within one event loop instead of multiple separate runs, preventing potential async execution issues.cognee/infrastructure/llm/generic_llm_api/adapter.py (1)
43-43
: LGTM! Prompt simplification is consistent with other adapters.The removal of the explicit instruction prefix aligns with similar changes made in other LLM adapters as mentioned in the summary. This simplification maintains the core functionality while streamlining the prompt structure.
cognee/api/v1/responses/routers/__init__.py (1)
1-3
: LGTM! Proper module initialization pattern.The import and export structure follows Python best practices for package organization. The
__all__
declaration properly controls the public API of the routers subpackage.cognee/api/client.py (2)
17-17
: LGTM! Import follows established pattern.The import statement correctly follows the established pattern for importing routers in this file and is placed in the appropriate location with other router imports.
171-172
: LGTM! Router registration follows established conventions.The router registration correctly follows the established pattern used for other API endpoints in this file, with appropriate prefix and tags.
cognee-frontend/src/app/wizard/ExploreStep/ExploreStep.tsx (2)
5-5
: LGTM! Type change aligns with broader refactoring.The dataset type change from
{id: string}
to{name: string}
is consistent with the broader frontend refactoring described in the AI summary and aligns with similar changes in other components.
11-11
: LGTM! Simplified prop passing.The removal of the non-null assertion operator (
!
) when passing the dataset prop suggests improved type safety or guaranteed non-null values, which is a positive change.cognee/infrastructure/llm/ollama/adapter.py (1)
42-42
: LGTM! Prompt simplification aligns with standardization effort.The removal of the explicit instruction prefix and sending raw
text_input
directly aligns with the standardization effort across different LLM adapters mentioned in the AI summary. This simplification maintains functionality while creating consistency across the codebase.cognee/api/v1/responses/__init__.py (1)
1-3
: LGTM! Standard package initialization following best practices.This is a well-structured package initialization file that follows Python best practices:
- Correct import path for the router function
- Proper use of
__all__
to control public exports- Clean and simple design that facilitates convenient imports
cognee-frontend/src/ui/Partials/Explorer/Explorer.tsx (1)
10-10
:✅ Verification successful
Verify breaking change impact across the codebase.
The change from
dataset: { id: string }
todataset: { name: string }
is a breaking change that affects the component contract. While the AI summary indicates coordinated updates across related components, this needs thorough verification.Run the following script to ensure all usages of the Explorer component have been updated:
🏁 Script executed:
#!/bin/bash # Description: Find all usages of Explorer component to verify they pass the correct dataset prop structure # Search for Explorer component imports and usage echo "=== Explorer component imports ===" rg -A 3 -B 1 "import.*Explorer" echo -e "\n=== Explorer component usage ===" rg -A 5 -B 2 "<Explorer.*dataset" echo -e "\n=== Dataset prop with id property ===" rg -A 2 -B 2 'dataset.*id.*string'Length of output: 6116
🏁 Script executed:
#!/bin/bash # Inspect ExplorerProps and dataset usage in Explorer.tsx echo "=== interface ExplorerProps and nearby lines ===" rg -n "interface ExplorerProps" -A10 -B2 cognee-frontend/src/ui/Partials/Explorer/Explorer.tsx || true echo -e "\n=== getExplorationGraphUrl usage ===" rg -n "getExplorationGraphUrl" -A5 -B2 cognee-frontend/src/ui/Partials/Explorer/Explorer.tsx || true echo -e "\n=== dataset usages within Explorer.tsx ===" rg -n "dataset\." -A5 -B2 cognee-frontend/src/ui/Partials/Explorer/Explorer.tsx || trueLength of output: 1548
#!/bin/bash # Locate and inspect the definition of getExplorationGraphUrl echo "=== Searching for getExplorationGraphUrl definition ===" rg -n "export function getExplorationGraphUrl" -C 5 cognee-frontend/src/modules/exploration || true echo -e "\n=== Fallback: any function getExplorationGraphUrl ===" rg -n "function getExplorationGraphUrl" -C 5 cognee-frontend/src/modules/exploration || true
#!/bin/bash # Locate the file defining getExplorationGraphUrl echo "=== File(s) containing getExplorationGraphUrl ===" rg -l "getExplorationGraphUrl" -R cognee-frontend/src/modules/exploration || true echo -e "\n=== Inspect definition of getExplorationGraphUrl ===" FILE=$(rg -l "getExplorationGraphUrl" -R cognee-frontend/src/modules/exploration | head -n1) if [ -n "$FILE" ]; then rg -n "getExplorationGraphUrl" -C5 "$FILE" || true fi
🏁 Script executed:
#!/bin/bash # Search for definitions or exports of getExplorationGraphUrl across the repository echo "=== getExplorationGraphUrl references ===" rg -n "getExplorationGraphUrl" -C5Length of output: 3187
#!/bin/bash # Verify the dataset prop type in ExploreStep component echo "=== ExploreStepProps definition ===" rg -n "interface ExploreStepProps" -A5 -B2 cognee-frontend/src/app/wizard/ExploreStep/ExploreStep.tsx || true echo -e "\n=== dataset usage in ExploreStepProps ===" rg -n "dataset" -A2 -B0 cognee-frontend/src/app/wizard/ExploreStep/ExploreStep.tsx || true
#!/bin/bash # Inspect ExploreStepProps and dataset usage in ExploreStep component echo "=== ExploreStepProps definition ===" rg -n "interface ExploreStepProps" -A5 -B2 cognee-frontend/src/app/wizard/ExploreStep/ExploreStep.tsx || true echo -e "\n=== dataset prop in ExploreStepProps ===" rg -n "dataset" -A2 -B2 cognee-frontend/src/app/wizard/ExploreStep/ExploreStep.tsx || true
🏁 Script executed:
#!/bin/bash # Verify the dataset type in ExploreStepProps for ExploreStep component echo "=== ExploreStepProps definition ===" rg -n "interface ExploreStepProps" -A5 -B2 cognee-frontend/src/app/wizard/ExploreStep/ExploreStep.tsx || true echo -e "\n=== dataset property in ExploreStepProps ===" rg -n "dataset" -A3 -B0 cognee-frontend/src/app/wizard/ExploreStep/ExploreStep.tsx || trueLength of output: 786
No breaking change detected—all Explorer usages supply a
name
field
All instances of<Explorer>
now receive objects containing aname
property (e.g.{ id, name }
in DatasetsView and{ name }
in ExploreStep), which satisfies the updatedExplorerProps
signature. No further updates are needed.cognee/modules/retrieval/graph_completion_retriever.py (1)
78-78
:✅ Verification successful
Verify that EntityNotFoundError is handled upstream.
The removal of EntityNotFoundError handling from this method could introduce unhandled exceptions if the error can still be raised by
get_triplets()
or its dependencies.Run the following script to verify error handling has been properly centralized:
🏁 Script executed:
#!/bin/bash # Description: Check if EntityNotFoundError handling exists in get_triplets or brute_force_triplet_search echo "=== Checking for EntityNotFoundError in brute_force_triplet_search ===" ast-grep --pattern $'try: $$$ except EntityNotFoundError: $$$' echo -e "\n=== Checking if EntityNotFoundError can still be raised ===" rg -A 3 -B 3 "EntityNotFoundError" echo -e "\n=== Checking get_triplets method implementation ===" ast-grep --pattern $'async def get_triplets($$$): $$$'Length of output: 18573
I’ll locate and inspect the
graph_completion_retriever.py
file and allget_triplets
references to confirm where it’s defined and how errors are handled upstream.
🏁 Script executed:
#!/bin/bash echo "Finding graph_completion_retriever.py path..." find . -type f -name 'graph_completion_retriever.py' echo "Displaying graph_completion_retriever.py (first 300 lines)..." FILE=$(find . -type f -name 'graph_completion_retriever.py') sed -n '1,300p' "$FILE" echo "Searching for get_triplets usage and definition..." rg -n "get_triplets"Length of output: 6337
EntityNotFoundError is safely handled in brute_force_triplet_search
Thebrute_force_triplet_search
utility (cognee/modules/retrieval/utils/brute_force_triplet_search.py lines 66–73) already catches and suppressesEntityNotFoundError
. As a result, neitherget_triplets
nor its callers will propagate that exception, and no additional try/except is needed here.cognee/infrastructure/databases/graph/graph_db_interface.py (1)
61-61
:❓ Verification inconclusive
Reconsider downgrading database error logging severity.
Changing database error logs from
logger.error
tologger.debug
significantly reduces their visibility. Database operations failing (adding relationships, committing sessions) are typically serious issues that should remain at error level for production monitoring and debugging.Unless these errors are truly expected and recoverable, consider keeping them at error level:
- logger.debug(f"Error adding relationship: {e}") + logger.error(f"Error adding relationship: {e}")- logger.debug(f"Error adding relationship: {e}") + logger.error(f"Error adding relationship: {e}")- logger.debug(f"Error committing session: {e}") + logger.error(f"Error committing session: {e}")What was the rationale for downgrading these database errors to debug level? Are these errors expected to occur frequently in normal operation?
Also applies to: 81-81, 88-88
Confirm intent behind lowering database error logs to debug level
I noticed that in
- cognee/infrastructure/databases/graph/graph_db_interface.py
• Line 61:logger.error
→logger.debug
when adding node relationships
• Line 81:logger.error
→logger.debug
when adding edge relationships
• Line 88:logger.error
→logger.debug
when committing the sessionDatabase operation failures are typically critical and warrant error-level logging for production monitoring and post-mortem analysis. If these exceptions are truly expected and safely recoverable in normal operation, please clarify the rationale. Otherwise, consider reverting to:
- logger.debug(f"Error adding relationship: {e}") + logger.error(f"Error adding relationship: {e}")- logger.debug(f"Error committing session: {e}") + logger.error(f"Error committing session: {e}")Can you confirm whether these database errors should remain at debug level?
cognee/__init__.py (1)
1-5
: Well-implemented version management approach.The integration of dynamic version detection is well done. The placement at the top of the file properly avoids circular imports, and the approach supports both development (from
pyproject.toml
) and installed package scenarios as shown in theget_cognee_version
function.The comment clearly explains the positioning rationale, which will help future maintainers.
cognee/infrastructure/llm/gemini/adapter.py (2)
2-3
: LGTM! Good import organization.The reorganization of imports with pydantic and typing imports moved up follows Python conventions and improves readability.
6-7
: Excellent observability standardization.The replacement of conditional observe import with the standardized
get_observe()
approach improves consistency across LLM adapters and centralizes observability configuration.Also applies to: 17-17
cognee/modules/observability/observers.py (1)
4-9
: Well-implemented Observer enum.The enum design follows Python best practices by inheriting from both
str
andEnum
, allowing for easy string comparisons while maintaining type safety. The monitoring tool options are clearly defined and appropriately named.cognee/tests/test_neo4j.py (1)
90-90
: Verify the search history count calculation.The assertion expects 6 entries after commenting out the natural language search. Ensure this count matches the actual number of search operations performed above.
The count appears correct based on the 3 search operations (INSIGHTS, CHUNKS, SUMMARIES) that remain active, but consider making this count dynamic if more tests are added/removed.
cognee/api/v1/cognify/cognify.py (1)
37-39
: Good improvement for pipeline management.Adding the explicit
pipeline_name="cognify_pipeline"
parameter enhances pipeline tracking and aligns with the broader pipeline management improvements mentioned in the AI summary.This change improves observability and allows for better pipeline status tracking without breaking existing functionality.
cognee-frontend/src/modules/exploration/getExplorationGraphUrl.ts (1)
3-3
:❓ Verification inconclusive
Verify the dataset parameter usage.
The function parameter was changed from
{ id: string }
to{ name: string }
, but thedataset
parameter is not used anywhere in the function body. This raises several concerns:
- Unused parameter: The function fetches from a generic
/v1/visualize
endpoint without passing any dataset-specific information.- Potential API mismatch: The endpoint might expect dataset identification but isn't receiving it.
- Incomplete refactoring: This change might be part of incomplete dataset identification standardization.
Please verify the intended behavior by checking:
- Does the
/v1/visualize
endpoint expect dataset parameters?- Should the dataset name be included in the request URL or body?
🏁 Script executed:
#!/bin/bash # Search for other usages of getExplorationGraphUrl to understand expected behavior rg -A 5 -B 5 "getExplorationGraphUrl" # Check the backend visualize endpoint implementation rg -A 10 "\/v1\/visualize"Length of output: 3133
Confirm dataset usage in getExplorationGraphUrl
The
dataset
parameter ({ name: string }
) is declared but never used. The call to
fetch('/v1/visualize')
does not include any dataset-specific information, which suggests an incomplete refactoring or a mismatch with the API.Please verify:
- Does the
/v1/visualize
endpoint require a dataset identifier (e.g.name
orid
)?- If so, should
dataset.name
be included as a query parameter or in the request body?File:
cognee-frontend/src/modules/exploration/getExplorationGraphUrl.ts
cognee/infrastructure/llm/anthropic/adapter.py (1)
20-22
: Excellent async client upgrade!The change from
anthropic.Anthropic()
toanthropic.AsyncAnthropic()
is a great improvement that:
- Aligns with the async method signatures (
acreate_structured_output
)- Improves performance through asynchronous operations
- Maintains consistency with other LLM adapters in the codebase
- Properly integrates with the instructor.patch wrapper
cognee/api/v1/datasets/datasets.py (1)
31-31
: Good alignment with pipeline status API changes.Adding the explicit
pipeline_name="cognify_pipeline"
parameter is a positive change that:
- Aligns with the updated
get_pipeline_status
function signature (as shown in the relevant code snippets)- Makes the status query more specific and targeted to the cognify pipeline
- Improves code clarity by being explicit about which pipeline status is being retrieved
- Maintains consistency with pipeline naming conventions used throughout the system
cognee/infrastructure/llm/prompts/cot_followup_user_prompt.txt (1)
1-14
: Well-structured prompt template for chain-of-thought reasoning.The prompt effectively guides the model to generate targeted follow-up questions by:
- Providing clear context about knowledge graph exploration
- Using structured tags for different input components
- Emphasizing the need for exactly one question to resolve validation gaps
cognee/infrastructure/llm/prompts/cot_validation_user_prompt.txt (1)
1-11
: Clean and well-structured validation prompt template.The template provides a clear, organized structure for chain-of-thought validation with:
- Consistent XML-style tag formatting
- Logical separation of question, answer, and context
- Simple and effective layout for LLM processing
cognee/modules/pipelines/operations/get_pipeline_status.py (1)
8-8
: LGTM! Enhanced pipeline status filtering.The addition of the
pipeline_name
parameter provides better specificity for pipeline status queries. This change aligns well with the broader pipeline improvements mentioned in the PR and correctly utilizes thepipeline_name
column from thePipelineRun
model.Also applies to: 23-23
cognee/base_config.py (1)
5-5
: LGTM! Clean refactoring from MonitoringTool to Observer.The transition from
MonitoringTool
toObserver
maintains the same functionality while using the updated enum structure. TheObserver.LANGFUSE
value is properly defined in the new enum, ensuring consistency across the codebase.Also applies to: 11-11
cognee/tests/test_relational_db_migration.py (2)
115-118
: LGTM! Updated assertions for enhanced migration granularity.The increased expected counts (12 distinct nodes and 15 edges) reflect the enhanced migration logic that creates more granular graph structures, including individual column data as separate nodes.
161-162
:❓ Verification inconclusive
Verify the updated node and edge count expectations.
The substantial increases in expected counts (SQLite: 543 nodes/1317 edges, PostgreSQL: 522 nodes/961 edges) suggest significant changes to the migration process. While this aligns with the enhanced migration logic mentioned in the AI summary, please ensure these specific values are accurate.
Run the following script to validate the current migration output matches these expectations:
Also applies to: 192-193
🏁 Script executed:
#!/bin/bash # Description: Verify that the migration process produces the expected node/edge counts # Search for the migration function to understand the new logic ast-grep --pattern 'async def migrate_relational_database($$$) { $$$ }' # Look for ColumnValue model references to confirm the enhanced granularity rg -A 5 "ColumnValue" --type py # Check for any recent changes to migration logic rg -A 10 -B 5 "migrate_column_data" --type pyLength of output: 5530
Confirm the migration produces 543 nodes and 1317 edges
We’ve identified the updated migration logic (includingColumnValue
nodes), but haven’t yet validated that it yields exactly 543 nodes/1317 edges (and 522/961 for PostgreSQL). Please run the migration test against your representative SQLite and PostgreSQL datasets and adjust the assertions incognee/tests/test_relational_db_migration.py
(lines 161–162 and 192–193) if the actual counts differ.• Run
pytest cognee/tests/test_relational_db_migration.py
with your sample databases
• Verify the node/edge counts reported by the migration
• Update the hard-coded expected values to match realitynotebooks/cognee_demo.ipynb (2)
470-470
: LGTM! Import correction aligns with module structure.The import statement correctly references the task module with lowercase naming, which aligns with the broader codebase refactoring mentioned in the summary.
508-508
: LGTM! User context propagation implemented correctly.The explicit
user
argument in therun_tasks
call correctly aligns with the API changes that now require user context for pipeline execution.cognee-frontend/src/app/page.tsx (2)
46-46
: LGTM: Clear user guidance added.The updated notification message effectively guides users to run "Cognify" when ready, improving the user experience flow.
106-106
: LGTM: Proper prop integration.The
onCognify
prop is correctly passed to the DataView component, enabling the cognification trigger functionality.cognee/infrastructure/llm/openai/adapter.py (2)
18-20
: LGTM: Standardized observability integration.The import cleanup and direct use of
get_observe()
aligns with the observability framework standardization mentioned in the AI summary.
66-66
: Verify that simplified prompts maintain structured output quality.The removal of the instruction prefix "Use the given format to extract information from the following input:" from user messages may impact the LLM's ability to generate properly structured outputs, especially for complex extraction tasks.
Please test that structured output generation still works correctly with the simplified prompts, particularly for complex data extraction scenarios.
Also applies to: 93-93
cognee/modules/engine/models/ColumnValue.py (1)
1-10
: LGTM: Well-structured model class.The
ColumnValue
class correctly inherits fromDataPoint
and follows the established pattern with:
- Proper field typing for
name
,description
, andproperties
- Appropriate metadata configuration specifying
properties
as an index field- Clean integration with the existing engine model infrastructure
cognee-frontend/src/ui/Partials/SearchView/SearchView.tsx (2)
81-81
: LGTM: Improved user experience with immediate input clearing.Moving the input clearing to happen immediately after adding the user message improves responsiveness and user experience, rather than waiting for the fetch response.
108-110
: LGTM: Good error handling for input recovery.The error handling correctly restores the input value if the fetch request fails, preventing loss of user input and improving the overall user experience.
cognee/infrastructure/databases/graph/get_graph_engine.py (1)
70-81
: LGTM! Memgraph integration follows established patterns.The Memgraph integration is well-implemented and follows the same patterns as other database providers:
- Proper credential validation with descriptive error messages
- Consistent import and instantiation approach
- Matches the structure used for Neo4j, FalkorDB, and Kuzu
The implementation correctly validates required credentials and provides clear error messaging when they're missing.
cognee/infrastructure/databases/vector/exceptions/exceptions.py (2)
9-9
: Good fix for the incorrect default name parameter.The default
name
parameter was incorrectly set to"DatabaseNotCreatedError"
instead of"CollectionNotFoundError"
. This fix ensures the error name correctly reflects the exception type.
11-14
: Enhancement: Added configurable logging parameters.The addition of
log
andlog_level
parameters provides better control over error logging behavior. This allows callers to customize whether and at what level collection-not-found errors should be logged, which is particularly useful for scenarios where missing collections might be expected behavior.notebooks/cognee_graphiti_demo.ipynb (5)
18-18
: Good fix for the typo.The correction from "libraires" to "libraries" improves readability.
131-132
: Simplified prune calls are cleaner.The removal of commented logging setup and direct async calls improve code readability.
135-136
: Good addition of user context initialization.Adding explicit user context aligns with the broader codebase pattern of associating user information with pipeline operations.
143-143
: Pipeline execution improvements look good.The cleaned up task list formatting and addition of
user=user
parameter torun_tasks
properly integrates user context into pipeline execution.Also applies to: 145-145
148-148
: Simplified print statement is cleaner.Direct printing of the result instead of formatted message reduces unnecessary complexity.
cognee/modules/pipelines/operations/log_pipeline_run_initiated.py (1)
6-14
: Function signature and PipelineRun creation look correct.The function properly accepts the required parameters and creates a PipelineRun instance with appropriate fields including the new status enum value.
alembic/versions/1d0bb7fede17_add_pipeline_run_status.py (2)
23-29
: Migration logic is correct and database-safe.The use of
ADD VALUE IF NOT EXISTS
prevents errors if the enum value already exists, and limiting to PostgreSQL dialect is appropriate.
32-33
: Empty downgrade function is acceptable for enum additions.Enum value removal in PostgreSQL is complex and often not necessary for this type of change.
cognee/modules/search/methods/search.py (3)
14-17
: New retriever imports follow proper naming conventions.The imports for
GraphCompletionCotRetriever
andGraphCompletionContextExtensionRetriever
are properly structured and follow the existing pattern.
26-26
: Import path correction is good.Fixing the import to use absolute paths from
cognee.modules.search.operations
improves import clarity.
77-84
: New search type integrations are properly implemented.The new search types
GRAPH_COMPLETION_COT
andGRAPH_COMPLETION_CONTEXT_EXTENSION
are correctly integrated into the search_tasks dictionary with appropriate parameters.cognee/eval_framework/answer_generation/answer_generation_executor.py (2)
3-6
: LGTM! New retriever imports added correctly.The new retriever imports are properly formatted and follow the existing import pattern.
17-18
: LGTM! New retriever options added correctly.The new retriever entries follow the existing naming convention and are properly mapped to their respective classes.
cognee-frontend/src/modules/ingestion/DataView/DataView.tsx (4)
9-10
: LGTM! New imports added correctly.The CTAButton and useBoolean imports are properly added to support the new cognify functionality.
33-33
: LGTM! OnCognify prop added correctly.The new prop follows TypeScript conventions and properly defines the expected async function signature.
60-70
: LGTM! Cognify state management implemented correctly.The useBoolean hook provides clean state management for the disabled state, and the handleCognify function properly handles the async operation with loading state.
80-84
: LGTM! Cognify button implementation is correct.The CTAButton is properly configured with the disabled state and onClick handler, providing good UX during the cognify operation.
cognee/shared/logging_utils.py (3)
43-48
: LGTM! System information constants added appropriately.The version and OS information constants provide useful debugging context for log analysis.
326-326
: Verify the condition change for SQLAlchemy warning suppression.The condition changed from
log_level > logging.WARNING
tolog_level > logging.DEBUG
, which broadens when SQLAlchemy warnings are suppressed. This means warnings will now be suppressed at INFO level and above, not just ERROR and CRITICAL.Please confirm this change is intentional. The new condition will suppress SQLAlchemy warnings for most common log levels (INFO, WARNING, ERROR, CRITICAL), only showing them when DEBUG level is set.
340-351
: LGTM! System information logging enhances debugging capability.The addition of system information logging provides valuable context for troubleshooting and debugging. The implementation correctly logs Python version, structlog version, cognee version, and OS information.
cognee-starter-kit/src/pipelines/custom-model.py (5)
8-27
: LGTM! Proper directory setup and configuration.The directory setup correctly uses pathlib for cross-platform compatibility and properly configures both data and system directories relative to the script location.
25-27
: LGTM! Clean state management with pruning.The pruning operations ensure a fresh state for the example, which is appropriate for a starter pipeline that should be repeatable.
36-52
: LGTM! Well-designed custom data models.The custom DataPoint classes are properly structured with:
- Clear inheritance from DataPoint
- Appropriate type annotations
- Metadata configuration for indexing
- Logical relationships between entities
70-88
: LGTM! Comprehensive search examples.The script demonstrates all major search types (graph completion, RAG completion, summaries, chunks) with clear output formatting, providing excellent learning examples.
91-92
: LGTM! Proper async main pattern.The script follows Python best practices for async execution with the standard
if __name__ == "__main__":
guard andasyncio.run()
.cognee/tests/unit/modules/retrieval/chunks_retriever_test.py (2)
19-19
: Good improvement for test isolation.The updated directory paths with more specific test context names (e.g.,
test_chunks_context_simple
,test_chunk_context_complex
,test_chunk_context_empty
) enhance test isolation by ensuring each test uses its own dedicated storage directories. This prevents potential interference between test runs.Also applies to: 23-23, 76-76, 80-80, 165-165, 169-169
193-198
: Cleaner async test execution pattern.The consolidation of test execution into a single
main()
coroutine that sequentially awaits all test methods is a cleaner approach compared to multiple separaterun()
calls. This ensures proper sequential execution and better error handling.examples/database_examples/chromadb_example.py (1)
8-87
: Excellent example structure and documentation.This example follows a clear, well-documented pattern:
- Configuration setup for ChromaDB
- Directory configuration
- Data cleanup and setup
- Data ingestion and processing
- Multiple search demonstrations
The inline comments and docstring provide good guidance for users. The example effectively demonstrates Cognee's integration with ChromaDB across different search types.
cognee/exceptions/exceptions.py (1)
15-16
: Good enhancement for configurable logging.Adding
log
andlog_level
parameters provides valuable flexibility for controlling error logging behavior across the application.examples/data/car_and_tech_companies.txt (1)
1-37
: Good sample data for testing and examples.The content provides comprehensive descriptions of major companies in automotive and technology sectors, making it suitable for testing knowledge extraction and search functionality in Cognee examples.
🧰 Tools
🪛 LanguageTool
[duplication] ~2-~2: Possible typo: you repeated a word.
Context: text_1 = """ 1. Audi Audi is known for its modern designs and adv...(ENGLISH_WORD_REPEAT_RULE)
[duplication] ~5-~5: Possible typo: you repeated a word.
Context: ...ns to high-performance sports cars. 2. BMW BMW, short for Bayerische Motoren Werke, is...(ENGLISH_WORD_REPEAT_RULE)
[style] ~6-~6: Consider using a more concise synonym.
Context: ... reflects that commitment. BMW produces a variety of cars that combine luxury with sporty pe...(A_VARIETY_OF)
[duplication] ~8-~8: Possible typo: you repeated a word.
Context: ...ine luxury with sporty performance. 3. Mercedes-Benz Mercedes-Benz is synonymous with luxury and quality. ...(ENGLISH_WORD_REPEAT_RULE)
[duplication] ~11-~11: Possible typo: you repeated a word.
Context: ... catering to a wide range of needs. 4. Porsche Porsche is a name that stands for high-performa...(ENGLISH_WORD_REPEAT_RULE)
[duplication] ~14-~14: Possible typo: you repeated a word.
Context: ...o value both performance and style. 5. Volkswagen Volkswagen, which means "people's car" in German, ...(ENGLISH_WORD_REPEAT_RULE)
[grammar] ~17-~17: The plural determiner ‘these’ does not agree with the singular noun ‘car’.
Context: ...nce practicality with quality. Each of these car manufacturer contributes to Germany's r...(THIS_NNS)
[uncategorized] ~17-~17: The grammatical number of this noun doesn’t look right. Consider replacing it.
Context: ...cality with quality. Each of these car manufacturer contributes to Germany's reputation as ...(AI_EN_LECTOR_REPLACEMENT_NOUN_NUMBER)
[duplication] ~21-~21: Possible typo: you repeated a word.
Context: ...design excellence. """ text_2 = """ 1. Apple Apple is renowned for its innovative consumer...(ENGLISH_WORD_REPEAT_RULE)
[duplication] ~27-~27: Possible typo: you repeated a word.
Context: ... in shaping the internet landscape. 3. Microsoft Microsoft Corporation has been a dominant force i...(ENGLISH_WORD_REPEAT_RULE)
[style] ~28-~28: Consider using a synonym to be more concise.
Context: ...n both business and personal computing. In recent years, Microsoft has expanded into cloud comp...(IN_RECENT_STYLE)
[uncategorized] ~31-~31: You might be missing the article “the” here.
Context: ...or innovation continues to reshape both retail and technology sectors. 5. Meta Meta, ...(AI_EN_LECTOR_MISSING_DETERMINER_THE)
[duplication] ~33-~33: Possible typo: you repeated a word.
Context: ...both retail and technology sectors. 5. Meta Meta, originally known as Facebook, revoluti...(ENGLISH_WORD_REPEAT_RULE)
cognee/modules/data/methods/create_dataset.py (4)
7-8
: LGTM! Clean import additions for refactoring.The new imports properly support the function signature change and the extracted dataset ID generation logic.
11-11
: Excellent architectural improvement!Changing from primitive
owner_id
to the fullUser
object provides better type safety and richer context. This follows good OOP principles and makes the API more robust.
12-12
: Proper handling of the signature change.Correctly extracts the owner ID from the User object to maintain compatibility with existing database logic and relationships.
24-25
: Well-implemented separation of concerns.The async call to
get_unique_dataset_id
correctly extracts the UUID generation logic while maintaining the same functionality. The function parameters and async handling are properly implemented.notebooks/graphrag_vs_rag.ipynb (4)
56-56
: Appropriate version update.Updating to cognee 0.1.39 aligns with the API changes throughout the notebook and ensures compatibility with the new import paths and function signatures.
152-152
: Good API restructuring to versioned namespace.The import path change from internal modules to
cognee.api.v1.search
represents better API organization with proper versioning. This is a good practice for maintaining backward compatibility.
153-153
: Excellent API improvements with keyword arguments.The consistent use of keyword arguments (
query_type=
,query_text=
) across all search calls improves code clarity and maintainability. The enum name change fromCOMPLETION
toRAG_COMPLETION
is more descriptive and better communicates the search type's purpose.Also applies to: 173-173, 202-202
56-56
: Comprehensive and consistent API migration.All API changes are applied consistently throughout the notebook - package version, import paths, and function call patterns are all properly aligned. No mixed usage of old and new patterns detected.
Also applies to: 152-152, 153-153, 173-173, 202-202
cognee/tasks/temporal_awareness/index_graphiti_objects.py (2)
34-34
: Method name change looks correct.The update from
get_model_independent_graph_data()
toget_graph_data()
aligns with the API standardization mentioned in the summary.
36-40
:✅ Verification successful
Verify the new data structure format.
The iteration pattern change suggests
nodes_data
now returns tuples of(node_id, node_data)
instead of just node dictionaries. This is a significant change that should be verified across the codebase.
🏁 Script executed:
#!/bin/bash # Description: Search for other usages of get_graph_data or nodes_data iteration # Expected: All should use the new tuple format echo "Searching for other get_graph_data usages..." rg "get_graph_data|nodes_data" -A 3 -B 1Length of output: 14055
All graph_data consumers unpack nodes_data as (id, data) tuples
Verified that every call site ofget_graph_data
expects and correctly unpacks a list of(node_id, node_data)
tuples:
- cognee/tasks/temporal_awareness/index_graphiti_objects.py (L36–40)
- cognee/modules/visualization/cognee_network_visualization.py
- cognee/modules/graph/cognee_graph/CogneeGraph.py
No callers iterate over
nodes_data
as plain dictionaries. The new tuple format is applied consistently—no further changes needed.cognee/tests/unit/modules/pipelines/run_tasks_with_context_test.py (2)
14-22
: Task definitions look correct for context testing.The task definitions properly demonstrate different scenarios:
task_1
: Uses both input and contexttask_2
: Only uses input (no context parameter)task_3
: Uses both input and contextThis effectively tests the context propagation mechanism.
37-39
:✅ Verification successful
Verify the expected result calculation.
Let me trace through the calculation:
- Initial data: 5, context: 7
- task_1: 5 + 7 = 12
- task_2: 12 * 2 = 24
- task_3: 24 ** 7 = 4,586,471,424
The calculation appears correct.
🏁 Script executed:
#!/bin/bash # Description: Verify the mathematical calculation # Expected: 24^7 should equal 4586471424 python3 -c "print(f'24^7 = {24**7}')"Length of output: 63
Calculation Verified
The computation 24 ** 7 equals 4,586,471,424 as expected. No changes required.examples/database_examples/qdrant_example.py (1)
66-85
: Excellent demonstration of different search types.The example effectively demonstrates the various search capabilities with clear output formatting and proper async handling.
cognee/modules/pipelines/operations/run_tasks.py (3)
23-25
: Context parameter addition looks correct.The addition of the optional
context: dict = None
parameter maintains backward compatibility while enabling context propagation.
88-93
: Consider using keyword arguments for clarity.The function call uses keyword arguments for most parameters, which improves readability and reduces the risk of parameter ordering errors.
77-93
:✅ Verification successful
Verify parameter forwarding consistency.
The context parameter is properly added and forwarded to
run_tasks_with_telemetry
. The parameter positioning and default value are consistent.However, ensure that the parameter order in the function call matches the function signature:
🏁 Script executed:
#!/bin/bash # Description: Check if run_tasks_with_telemetry signature matches the call # Expected: Parameter order should be consistent echo "Checking run_tasks_with_telemetry function signature..." rg "async def run_tasks_with_telemetry" -A 2 echo -e "\nChecking function call..." rg "run_tasks_with_telemetry" -A 5 -B 1Length of output: 1682
Parameter forwarding verified: signature and call align
The
run_tasks_with_telemetry
function signature (tasks, data, user, pipeline_name, context=None
) matches the named arguments in the call. No further changes needed—approving as-is.cognee/tasks/ingestion/migrate_relational_database.py (3)
11-11
: LGTM! Import addition supports new functionality.The addition of
ColumnValue
to the imports correctly supports the new column data migration feature.
16-16
: Good backward compatibility approach.Adding the optional parameter
migrate_column_data
with a default value ofTrue
maintains backward compatibility while enabling the new functionality.
98-133
: Column data migration implementation looks solid.The new column data migration functionality is well-structured:
- Properly filters out primary keys and foreign keys to avoid duplication
- Creates appropriate ColumnValue nodes with meaningful IDs
- Establishes correct relationships between table rows and column values
- Uses consistent node ID generation patterns
The implementation integrates well with the existing migration logic.
cognee/api/v1/cognify/code_graph_pipeline.py (4)
5-6
: Good import organization and observability refactoring.The import reordering and move to centralized
get_observe()
improves code modularity and maintainability.
16-16
: Excellent addition for user-aware dataset handling.The import of
get_unique_dataset_id
supports the transition to user-aware dataset ID generation, which is a significant improvement over static UUIDs.
25-25
: Centralized observability management.Using
get_observe()
instead of conditional imports provides better consistency and maintainability across the codebase.
68-68
: Improved dataset ID generation.The change from static UUID generation to user-aware async method
get_unique_dataset_id("codebase", user)
enables better data isolation and user context handling.examples/database_examples/falkordb_example.py (1)
8-87
: Excellent comprehensive example for FalkorDB integration.This example script is well-structured and demonstrates:
- Proper FalkorDB configuration
- Clear data directory setup
- Complete workflow from data ingestion to search
- Multiple search types (insights, chunks, graph completion)
- Good documentation and comments
The implementation follows consistent patterns with other database examples and provides users with a ready-to-run reference.
examples/database_examples/kuzu_example.py (1)
8-85
: Excellent KuzuDB integration example.This example script provides a comprehensive demonstration of:
- Proper KuzuDB configuration
- Data directory management
- Complete cognify workflow
- Multiple search operations with different types
- Clear documentation and structure
The implementation is consistent with other database examples and serves as an excellent reference for users wanting to integrate with KuzuDB.
notebooks/cognee_openai_compatable_demo.ipynb (1)
1-109
: Excellent demonstration of the OpenAI-compatible API integration!This notebook effectively showcases the new OpenAI-compatible responses API with clear examples of both cognify and search operations. The execution outputs demonstrate successful function calls and proper API responses.
examples/database_examples/neo4j_example.py (2)
8-18
: Well-structured example with clear documentation.The function documentation clearly explains the workflow and the script follows the established pattern of other database examples in the repository.
67-86
: Consistent search pattern implementation.The three search operations demonstrate different SearchType values effectively and match the pattern used in other database examples.
cognee/tests/test_starter_pipelines.py (4)
7-9
: Clear test class with good documentation.The class name and docstring clearly indicate the purpose of testing pipeline execution.
10-27
: Excellent environment validation and setup.The setUp method properly validates required environment variables and skips tests when prerequisites aren't met, preventing false negatives.
28-51
: Robust pipeline execution with good error handling.The helper method includes proper timeout handling, virtual environment detection, and detailed error reporting.
52-66
: Comprehensive test coverage for all pipeline scripts.All three pipeline scripts are tested with consistent assertion patterns.
examples/database_examples/milvus_example.py (4)
8-18
: Clear documentation following established pattern.The function documentation is consistent with other database examples and clearly explains the workflow.
29-38
: Appropriate vector database configuration.The script correctly uses
set_vector_db_config
for Milvus (a vector database) instead ofset_graph_db_config
, and the local database path construction is suitable for an example.
47-52
: Relevant sample text for vector database demonstration.The sample text effectively describes Milvus features and capabilities, making it appropriate for testing vector search functionality.
60-81
: Consistent search pattern implementation.The three search operations match the pattern used in other database examples, demonstrating different SearchType capabilities effectively.
cognee/eval_framework/eval_config.py (4)
3-3
: LGTM!Good addition of the
Optional
import to support the newinstance_filter
field.
17-17
: LGTM!The expansion of QA engine options to include the new graph completion variants looks good and aligns well with the new retrievers mentioned in the AI summary.
44-44
: LGTM!The new
instance_filter
field is properly typed and provides useful filtering functionality for evaluations.
70-70
: LGTM!Good consistency in including the new
instance_filter
field in theto_dict
method output.cognee/api/v1/responses/default_tools.py (1)
1-63
: Well-structured tool definitions following OpenAI function calling standards.The DEFAULT_TOOLS constant is well-implemented with:
- Proper JSON schema structures for each tool
- Clear descriptions and parameter types
- Appropriate required/optional field definitions
- Good practice of commenting out the potentially dangerous
prune
functionThe search tool enum values and parameter structure look comprehensive for the various search types supported.
cognee/modules/retrieval/graph_completion_context_extension_retriever.py (1)
24-74
: Well-implemented context extension logic with proper early stopping.The
get_completion
method effectively implements iterative context extension with:
- Proper initialization and context handling when none is provided
- Smart early stopping when no new triplets are found (lines 54-58)
- Good logging for monitoring the extension process
- Deduplication of triplets using
set()
conversion (line 49)The implementation follows good async patterns and handles the iterative process cleanly.
examples/database_examples/pgvector_example.py (2)
8-99
: Excellent comprehensive example demonstrating PGVector integration.This example script effectively demonstrates:
- Proper configuration of PGVector and PostgreSQL settings
- Complete workflow from data addition to knowledge extraction
- Multiple search types (insights, chunks, graph completion)
- Good documentation and comments explaining each step
- Proper async/await usage throughout
The script serves as a clear reference for users wanting to integrate Cognee with PGVector.
8-18
: Inconsistency between PR objectives and actual changes.The PR objectives mention "add getting started tutorial to git" but the changes appear to be substantial enhancements to the Cognee codebase including new retrievers, API tools, and database examples. This seems inconsistent with the stated objectives.
Likely an incorrect or invalid review comment.
cognee/api/v1/responses/routers/default_tools.py (2)
1-37
: LGTM! Well-structured search tool definition.The search tool schema is well-defined with proper parameter types, descriptions, and validation. The enum values for search_type are comprehensive and the optional parameters are appropriately configured.
16-22
: Inconsistent with AI summary - missing search types.The AI summary mentions that "GRAPH_COMPLETION_COT" and "GRAPH_COMPLETION_CONTEXT_EXTENSION" were added to the search_type enum, but they're not present in the current code.
#!/bin/bash # Search for any references to the missing search types in the codebase rg -i "GRAPH_COMPLETION_COT|GRAPH_COMPLETION_CONTEXT_EXTENSION"Likely an incorrect or invalid review comment.
examples/database_examples/weaviate_example.py (2)
8-18
: Excellent documentation and structure.The function documentation clearly explains the workflow and the script is well-organized with clear step-by-step comments.
64-84
: Good variety of search examples.The search examples demonstrate different search types effectively and provide good coverage of the API capabilities.
cognee-starter-kit/README.md (1)
1-13
: Well-structured introduction and overview.The README provides a clear and comprehensive introduction to the starter kit with good organization and helpful feature highlights.
cognee/modules/pipelines/operations/run_tasks_base.py (2)
17-17
: Good implementation of context parameter propagation.The addition of the optional
context
parameter with proper default value handling enables flexible context passing through the pipeline execution.Also applies to: 66-66
35-36
: Well-implemented conditional context passing.The parameter inspection approach ensures context is only passed to functions that expect it, and the recursive propagation maintains context throughout the execution chain.
Also applies to: 40-40, 79-81
cognee/api/v1/responses/routers/get_responses_router.py (1)
74-74
:⚠️ Potential issueFix the static analysis warning for function call in argument defaults.
The static analysis tool correctly identifies that using
Depends()
in argument defaults can cause issues.Move the dependency call inside the function or use a module-level variable:
async def create_response( request: ResponseRequest, - user: User = Depends(get_authenticated_user), + user: User = Depends(get_authenticated_user), ) -> ResponseBody:Actually, the current usage is correct for FastAPI. The static analysis tool is giving a false positive here since FastAPI specifically designed
Depends()
to be used this way in function signatures.🧰 Tools
🪛 Ruff (0.11.9)
74-74: Do not perform function call
Depends
in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable(B008)
cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (3)
186-188
: Double-checkCollectionNotFoundError
signatureA new
log_level="DEBUG"
parameter is passed to the exception. IfCollectionNotFoundError
still only acceptsmessage: str
, this will raise aTypeError
and mask the real “collection not found” problem.-raise CollectionNotFoundError( - f"Collection '{collection_name}' not found!", log_level="DEBUG" -) +raise CollectionNotFoundError(f"Collection '{collection_name}' not found!", log_level="DEBUG") # or drop the extra arg if the class signature has not been updatedPlease confirm the updated constructor and adjust accordingly.
222-237
: Avoid shadowing & clarify variable semantics
closest_items
is first initialised as an empty list (line 223) and then immediately overwritten by theCursorResult
returned fromsession.execute
(line 236).
This shadowing:
- Obscures the real type held by the variable (list →
CursorResult
).- Makes the later
closest_items.all()
call slightly confusing to read.A tiny rename keeps intent crystal-clear:
-# NOTE: This needs to be initialized in case search doesn't return a value -closest_items = [] +closest_items_result: CursorResult | None = None ... -closest_items = await session.execute(query) +closest_items_result = await session.execute(query) ... -for row in closest_items_result.all(): +for row in closest_items_result.all():[ suggest_optional_refactor ]
241-252
: Graceful handling of empty result setsGood catch adding the explicit early-return on an empty
vector_list
(lines 250-252). ✔️
This prevents a division-by-zero insidenormalize_distances
.cognee/infrastructure/databases/graph/networkx/adapter.py (1)
624-636
: Return value consistency
get_nodes
now returns dictionaries containing"id"
as aUUID
.
Call-sites that previously expected strings (e.g., REST serialisation) will need to convert.
Suggest documenting the change in the adapter docstring and/or adding:{"id": str(node_id), **self.graph.nodes[node_id]}to minimise downstream breakage.
cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py (2)
37-47
: Repeatedconnect()
calls can leak sockets
get_client()
callsawait self.client.connect()
every time.
The async client usually keeps an internal pool; reconnecting on every DAO method may open superfluous websockets.async def get_client(self): - await self.client.connect() - return self.client + if not getattr(self.client, "is_connected", False): + await self.client.connect() + return self.client(or cache a boolean flag).
[ suggest_essential_refactor ]
166-174
:search
parameter validation looks goodNice consolidation of distance handling and a higher default
limit = 15
.
Input validation correctly preventsNone/None
queries. ✔️
# Drop all tables from the public schema | ||
schema_list = ["public", "public_staging"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Verify the behavioral change from dynamic to static schema targeting.
The hardcoded schema list ["public", "public_staging"]
replaces the dynamic get_schema_list()
call, fundamentally changing the scope of the database deletion operation. This could be a breaking change if other schemas exist that need cleanup.
The method name delete_database()
suggests complete database cleanup, but now it only targets specific schemas. Consider either:
- Updating the method name to reflect the limited scope (e.g.,
delete_core_schemas()
) - Adding documentation explaining why only these schemas are targeted
- Reverting to dynamic schema detection if complete cleanup is intended
🏁 Script executed:
#!/bin/bash
# Description: Verify if there are other usages of get_schema_list() and check for potential schemas that might be missed
# Expected: Find other usages of get_schema_list() and understand the impact
echo "=== Checking other usages of get_schema_list() ==="
rg -A 3 -B 3 "get_schema_list"
echo "=== Checking for other schema references in the codebase ==="
rg -A 2 -B 2 "schema.*=|schema_name.*="
echo "=== Checking for migration or setup files that might create additional schemas ==="
fd -e sql -e py | xargs rg -l "CREATE SCHEMA|create.*schema"
Length of output: 86031
Clarify delete_database() scope
Our search found no other schema definitions or uses of get_schema_list() outside of public and public_staging. The change from dynamic to ["public","public_staging"]
therefore limits cleanup to just those two schemas. To avoid confusion or future breakage if new schemas are introduced, please either:
• Update the delete_database()
docstring (in cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py) to explain that it only drops tables in the “public” and “public_staging” schemas, or
• Rename the method (e.g., to delete_core_schemas()
) to more accurately reflect its limited scope.
🤖 Prompt for AI Agents
In cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py
around lines 332 to 333, the delete_database() method now only drops tables from
the hardcoded schemas "public" and "public_staging" instead of dynamically
targeting all schemas. To clarify this limited scope and prevent confusion,
either update the delete_database() method's docstring to explicitly state it
only affects these two schemas or rename the method to something like
delete_core_schemas() that better reflects this narrower functionality.
@@ -44,7 +44,7 @@ async def embed_text(self, text: List[str]) -> List[List[float]]: | |||
embeddings = self.embedding_model.embed( | |||
text, | |||
batch_size=len(text), | |||
parallel=0, | |||
parallel=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Verify the correct parameter value for fastembed's parallel option.
The change from parallel=0
to parallel=None
could have different semantics depending on the fastembed library's API. Please ensure this aligns with the intended behavior and the library's documentation.
🌐 Web query:
What is the correct usage of the parallel parameter in fastembed TextEmbedding.embed() method?
💡 Result:
The parallel
parameter in FastEmbed's embed()
method is used to enable data-parallel processing for embedding generation, which can significantly improve performance when working with large datasets.
Correct Usage of the Parallel Parameter
The parallel
parameter accepts an integer value that determines how the embedding workload will be distributed:
- If set to a value greater than 1, data-parallel encoding will be used, which is recommended for offline encoding of large datasets[6][9].
- If set to 0, all available CPU cores will be used[9].
- If set to None (default), data-parallel processing won't be used, and the default ONNX runtime threading will be used instead[9].
When working with GPU acceleration, there are additional considerations:
- The value of
parallel
must match the number of GPUs specified in thedevice_ids
parameter[5]. - For example, if you specify
device_ids=[0,1,2,3]
, you should setparallel=4
[5]. - This ensures proper utilization by spawning child processes for each GPU.
Implementation Example
Here's how you can use the parallel parameter with FastEmbed:
from fastembed.embedding import DefaultEmbedding
# Initialize the embedding model
embedding_model = DefaultEmbedding()
# For CPU-based parallel processing
embeddings = list(embedding_model.embed(documents, parallel=4, batch_size=256))
# For GPU-based parallel processing
embedding_model_gpu = TextEmbedding(
model_name="intfloat/multilingual-e5-large",
cuda=True,
device_ids=[0, 1, 2, 3]
)
# Must match number of GPUs in device_ids
embeddings = list(embedding_model_gpu.embed(documents, parallel=4, batch_size=256))
Additional Considerations
When using the parallel
parameter:
- It works well with the
batch_size
parameter, which controls how many documents are processed in a single batch[6][7]. - For single GPU usage, the
parallel
parameter is not necessary[5]. - When using GPUs, you must use
cuda=True
and not explicitly specifyproviders
[5]. - The
cuda
andproviders
parameters are mutually exclusive[5].
This parameter is particularly useful for improving performance when embedding large datasets, allowing you to take advantage of multiple CPU cores or GPUs for parallel processing.
Citations:
- 1: https://docs.haystack.deepset.ai/docs/fastembedtextembedder
- 2: https://github.com/qdrant/fastembed
- 3: https://docs.haystack.deepset.ai/v2.0/docs/fastembedtextembedder
- 4: https://dev.to/qdrant/fastembed-fast-and-lightweight-embedding-generation-for-text-4i6c
- 5: [Bug]: Parallel Embedding is not working on Windows Servers qdrant/fastembed#414
- 6: https://python-client.qdrant.tech/qdrant_client.qdrant_fastembed
- 7: https://python-client.qdrant.tech/_modules/qdrant_client/qdrant_fastembed
- 8: https://docs.llamaindex.ai/en/stable/examples/embeddings/fastembed/
- 9: https://python.langchain.com/docs/integrations/text_embedding/fastembed/
- 10: https://docs.llamaindex.ai/en/v0.10.20/examples/embeddings/fastembed.html
Review the parallel
parameter change in FastembedEmbeddingEngine
The update from parallel=0
to parallel=None
is not equivalent:
parallel=0
⇒ enables data-parallel encoding across all CPU cores.parallel=None
(the default) ⇒ disables data-parallel processing and falls back to ONNX Runtime’s internal threading.
Please confirm whether you intended to drop data-parallel CPU processing:
- If you still want to leverage all cores for large-batch offline encoding, revert to
parallel=0
. - If you deliberately prefer ONNX Runtime threading only, keep
parallel=None
and consider documenting this behavioral change.
File:
- cognee/infrastructure/databases/vector/embeddings/FastembedEmbeddingEngine.py (line 47)
🤖 Prompt for AI Agents
In cognee/infrastructure/databases/vector/embeddings/FastembedEmbeddingEngine.py
at line 47, the change from parallel=0 to parallel=None alters the behavior of
the embed method by disabling data-parallel CPU processing and relying on ONNX
Runtime threading instead. To fix this, decide if you want to keep data-parallel
processing across all CPU cores; if so, revert parallel back to 0. Otherwise,
keep parallel=None but add a comment explaining this intentional change in
threading behavior for clarity.
@@ -61,8 +61,7 @@ More on [use-cases](https://docs.cognee.ai/use-cases) and [evals](https://github | |||
|
|||
## Get Started | |||
|
|||
Get started quickly with a Google Colab <a href="https://colab.research.google.com/drive/1jHbWVypDgCLwjE71GSXhRL3YxYhCZzG1?usp=sharing">notebook</a> or <a href="https://github.com/topoteretes/cognee-starter">starter repo</a> | |||
|
|||
Get started quickly with a Google Colab <a href="https://colab.research.google.com/drive/1jHbWVypDgCLwjE71GSXhRL3YxYhCZzG1?usp=sharing">notebook</a> or our starter folder</a> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix incomplete link markup.
The link markup is incomplete and will break the README rendering. The opening <a>
tag is missing.
-Get started quickly with a Google Colab <a href="https://colab.research.google.com/drive/1jHbWVypDgCLwjE71GSXhRL3YxYhCZzG1?usp=sharing">notebook</a> or our starter folder</a>
+Get started quickly with a Google Colab <a href="https://colab.research.google.com/drive/1jHbWVypDgCLwjE71GSXhRL3YxYhCZzG1?usp=sharing">notebook</a> or <a href="PATH_TO_STARTER_FOLDER">our starter folder</a>
Note: Replace PATH_TO_STARTER_FOLDER
with the actual path to the starter folder.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
Get started quickly with a Google Colab <a href="https://colab.research.google.com/drive/1jHbWVypDgCLwjE71GSXhRL3YxYhCZzG1?usp=sharing">notebook</a> or our starter folder</a> | |
Get started quickly with a Google Colab <a href="https://colab.research.google.com/drive/1jHbWVypDgCLwjE71GSXhRL3YxYhCZzG1?usp=sharing">notebook</a> or <a href="PATH_TO_STARTER_FOLDER">our starter folder</a> |
🤖 Prompt for AI Agents
In README.md at line 64, the anchor tag for the starter folder link is missing
its opening <a> tag, causing broken link markup. Add the opening <a> tag with
the correct href attribute pointing to the starter folder path before the text
"our starter folder" to complete the link markup and ensure proper rendering.
entrypoint.sh
Outdated
debugpy --wait-for-client --listen 0.0.0.0:5678 -m gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level debug --reload cognee.api.client:app | ||
else | ||
exec gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level debug --reload cognee.api.client:app | ||
gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level debug --reload cognee.api.client:app | ||
fi | ||
else | ||
exec gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level error cognee.api.client:app | ||
gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level error cognee.api.client:app | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider restoring exec
for proper signal handling in containers.
The removal of exec
from the gunicorn startup commands means the shell process remains as the parent, which can interfere with proper signal handling in containerized environments. Without exec
, SIGTERM signals sent by container orchestrators may not reach the gunicorn process directly, potentially causing issues with graceful shutdowns.
Restore exec
for proper signal propagation:
- debugpy --wait-for-client --listen 0.0.0.0:5678 -m gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level debug --reload cognee.api.client:app
+ exec debugpy --wait-for-client --listen 0.0.0.0:5678 -m gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level debug --reload cognee.api.client:app
- gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level debug --reload cognee.api.client:app
+ exec gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level debug --reload cognee.api.client:app
- gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level error cognee.api.client:app
+ exec gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level error cognee.api.client:app
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
debugpy --wait-for-client --listen 0.0.0.0:5678 -m gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level debug --reload cognee.api.client:app | |
else | |
exec gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level debug --reload cognee.api.client:app | |
gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level debug --reload cognee.api.client:app | |
fi | |
else | |
exec gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level error cognee.api.client:app | |
gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level error cognee.api.client:app | |
fi | |
exec debugpy --wait-for-client --listen 0.0.0.0:5678 -m gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level debug --reload cognee.api.client:app | |
else | |
exec gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level debug --reload cognee.api.client:app | |
fi | |
else | |
exec gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level error cognee.api.client:app | |
fi |
🤖 Prompt for AI Agents
In entrypoint.sh around lines 36 to 42, the gunicorn startup commands lack the
'exec' prefix, causing the shell to remain the parent process and interfering
with proper signal handling in containers. To fix this, add 'exec' before each
gunicorn command so that the shell process is replaced by gunicorn, ensuring
SIGTERM and other signals are correctly propagated for graceful shutdowns.
@@ -29,7 +29,7 @@ export default function WizardPage({ | |||
setFalse: closeSettingsModal, | |||
} = useBoolean(false); | |||
|
|||
const dataset = { id: 'main' }; | |||
const dataset = { name: 'main' }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Verify dataset property change coordination across components.
The change from id
to name
property looks correct based on the coordinated updates mentioned in the AI summary. However, ensure that all consuming components (CognifyStep
and ExploreStep
) have been updated to expect the name
property instead of id
.
🏁 Script executed:
#!/bin/bash
# Description: Verify that CognifyStep and ExploreStep components expect dataset.name property
# Expected: Components should reference dataset.name, not dataset.id
echo "Checking CognifyStep component..."
rg -A 5 -B 5 "dataset\.(id|name)" cognee-frontend/src/app/wizard/CognifyStep/
echo "Checking ExploreStep component..."
rg -A 5 -B 5 "dataset\.(id|name)" cognee-frontend/src/app/wizard/ExploreStep/
echo "Checking other potential references to dataset.id..."
rg -A 2 -B 2 "dataset\.id" cognee-frontend/src/
Length of output: 4262
Align dataset identifier usage across the codebase
The switch from dataset.id
to dataset.name
in WizardPage.tsx is incomplete—several modules still reference dataset.id
, leading to runtime errors. Please update or map these to use dataset.name
(or introduce a consistent adapter) in the following locations:
• src/modules/ingestion/useDatasets.ts
– statuses[dataset.id]
– removeDataset filter: dataset.id !== datasetId
• src/modules/ingestion/addData.ts
– formData.append('datasetId', dataset.id)
• src/modules/ingestion/DatasetsView/DatasetsView.tsx
– <DatasetItem key={dataset.id} ...>
– onClick(dataset) passing old id
• src/modules/datasets/getDatasetData.ts
– function signature getDatasetData(dataset: { id: string })
– URL: /v1/datasets/${dataset.id}/data
• src/modules/datasets/deleteDataset.ts
– function signature deleteDataset(dataset: { id: string })
– URL: /v1/datasets/${dataset.id}
• src/app/page.tsx
– datasets.find((d) => d.id === selectedDataset)
– onDatasetCognify({ id: dataset.id })
• src/modules/datasets/cognifyDataset.ts
– datasets: [dataset.id || dataset.name]
in request body
Suggested next steps:
- Update these functions/components to consume
dataset.name
. - Adjust API calls or backend adapters to accept name as identifier (or map name→id).
- Revise type definitions/interfaces to reflect the new
name
property.
This refactor must be coordinated end-to-end to prevent breaking changes.
🤖 Prompt for AI Agents
In cognee-frontend/src/app/wizard/WizardPage.tsx at line 32, the dataset object
uses the property name instead of id, but multiple other modules still reference
dataset.id causing runtime errors. To fix this, update all references in the
listed files (useDatasets.ts, addData.ts, DatasetsView.tsx, getDatasetData.ts,
deleteDataset.ts, page.tsx, cognifyDataset.ts) to use dataset.name consistently.
Adjust function signatures, API calls, and type definitions accordingly to
accept name as the identifier or implement a mapping from name to id to maintain
compatibility. Ensure all consuming components and backend interactions align
with this change to avoid breaking issues.
for neighbour in predecessors: | ||
neighbour = neighbour["relation"] | ||
connections.append((neighbour[0], {"relationship_name": neighbour[1]}, neighbour[2])) | ||
|
||
for neighbour in successors: | ||
neighbour = neighbour["relation"] | ||
connections.append((neighbour[0], {"relationship_name": neighbour[1]}, neighbour[2])) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_connections
indexes into a relationship instead of using its attributes
Same issue as above—relation[0]
, relation[1]
, relation[2]
are invalid. You already return the nodes in the Cypher query; simply forward them:
MATCH (node)<-[relation]-(neighbour)
...
RETURN neighbour.id AS src_id,
type(relation) AS rel_type,
node.id AS dst_id
and then:
connections.append(
(row["src_id"], {"relationship_name": row["rel_type"]}, row["dst_id"])
)
🤖 Prompt for AI Agents
In cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py around
lines 387 to 394, the code incorrectly indexes into the relationship object
using relation[0], relation[1], and relation[2], which is invalid. Instead,
modify the code to use the attributes returned by the Cypher query directly by
appending connections with the source ID, a dictionary containing the
relationship type, and the destination ID, like
connections.append((row["src_id"], {"relationship_name": row["rel_type"]},
row["dst_id"])). This change ensures the correct use of relationship data
without indexing errors.
sanitized_id = node_id.replace(":", "_") | ||
|
||
query = "MATCH (node: {{id: $node_id}}) DETACH DELETE node" | ||
params = {"node_id": sanitized_id} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Malformed Cypher pattern prevents node deletion
MATCH (node: {{id: $node_id}})
contains an extra :
and doubled braces, producing a syntax error. Replace with a property map:
- query = "MATCH (node: {{id: $node_id}}) DETACH DELETE node"
+ query = "MATCH (node {id: $node_id}) DETACH DELETE node"
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
sanitized_id = node_id.replace(":", "_") | |
query = "MATCH (node: {{id: $node_id}}) DETACH DELETE node" | |
params = {"node_id": sanitized_id} | |
sanitized_id = node_id.replace(":", "_") | |
query = "MATCH (node {id: $node_id}) DETACH DELETE node" | |
params = {"node_id": sanitized_id} |
🤖 Prompt for AI Agents
In cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py around
lines 120 to 124, the Cypher query syntax is incorrect due to an extra colon and
doubled braces in the node pattern. Replace the pattern MATCH (node: {{id:
$node_id}}) with MATCH (node {id: $node_id}) to correctly specify the node by
its property map without the extra colon and braces, ensuring the query runs
without syntax errors.
return [ | ||
(result["n"]["id"], result["m"]["id"], {"relationship_name": result["r"][1]}) | ||
for result in results | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relationship object is not subscriptable
result["r"][1]
assumes a list/tuple; Neo4j’s Relationship
object exposes its type via .type
. Accessing it by index will raise TypeError
.
- (result["n"]["id"], result["m"]["id"], {"relationship_name": result["r"][1]})
+ (
+ result["n"]["id"],
+ result["m"]["id"],
+ {"relationship_name": result["r"].type},
+ )
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
return [ | |
(result["n"]["id"], result["m"]["id"], {"relationship_name": result["r"][1]}) | |
for result in results | |
] | |
return [ | |
( | |
result["n"]["id"], | |
result["m"]["id"], | |
{"relationship_name": result["r"].type}, | |
) | |
for result in results | |
] |
🤖 Prompt for AI Agents
In cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py around
lines 255 to 258, the code incorrectly accesses the relationship type using
indexing on result["r"], which is a Relationship object and not subscriptable.
Replace result["r"][1] with result["r"].type to correctly access the
relationship type attribute without causing a TypeError.
def serialize_properties(self, properties=dict()): | ||
serialized_properties = {} | ||
|
||
for property_key, property_value in properties.items(): | ||
if isinstance(property_value, UUID): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid mutable default argument
properties=dict()
creates a shared dictionary across calls.
- def serialize_properties(self, properties=dict()):
+ def serialize_properties(self, properties: Optional[dict] = None):
serialized_properties = {}
-
- for property_key, property_value in properties.items():
+ properties = properties or {}
+ for property_key, property_value in properties.items():
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def serialize_properties(self, properties=dict()): | |
serialized_properties = {} | |
for property_key, property_value in properties.items(): | |
if isinstance(property_value, UUID): | |
def serialize_properties(self, properties: Optional[dict] = None): | |
serialized_properties = {} | |
properties = properties or {} | |
for property_key, property_value in properties.items(): | |
if isinstance(property_value, UUID): | |
... |
🧰 Tools
🪛 Ruff (0.11.9)
430-430: Do not use mutable data structures for argument defaults
Replace with None
; initialize within function
(B006)
🤖 Prompt for AI Agents
In cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py around
lines 430 to 434, the method serialize_properties uses a mutable default
argument properties=dict(), which can lead to unexpected behavior due to shared
state across calls. Change the default value of properties to None and inside
the method initialize it to an empty dictionary if it is None, ensuring each
call gets a fresh dictionary.
UNWIND $node_ids AS id | ||
MATCH (node:`{id}`)<-[r:{edge_label}]-(successor) | ||
DELETE r; | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cypher pattern in remove_connection_to_successors_of
is invalid
Back-ticked {id}
and inline braces break parsing. Use a parameterised property map instead:
-MATCH (node:`{id}`)<-[r:{edge_label}]-(successor)
+MATCH (node {id: id})<-[r]-(successor)
+WHERE type(r) = $edge_label
(Assuming you meant the UNWIND variable id
.)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
UNWIND $node_ids AS id | |
MATCH (node:`{id}`)<-[r:{edge_label}]-(successor) | |
DELETE r; | |
""" | |
UNWIND $node_ids AS id | |
- MATCH (node:`{id}`)<-[r:{edge_label}]-(successor) | |
+ MATCH (node {id: id})<-[r]-(successor) | |
+ WHERE type(r) = $edge_label | |
DELETE r; | |
""" |
🤖 Prompt for AI Agents
In cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py around
lines 415 to 418, the Cypher pattern uses back-ticked `{id}` and inline braces
which is invalid syntax. Replace the pattern `(node:`{id}`)` with a
parameterized property map using the UNWIND variable `id`, for example `(node
{id: id})`, to correctly match nodes by their id property without breaking
parsing.
entrypoint.sh
Outdated
if [ "$DEBUG" = "true" ]; then | ||
echo "Waiting for the debugger to attach..." | ||
exec python -m debugpy --wait-for-client --listen 0.0.0.0:5678 -m gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level debug --reload cognee.api.client:app | ||
debugpy --wait-for-client --listen 0.0.0.0:5678 -m gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level debug --reload cognee.api.client:app |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exposed Debug Interface with Remote Code Execution Risk
Suggested Fix
@@ -1,4 +1,12 @@
+` tag.
+
+*Changes made*: The only change is replacing `--listen 0.0.0.0:5678` with `--listen 127.0.0.1:5678` in the invocation of `debugpy`, restricting access to the debugger port to localhost only.
+
+*Potential Impacts*: Developers must now attach debuggers from inside the container or via `kubectl exec`/`docker exec` as the debugger is not accessible remotely. This does not impact existing application startup, migration, or gunicorn operations. No other code is affected.
+
+</explanation>
+<patch>
#!/bin/bash
set -e # Exit on error
echo "Debug mode: $DEBUG"
@@ -32,11 +40,11 @@
# Modified Gunicorn startup with error handling
if [ "$ENVIRONMENT" = "dev" ] || [ "$ENVIRONMENT" = "local" ]; then
if [ "$DEBUG" = "true" ]; then
echo "Waiting for the debugger to attach..."
- debugpy --wait-for-client --listen 0.0.0.0:5678 -m gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level debug --reload cognee.api.client:app
+ debugpy --wait-for-client --listen 127.0.0.1:5678 -m gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level debug --reload cognee.api.client:app
else
gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level debug --reload cognee.api.client:app
fi
else
gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level error cognee.api.client:app
-fi
+fi
\ No newline at end of file
Explanation of Fix
Vulnerability and Fix Summary:
Vulnerability: The script exposes the debugpy
debugger on all interfaces (0.0.0.0:5678
) when DEBUG=true
and ENVIRONMENT
is set to dev
or local
. This enables anyone with network access to attach to the debugger and execute arbitrary code, leading to a critical security risk (CWE-489, CWE-284).
Fix: The patch restricts the debugpy debugger to only listen on 127.0.0.1
(localhost) rather than 0.0.0.0
, making it accessible only from within the container/pod itself and not from external hosts. This eliminates the risk of remote code execution by unauthorized users while preserving the intended debugging functionality for local development. No new dependencies are introduced and behavior remains unchanged except for this stricter binding.
Original Vulnerable Code:
if [ "$ENVIRONMENT" = "dev" ] || [ "$ENVIRONMENT" = "local" ]; then
if [ "$DEBUG" = "true" ]; then
echo "Waiting for the debugger to attach..."
debugpy --wait-for-client --listen 0.0.0.0:5678 -m gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level debug --reload cognee.api.client:app
else
gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level debug --reload cognee.api.client:app
fi
else
gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level error cognee.api.client:app
fi
Patched Code:
See below in the <patch>
tag.
Changes made: The only change is replacing --listen 0.0.0.0:5678
with --listen 127.0.0.1:5678
in the invocation of debugpy
, restricting access to the debugger port to localhost only.
Potential Impacts: Developers must now attach debuggers from inside the container or via kubectl exec
/docker exec
as the debugger is not accessible remotely. This does not impact existing application startup, migration, or gunicorn operations. No other code is affected.
Issues
Type | Identifier | Message | Severity | Link |
---|---|---|---|---|
Application |
CWE-489, CWE-284 |
When DEBUG=true in a dev/local environment, the script launches debugpy listening on 0.0.0.0:5678. Anyone who can reach the container’s network can attach to this debugger and execute arbitrary Python code, effectively granting remote code-execution rights. Exposing a debug interface in production or an otherwise reachable environment constitutes both leftover debug code (CWE-489) and improper access control (CWE-284). |
critical |
Link |
Arbitrary Python Code Execution via Unrestricted Module Loading @@ -420,10 +420,38 @@
return "\n".join(edge_strings)
def load_class(model_file, model_name):
- model_file = os.path.abspath(model_file)
- spec = importlib.util.spec_from_file_location("graph_model", model_file)
+ """
+ Safely load a class from a Python file, restricting import to a safe subdirectory.
+ Only allows .py files within the same directory as this server.py or a designated safe folder.
+ """
+ # Base directory to allow imports from (e.g., the directory containing this file)
+ SAFE_BASE_DIR = os.path.dirname(os.path.realpath(__file__))
+ model_file_path = os.path.abspath(model_file)
+
+ # Check that it's a Python file
+ if not model_file_path.endswith('.py'):
+ raise ValueError("Only Python (.py) files are allowed for graph_model_file.")
+
+ # Prevent directory traversal and absolute path outside safe base
+ # Disallow any path containing .. segments that would escape the base dir
+ # Also disallow symlinks outside safe dir
+ try:
+ common = os.path.commonpath([SAFE_BASE_DIR, os.path.realpath(model_file_path)])
+ except ValueError:
+ # In case paths are on different drives/platforms
+ raise ValueError("The graph_model_file path is not permitted.")
+
+ if common != SAFE_BASE_DIR:
+ raise ValueError(
+ "graph_model_file must be located within the allowed directory."
+ )
+
+ if not os.path.isfile(model_file_path):
+ raise ValueError("The specified graph_model_file does not exist.")
+
+ spec = importlib.util.spec_from_file_location("graph_model", model_file_path)
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)
model_class = getattr(module, model_name)
@@ -457,5 +485,5 @@
try:
asyncio.run(main())
except Exception as e:
logger.error(f"Error initializing Cognee MCP server: {str(e)}")
- raise
+ raise
\ No newline at end of file
Explanation of FixVulnerability summary: Fix: Changes made:
Potential impacts:
Unexpected import note:
This patch is production ready and does not introduce new dependencies. Issues
|
Path Traversal Vulnerability in Local File Processing @@ -1,8 +1,9 @@
import dlt
import s3fs
import json
import inspect
+import os # Added for secure path handling
from typing import Union, BinaryIO, Any, List, Optional
import cognee.modules.ingestion as ingestion
from cognee.infrastructure.databases.relational import get_relational_engine
from cognee.modules.data.methods import create_dataset, get_dataset_data, get_datasets_by_name
@@ -12,12 +13,10 @@
from cognee.modules.users.permissions.methods import give_permission_on_document
from .get_dlt_destination import get_dlt_destination
from .save_data_item_to_storage import save_data_item_to_storage
-
from cognee.api.v1.add.config import get_s3_config
-
async def ingest_data(
data: Any, dataset_name: str, user: User, node_set: Optional[List[str]] = None
):
destination = get_dlt_destination()
@@ -37,15 +36,27 @@
fs = s3fs.S3FileSystem(
key=s3_config.aws_access_key_id, secret=s3_config.aws_secret_access_key, anon=False
)
+ # Define a safe base directory for all local files (must be set to your data storage location)
+ base_dataset_dir = "/path/to/your/dataset/storage" # <-- Set this to the intended data storage directory
+
def open_data_file(file_path: str):
if file_path.startswith("s3://"):
return fs.open(file_path, mode="rb")
else:
local_path = file_path.replace("file://", "")
- return open(local_path, mode="rb")
+ # Secure absolute, canonical path
+ requested_path = os.path.abspath(os.path.realpath(local_path))
+ base_path = os.path.abspath(os.path.realpath(base_dataset_dir))
+
+ # Check that requested_path is a subpath of base_path
+ if not os.path.commonprefix([requested_path, base_path]) == base_path:
+ raise ValueError("Attempted file access outside of the permitted data directory.")
+
+ return open(requested_path, mode="rb")
+
def get_external_metadata_dict(data_item: Union[BinaryIO, str, Any]) -> dict[str, Any]:
if hasattr(data_item, "dict") and inspect.ismethod(getattr(data_item, "dict")):
return {"metadata": data_item.dict(), "origin": str(type(data_item))}
else:
@@ -188,5 +199,5 @@
if datasets:
dataset = datasets[0]
data_documents = await get_dataset_data(dataset_id=dataset.id)
return data_documents
- return []
+ return []
\ No newline at end of file
Explanation of FixVulnerability and Fix Explanation: The vulnerability is a classic path traversal issue (CWE-22). In the original code, local file paths are accepted with a Fix: Note:
Summary of impacts:
Issues
|
Default User Authorization Bypass in Dataset Operations @@ -1,40 +1,82 @@
from uuid import UUID
-from cognee.modules.users.methods import get_default_user
from cognee.modules.ingestion import discover_directory_datasets
from cognee.modules.pipelines.operations.get_pipeline_status import get_pipeline_status
class datasets:
@staticmethod
- async def list_datasets():
+ async def list_datasets(user):
+ """
+ Lists datasets owned by the specified user.
+
+ Args:
+ user: The authenticated user object whose datasets are to be listed.
+
+ Returns:
+ List of datasets belonging to the user.
+ """
from cognee.modules.data.methods import get_datasets
- user = await get_default_user()
return await get_datasets(user.id)
@staticmethod
def discover_datasets(directory_path: str):
+ """
+ Discovers datasets in a directory. This is not user-context sensitive.
+
+ Args:
+ directory_path: Path to directory to discover datasets in.
+
+ Returns:
+ List of discovered dataset names.
+ """
return list(discover_directory_datasets(directory_path).keys())
@staticmethod
- async def list_data(dataset_id: str):
- from cognee.modules.data.methods import get_dataset, get_dataset_data
+ async def list_data(user, dataset_id: str):
+ """
+ Lists data in a specific dataset owned by the user.
- user = await get_default_user()
+ Args:
+ user: The authenticated user object.
+ dataset_id: The ID of the dataset to retrieve data from.
+ Returns:
+ Data in the specified dataset.
+ """
+ from cognee.modules.data.methods import get_dataset, get_dataset_data
+
dataset = await get_dataset(user.id, dataset_id)
return await get_dataset_data(dataset.id)
@staticmethod
async def get_status(dataset_ids: list[UUID]) -> dict:
+ """
+ Gets the status for the provided dataset IDs.
+
+ Args:
+ dataset_ids: List of UUIDs for datasets.
+
+ Returns:
+ Dictionary of statuses for each dataset.
+ """
return await get_pipeline_status(dataset_ids, pipeline_name="cognify_pipeline")
@staticmethod
- async def delete_dataset(dataset_id: str):
+ async def delete_dataset(user, dataset_id: str):
+ """
+ Deletes a dataset owned by the specified user.
+
+ Args:
+ user: The authenticated user object.
+ dataset_id: The ID of the dataset to be deleted.
+
+ Returns:
+ Result of delete operation.
+ """
from cognee.modules.data.methods import get_dataset, delete_dataset
- user = await get_default_user()
dataset = await get_dataset(user.id, dataset_id)
- return await delete_dataset(dataset)
+ return await delete_dataset(dataset)
\ No newline at end of file
Explanation of FixVulnerability & Fix Explanation: The root cause of the vulnerability is that all sensitive dataset operations ( Fix: Other class methods (such as Side Effects & Compatibility: Issues
|
from cognee.api.v1.search import search | ||
from cognee.api.v1.cognify import cognify | ||
from cognee.api.v1.prune import prune | ||
|
||
|
||
from cognee.modules.users.methods import get_default_user | ||
from cognee.api.v1.responses.default_tools import DEFAULT_TOOLS | ||
|
||
logger = logging.getLogger(__name__) | ||
|
||
|
||
async def dispatch_function(tool_call: Union[ToolCall, Dict[str, Any]]) -> str: | ||
""" | ||
Dispatches a function call to the appropriate Cognee function. | ||
""" | ||
if isinstance(tool_call, dict): | ||
function_data = tool_call.get("function", {}) | ||
function_name = function_data.get("name", "") | ||
arguments_str = function_data.get("arguments", "{}") | ||
else: | ||
function_name = tool_call.function.name | ||
arguments_str = tool_call.function.arguments | ||
|
||
arguments = json.loads(arguments_str) | ||
|
||
logger.info(f"Dispatching function: {function_name} with args: {arguments}") | ||
|
||
user = await get_default_user() | ||
|
||
if function_name == "search": | ||
return await handle_search(arguments, user) | ||
elif function_name == "cognify": | ||
return await handle_cognify(arguments, user) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default User Privilege Escalation in Function Dispatcher
Explanation of Fix
Vulnerability:
The original code in dispatch_function
always uses get_default_user()
to supply the user context to sensitive downstream functions (search, add, cognify, prune). This means that any caller, regardless of their real identity or tenant, has all actions executed as the default user. This results in privilege escalation, unauthorized access, and cross-tenant data exposure (CWE-285). Furthermore, logs may include potentially sensitive input arguments (CWE-532).
Fix:
The fix introduces a mandatory user
argument to dispatch_function
and properly passes it to all handlers, ensuring privileged operations are performed with the identity of the actual caller, not a hardcoded default. The get_default_user()
call is removed. The function signatures and usages are updated to require the user argument, providing backward compatibility for the dispatcher by making the requirement explicit. Additionally, log messages are adjusted to avoid dumping the full arguments (partially mitigating info leakage, but main fix is privilege context).
Potential impacts:
- All calls to
dispatch_function
must now supply a valid authenticated user object. If not, the dispatcher will return an explicit error. - Downstream functions requiring a user are unaffected (they already require a user), but the dispatcher no longer accidentally grants default user privileges.
- No new dependencies are introduced and existing APIs/handlers (
handle_search
, etc.) remain unchanged except for propagation of the realuser
.
Issues
Type | Identifier | Message | Severity | Link |
---|---|---|---|---|
Application |
CWE-532, CWE-285 |
Every function call is executed using a hard-coded get_default_user() identity instead of the real caller. This bypasses any per-user authorization and effectively grants the privileges of the default (potentially privileged) account to anyone who can invoke the dispatcher, enabling privilege escalation or unauthorized access across tenant boundaries. |
critical |
Link |
Suggested Fix
from cognee.api.v1.search import search | |
from cognee.api.v1.cognify import cognify | |
from cognee.api.v1.prune import prune | |
from cognee.modules.users.methods import get_default_user | |
from cognee.api.v1.responses.default_tools import DEFAULT_TOOLS | |
logger = logging.getLogger(__name__) | |
async def dispatch_function(tool_call: Union[ToolCall, Dict[str, Any]]) -> str: | |
""" | |
Dispatches a function call to the appropriate Cognee function. | |
""" | |
if isinstance(tool_call, dict): | |
function_data = tool_call.get("function", {}) | |
function_name = function_data.get("name", "") | |
arguments_str = function_data.get("arguments", "{}") | |
else: | |
function_name = tool_call.function.name | |
arguments_str = tool_call.function.arguments | |
arguments = json.loads(arguments_str) | |
logger.info(f"Dispatching function: {function_name} with args: {arguments}") | |
user = await get_default_user() | |
if function_name == "search": | |
return await handle_search(arguments, user) | |
elif function_name == "cognify": | |
return await handle_cognify(arguments, user) | |
from cognee.api.v1.search import search | |
from cognee.api.v1.cognify import cognify | |
from cognee.api.v1.prune import prune | |
# get_default_user import REMOVED | |
from cognee.api.v1.responses.default_tools import DEFAULT_TOOLS | |
logger = logging.getLogger(__name__) | |
async def dispatch_function(tool_call: Union[ToolCall, Dict[str, Any]], user: Any = None) -> str: | |
""" | |
Dispatches a function call to the appropriate Cognee function. | |
Requires the real acting user to be provided as an argument. | |
""" | |
if user is None: | |
logger.error("dispatch_function: Missing required 'user' argument.") | |
return "Error: Authentication required. User context is missing." | |
if isinstance(tool_call, dict): | |
function_data = tool_call.get("function", {}) | |
function_name = function_data.get("name", "") | |
arguments_str = function_data.get("arguments", "{}") | |
else: | |
function_name = tool_call.function.name | |
arguments_str = tool_call.function.arguments | |
try: | |
arguments = json.loads(arguments_str) | |
except Exception as e: | |
logger.error(f"Failed to parse arguments JSON: {e}") | |
return "Error: Invalid arguments format." | |
logger.info(f"Dispatching function: {function_name}") # Avoid logging all arguments for CWE-532 | |
if function_name == "search": | |
return await handle_search(arguments, user) | |
elif function_name == "cognify": | |
return await handle_cognify(arguments, user) |
Provide feedback with 👍 | 👎
Customize these alerts in project settings
|
||
async def handle_prune(arguments: Dict[str, Any], user) -> str: | ||
"""Handle prune function call""" | ||
await prune() | ||
return "Memory has been pruned successfully." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default User Privilege Escalation in Function Dispatcher
Explanation of Fix
Vulnerability:
The original code in dispatch_function
always uses get_default_user()
to supply the user context to sensitive downstream functions (search, add, cognify, prune). This means that any caller, regardless of their real identity or tenant, has all actions executed as the default user. This results in privilege escalation, unauthorized access, and cross-tenant data exposure (CWE-285). Furthermore, logs may include potentially sensitive input arguments (CWE-532).
Fix:
The fix introduces a mandatory user
argument to dispatch_function
and properly passes it to all handlers, ensuring privileged operations are performed with the identity of the actual caller, not a hardcoded default. The get_default_user()
call is removed. The function signatures and usages are updated to require the user argument, providing backward compatibility for the dispatcher by making the requirement explicit. Additionally, log messages are adjusted to avoid dumping the full arguments (partially mitigating info leakage, but main fix is privilege context).
Potential impacts:
- All calls to
dispatch_function
must now supply a valid authenticated user object. If not, the dispatcher will return an explicit error. - Downstream functions requiring a user are unaffected (they already require a user), but the dispatcher no longer accidentally grants default user privileges.
- No new dependencies are introduced and existing APIs/handlers (
handle_search
, etc.) remain unchanged except for propagation of the realuser
.
Issues
Type | Identifier | Message | Severity | Link |
---|---|---|---|---|
Application |
CWE-532, CWE-285 |
Every function call is executed using a hard-coded get_default_user() identity instead of the real caller. This bypasses any per-user authorization and effectively grants the privileges of the default (potentially privileged) account to anyone who can invoke the dispatcher, enabling privilege escalation or unauthorized access across tenant boundaries. |
critical |
Link |
Suggested Fix
async def handle_prune(arguments: Dict[str, Any], user) -> str: | |
"""Handle prune function call""" | |
await prune() | |
return "Memory has been pruned successfully." | |
async def handle_prune(arguments: Dict[str, Any], user) -> str: | |
"""Handle prune function call""" | |
await prune() | |
return "Memory has been pruned successfully." |
Provide feedback with 👍 | 👎
Customize these alerts in project settings
Sensitive Configuration Data Exposure via Telemetry Service @@ -19,24 +19,55 @@
logger = get_logger("run_tasks(tasks: [Task], data)")
+def _sanitize_config_for_telemetry(config: dict) -> dict:
+ """
+ Returns a copy of the config with sensitive keys removed,
+ such as those containing secrets, passwords, tokens, or keys.
+ """
+ # Define keys that are considered sensitive
+ sensitive_keywords = {
+ "secret",
+ "password",
+ "passwd",
+ "token",
+ "apikey",
+ "api_key",
+ "access_key",
+ "private",
+ "credential",
+ "credentials",
+ "auth",
+ "session",
+ }
+ sanitized = {}
+ for k, v in config.items():
+ key_lower = str(k).lower()
+ # Remove keys if they match any sensitive keyword
+ if not any(word in key_lower for word in sensitive_keywords):
+ sanitized[k] = v
+ return sanitized
+
+
async def run_tasks_with_telemetry(
tasks: list[Task], data, user: User, pipeline_name: str, context: dict = None
):
config = get_current_settings()
-
logger.debug("\nRunning pipeline with configuration:\n%s\n", json.dumps(config, indent=1))
+ # Prepare sanitized config for telemetry
+ sanitized_config = _sanitize_config_for_telemetry(config)
+
try:
logger.info("Pipeline run started: `%s`", pipeline_name)
send_telemetry(
"Pipeline Run Started",
user.id,
additional_properties={
"pipeline_name": str(pipeline_name),
}
- | config,
+ | sanitized_config,
)
async for result in run_tasks_base(tasks, data, user, context):
yield result
@@ -61,9 +92,9 @@
user.id,
additional_properties={
"pipeline_name": str(pipeline_name),
}
- | config,
+ | sanitized_config,
)
raise error
@@ -100,5 +131,5 @@
except Exception as e:
yield await log_pipeline_run_error(
pipeline_run_id, pipeline_id, pipeline_name, dataset_id, data, e
)
- raise e
+ raise e
\ No newline at end of file
Explanation of FixVulnerability Explanation and Fix: Fix: Potential Impacts:
Note: Issues
|
""" | ||
# Use default tools if none provided | ||
tools = request.tools or DEFAULT_TOOLS | ||
|
||
# Call the API | ||
response = await call_openai_api_for_model( | ||
input_text=request.input, | ||
model=request.model, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unvalidated LLM Function Call Execution
Explanation of Fix
Vulnerability & Fix Summary:
The vulnerability here is LLM Function-Call Injection (ML09) and Improper Control of Dynamically Evaluated Code (CWE-94): after calling the LLM API, the backend blindly trusts items in the LLM's output
list where type == "function_call"
—specifically, the function_name
and arguments_str
are passed directly to dispatch_function
, which can be exploited to perform unauthorized or arbitrary backend calls.
Patch Approach:
-
*Allow-list: Only permit function names that come from registered, pre-approved tools—tools that should have been passed into the LLM as available options (via the
tools
argument). We extract these names by examiningtools
: for each tool, if itstype
is"function"
and it provides a valid function name, then it is permitted. -
Validation: When iterating through LLM outputs, we now check that each function call’s
name
is in our allow-list (allowed_function_names
). If not, we skip dispatching and instead insert an error in the output for that call, preventing execution of arbitrary or malicious functions.
Potential Impacts:
- Any function call generated by the LLM which is not present in the explicit
tools
list provided as input will be ignored/skipped and result in an error message in the response object. - This is compatible with established OpenAI tool conventions and does not break backward compatibility unless consumers were previously relying on the ability to call arbitrary backend functions (which is highly dangerous and should not be allowed).
- No new dependencies are introduced. Error handling is preserved.
Issues
Type | Identifier | Message | Severity | Link |
---|---|---|---|---|
Application |
CWE-94, ML09 |
LLM-generated function calls are executed blindly via dispatch_function with no allow-list or validation of function_name or arguments_str . A user can influence the model’s response (or provide custom tool definitions earlier in the request) to trigger unintended backend functions or supply malicious arguments, leading to privilege escalation or arbitrary code execution. This is an instance of LLM Function-Call Injection (ML09) and general Improper Control of Dynamically Evaluated Code (CWE-94). |
critical |
Link |
Suggested Fix
""" | |
# Use default tools if none provided | |
tools = request.tools or DEFAULT_TOOLS | |
# Call the API | |
response = await call_openai_api_for_model( | |
input_text=request.input, | |
model=request.model, | |
""" | |
# Use default tools if none provided | |
tools = request.tools or DEFAULT_TOOLS | |
# Compute allow-list of allowed function names from tools | |
allowed_function_names = set() | |
for tool in tools: | |
if isinstance(tool, dict): | |
# OpenAI tool convention: type, function (or name for older versions) | |
if tool.get("type") == "function": | |
function_def = tool.get("function") | |
if isinstance(function_def, dict): | |
# By OpenAI spec: function_def should have "name" | |
fname = function_def.get("name") | |
if isinstance(fname, str): | |
allowed_function_names.add(fname) | |
# Sometimes the name could be at the top level for backwards compatibility | |
fname = tool.get("name") | |
if isinstance(fname, str): | |
allowed_function_names.add(fname) | |
# Call the API | |
response = await call_openai_api_for_model( | |
input_text=request.input, | |
model=request.model, |
Provide feedback with 👍 | 👎
Customize these alerts in project settings
function_name = item.get("name", "") | ||
arguments_str = item.get("arguments", "{}") | ||
call_id = item.get("call_id", f"call_{uuid.uuid4().hex}") | ||
|
||
# Create a format the dispatcher can handle | ||
tool_call = { | ||
"id": call_id, | ||
"function": {"name": function_name, "arguments": arguments_str}, | ||
"type": "function", | ||
} | ||
|
||
# Dispatch the function | ||
try: | ||
function_result = await dispatch_function(tool_call) | ||
output_status = "success" | ||
except Exception as e: | ||
logger.exception(f"Error executing function {function_name}: {e}") | ||
function_result = f"Error executing {function_name}: {str(e)}" | ||
output_status = "error" | ||
|
||
processed_call = ResponseToolCall( | ||
id=call_id, | ||
type="function", | ||
function=FunctionCall(name=function_name, arguments=arguments_str), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unvalidated LLM Function Call Execution
Explanation of Fix
Vulnerability & Fix Summary:
The vulnerability here is LLM Function-Call Injection (ML09) and Improper Control of Dynamically Evaluated Code (CWE-94): after calling the LLM API, the backend blindly trusts items in the LLM's output
list where type == "function_call"
—specifically, the function_name
and arguments_str
are passed directly to dispatch_function
, which can be exploited to perform unauthorized or arbitrary backend calls.
Patch Approach:
-
*Allow-list: Only permit function names that come from registered, pre-approved tools—tools that should have been passed into the LLM as available options (via the
tools
argument). We extract these names by examiningtools
: for each tool, if itstype
is"function"
and it provides a valid function name, then it is permitted. -
Validation: When iterating through LLM outputs, we now check that each function call’s
name
is in our allow-list (allowed_function_names
). If not, we skip dispatching and instead insert an error in the output for that call, preventing execution of arbitrary or malicious functions.
Potential Impacts:
- Any function call generated by the LLM which is not present in the explicit
tools
list provided as input will be ignored/skipped and result in an error message in the response object. - This is compatible with established OpenAI tool conventions and does not break backward compatibility unless consumers were previously relying on the ability to call arbitrary backend functions (which is highly dangerous and should not be allowed).
- No new dependencies are introduced. Error handling is preserved.
Issues
Type | Identifier | Message | Severity | Link |
---|---|---|---|---|
Application |
CWE-94, ML09 |
LLM-generated function calls are executed blindly via dispatch_function with no allow-list or validation of function_name or arguments_str . A user can influence the model’s response (or provide custom tool definitions earlier in the request) to trigger unintended backend functions or supply malicious arguments, leading to privilege escalation or arbitrary code execution. This is an instance of LLM Function-Call Injection (ML09) and general Improper Control of Dynamically Evaluated Code (CWE-94). |
critical |
Link |
Suggested Fix
function_name = item.get("name", "") | |
arguments_str = item.get("arguments", "{}") | |
call_id = item.get("call_id", f"call_{uuid.uuid4().hex}") | |
# Create a format the dispatcher can handle | |
tool_call = { | |
"id": call_id, | |
"function": {"name": function_name, "arguments": arguments_str}, | |
"type": "function", | |
} | |
# Dispatch the function | |
try: | |
function_result = await dispatch_function(tool_call) | |
output_status = "success" | |
except Exception as e: | |
logger.exception(f"Error executing function {function_name}: {e}") | |
function_result = f"Error executing {function_name}: {str(e)}" | |
output_status = "error" | |
processed_call = ResponseToolCall( | |
id=call_id, | |
type="function", | |
function=FunctionCall(name=function_name, arguments=arguments_str), | |
function_name = item.get("name", "") | |
arguments_str = item.get("arguments", "{}") | |
call_id = item.get("call_id", f"call_{uuid.uuid4().hex}") | |
# Check if the function_name is in the allowlist | |
if function_name not in allowed_function_names: | |
logger.warning( | |
f"Function call '{function_name}' not in allowed tools. Skipping dispatch." | |
) | |
output_status = "error" | |
function_result = ( | |
f"Function '{function_name}' is not an allowed tool and will not be executed." | |
) | |
else: | |
# Create a format the dispatcher can handle | |
tool_call = { | |
"id": call_id, | |
"function": {"name": function_name, "arguments": arguments_str}, | |
"type": "function", | |
} | |
# Dispatch the function | |
try: | |
function_result = await dispatch_function(tool_call) | |
output_status = "success" | |
except Exception as e: | |
logger.exception(f"Error executing function {function_name}: {e}") | |
function_result = f"Error executing {function_name}: {str(e)}" | |
output_status = "error" | |
processed_call = ResponseToolCall( | |
id=call_id, | |
type="function", | |
function=FunctionCall(name=function_name, arguments=arguments_str), |
Provide feedback with 👍 | 👎
Customize these alerts in project settings
) | ||
|
||
return response_obj | ||
|
||
return router |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unvalidated LLM Function Call Execution
Explanation of Fix
Vulnerability & Fix Summary:
The vulnerability here is LLM Function-Call Injection (ML09) and Improper Control of Dynamically Evaluated Code (CWE-94): after calling the LLM API, the backend blindly trusts items in the LLM's output
list where type == "function_call"
—specifically, the function_name
and arguments_str
are passed directly to dispatch_function
, which can be exploited to perform unauthorized or arbitrary backend calls.
Patch Approach:
-
*Allow-list: Only permit function names that come from registered, pre-approved tools—tools that should have been passed into the LLM as available options (via the
tools
argument). We extract these names by examiningtools
: for each tool, if itstype
is"function"
and it provides a valid function name, then it is permitted. -
Validation: When iterating through LLM outputs, we now check that each function call’s
name
is in our allow-list (allowed_function_names
). If not, we skip dispatching and instead insert an error in the output for that call, preventing execution of arbitrary or malicious functions.
Potential Impacts:
- Any function call generated by the LLM which is not present in the explicit
tools
list provided as input will be ignored/skipped and result in an error message in the response object. - This is compatible with established OpenAI tool conventions and does not break backward compatibility unless consumers were previously relying on the ability to call arbitrary backend functions (which is highly dangerous and should not be allowed).
- No new dependencies are introduced. Error handling is preserved.
Issues
Type | Identifier | Message | Severity | Link |
---|---|---|---|---|
Application |
CWE-94, ML09 |
LLM-generated function calls are executed blindly via dispatch_function with no allow-list or validation of function_name or arguments_str . A user can influence the model’s response (or provide custom tool definitions earlier in the request) to trigger unintended backend functions or supply malicious arguments, leading to privilege escalation or arbitrary code execution. This is an instance of LLM Function-Call Injection (ML09) and general Improper Control of Dynamically Evaluated Code (CWE-94). |
critical |
Link |
Suggested Fix
) | |
return response_obj | |
return router | |
) | |
return response_obj | |
return router |
Provide feedback with 👍 | 👎
Customize these alerts in project settings
import asyncio | ||
from typing import Generic, List, Optional, TypeVar, Union, get_args, get_origin, get_type_hints | ||
|
||
import lancedb | ||
from lancedb.pydantic import LanceModel, Vector | ||
from pydantic import BaseModel | ||
from lancedb.pydantic import LanceModel, Vector | ||
from typing import Generic, List, Optional, TypeVar, Union, get_args, get_origin, get_type_hints |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrestricted Path Traversal in Database Adapter Cleanup Operation
Explanation of Fix
Vulnerability Summary:
The vulnerability occurs because self.url
(user/configuration supplied) is directly used in the prune()
function to call LocalStorage.remove_all(self.url)
if it starts with a "/". This allows malicious input such as "/" or "/etc" to trigger recursive deletion of critical filesystem locations, leading to catastrophic data loss.
Fix:
To mitigate this, the patch introduces strict validation for the self.url
path in the prune()
method before calling LocalStorage.remove_all(self.url)
. Specifically, we ensure the path is:
- Not empty
- Not equal to "/" (the root directory) or any other extremely common dangerous paths (such as "/etc", "/bin", "/usr", "/root", etc.)
- Not a parent directory reference (".." anywhere in the path)
- Not a symlink (resolving symlink could otherwise potentially get around this defense)
- And ideally, is located within a known safe prefix (optionally, but commented in the code to allow for future enforcement)
If the path fails any check, a RuntimeError is raised and the removal does not proceed.
We add an import of os
at the top, which is acceptable given the standard library constraint. If strict parent directory protection is needed for your application, you may add further whitelisting.
No changes are made to other methods or logic.
Potential impact:
This patch will raise an error if prune()
is called and the configured storage path is unsafe or suspicious, protecting system integrity. If previously valid installations relied on deletion of such critical system paths (highly unlikely and improper), their behavior will change—intentional. Normal user data directories (such as /app/data
or /var/app/lancedb_data
) are not affected and will continue to work.
Notable for reviewers: The validation routine is conservative but not foolproof against all possible clever bypasses. Adjust the safe prefix as needed for your deployment.
Issues
Type | Identifier | Message | Severity | Link |
---|---|---|---|---|
Application |
CWE-22 |
self.url is accepted directly from external configuration and, if it begins with / , is passed to LocalStorage.remove_all without validation. Supplying a path such as "/" or "/etc" could trigger recursive deletion of critical filesystem directories (depending on runtime privileges), resulting in denial-of-service or full data loss. This is a classic path-traversal / arbitrary file operation issue. |
critical |
Link |
Suggested Fix
import asyncio | |
from typing import Generic, List, Optional, TypeVar, Union, get_args, get_origin, get_type_hints | |
import lancedb | |
from lancedb.pydantic import LanceModel, Vector | |
from pydantic import BaseModel | |
from lancedb.pydantic import LanceModel, Vector | |
from typing import Generic, List, Optional, TypeVar, Union, get_args, get_origin, get_type_hints | |
import asyncio | |
import os | |
import lancedb | |
from pydantic import BaseModel | |
from lancedb.pydantic import LanceModel, Vector | |
from typing import Generic, List, Optional, TypeVar, Union, get_args, get_origin, get_type_hints |
Provide feedback with 👍 | 👎
Customize these alerts in project settings
Unrestricted Path Traversal in Database Adapter Cleanup Operation @@ -1,5 +1,6 @@
import asyncio
+import os
import lancedb
from pydantic import BaseModel
from lancedb.pydantic import LanceModel, Vector
from typing import Generic, List, Optional, TypeVar, Union, get_args, get_origin, get_type_hints
@@ -236,9 +237,26 @@
collection = await self.get_collection(collection_name)
await collection.delete("id IS NOT NULL")
await connection.drop_table(collection_name)
- if self.url.startswith("/"):
+ # Securely validate the path before *any* dangerous operation.
+ if self.url and self.url.startswith("/"):
+ # Reject obviously dangerous paths
+ dangerous_paths = [
+ "/", # root
+ "/bin", "/boot", "/dev", "/etc", "/lib", "/lib64", "/proc", "/root", "/run", "/sbin", "/sys", "/usr", "/var"
+ ]
+ normpath = os.path.normpath(os.path.abspath(self.url))
+ # Prevent deletion of dangerous system directories, directory traversal, and symlinks
+ if (
+ normpath in dangerous_paths or
+ normpath == "" or
+ ".." in normpath.split(os.sep) or
+ os.path.islink(normpath)
+ # Optionally ensure the path is under a known safe prefix, e.g. "/app/data/"
+ # not normpath.startswith("/app/data/")
+ ):
+ raise RuntimeError("Refusing to recursively remove unsafe or suspicious path: '%s'" % self.url)
LocalStorage.remove_all(self.url)
def get_data_point_schema(self, model_type: BaseModel):
related_models_fields = []
@@ -274,5 +292,5 @@
include_fields={
"id": (str, ...),
},
exclude_fields=["metadata"] + related_models_fields,
- )
+ )
\ No newline at end of file
Explanation of FixVulnerability Summary: Fix:
If the path fails any check, a RuntimeError is raised and the removal does not proceed. We add an import of No changes are made to other methods or logic. Potential impact: Notable for reviewers: The validation routine is conservative but not foolproof against all possible clever bypasses. Adjust the safe prefix as needed for your deployment. Issues
|
SQL Injection in Dynamic Table Creation Through Unsanitized Schema and Table Parameters @@ -17,9 +17,43 @@
logger = get_logger()
+# Allow-list of basic, safe SQL types (expand as desired for your application)
+_SQL_TYPE_WHITELIST = {
+ # Integers
+ "integer", "int", "bigint", "smallint", "tinyint", "serial", "bigserial",
+ # Floating point
+ "float", "double", "real", "decimal", "numeric",
+ # Text types
+ "varchar", "character varying", "char", "character", "text", "string",
+ # Boolean
+ "boolean", "bool",
+ # Date/time
+ "date", "time", "timestamp", "timestamptz", "datetime",
+ # UUID
+ "uuid",
+ # JSON
+ "json", "jsonb",
+}
+
+def _is_safe_sql_identifier(identifier: str) -> bool:
+ return isinstance(identifier, str) and identifier.isidentifier()
+
+
+def _is_safe_sql_type(typ: str) -> bool:
+ """
+ Very basic: Accepts only known safe types (case-insensitive, ignores length/precision).
+ e.g. "varchar(255)" is parsed as "varchar".
+ """
+ # Extract base type, e.g. "varchar(255)" -> "varchar"
+ if not isinstance(typ, str):
+ return False
+ base_type = typ.split("(", 1)[0].strip().lower()
+ return base_type in _SQL_TYPE_WHITELIST
+
+
class SQLAlchemyAdapter:
def __init__(self, connection_string: str):
self.db_path: str = None
self.db_uri: str = connection_string
@@ -55,17 +89,40 @@
datasets = result.unique().scalars().all()
return datasets
async def create_table(self, schema_name: str, table_name: str, table_config: list[dict]):
- fields_query_parts = [f"{item['name']} {item['type']}" for item in table_config]
+ # --- PATCH START: strong identifier/type validation ---
+ # Validate schema_name
+ if not _is_safe_sql_identifier(schema_name):
+ raise ValueError(f"Invalid schema name: {schema_name!r}")
+
+ # Validate table_name
+ if not _is_safe_sql_identifier(table_name):
+ raise ValueError(f"Invalid table name: {table_name!r}")
+
+ # Validate and build field definition string securely
+ fields_query_parts = []
+ for item in table_config:
+ colname = item.get("name")
+ coltype = item.get("type")
+ if not _is_safe_sql_identifier(colname):
+ raise ValueError(f"Invalid column name: {colname!r}")
+ if not _is_safe_sql_type(coltype):
+ raise ValueError(f"Invalid or unsafe SQL type: {coltype!r}")
+ # Safe to quote colname and pass type as-is (since checked)
+ fields_query_parts.append(f'"{colname}" {coltype}')
+
async with self.engine.begin() as connection:
- await connection.execute(text(f"CREATE SCHEMA IF NOT EXISTS {schema_name};"))
await connection.execute(
+ text(f"CREATE SCHEMA IF NOT EXISTS {schema_name};")
+ )
+ await connection.execute(
text(
f'CREATE TABLE IF NOT EXISTS {schema_name}."{table_name}" ({", ".join(fields_query_parts)});'
)
)
await connection.close()
+ # --- PATCH END ---
async def delete_table(self, table_name: str, schema_name: Optional[str] = "public"):
async with self.engine.begin() as connection:
if self.engine.dialect.name == "sqlite":
@@ -434,5 +491,5 @@
logger.warning(
f"Missing value in foreign key information. \nColumn value: {col}\nReference column value: {ref_col}\n"
)
- return schema
+ return schema
\ No newline at end of file
Explanation of FixExplanation of the Vulnerability and the Fix The original What is addressed in the fix:
The rest of the codebase is unaffected. Only the vulnerable code and any direct usages that could be impacted by stricter validation are addressed. User code passing invalid names will now raise an explicit Issues
|
SQL Injection in Table Drop Operations via Unsanitized Identifiers @@ -66,16 +66,22 @@
)
await connection.close()
async def delete_table(self, table_name: str, schema_name: Optional[str] = "public"):
+ # Validate table_name and schema_name to prevent SQL injection
+ if not table_name or not table_name.isidentifier():
+ raise ValueError("Invalid table name")
+ if schema_name and not schema_name.isidentifier():
+ raise ValueError("Invalid schema name")
async with self.engine.begin() as connection:
if self.engine.dialect.name == "sqlite":
# SQLite doesn't support schema namespaces and the CASCADE keyword.
# However, foreign key constraint can be defined with ON DELETE CASCADE during table creation.
await connection.execute(text(f'DROP TABLE IF EXISTS "{table_name}";'))
else:
+ # For PostgreSQL and similar, quote identifiers safely
await connection.execute(
- text(f'DROP TABLE IF EXISTS {schema_name}."{table_name}" CASCADE;')
+ text(f'DROP TABLE IF EXISTS "{schema_name}"."{table_name}" CASCADE;')
)
async def insert_data(
self,
@@ -434,5 +440,5 @@
logger.warning(
f"Missing value in foreign key information. \nColumn value: {col}\nReference column value: {ref_col}\n"
)
- return schema
+ return schema
\ No newline at end of file
Explanation of FixExplanation of the Vulnerability and the Fix The original implementation of the What Has Been Changed: To fix this, the following changes were made:
Possible Impacts:
No other parts of the codebase are directly affected, as only input validation and safe quoting were introduced within the function. Issues
|
Unvalidated SQL Query Execution Allowing Direct SQL Injection @@ -289,11 +289,47 @@
# Fetch all rows as a list of dictionaries
rows = result.mappings().all()
return rows
- async def execute_query(self, query):
+ async def execute_query(self, query, params=None):
+ """
+ Executes a parameterized, read-only SELECT SQL query in a safe way.
+
+ Args:
+ query (str): Must be a SELECT query with optional parameter placeholders (:param).
+ params (dict, optional): Parameters to bind to the query. Defaults to None.
+
+ Returns:
+ list[dict]: Query result rows.
+
+ Raises:
+ ValueError: If query is not a single SELECT statement, or appears unsafe/unsupported.
+ """
+ if params is None:
+ params = {}
+
+ # Remove leading/trailing whitespace.
+ stripped_query = query.strip()
+ # Forbid empty queries
+ if not stripped_query:
+ raise ValueError("Empty query not allowed.")
+
+ # Make sure it's a single SELECT statement
+ # Basic checks: startswith SELECT (SQL is case-insensitive)
+ # Only allow select statements. Doesn't allow CTE, multiple statements, DML, DDL, etc.
+ lowered = stripped_query.lower()
+ if not lowered.startswith("select"):
+ raise ValueError("Only SELECT statements are allowed.")
+
+ # Forbid any extra statements by checking for semicolons not at the end
+ # Allow semicolon only if it's the very last character after stripping, or not present at all.
+ if ";" in stripped_query[:-1]:
+ raise ValueError("Multiple or stacked SQL statements are not allowed.")
+
+ # (Optional) Could add more checks, like forbidding union/all, etc.
+
async with self.engine.begin() as connection:
- result = await connection.execute(text(query))
+ result = await connection.execute(text(stripped_query), params)
return [dict(row) for row in result]
async def drop_tables(self):
async with self.engine.begin() as connection:
@@ -434,5 +470,5 @@
logger.warning(
f"Missing value in foreign key information. \nColumn value: {col}\nReference column value: {ref_col}\n"
)
- return schema
+ return schema
\ No newline at end of file
Explanation of FixVulnerability Summary: Fix Summary:
Minimal Impact:
Potential Impacts:
Issues
|
Description
DCO Affirmation
I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.