Skip to content

fix: Add getting started tutorial to git #870

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: dev
Choose a base branch
from
Open

Conversation

Vasilije1990
Copy link
Contributor

Description

DCO Affirmation

I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.

Copy link

Please make sure all the checkboxes are checked:

  • I have tested these changes locally.
  • I have reviewed the code changes.
  • I have added end-to-end and unit tests (if applicable).
  • I have updated the documentation and README.md file (if necessary).
  • I have removed unnecessary code and debug statements.
  • PR title is clear and follows the convention.
  • I have tagged reviewers or team members for feedback.

Copy link
Contributor

coderabbitai bot commented May 24, 2025

Walkthrough

This update introduces major enhancements and refactoring across the Cognee codebase. Notable changes include the addition of new graph retrievers with chain-of-thought and context extension capabilities, expanded support for Memgraph as a graph database, improved pipeline and task execution with context propagation, and a new OpenAI-compatible responses API. Numerous bug fixes, error handling improvements, and new example scripts and tests are also included.

Changes

File(s) / Path(s) Change Summary
API & Pipeline Enhancements
cognee/api/v1/responses/*, cognee/api/v1/responses/routers/*, cognee/api/v1/responses/models.py Introduced a new OpenAI-compatible responses API with structured tool schemas, function dispatching, and tool call models. Added default tool definitions and response models.
cognee/modules/pipelines/operations/*, cognee/modules/data/methods/*, cognee/modules/pipelines/models/* Refactored pipeline execution: improved dataset resolution, added explicit pipeline run status refresh, introduced context propagation in task execution, and added new pipeline run status enum.
cognee/modules/observability/* Introduced a new Observer enum and a utility to get the correct observability decorator. Removed the old MonitoringTool enum.
Graph & Retrieval Improvements
cognee/modules/retrieval/graph_completion_cot_retriever.py, cognee/modules/retrieval/graph_completion_context_extension_retriever.py Added new retriever classes for graph completion with chain-of-thought and context extension.
cognee/modules/search/methods/search.py, cognee/modules/search/types/SearchType.py Integrated new retriever types into the search dispatch mechanism and search type enums.
cognee/modules/retrieval/graph_completion_retriever.py, cognee/modules/retrieval/utils/brute_force_triplet_search.py Improved error handling for missing entities and collections in graph and brute-force retrievers.
Database & Adapter Updates
cognee/infrastructure/databases/graph/get_graph_engine.py, cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py Added support for Memgraph as a graph database, including a new async adapter with comprehensive graph operations and metrics.
cognee/infrastructure/databases/vector/* Refactored collection retrieval and error handling for ChromaDB, LanceDB, Milvus, PGVector, Qdrant, and Weaviate adapters. Standardized search defaults and improved missing collection error reporting.
cognee/infrastructure/databases/graph/networkx/adapter.py Refined type annotations for node IDs to use UUIDs consistently.
LLM & Prompting
cognee/infrastructure/llm/* Standardized prompt handling, switched to async clients where appropriate, and updated prompt templates for chain-of-thought validation and follow-up.
Frontend & UI
cognee-frontend/src/app/*, cognee-frontend/src/modules/*, cognee-frontend/src/ui/* Updated dataset prop types from {id: string} to {name: string} across components, added "Cognify" button and callback, improved input clearing and error handling in search, and updated API base URL to localhost.
Tests & Examples
cognee/tests/*, examples/database_examples/*, cognee-starter-kit/* Added new integration and unit tests for Memgraph, new retriever types, and starter pipelines. Introduced database backend example scripts for ChromaDB, FalkorDB, Kuzu, Milvus, Neo4j, PGVector, Qdrant, and Weaviate.
Docs, Config, and Misc
README.md, cognee-starter-kit/README.md, Dockerfile, Dockerfile_modal, entrypoint.sh, .gitignore, .env.template Updated documentation, improved Docker builds (multi-stage, uv-based), revised entrypoint script, and added comprehensive .gitignore and environment templates.
Removals
profiling/*, notebooks/hr_demo.ipynb, notebooks/github_graph_visualization.html, assets/graph_visualization.html, .data/code/example.txt, cognee/tests/unit/modules/retrieval/utils/brute_force_triplet_search_test.py, cognee/tests/integration/run_toy_tasks/conftest.py Removed obsolete profiling scripts, test fixtures, and demo notebooks/HTML files.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant API
    participant Dispatcher
    participant ToolFunction

    Client->>API: POST /api/v1/responses (with model/input/tools/tool_choice)
    API->>API: Call OpenAI-compatible endpoint
    API->>Dispatcher: For each function call in response, dispatch_function(tool_call)
    Dispatcher->>ToolFunction: handle_search / handle_cognify / handle_prune (async)
    ToolFunction-->>Dispatcher: Return result or error
    Dispatcher-->>API: Aggregate tool call outputs
    API-->>Client: Return structured ResponseBody with tool call results and usage
Loading
sequenceDiagram
    participant Pipeline
    participant Task
    participant User
    participant Context

    Pipeline->>Task: run_tasks(tasks, user, pipeline_name, context)
    Task->>Task: handle_task(..., context)
    Task->>Task: If executable accepts context, pass context
    Task->>Task: Possibly recurse for sub-tasks with context
    Task-->>Pipeline: Return result
Loading

Possibly related PRs

  • topoteretes/cognee#501: Refactors pipeline run status logging and dataset ID handling, directly related to the new status and dataset resolution logic in this PR.
  • topoteretes/cognee#788: Adds a context parameter to task execution functions, matching the context propagation changes here.
  • topoteretes/cognee#760: Simplifies schema iteration in database deletion, which is also updated in this PR.

Poem

In Cognee's warren, the code did grow,
New retrievers hop with chain-of-thought flow.
Memgraph joins the garden, adapters bloom anew,
Pipelines pass their context, as rabbits love to do.
Tests and docs are polished, old scripts swept away—
A springtime leap for knowledge graphs,
Hip-hip-hooray!

((\
( -.-)
o_(")(")

Note

⚡️ AI Code Reviews for VS Code, Cursor, Windsurf

CodeRabbit now has a plugin for VS Code, Cursor and Windsurf. This brings AI code reviews directly in the code editor. Each commit is reviewed immediately, finding bugs before the PR is raised. Seamless context handoff to your AI code agent ensures that you can easily incorporate review feedback.
Learn more here.

✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

gitguardian bot commented May 24, 2025

️✅ There are no secrets present in this pull request anymore.

If these secrets were true positive and are still valid, we highly recommend you to revoke them.
While these secrets were previously flagged, we no longer have a reference to the
specific commits where they were detected. Once a secret has been leaked into a git
repository, you should consider it compromised, even if it was deleted immediately.
Find here more information about risks.


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 48

🔭 Outside diff range comments (3)
cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py (1)

151-156: 💡 Verification agent

🧩 Analysis chain

with_vector and normalized parameters are ignored

The search signature exposes with_vector and normalized, yet the body never references them.
This makes the public API misleading and forces callers to pay for functionality that is never delivered.

-        limit: int = 15,
-        with_vector: bool = False,
-        normalized: bool = True,
+        limit: int = 15,
+        *,
+        with_vector: bool = False,      # ← either honour it …
+        normalized: bool = True,        # ← … or remove from the signature.

Follow-up actions:

  1. Either implement the feature (include vectors in the DataFrame / optionally skip normalize_distances) or
  2. Delete the unused parameters and update call-sites.

Let’s verify whether with_vector and normalized are actually referenced in LanceDBAdapter.py:


🏁 Script executed:

#!/bin/bash
# Search for any usage of the parameters within the file
grep -R "with_vector" -n cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py || echo "No with_vector usages"
grep -R "normalized" -n cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py || echo "No normalized usages"

Length of output: 469


🏁 Script executed:

#!/bin/bash
# Display the portion of the LanceDBAdapter.py around the search function to inspect parameter usage
sed -n '130,240p' cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py

Length of output: 3855


Remove or Implement Unused Parameters in LanceDBAdapter.search

The search signature exposes with_vector and normalized, but neither is ever referenced in the method body:

  • with_vector isn’t used to include vectors in the returned results.
  • normalized isn’t used to skip or control the normalize_distances call.

Locations to fix:

  • File: cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py
  • Method: async def search(...), lines ~151–156

Suggested diff (make keyword-only or remove):

-    async def search(
-        self,
-        collection_name: str,
-        query_text: str = None,
-        query_vector: List[float] = None,
-        limit: int = 15,
-        with_vector: bool = False,
-        normalized: bool = True,
-    ):
+    async def search(
+        self,
+        collection_name: str,
+        query_text: str = None,
+        query_vector: List[float] = None,
+        limit: int = 15,
+        *,
+        with_vector: bool = False,   # ← honor or remove
+        normalized: bool = True,      # ← honor or remove
+    ):

Follow-up actions:

  1. To implement:
    • Return vectors in each ScoredResult when with_vector=True.
    • Skip or apply normalize_distances based on normalized=False.
  2. Otherwise, remove these parameters (and update all call-sites, including batch_search).
cognee/infrastructure/databases/vector/milvus/MilvusAdapter.py (1)

151-160: ⚠️ Potential issue

Retrieve leaks client handle and risks type mismatch

  1. client is never closed – leaking sockets/threads.
  2. UUID objects are implicitly coerced inside an f-string; Milvus expects plain strings.
client = self.get_milvus_client()
try:
-    filter_expression = f"""id in [{", ".join(f'"{id}"' for id in data_point_ids)}]"""
+    filter_expression = (
+        "id in [" + ", ".join(f'"{str(dp_id)}"' for dp_id in data_point_ids) + "]"
+    )
     results = client.query(
         collection_name=collection_name,
         expr=filter_expression,
         output_fields=["*"],
     )
     return results
 finally:
     client.close()
cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py (1)

85-99: 🛠️ Refactor suggestion

convert_to_weaviate_data_points is O(n²)`

Using data_points.index(data_point) within a list-comprehension forces a linear scan for every element.
Replace with an enumerate for O(n):

-data_vectors = await self.embed_data([...])
-
-def convert_to_weaviate_data_points(data_point: DataPoint):
-    vector = data_vectors[data_points.index(data_point)]
+data_vectors = await self.embed_data([...])
+
+def convert_to_weaviate_data_points(idx_dp: tuple[int, DataPoint]):
+    idx, data_point = idx_dp
+    vector = data_vectors[idx]
     ...
-
-data_points = [convert_to_weaviate_data_points(data_point) for data_point in data_points]
+data_points = [
+    convert_to_weaviate_data_points(t) for t in enumerate(data_points)
+]

This matters for large batch inserts.

🧹 Nitpick comments (62)
cognee/modules/retrieval/exceptions/__init__.py (1)

7-7: Explicitly define the public API via __all__
To signal that SearchTypeNotSupported and CypherSearchError are intended exports (and silence the F401 lint warnings), add an __all__ declaration.

 from .exceptions import SearchTypeNotSupported, CypherSearchError
+
+__all__ = ["SearchTypeNotSupported", "CypherSearchError"]
🧰 Tools
🪛 Ruff (0.11.9)

7-7: .exceptions.SearchTypeNotSupported imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)


7-7: .exceptions.CypherSearchError imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)

examples/python/multimedia_example.py (1)

23-29: Use pathlib for path construction: While os.path.join works, switching to pathlib.Path will make the code more readable and robust across platforms. For example:

from pathlib import Path

project_root = Path(__file__).parents[2]
mp3_file_path = project_root / "examples" / "data" / "multimedia" / "text_to_speech.mp3"
png_file_path = project_root / "examples" / "data" / "multimedia" / "example.png"
notebooks/cognee_multimedia_demo.ipynb (2)

25-27: Automate notebook cleanup: To ensure consistency, consider adding nbstripout or a pre-commit hook that automatically strips outputs and execution counts on commit.


34-38: Improve path resolution in notebook: Relying on os.path.abspath("") and "../" can break depending on the working directory. A more reliable approach is to use pathlib.Path and derive the project root explicitly. For instance:

from pathlib import Path

notebook_dir = Path().resolve().parent
project_root = notebook_dir.parent
mp3_file_path = project_root / "examples" / "data" / "multimedia" / "text_to_speech.mp3"
png_file_path = project_root / "examples" / "data" / "multimedia" / "example.png"

Also applies to: 40-43

cognee/modules/retrieval/utils/brute_force_triplet_search.py (1)

66-73: Consider using contextlib.suppress for cleaner exception handling.

The error handling logic is correct - gracefully continuing with an empty memory fragment when entities are not found is appropriate. However, the static analysis suggestion to use contextlib.suppress would be more pythonic.

+from contextlib import suppress
+
-    try:
-        await memory_fragment.project_graph_from_db(
-            graph_engine,
-            node_properties_to_project=properties_to_project,
-            edge_properties_to_project=["relationship_name"],
-        )
-    except EntityNotFoundError:
-        pass
+    with suppress(EntityNotFoundError):
+        await memory_fragment.project_graph_from_db(
+            graph_engine,
+            node_properties_to_project=properties_to_project,
+            edge_properties_to_project=["relationship_name"],
+        )
🧰 Tools
🪛 Ruff (0.11.9)

66-73: Use contextlib.suppress(EntityNotFoundError) instead of try-except-pass

Replace with contextlib.suppress(EntityNotFoundError)

(SIM105)

cognee-starter-kit/.gitignore (1)

1-197: Comprehensive .gitignore with minor duplication issue.

This is a well-structured and comprehensive .gitignore file that covers Python, Node.js, various frameworks, and cognee-specific patterns. However, there's a redundant entry for .env (appears on both line 2 and line 140).

Remove the duplicate .env entry:

 .data
-.env
 .local.env
 .prod.env
 cognee/.data/

The comprehensive coverage of development tools, frameworks, and cognee-specific directories is excellent for maintaining a clean repository.

cognee/tests/unit/modules/pipelines/run_tasks_test.py (1)

53-55: Evaluate necessity of if __name__ == "__main__" guard in test
The entry‐point guard allows running this test script directly but isn’t needed for standard unittest or pytest discovery. Consider removing it to keep tests consistent and avoid divergence from test framework conventions.

cognee-frontend/src/utils/fetch.ts (1)

4-4: Externalize API base URL for flexibility
Hardcoding http://localhost:8000/api may lead to environment-specific issues. Consider moving this into a configuration or environment variable (e.g. process.env.API_BASE_URL) to centralize the endpoint and enable easier testing and deployment.

cognee/modules/pipelines/operations/__init__.py (1)

1-1: Consider adding __all__ to clarify the public API.

The static analysis tool correctly identifies that log_pipeline_run_initiated is imported but not explicitly exported. To maintain consistency and clarity about the module's public interface, consider adding an __all__ declaration:

+__all__ = [
+    "log_pipeline_run_initiated",
+    "log_pipeline_run_start", 
+    "log_pipeline_run_complete",
+    "log_pipeline_run_error",
+    "cognee_pipeline"
+]

This makes the intent to expose these functions at the package level explicit.

🧰 Tools
🪛 Ruff (0.11.9)

1-1: .log_pipeline_run_initiated.log_pipeline_run_initiated imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)

cognee/modules/engine/models/__init__.py (1)

6-6: Consider adding __all__ to clarify the public API.

The static analysis tool correctly identifies that ColumnValue is imported but not explicitly exported. To maintain consistency and clarity about the module's public interface, consider adding an __all__ declaration:

+__all__ = [
+    "Entity",
+    "EntityType", 
+    "TableRow",
+    "TableType",
+    "NodeSet",
+    "ColumnValue"
+]

This makes the intent to expose these model classes at the package level explicit and follows Python best practices for module organization.

🧰 Tools
🪛 Ruff (0.11.9)

6-6: .ColumnValue.ColumnValue imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)

cognee/tests/test_neo4j.py (1)

77-85: Consider conditional test skipping instead of commenting out.

While the reasoning for disabling the test is valid, commenting out tests reduces coverage. Consider using conditional skipping based on LLM model capability or configuration.

Instead of commenting out, consider using pytest skip:

-    # NOTE: Due to the test failing often on weak LLM models we've removed this test for now
-    # search_results = await cognee.search(
-    #     query_type=SearchType.NATURAL_LANGUAGE,
-    #     query_text=f"Find nodes connected to node with name {random_node_name}",
-    # )
-    # assert len(search_results) != 0, "Query related natural language don't exist."
-    # print("\nExtracted results are:\n")
-    # for result in search_results:
-    #     print(f"{result}\n")
+    # Skip natural language search test for weak LLM models
+    try:
+        search_results = await cognee.search(
+            query_type=SearchType.NATURAL_LANGUAGE,
+            query_text=f"Find nodes connected to node with name {random_node_name}",
+        )
+        assert len(search_results) != 0, "Query related natural language don't exist."
+        print("\nExtracted results are:\n")
+        for result in search_results:
+            print(f"{result}\n")
+        expected_history_count = 8
+    except Exception as e:
+        logger.warning(f"Natural language search test skipped due to: {e}")
+        expected_history_count = 6
cognee/infrastructure/llm/prompts/cot_followup_user_prompt.txt (1)

5-5: Fix template syntax inconsistency.

There's a missing space in the template placeholder syntax.

Apply this fix for consistency:

-`{{ query}}`
+`{{ query }}`
cognee/eval_framework/metrics_dashboard.py (1)

131-170: Consider improving function documentation.

The function now has dual responsibilities: writing to file AND returning HTML content. The docstring should be updated to reflect this behavior change.

Update the docstring to clarify the return value:

-    """Create and save the dashboard with all visualizations."""
+    """Create and save the dashboard with all visualizations.
+    
+    Returns:
+        str: The generated dashboard HTML content.
+    """
cognee/infrastructure/llm/prompts/cot_validation_user_prompt.txt (1)

2-2: Fix template syntax inconsistency.

There's a missing space in the template placeholder syntax, similar to the issue in the follow-up prompt template.

Apply this fix for consistency:

-`{{ query}}`
+`{{ query }}`
cognee-starter-kit/.env.template (1)

1-19: LGTM! Well-structured environment template.

This .env.template file provides clear guidance for users setting up the cognee starter kit. The comments effectively explain which variables are required for different providers, and the structure logically groups related configuration options.

Consider adding a comment explaining the model name format (e.g., "openai/gpt-4o-mini") to help users understand the expected syntax:

 # In case you choose to use OpenAI provider, just adjust the model and api_key.
 LLM_API_KEY=""
+# Format: provider/model-name (e.g., "openai/gpt-4o-mini")
 LLM_MODEL="openai/gpt-4o-mini"
notebooks/cognee_demo.ipynb (1)

532-536: Consider simplifying redundant user retrieval.

The current pattern retrieves the default user and then immediately fetches the same user by ID. This appears redundant unless there's a specific reason for the additional lookup.

Consider simplifying to:

 from cognee.modules.users.methods import get_default_user
 from cognee.modules.data.methods import get_datasets_by_name
-from cognee.modules.users.methods import get_user

-default_user = await get_default_user()
-
-user = await get_user(default_user.id)
+user = await get_default_user()

Unless the additional get_user call serves a specific purpose (like ensuring fresh data or different object properties), the direct use of get_default_user() should suffice.

cognee-starter-kit/src/pipelines/default.py (3)

5-5: Remove or uncomment the unused import.

The commented import from cognee.shared.utils import render_graph is not used in the current code. Consider removing it entirely or uncommenting if it will be used in the future.

-# from cognee.shared.utils import render_graph

36-38: Consider adding configuration guidance for Graphistry integration.

The commented Graphistry integration code provides a valuable alternative visualization option. Consider adding a comment explaining how users can enable this feature or pointing to documentation.

-    # # Get a graphistry url (Register for a free account at https://www.graphistry.com)
-    # url = await render_graph()
-    # print(f"Graphistry URL: {url}")
+    # Alternative: Get a Graphistry URL for advanced graph visualization
+    # (Register for a free account at https://www.graphistry.com and uncomment the render_graph import)
+    # url = await render_graph()
+    # print(f"Graphistry URL: {url}")

69-69: Remove the extra blank line.

There's an unnecessary extra blank line before the main execution block.

-

-
+
 if __name__ == "__main__":
cognee/version.py (2)

10-18: Apply the static analysis suggestion to combine with statements.

The static analysis tool correctly identified that the nested with statements can be combined for better readability and style.

     with suppress(FileNotFoundError, StopIteration):
-        with open(
-            os.path.join(Path(__file__).parent.parent, "pyproject.toml"), encoding="utf-8"
-        ) as pyproject_toml:
-            version = (
-                next(line for line in pyproject_toml if line.startswith("version"))
-                .split("=")[1]
-                .strip("'\"\n ")
-            )
+        pyproject_path = os.path.join(Path(__file__).parent.parent, "pyproject.toml")
+        with open(pyproject_path, encoding="utf-8") as pyproject_toml:
+            version = (
+                next(line for line in pyproject_toml if line.startswith("version"))
+                .split("=")[1]
+                .strip("'\"\n ")
+            )
             # Mark the version as a local Cognee library by appending "-dev"
             return f"{version}-dev"
🧰 Tools
🪛 Ruff (0.11.9)

10-13: Use a single with statement with multiple contexts instead of nested with statements

Combine with statements

(SIM117)


14-18: Improve robustness of version parsing.

The current version parsing assumes a specific format and could fail if the pyproject.toml format changes. Consider using a proper TOML parser for more robust parsing.

-            version = (
-                next(line for line in pyproject_toml if line.startswith("version"))
-                .split("=")[1]
-                .strip("'\"\n ")
-            )
+            for line in pyproject_toml:
+                if line.strip().startswith("version"):
+                    # Handle both version = "1.0.0" and version = '1.0.0' formats
+                    version = line.split("=", 1)[1].strip()
+                    version = version.strip("'\"\n ")
+                    break
+            else:
+                # If no version line found, let StopIteration be raised
+                raise StopIteration("No version found in pyproject.toml")

Alternatively, consider using a proper TOML parser like tomllib (Python 3.11+) or tomli for more robust parsing.

cognee/modules/pipelines/operations/log_pipeline_run_initiated.py (1)

16-22: Consider adding error handling for database operations.

The database session management is correct, but consider adding try-catch blocks for better error handling and logging.

 async def log_pipeline_run_initiated(pipeline_id: str, pipeline_name: str, dataset_id: UUID):
     pipeline_run = PipelineRun(
         pipeline_run_id=uuid4(),
         pipeline_name=pipeline_name,
         pipeline_id=pipeline_id,
         status=PipelineRunStatus.DATASET_PROCESSING_INITIATED,
         dataset_id=dataset_id,
         run_info={},
     )

     db_engine = get_relational_engine()

+    try:
         async with db_engine.get_async_session() as session:
             session.add(pipeline_run)
             await session.commit()
+    except Exception as e:
+        # Log the error and re-raise or handle appropriately
+        raise

     return pipeline_run
cognee-frontend/src/modules/ingestion/DataView/DataView.tsx (1)

36-36: Consider renaming the component to avoid shadowing the global DataView.

The component name DataView shadows the global DataView property (Web API). Consider renaming it to avoid confusion.

-export default function DataView({ datasetId, data, onClose, onDataAdd, onCognify }: DataViewProps) {
+export default function DatasetView({ datasetId, data, onClose, onDataAdd, onCognify }: DataViewProps) {

Also update the interface name:

-interface DataViewProps {
+interface DatasetViewProps {
🧰 Tools
🪛 Biome (1.9.4)

[error] 36-36: Do not shadow the global "DataView" property.

Consider renaming this variable. It's easy to confuse the origin of variables when they're named after a known global.

(lint/suspicious/noShadowRestrictedNames)

cognee/shared/logging_utils.py (1)

7-12: Remove unused import and verify version retrieval.

The importlib.metadata import appears unused according to static analysis. The structlog.__version__ and cognee.__version__ are used directly without importing metadata.

-import importlib.metadata

If version information needs to be retrieved from other packages, use importlib.metadata appropriately or remove the unused import.

🧰 Tools
🪛 Ruff (0.11.9)

10-10: importlib.metadata imported but unused

Remove unused import: importlib.metadata

(F401)

examples/database_examples/chromadb_example.py (1)

1-1: Remove unused import.

The os module is imported but never used in this file.

-import os
🧰 Tools
🪛 Ruff (0.11.9)

1-1: os imported but unused

Remove unused import: os

(F401)

cognee/exceptions/exceptions.py (1)

23-30: Simplify the logging conditional structure.

The current if/elif chain can be simplified using a more elegant approach.

-        if log and (log_level == "ERROR"):
-            logger.error(f"{self.name}: {self.message} (Status code: {self.status_code})")
-        elif log and (log_level == "WARNING"):
-            logger.warning(f"{self.name}: {self.message} (Status code: {self.status_code})")
-        elif log and (log_level == "INFO"):
-            logger.info(f"{self.name}: {self.message} (Status code: {self.status_code})")
-        elif log and (log_level == "DEBUG"):
-            logger.debug(f"{self.name}: {self.message} (Status code: {self.status_code})")
+        if log:
+            log_method = getattr(logger, log_level.lower(), None)
+            if log_method:
+                log_method(f"{self.name}: {self.message} (Status code: {self.status_code})")
examples/data/car_and_tech_companies.txt (1)

17-17: Fix grammatical error in sample text.

There's a subject-verb disagreement: "Each of these car manufacturer" should use the plural form to agree with "these".

-Each of these car manufacturer contributes to Germany's reputation as a leader in the global automotive industry, showcasing a blend of innovation, performance, and design excellence.
+Each of these car manufacturers contributes to Germany's reputation as a leader in the global automotive industry, showcasing a blend of innovation, performance, and design excellence.
🧰 Tools
🪛 LanguageTool

[grammar] ~17-~17: The plural determiner ‘these’ does not agree with the singular noun ‘car’.
Context: ...nce practicality with quality. Each of these car manufacturer contributes to Germany's r...

(THIS_NNS)


[uncategorized] ~17-~17: The grammatical number of this noun doesn’t look right. Consider replacing it.
Context: ...cality with quality. Each of these car manufacturer contributes to Germany's reputation as ...

(AI_EN_LECTOR_REPLACEMENT_NOUN_NUMBER)

cognee/tasks/temporal_awareness/index_graphiti_objects.py (1)

64-66: Clarify edge data structure documentation.

The comment indicates that the relationship name is at index 2, but it would be helpful to document the complete structure of the edge tuple for maintainability.

Consider adding a more comprehensive comment:

-    edge_types = Counter(
-        edge[2]  # The edge key (relationship name) is at index 2
-        for edge in edges_data
-    )
+    edge_types = Counter(
+        edge[2]  # Edge tuple format: (source_id, target_id, relationship_name, ...)
+        for edge in edges_data
+    )
cognee/tests/unit/modules/pipelines/run_tasks_with_context_test.py (1)

42-47: Consider using pytest instead of manual test runner.

While the current approach works, consider integrating with pytest for better test discovery and reporting capabilities.

-def test_run_tasks():
-    asyncio.run(run_and_check_tasks())
-
-
-if __name__ == "__main__":
-    test_run_tasks()
+@pytest.mark.asyncio
+async def test_run_tasks():
+    await run_and_check_tasks()
examples/database_examples/qdrant_example.py (1)

51-56: Consider extracting sample text to external file.

For a cleaner example, consider moving the sample text to a separate file or making it more concise.

You could create a sample_data.txt file or use a shorter, more focused example text to improve readability.

examples/database_examples/falkordb_example.py (1)

1-1: Remove unused import.

The os module is imported but not used in this file.

-import os
 import pathlib
🧰 Tools
🪛 Ruff (0.11.9)

1-1: os imported but unused

Remove unused import: os

(F401)

examples/database_examples/kuzu_example.py (1)

1-1: Remove unused import.

The os module is imported but not used in this file.

-import os
 import pathlib
🧰 Tools
🪛 Ruff (0.11.9)

1-1: os imported but unused

Remove unused import: os

(F401)

notebooks/cognee_openai_compatable_demo.ipynb (1)

14-16: Consider adding authentication documentation.

The comment suggests using /api/v1/auth/login to get a JWT, but the example uses a hardcoded API key. Consider adding a cell demonstrating proper authentication or clarifying when the hardcoded key is acceptable for local development.

-    "# Use /api/v1/auth/login to get JWT\n",
+    "# For production: Use /api/v1/auth/login to get JWT\n",
+    "# For local development: Use a placeholder API key as shown below\n",
cognee/modules/retrieval/graph_completion_context_extension_retriever.py (1)

3-3: Remove unused imports identified by static analysis.

The static analysis correctly identifies unused imports that should be removed for cleaner code.

Apply this diff to remove the unused imports:

-from cognee.infrastructure.llm.get_llm_client import get_llm_client
-from cognee.infrastructure.llm.prompts import read_query_prompt, render_prompt

Also applies to: 6-6

🧰 Tools
🪛 Ruff (0.11.9)

3-3: cognee.infrastructure.llm.get_llm_client.get_llm_client imported but unused

Remove unused import: cognee.infrastructure.llm.get_llm_client.get_llm_client

(F401)

examples/database_examples/pgvector_example.py (1)

1-1: Remove unused import identified by static analysis.

The os module is imported but never used in the script.

Apply this diff to remove the unused import:

-import os
🧰 Tools
🪛 Ruff (0.11.9)

1-1: os imported but unused

Remove unused import: os

(F401)

examples/database_examples/weaviate_example.py (1)

42-44: Consider making cleanup optional with user confirmation.

The automatic pruning of existing data might be unexpected for users. Consider making this optional or adding a confirmation mechanism.

-    # Clean any existing data (optional)
-    await cognee.prune.prune_data()
-    await cognee.prune.prune_system(metadata=True)
+    # Clean any existing data (optional)
+    # Uncomment the following lines if you want to clean existing data
+    # await cognee.prune.prune_data()
+    # await cognee.prune.prune_system(metadata=True)
cognee/modules/pipelines/operations/run_tasks_base.py (1)

31-36: Optimize the parameter inspection logic.

The logic for checking if a function accepts a context parameter works correctly, but can be simplified based on the static analysis hint.

-    has_context = any(
-        [key == "context" for key in inspect.signature(running_task.executable).parameters.keys()]
-    )
+    has_context = "context" in inspect.signature(running_task.executable).parameters
🧰 Tools
🪛 Ruff (0.11.9)

32-32: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

cognee/tests/test_memgraph.py (2)

16-27: Consider using more portable path construction.

The current path construction using os.path.join and pathlib.Path is functional but could be simplified and made more readable.

Consider using pathlib consistently for cleaner path construction:

-    data_directory_path = str(
-        pathlib.Path(
-            os.path.join(pathlib.Path(__file__).parent, ".data_storage/test_memgraph")
-        ).resolve()
-    )
+    data_directory_path = str(
+        (pathlib.Path(__file__).parent / ".data_storage/test_memgraph").resolve()
+    )

Apply the same pattern to the cognee_directory_path for consistency.


91-91: Verify the expected search history count.

The assertion expects exactly 8 search history entries, but this seems tightly coupled to the number of search operations in the test.

Consider making the assertion more descriptive and potentially more flexible:

-    assert len(history) == 8, "Search history is not correct."
+    expected_searches = 4  # Number of search operations performed
+    assert len(history) == expected_searches, f"Expected {expected_searches} search history entries, got {len(history)}"

This makes the relationship between operations and expected history clearer.

cognee/api/v1/responses/routers/get_responses_router.py (1)

115-121: Enhance error logging for function dispatch failures.

The error handling captures exceptions but could provide more detailed information for debugging.

Consider adding more context to the error logging:

                 try:
                     function_result = await dispatch_function(tool_call)
                     output_status = "success"
                 except Exception as e:
-                    logger.exception(f"Error executing function {function_name}: {e}")
+                    logger.exception(
+                        f"Error executing function {function_name} with arguments {arguments_str}: {e}",
+                        extra={"function_name": function_name, "arguments": arguments_str, "call_id": call_id}
+                    )
                     function_result = f"Error executing {function_name}: {str(e)}"
                     output_status = "error"
cognee-starter-kit/src/pipelines/low_level.py (1)

110-113: Add error handling for graph visualization.

The graph visualization operation could fail if the directory doesn't exist or there are permission issues.

Add error handling and ensure the directory exists:

     # Or use our simple graph preview
     graph_file_path = str(
         os.path.join(os.path.dirname(__file__), ".artifacts/graph_visualization.html")
     )
+    # Ensure the artifacts directory exists
+    os.makedirs(os.path.dirname(graph_file_path), exist_ok=True)
+    
+    try:
         await visualize_graph(graph_file_path)
+        print(f"Graph visualization saved to: {graph_file_path}")
+    except Exception as e:
+        print(f"Failed to generate graph visualization: {e}")
cognee/eval_framework/modal_eval_dashboard.py (2)

35-41: Consider security implications of disabled CORS and XSRF protection.

Disabling CORS and XSRF protection can create security vulnerabilities, especially if the dashboard is exposed to external networks.

Evaluate whether these security settings are necessary and document the security implications:

     cmd = (
         "streamlit run /root/serve_dashboard.py "
         "--server.port 8000 "
+        # WARNING: These settings disable security protections
+        # Only use in trusted environments
         "--server.enableCORS=false "
         "--server.enableXsrfProtection=false"
     )

Consider if these settings are truly necessary for your deployment environment.


41-41: Consider using exec() instead of Popen() for better process management.

Using subprocess.Popen() without waiting for the process can lead to resource leaks or unclear error states.

Consider using a more explicit process management approach:

-    subprocess.Popen(cmd, shell=True)
+    try:
+        subprocess.run(cmd, shell=True, check=True)
+    except subprocess.CalledProcessError as e:
+        raise RuntimeError(f"Failed to start Streamlit: {e}")

Alternatively, if you need the process to run in the background, consider proper process management to handle cleanup.

cognee/tests/unit/modules/retrieval/graph_completion_retriever_context_extension_test.py (1)

181-183: Fix incorrect method names in the manual test runner

test_graph_completion_context_simple / complex are not defined – the actual names include _extension_.
Running this file directly will therefore raise AttributeError.

-        await test.test_graph_completion_context_simple()
-        await test.test_graph_completion_context_complex()
+        await test.test_graph_completion_extension_context_simple()
+        await test.test_graph_completion_extension_context_complex()
cognee/tests/unit/modules/retrieval/graph_completion_retriever_cot_test.py (1)

178-181: Align manual runner with real test names

Names invoked here don’t exist (_context_ vs. _cot_context_) which will crash the ad-hoc runner.

-        await test.test_graph_completion_context_simple()
-        await test.test_graph_completion_context_complex()
+        await test.test_graph_completion_cot_context_simple()
+        await test.test_graph_completion_cot_context_complex()
cognee/modules/retrieval/graph_completion_cot_retriever.py (2)

36-40: Initialise answer as str, not List[str], to match later usage

answer is first declared as a list and then replaced by the string returned from generate_completion.
Keeping the initial type consistent eliminates static-type noise and avoids accidental nested-list returns in the future.

-        followup_question = ""
-        triplets = []
-        answer = [""]
+        followup_question: str = ""
+        triplets: list[str] = []
+        answer: str = ""

84-85: Return value can be simplified

get_completion promises List[str] but currently returns [answer] where answer is str, causing a single-element list.
If multi-answer support is not planned, returning the string directly simplifies both API and consumers (tests would need a one-line update).

cognee/modules/pipelines/operations/pipeline.py (1)

172-174: Error message does not match validation rule

The guard checks for "." and " " but the exception says “spaces or underscores”.

-        raise ValueError("Dataset name cannot contain spaces or underscores")
+        raise ValueError("Dataset name cannot contain spaces or periods")
cognee/api/v1/responses/dispatch_function.py (2)

31-34: Un-guarded json.loads can explode on malformed arguments

A malformed arguments string (which is common with LLM tool calls) will raise JSONDecodeError and bubble all the way up. Consider a safe parse with graceful degradation:

-    arguments = json.loads(arguments_str)
+    try:
+        arguments = json.loads(arguments_str)
+    except json.JSONDecodeError as exc:
+        logger.error("Cannot parse tool-call arguments: %s", exc)
+        return f"Error: Invalid JSON in tool arguments – {exc}"

71-82: top_k should be coerced to int to avoid silent fall-back

If top_k arrives as a string (e.g. "5"), the current isinstance(top_k, int) check forces the default 10, surprising the caller.

-    top_k = arguments.get("top_k")
+    try:
+        top_k = int(arguments["top_k"])
+    except (KeyError, ValueError, TypeError):
+        top_k = 10
cognee/infrastructure/databases/vector/qdrant/QDrantAdapter.py (2)

108-116: O(n²) lookup while building points

data_points.index(data_point) performs a linear scan for every element. Use enumerate to keep the complexity linear:

-        points = [convert_to_qdrant_point(point) for point in data_points]
+        points = [
+            models.PointStruct(
+                id=str(point.id),
+                payload=point.model_dump(),
+                vector={"text": data_vectors[i]},
+            )
+            for i, point in enumerate(data_points)
+        ]

195-216: Client closed twice – minor but noisy

await client.close() is executed inside the try block and the finally block, which logs a warning in recent qdrant-client versions. Remove the first one and keep the finally cleanup.

-            await client.close()
cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py (2)

163-165: Unbounded limit == 0 can explode memory usage

limit == 0 is translated to “return all rows” (await collection.count_rows()), which on large tables can load millions of vectors into memory and DOS the service.

Consider:

-    if limit == 0:
-        limit = await collection.count_rows()
+    if limit == 0:
+        total = await collection.count_rows()
+        limit = min(total, self.MAX_BULK_LIMIT)  # e.g. 1_000
+        logger.warning(
+            "Limit=0 detected → clamped to %s to avoid OOM (collection size: %s)",
+            limit,
+            total,
+        )

205-211: Inefficient row-by-row deletion

Iterating and issuing one DELETE per ID results in N network round-trips:

for data_point_id in data_point_ids:
    await collection.delete(f"id = '{data_point_id}'")

Most vector stores (including LanceDB) support bulk predicates. Deleting in one statement is both atomic and ~N-fold faster:

-for data_point_id in data_point_ids:
-    await collection.delete(f"id = '{data_point_id}'")
+ids = "', '".join(map(str, data_point_ids))
+await collection.delete(f"id IN ('{ids}')")

If you retain the per-row strategy to avoid commit conflicts, add a short await asyncio.sleep() back-off to reduce lock contention.

cognee/infrastructure/databases/vector/milvus/MilvusAdapter.py (2)

181-186: Redundant guard makes later limit if limit > 0 else None unreachable

Because you return [] when limit <= 0, the conditional on line 204 is dead code.

-if limit <= 0:
-    return []
+if limit is None or limit <= 0:
+    logger.warning("Non-positive limit supplied (%s); returning [].", limit)
+    return []
...
-limit=limit if limit > 0 else None,
+limit=limit,

219-232: Exception filtering duplicates earlier existence check

You already bail out if the collection is missing (lines 187-192).
Catching the same CollectionNotExistException again is harmless but adds noise. Consider collapsing the two branches into a single guard clause.

cognee/infrastructure/databases/graph/networkx/adapter.py (2)

139-142: Edge retrieval now limited to incoming/outgoing only

Changing the identifier type is fine, but note that the helper combines in_edges + out_edges without preserving multi-edge keys.
If multiple edges with different keys exist between two nodes, only the first will be returned, losing information.
Consider:

edges = list(self.graph.in_edges(node_id, data=True, keys=True))
edges += list(self.graph.out_edges(node_id, data=True, keys=True))

so every parallel edge is captured.


218-226: Potential performance hit in neighbor aggregation

await asyncio.gather(self.get_predecessors(), self.get_successors()) is neat, but both called methods perform synchronous NetworkX look-ups; there is no actual I/O to await.
You can simplify and avoid the overhead:

predecessors = await self.get_predecessors(node_id)
successors   = await self.get_successors(node_id)

or make the helper functions synchronous and call directly.

cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py (2)

206-216: Batch search executes sequentially

await query_search(...) inside a comprehension performs the searches serially.
Leverage concurrency with asyncio.gather:

vectors = await self.embed_data(query_texts)
tasks = [
    self.search(collection_name, query_vector=v, limit=limit, with_vector=with_vectors)
    for v in vectors
]
return await asyncio.gather(*tasks)

Reduces total latency considerably for remote calls.


221-227: delete_many result handling

Weaviate’s delete_many returns an object containing status counts.
Consider returning result.summary (or similar) so callers can verify deletions instead of the raw SDK object.

cognee-mcp/src/server.py (1)

87-94: Background tasks may swallow exceptions silently

asyncio.create_task() detaches the coroutine; if it raises, the error is logged only when the task is garbage-collected. Consider attaching a done-callback that logs unexpected exceptions or gather the tasks in a supervisor:

task = asyncio.create_task(cognify_task(abs_path))
task.add_done_callback(lambda t: logger.error(t.exception()) if t.exception() else None)
tasks.append(task)

This makes operational issues visible without blocking the caller.

cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py (1)

128-133: Double round-trip when the collection already exists

create_data_points() calls await self.create_collection(), which in turn performs a list_collections round-trip even when the collection is known to exist. Right after that you again call get_collection(). For high-throughput ingestion this is two extra network calls per batch.

If you optimistically try get_collection() first and fall back to create_collection() only on CollectionNotFoundError, you save one RTT in the common path:

try:
    collection = await self.get_collection(collection_name)
except CollectionNotFoundError:
    await self.create_collection(collection_name)
    collection = await self.get_collection(collection_name)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 822cc55 and 39b8a6a.

⛔ Files ignored due to path filters (30)
  • .data/short_stories/soldiers-home.pdf is excluded by !**/*.pdf, !**/*.pdf
  • .dlt/config.toml is excluded by !**/*.toml
  • .github/actions/cognee_setup/action.yml is excluded by !**/*.yml
  • .github/workflows/db_examples_tests.yml is excluded by !**/*.yml
  • .github/workflows/python_version_tests.yml is excluded by !**/*.yml
  • .github/workflows/test_memgraph.yml is excluded by !**/*.yml
  • .github/workflows/test_suites.yml is excluded by !**/*.yml
  • assets/community/cognee_benefits_zh.JPG is excluded by !**/*.jpg, !**/*.jpg
  • assets/community/cognee_diagram_zh.JPG is excluded by !**/*.jpg, !**/*.jpg
  • assets/community/graph_visualization_pt.png is excluded by !**/*.png, !**/*.png
  • assets/community/graph_visualization_ru.png is excluded by !**/*.png, !**/*.png
  • assets/graph_visualization.png is excluded by !**/*.png, !**/*.png
  • cognee-mcp/pyproject.toml is excluded by !**/*.toml
  • cognee-mcp/uv.lock is excluded by !**/*.lock, !**/*.lock
  • cognee-starter-kit/pyproject.toml is excluded by !**/*.toml
  • cognee-starter-kit/src/data/companies.json is excluded by !**/*.json
  • cognee-starter-kit/src/data/people.json is excluded by !**/*.json
  • deployment/helm/Chart.yaml is excluded by !**/*.yaml
  • deployment/helm/docker-compose-helm.yml is excluded by !**/*.yml
  • deployment/helm/templates/cognee_deployment.yaml is excluded by !**/*.yaml
  • deployment/helm/templates/cognee_service.yaml is excluded by !**/*.yaml
  • deployment/helm/templates/postgres_deployment.yaml is excluded by !**/*.yaml
  • deployment/helm/templates/postgres_pvc.yaml is excluded by !**/*.yaml
  • deployment/helm/templates/postgres_service.yaml is excluded by !**/*.yaml
  • deployment/helm/values.yaml is excluded by !**/*.yaml
  • examples/data/multimedia/example.png is excluded by !**/*.png, !**/*.png
  • examples/data/multimedia/text_to_speech.mp3 is excluded by !**/*.mp3, !**/*.mp3
  • poetry.lock is excluded by !**/*.lock, !**/*.lock
  • pyproject.toml is excluded by !**/*.toml
  • uv.lock is excluded by !**/*.lock, !**/*.lock
📒 Files selected for processing (140)
  • .data/code/example.txt (0 hunks)
  • CONTRIBUTING.md (1 hunks)
  • Dockerfile (1 hunks)
  • Dockerfile_modal (1 hunks)
  • README.md (3 hunks)
  • alembic/versions/1d0bb7fede17_add_pipeline_run_status.py (1 hunks)
  • alembic/versions/482cd6517ce4_add_default_user.py (1 hunks)
  • assets/graph_visualization.html (0 hunks)
  • cognee-frontend/src/app/page.tsx (3 hunks)
  • cognee-frontend/src/app/wizard/CognifyStep/CognifyStep.tsx (1 hunks)
  • cognee-frontend/src/app/wizard/ExploreStep/ExploreStep.tsx (1 hunks)
  • cognee-frontend/src/app/wizard/WizardPage.tsx (1 hunks)
  • cognee-frontend/src/modules/datasets/cognifyDataset.ts (1 hunks)
  • cognee-frontend/src/modules/exploration/getExplorationGraphUrl.ts (1 hunks)
  • cognee-frontend/src/modules/ingestion/DataView/DataView.tsx (4 hunks)
  • cognee-frontend/src/ui/Partials/Explorer/Explorer.tsx (1 hunks)
  • cognee-frontend/src/ui/Partials/SearchView/SearchView.tsx (2 hunks)
  • cognee-frontend/src/utils/fetch.ts (1 hunks)
  • cognee-mcp/src/server.py (3 hunks)
  • cognee-starter-kit/.env.template (1 hunks)
  • cognee-starter-kit/.gitignore (1 hunks)
  • cognee-starter-kit/README.md (1 hunks)
  • cognee-starter-kit/src/pipelines/custom-model.py (1 hunks)
  • cognee-starter-kit/src/pipelines/default.py (1 hunks)
  • cognee-starter-kit/src/pipelines/low_level.py (1 hunks)
  • cognee/__init__.py (1 hunks)
  • cognee/api/client.py (2 hunks)
  • cognee/api/v1/add/add.py (1 hunks)
  • cognee/api/v1/cognify/code_graph_pipeline.py (3 hunks)
  • cognee/api/v1/cognify/cognify.py (1 hunks)
  • cognee/api/v1/config/config.py (1 hunks)
  • cognee/api/v1/datasets/datasets.py (1 hunks)
  • cognee/api/v1/delete/routers/get_delete_router.py (1 hunks)
  • cognee/api/v1/responses/__init__.py (1 hunks)
  • cognee/api/v1/responses/default_tools.py (1 hunks)
  • cognee/api/v1/responses/dispatch_function.py (1 hunks)
  • cognee/api/v1/responses/models.py (1 hunks)
  • cognee/api/v1/responses/routers/__init__.py (1 hunks)
  • cognee/api/v1/responses/routers/default_tools.py (1 hunks)
  • cognee/api/v1/responses/routers/get_responses_router.py (1 hunks)
  • cognee/base_config.py (1 hunks)
  • cognee/eval_framework/answer_generation/answer_generation_executor.py (2 hunks)
  • cognee/eval_framework/eval_config.py (4 hunks)
  • cognee/eval_framework/metrics_dashboard.py (1 hunks)
  • cognee/eval_framework/modal_eval_dashboard.py (1 hunks)
  • cognee/eval_framework/modal_run_eval.py (3 hunks)
  • cognee/exceptions/exceptions.py (1 hunks)
  • cognee/infrastructure/databases/graph/get_graph_engine.py (1 hunks)
  • cognee/infrastructure/databases/graph/graph_db_interface.py (2 hunks)
  • cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py (1 hunks)
  • cognee/infrastructure/databases/graph/networkx/adapter.py (7 hunks)
  • cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py (2 hunks)
  • cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py (8 hunks)
  • cognee/infrastructure/databases/vector/embeddings/FastembedEmbeddingEngine.py (1 hunks)
  • cognee/infrastructure/databases/vector/exceptions/exceptions.py (1 hunks)
  • cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py (8 hunks)
  • cognee/infrastructure/databases/vector/milvus/MilvusAdapter.py (9 hunks)
  • cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (5 hunks)
  • cognee/infrastructure/databases/vector/qdrant/QDrantAdapter.py (6 hunks)
  • cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py (9 hunks)
  • cognee/infrastructure/llm/anthropic/adapter.py (1 hunks)
  • cognee/infrastructure/llm/gemini/adapter.py (1 hunks)
  • cognee/infrastructure/llm/generic_llm_api/adapter.py (1 hunks)
  • cognee/infrastructure/llm/ollama/adapter.py (1 hunks)
  • cognee/infrastructure/llm/openai/adapter.py (4 hunks)
  • cognee/infrastructure/llm/prompts/cot_followup_system_prompt.txt (1 hunks)
  • cognee/infrastructure/llm/prompts/cot_followup_user_prompt.txt (1 hunks)
  • cognee/infrastructure/llm/prompts/cot_validation_system_prompt.txt (1 hunks)
  • cognee/infrastructure/llm/prompts/cot_validation_user_prompt.txt (1 hunks)
  • cognee/modules/data/methods/__init__.py (1 hunks)
  • cognee/modules/data/methods/create_dataset.py (2 hunks)
  • cognee/modules/data/methods/get_unique_dataset_id.py (1 hunks)
  • cognee/modules/engine/models/ColumnValue.py (1 hunks)
  • cognee/modules/engine/models/__init__.py (1 hunks)
  • cognee/modules/graph/cognee_graph/CogneeGraph.py (1 hunks)
  • cognee/modules/observability/get_observe.py (1 hunks)
  • cognee/modules/observability/observers.py (1 hunks)
  • cognee/modules/pipelines/models/PipelineRun.py (1 hunks)
  • cognee/modules/pipelines/operations/__init__.py (1 hunks)
  • cognee/modules/pipelines/operations/get_pipeline_status.py (2 hunks)
  • cognee/modules/pipelines/operations/log_pipeline_run_initiated.py (1 hunks)
  • cognee/modules/pipelines/operations/pipeline.py (3 hunks)
  • cognee/modules/pipelines/operations/run_tasks.py (4 hunks)
  • cognee/modules/pipelines/operations/run_tasks_base.py (4 hunks)
  • cognee/modules/retrieval/exceptions/__init__.py (1 hunks)
  • cognee/modules/retrieval/exceptions/exceptions.py (0 hunks)
  • cognee/modules/retrieval/graph_completion_context_extension_retriever.py (1 hunks)
  • cognee/modules/retrieval/graph_completion_cot_retriever.py (1 hunks)
  • cognee/modules/retrieval/graph_completion_retriever.py (1 hunks)
  • cognee/modules/retrieval/utils/brute_force_triplet_search.py (4 hunks)
  • cognee/modules/search/methods/search.py (3 hunks)
  • cognee/modules/search/types/SearchType.py (1 hunks)
  • cognee/modules/settings/get_settings.py (2 hunks)
  • cognee/modules/visualization/cognee_network_visualization.py (1 hunks)
  • cognee/shared/data_models.py (0 hunks)
  • cognee/shared/logging_utils.py (4 hunks)
  • cognee/tasks/ingestion/ingest_data.py (1 hunks)
  • cognee/tasks/ingestion/migrate_relational_database.py (2 hunks)
  • cognee/tasks/temporal_awareness/index_graphiti_objects.py (2 hunks)
  • cognee/tests/integration/run_toy_tasks/conftest.py (0 hunks)
  • cognee/tests/test_memgraph.py (1 hunks)
  • cognee/tests/test_neo4j.py (1 hunks)
  • cognee/tests/test_relational_db_migration.py (3 hunks)
  • cognee/tests/test_starter_pipelines.py (1 hunks)
  • cognee/tests/test_weaviate.py (1 hunks)
  • cognee/tests/unit/modules/pipelines/run_tasks_test.py (1 hunks)
  • cognee/tests/unit/modules/pipelines/run_tasks_with_context_test.py (1 hunks)
  • cognee/tests/unit/modules/retrieval/chunks_retriever_test.py (4 hunks)
  • cognee/tests/unit/modules/retrieval/graph_completion_retriever_context_extension_test.py (1 hunks)
  • cognee/tests/unit/modules/retrieval/graph_completion_retriever_cot_test.py (1 hunks)
  • cognee/tests/unit/modules/retrieval/graph_completion_retriever_test.py (1 hunks)
  • cognee/tests/unit/modules/retrieval/summaries_retriever_test.py (1 hunks)
  • cognee/tests/unit/modules/retrieval/utils/brute_force_triplet_search_test.py (0 hunks)
  • cognee/version.py (1 hunks)
  • entrypoint.sh (3 hunks)
  • examples/data/car_and_tech_companies.txt (1 hunks)
  • examples/database_examples/chromadb_example.py (1 hunks)
  • examples/database_examples/falkordb_example.py (1 hunks)
  • examples/database_examples/kuzu_example.py (1 hunks)
  • examples/database_examples/milvus_example.py (1 hunks)
  • examples/database_examples/neo4j_example.py (1 hunks)
  • examples/database_examples/pgvector_example.py (1 hunks)
  • examples/database_examples/qdrant_example.py (1 hunks)
  • examples/database_examples/weaviate_example.py (1 hunks)
  • examples/python/graphiti_example.py (2 hunks)
  • examples/python/multimedia_example.py (1 hunks)
  • notebooks/cognee_demo.ipynb (3 hunks)
  • notebooks/cognee_graphiti_demo.ipynb (4 hunks)
  • notebooks/cognee_llama_index.ipynb (2 hunks)
  • notebooks/cognee_multimedia_demo.ipynb (2 hunks)
  • notebooks/cognee_openai_compatable_demo.ipynb (1 hunks)
  • notebooks/cognee_simple_demo.ipynb (7 hunks)
  • notebooks/github_graph_visualization.html (0 hunks)
  • notebooks/graphrag_vs_rag.ipynb (7 hunks)
  • notebooks/hr_demo.ipynb (0 hunks)
  • notebooks/llama_index_cognee_integration.ipynb (5 hunks)
  • profiling/graph_pydantic_conversion/benchmark_function.py (0 hunks)
  • profiling/graph_pydantic_conversion/profile_graph_pydantic_conversion.py (0 hunks)
  • profiling/util/DummyEmbeddingEngine.py (0 hunks)
  • profiling/util/DummyLLMAdapter.py (0 hunks)
💤 Files with no reviewable changes (12)
  • cognee/shared/data_models.py
  • cognee/modules/retrieval/exceptions/exceptions.py
  • profiling/graph_pydantic_conversion/profile_graph_pydantic_conversion.py
  • profiling/util/DummyEmbeddingEngine.py
  • cognee/tests/integration/run_toy_tasks/conftest.py
  • cognee/tests/unit/modules/retrieval/utils/brute_force_triplet_search_test.py
  • .data/code/example.txt
  • assets/graph_visualization.html
  • profiling/graph_pydantic_conversion/benchmark_function.py
  • notebooks/github_graph_visualization.html
  • profiling/util/DummyLLMAdapter.py
  • notebooks/hr_demo.ipynb
🧰 Additional context used
🧬 Code Graph Analysis (32)
cognee/tests/unit/modules/pipelines/run_tasks_test.py (1)
cognee/tests/unit/modules/pipelines/run_tasks_with_context_test.py (1)
  • test_run_tasks (42-43)
cognee/tests/unit/modules/retrieval/summaries_retriever_test.py (1)
cognee/modules/retrieval/summaries_retriever.py (1)
  • SummariesRetriever (9-33)
cognee/modules/data/methods/__init__.py (1)
cognee/modules/data/methods/get_unique_dataset_id.py (1)
  • get_unique_dataset_id (5-6)
cognee/api/v1/responses/routers/__init__.py (1)
cognee/api/v1/responses/routers/get_responses_router.py (1)
  • get_responses_router (25-149)
cognee/api/v1/add/add.py (4)
cognee/modules/users/models/User.py (1)
  • User (12-39)
cognee/modules/pipelines/operations/pipeline.py (1)
  • cognee_pipeline (30-103)
cognee/tasks/ingestion/ingest_data.py (1)
  • ingest_data (20-192)
cognee/tasks/ingestion/resolve_data_directories.py (1)
  • resolve_data_directories (8-72)
cognee/tasks/ingestion/ingest_data.py (1)
cognee/modules/data/methods/create_dataset.py (1)
  • create_dataset (11-33)
cognee/modules/data/methods/get_unique_dataset_id.py (1)
cognee/modules/users/models/User.py (1)
  • User (12-39)
cognee-frontend/src/app/wizard/ExploreStep/ExploreStep.tsx (1)
cognee-frontend/src/ui/Partials/Explorer/Explorer.tsx (1)
  • Explorer (15-61)
cognee/api/v1/responses/__init__.py (1)
cognee/api/v1/responses/routers/get_responses_router.py (1)
  • get_responses_router (25-149)
cognee/__init__.py (1)
cognee/version.py (1)
  • get_cognee_version (7-24)
cognee/modules/observability/get_observe.py (2)
cognee/base_config.py (1)
  • get_base_config (29-30)
cognee/modules/observability/observers.py (1)
  • Observer (4-9)
cognee/api/client.py (1)
cognee/api/v1/responses/routers/get_responses_router.py (1)
  • get_responses_router (25-149)
cognee/modules/engine/models/__init__.py (1)
cognee/modules/engine/models/ColumnValue.py (1)
  • ColumnValue (4-9)
cognee/tests/test_neo4j.py (1)
cognee/modules/users/methods/get_default_user.py (1)
  • get_default_user (12-37)
cognee/modules/pipelines/operations/get_pipeline_status.py (1)
cognee/modules/pipelines/models/PipelineRun.py (1)
  • PipelineRun (15-27)
cognee/api/v1/datasets/datasets.py (1)
cognee/modules/pipelines/operations/get_pipeline_status.py (1)
  • get_pipeline_status (8-35)
cognee-frontend/src/modules/datasets/cognifyDataset.ts (1)
cognee-frontend/src/utils/fetch.ts (1)
  • fetch (3-12)
cognee/infrastructure/databases/graph/get_graph_engine.py (1)
cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py (1)
  • MemgraphAdapter (20-690)
cognee/base_config.py (1)
cognee/modules/observability/observers.py (1)
  • Observer (4-9)
cognee/modules/engine/models/ColumnValue.py (1)
cognee/infrastructure/engine/models/DataPoint.py (1)
  • DataPoint (16-96)
cognee-frontend/src/app/page.tsx (1)
cognee/api/v1/datasets/datasets.py (1)
  • datasets (7-40)
cognee/infrastructure/llm/openai/adapter.py (1)
cognee/modules/observability/get_observe.py (1)
  • get_observe (5-11)
cognee/tests/test_weaviate.py (1)
cognee/infrastructure/databases/vector/get_vector_engine.py (1)
  • get_vector_engine (5-6)
cognee/modules/data/methods/create_dataset.py (2)
cognee/modules/data/methods/get_unique_dataset_id.py (1)
  • get_unique_dataset_id (5-6)
cognee/modules/users/models/User.py (1)
  • User (12-39)
examples/database_examples/neo4j_example.py (5)
cognee/modules/search/types/SearchType.py (1)
  • SearchType (4-15)
examples/database_examples/kuzu_example.py (1)
  • main (8-81)
examples/database_examples/falkordb_example.py (1)
  • main (8-83)
examples/database_examples/milvus_example.py (1)
  • main (8-85)
cognee/api/v1/config/config.py (4)
  • config (15-194)
  • set_graph_db_config (149-158)
  • data_root_directory (32-34)
  • system_root_directory (17-29)
alembic/versions/482cd6517ce4_add_default_user.py (1)
cognee/modules/users/methods/create_default_user.py (1)
  • create_default_user (5-19)
cognee/modules/retrieval/utils/brute_force_triplet_search.py (6)
cognee/infrastructure/databases/exceptions/exceptions.py (1)
  • EntityNotFoundError (15-27)
cognee/infrastructure/databases/vector/exceptions/exceptions.py (1)
  • CollectionNotFoundError (5-14)
cognee/modules/graph/cognee_graph/CogneeGraph.py (1)
  • project_graph_from_db (55-113)
cognee/infrastructure/databases/vector/qdrant/QDrantAdapter.py (1)
  • search (155-216)
cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py (1)
  • search (166-204)
cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py (1)
  • search (146-182)
cognee/exceptions/exceptions.py (1)
cognee/shared/logging_utils.py (4)
  • error (138-139)
  • warning (135-136)
  • info (132-133)
  • debug (144-145)
cognee/modules/pipelines/operations/run_tasks.py (3)
cognee/modules/pipelines/tasks/task.py (1)
  • Task (5-97)
cognee/modules/users/models/User.py (1)
  • User (12-39)
cognee/modules/pipelines/operations/run_tasks_base.py (1)
  • run_tasks_base (66-82)
cognee/modules/pipelines/operations/__init__.py (1)
cognee/modules/pipelines/operations/log_pipeline_run_initiated.py (1)
  • log_pipeline_run_initiated (6-22)
cognee/infrastructure/databases/vector/milvus/MilvusAdapter.py (10)
cognee/shared/logging_utils.py (1)
  • get_logger (148-169)
cognee/infrastructure/engine/models/DataPoint.py (1)
  • DataPoint (16-96)
cognee/infrastructure/engine/utils/parse_id.py (1)
  • parse_id (4-10)
cognee/infrastructure/databases/vector/exceptions/exceptions.py (1)
  • CollectionNotFoundError (5-14)
cognee/infrastructure/databases/vector/qdrant/QDrantAdapter.py (3)
  • retrieve (149-153)
  • has_collection (74-78)
  • delete_data_points (259-262)
cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py (3)
  • retrieve (151-164)
  • has_collection (51-53)
  • delete_data_points (218-226)
cognee/infrastructure/databases/vector/lancedb/LanceDBAdapter.py (3)
  • retrieve (129-144)
  • has_collection (52-55)
  • delete_data_points (205-210)
cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (3)
  • retrieve (190-203)
  • has_collection (61-71)
  • delete_data_points (285-293)
cognee/infrastructure/databases/vector/chromadb/ChromaDBAdapter.py (3)
  • retrieve (165-177)
  • has_collection (111-113)
  • delete_data_points (300-304)
cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py (3)
  • retrieve (226-233)
  • has_collection (116-119)
  • delete_data_points (354-360)
cognee/infrastructure/databases/graph/networkx/adapter.py (4)
cognee/infrastructure/databases/graph/neo4j_driver/adapter.py (7)
  • has_node (66-75)
  • get_edges (264-275)
  • extract_node (121-124)
  • extract_nodes (126-136)
  • get_neighbors (381-383)
  • get_node (385-392)
  • get_nodes (394-402)
cognee/infrastructure/databases/graph/kuzu/adapter.py (7)
  • has_node (167-171)
  • get_edges (439-475)
  • extract_node (284-304)
  • extract_nodes (306-325)
  • get_neighbors (479-481)
  • get_node (483-502)
  • get_nodes (504-521)
cognee/infrastructure/databases/graph/graph_db_interface.py (4)
  • get_edges (177-179)
  • get_neighbors (182-184)
  • get_node (125-127)
  • get_nodes (130-132)
cognee/infrastructure/databases/hybrid/falkordb/FalkorDBAdapter.py (2)
  • extract_node (235-238)
  • extract_nodes (240-241)
🪛 Ruff (0.11.9)
cognee/modules/data/methods/__init__.py

10-10: .get_unique_dataset_id.get_unique_dataset_id imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)

cognee/modules/retrieval/exceptions/__init__.py

7-7: .exceptions.SearchTypeNotSupported imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)


7-7: .exceptions.CypherSearchError imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)

cognee/modules/engine/models/__init__.py

6-6: .ColumnValue.ColumnValue imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)

examples/database_examples/chromadb_example.py

1-1: os imported but unused

Remove unused import: os

(F401)

cognee/version.py

10-13: Use a single with statement with multiple contexts instead of nested with statements

Combine with statements

(SIM117)

examples/database_examples/falkordb_example.py

1-1: os imported but unused

Remove unused import: os

(F401)

cognee/shared/logging_utils.py

10-10: importlib.metadata imported but unused

Remove unused import: importlib.metadata

(F401)

cognee/modules/retrieval/graph_completion_context_extension_retriever.py

3-3: cognee.infrastructure.llm.get_llm_client.get_llm_client imported but unused

Remove unused import: cognee.infrastructure.llm.get_llm_client.get_llm_client

(F401)


6-6: cognee.infrastructure.llm.prompts.read_query_prompt imported but unused

Remove unused import

(F401)


6-6: cognee.infrastructure.llm.prompts.render_prompt imported but unused

Remove unused import

(F401)

alembic/versions/482cd6517ce4_add_default_user.py

24-27: Use contextlib.suppress(Exception) instead of try-except-pass

Replace with contextlib.suppress(Exception)

(SIM105)

examples/database_examples/kuzu_example.py

1-1: os imported but unused

Remove unused import: os

(F401)

examples/database_examples/pgvector_example.py

1-1: os imported but unused

Remove unused import: os

(F401)

alembic/versions/1d0bb7fede17_add_pipeline_run_status.py

13-13: cognee.modules.pipelines.models.PipelineRun.PipelineRun imported but unused

Remove unused import

(F401)


13-13: cognee.modules.pipelines.models.PipelineRun.PipelineRunStatus imported but unused

Remove unused import

(F401)

cognee/modules/retrieval/utils/brute_force_triplet_search.py

66-73: Use contextlib.suppress(EntityNotFoundError) instead of try-except-pass

Replace with contextlib.suppress(EntityNotFoundError)

(SIM105)

cognee/api/v1/responses/routers/get_responses_router.py

74-74: Do not perform function call Depends in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

cognee-starter-kit/src/pipelines/low_level.py

37-37: Use a context manager for opening files

(SIM115)


40-40: Use a context manager for opening files

(SIM115)

cognee/modules/pipelines/operations/__init__.py

1-1: .log_pipeline_run_initiated.log_pipeline_run_initiated imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)

cognee/modules/pipelines/operations/run_tasks_base.py

32-32: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)

cognee/eval_framework/modal_run_eval.py

5-5: hashlib imported but unused

Remove unused import: hashlib

(F401)

cognee-mcp/src/server.py

166-166: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py

430-430: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

🪛 LanguageTool
cognee/infrastructure/llm/prompts/cot_validation_system_prompt.txt

[uncategorized] ~1-~1: This verb does not appear to agree with the subject. Consider using a different form.
Context: You are a helpful agent who are allowed to use only the provided questi...

(AI_EN_LECTOR_REPLACEMENT_VERB_AGREEMENT)

cognee-starter-kit/README.md

[uncategorized] ~85-~85: You might be missing the article “the” here.
Context: ... the following environment variables to .env file: ``` GRAPHISTRY_USERNAME="" G...

(AI_EN_LECTOR_MISSING_DETERMINER_THE)


[uncategorized] ~90-~90: You might be missing the article “an” here.
Context: ...D="" ``` Note: GRAPHISTRY_PASSWORD is API key. # What will you build with cogne...

(AI_EN_LECTOR_MISSING_DETERMINER_AN)

examples/data/car_and_tech_companies.txt

[duplication] ~2-~2: Possible typo: you repeated a word.
Context: text_1 = """ 1. Audi Audi is known for its modern designs and adv...

(ENGLISH_WORD_REPEAT_RULE)


[duplication] ~5-~5: Possible typo: you repeated a word.
Context: ...ns to high-performance sports cars. 2. BMW BMW, short for Bayerische Motoren Werke, is...

(ENGLISH_WORD_REPEAT_RULE)


[style] ~6-~6: Consider using a more concise synonym.
Context: ... reflects that commitment. BMW produces a variety of cars that combine luxury with sporty pe...

(A_VARIETY_OF)


[duplication] ~8-~8: Possible typo: you repeated a word.
Context: ...ine luxury with sporty performance. 3. Mercedes-Benz Mercedes-Benz is synonymous with luxury and quality. ...

(ENGLISH_WORD_REPEAT_RULE)


[duplication] ~11-~11: Possible typo: you repeated a word.
Context: ... catering to a wide range of needs. 4. Porsche Porsche is a name that stands for high-performa...

(ENGLISH_WORD_REPEAT_RULE)


[duplication] ~14-~14: Possible typo: you repeated a word.
Context: ...o value both performance and style. 5. Volkswagen Volkswagen, which means "people's car" in German, ...

(ENGLISH_WORD_REPEAT_RULE)


[grammar] ~17-~17: The plural determiner ‘these’ does not agree with the singular noun ‘car’.
Context: ...nce practicality with quality. Each of these car manufacturer contributes to Germany's r...

(THIS_NNS)


[uncategorized] ~17-~17: The grammatical number of this noun doesn’t look right. Consider replacing it.
Context: ...cality with quality. Each of these car manufacturer contributes to Germany's reputation as ...

(AI_EN_LECTOR_REPLACEMENT_NOUN_NUMBER)


[duplication] ~21-~21: Possible typo: you repeated a word.
Context: ...design excellence. """ text_2 = """ 1. Apple Apple is renowned for its innovative consumer...

(ENGLISH_WORD_REPEAT_RULE)


[duplication] ~27-~27: Possible typo: you repeated a word.
Context: ... in shaping the internet landscape. 3. Microsoft Microsoft Corporation has been a dominant force i...

(ENGLISH_WORD_REPEAT_RULE)


[style] ~28-~28: Consider using a synonym to be more concise.
Context: ...n both business and personal computing. In recent years, Microsoft has expanded into cloud comp...

(IN_RECENT_STYLE)


[uncategorized] ~31-~31: You might be missing the article “the” here.
Context: ...or innovation continues to reshape both retail and technology sectors. 5. Meta Meta, ...

(AI_EN_LECTOR_MISSING_DETERMINER_THE)


[duplication] ~33-~33: Possible typo: you repeated a word.
Context: ...both retail and technology sectors. 5. Meta Meta, originally known as Facebook, revoluti...

(ENGLISH_WORD_REPEAT_RULE)

🪛 Biome (1.9.4)
cognee-frontend/src/modules/ingestion/DataView/DataView.tsx

[error] 36-36: Do not shadow the global "DataView" property.

Consider renaming this variable. It's easy to confuse the origin of variables when they're named after a known global.

(lint/suspicious/noShadowRestrictedNames)

🪛 markdownlint-cli2 (0.17.2)
cognee-starter-kit/README.md

17-17: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)


21-21: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)


28-28: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)


43-43: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)


51-51: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)


59-59: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)


67-67: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)


75-75: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)


84-84: Bare URL used
null

(MD034, no-bare-urls)


86-86: Fenced code blocks should have a language specified
null

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Pensar Scan
🔇 Additional comments (174)
CONTRIBUTING.md (1)

100-100: Direct and clear phrasing is a great improvement
Removing “make sure to” tightens the instruction and makes the step more actionable without changing its meaning.

notebooks/cognee_llama_index.ipynb (2)

15-17: LGTM! Standard Jupyter notebook formatting improvement.

The conversion of the source field from a single string to a list containing that string follows standard Jupyter notebook JSON formatting conventions. This change improves notebook compatibility across different Jupyter environments without affecting code execution.


123-124: LGTM! Proper metadata and outputs field formatting.

The reordering and formatting of the metadata and outputs fields aligns with standard Jupyter notebook structure conventions. These formatting improvements enhance notebook consistency without impacting functionality.

notebooks/llama_index_cognee_integration.ipynb (5)

6-8: Approve formatting update: Converted source field to list
The change converts single-string source entries to lists (standard Jupyter JSON format). No functional impact.


62-64: Approve formatting update: Code cell source field as list
The pip install line is now correctly wrapped in a list for Jupyter compatibility.


199-201: Approve formatting update: Markdown source field as list
The "### Answer prompt based on RAG approach:" cell now uses a list for its source.


219-221: Approve formatting update: Conclusion source field as list
The conclusion paragraph is correctly formatted as a list.


282-283: Approve metadata update: Python kernel version bump
Updated language_info.version to "3.12.9" for reproducibility and clarity.

notebooks/cognee_simple_demo.ipynb (8)

13-19: Update installation command to Cognee v0.1.39
The tutorial now installs cognee==0.1.39, which aligns with the latest features mentioned in the PR summary. Please verify that this version is published on PyPI and free of regressions.


32-41: Standardize code cell metadata (execution_count & outputs)
Adding "execution_count": null and an empty outputs array ensures a reproducible, clean notebook state across environments.


53-60: Standardize code cell metadata (execution_count & outputs)
Consistent metadata scaffolding helps avoid stale outputs when rendering or testing the notebook.


72-80: Standardize code cell metadata (execution_count & outputs)
This cell’s metadata updates maintain consistency with other notebook examples in the repo.


92-98: Standardize code cell metadata (execution_count & outputs)
Cleaning execution state metadata here will prevent accidental output leakage in CI or PR previews.


102-108: Standardize code cell metadata (execution_count & outputs)
Empty outputs and null execution counts make this cell ready for fresh runs.


112-118: Standardize code cell metadata (execution_count & outputs)
Maintains a clean slate for the personality query example.


130-143: Standardize code cell metadata (execution_count & outputs)
Ensures the visualization cell has no precomputed outputs and a fresh execution count.

notebooks/cognee_multimedia_demo.ipynb (2)

25-27: Cleaned notebook metadata: Stripping execution_count and outputs keeps the notebook lean and version-control friendly.


34-43: Verify example data directory presence: Make sure the examples/data/multimedia directory—and the files text_to_speech.mp3 and example.png—are committed to the repo so this demo runs without errors.

cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py (1)

72-72: LGTM: Comment typography corrected.

The straight apostrophe replacement improves text consistency and readability.

cognee/api/v1/add/add.py (1)

2-6: LGTM! Import reordering improves organization.

The import reordering follows a logical structure (core modules → user models → task-specific imports) without affecting functionality. This aligns well with the broader dataset lifecycle management improvements mentioned in the PR context.

README.md (2)

133-139: Great addition of cognee UI section!

The new UI section enhances the documentation by showcasing the visual interface capabilities. The image and documentation link provide users with clear guidance on accessing the UI features.


38-42:

✅ Verification successful

LGTM! Path corrections look good.

The updated paths from "community/" to "assets/community/" appear correct and align with the repository structure improvements.


🏁 Script executed:

#!/bin/bash
# Verify the corrected README paths exist
fd "README.pt.md|README.zh.md|README.ru.md" assets/community/

Length of output: 153


Verified: Localized README links are correct

All three locale files (README.pt.md, README.zh.md, README.ru.md) exist under assets/community/ as intended. No further changes needed.

Dockerfile_modal (2)

21-21: Good optimization! Streamlined file copying.

Including README.md in the initial COPY command is more efficient than separate copy operations and follows Docker best practices for layer optimization.


25-27: LGTM! Poetry configuration improvements.

The changes appropriately:

  • Disable virtualenv creation (suitable for containers)
  • Install only specific required extras (neo4j, qdrant)
  • Follow modern Poetry best practices

This aligns well with the streamlined dependency management approach mentioned in the AI summary.

cognee/modules/retrieval/utils/brute_force_triplet_search.py (5)

4-6: LGTM! Import additions support robust error handling.

The new imports for EntityNotFoundError and CollectionNotFoundError are properly sourced and necessary for the enhanced error handling implemented in this file.


146-153: Excellent encapsulation of collection search with error handling.

The search_in_collection helper function properly isolates the error handling for individual collections, returning empty results when collections don't exist rather than propagating errors. This aligns with the vector adapter patterns shown in the codebase.


156-156: Good use of the helper function while maintaining parallelism.

The refactored asyncio.gather call properly utilizes the new helper function while preserving parallel execution across collections.


159-160: Appropriate graceful handling of empty search results.

Returning an empty list when no results are found across all collections is more user-friendly than raising exceptions for data absence scenarios.


173-174: Good defensive programming with outer exception handling.

The outer CollectionNotFoundError handling provides a safety net and maintains consistency with the graceful error handling approach throughout the function.

Dockerfile (3)

1-11: Well-designed multi-stage build approach.

The switch to uv with a multi-stage build is excellent for optimizing image size and build performance. The UV_LINK_MODE=copy setting is appropriate for mounted volumes, and the disabled bytecode compilation comment suggests thoughtful consideration of build options.


30-44: Excellent dependency management and caching strategy.

The approach of installing dependencies first (lines 33-34) followed by source code copy and final installation (lines 42-44) optimizes Docker layer caching. The explicit extras specification is more maintainable than ARG-based approaches, and using --frozen ensures reproducible builds.


46-60: Clean runtime stage configuration.

The runtime stage properly copies the built application and sets up the environment for execution. The PATH configuration correctly includes the uv virtual environment, and the PYTHONPATH setting ensures proper Python module resolution.

entrypoint.sh (1)

16-16: Appropriate removal of Poetry wrapper.

The direct alembic upgrade head call aligns with the Dockerfile migration away from Poetry to uv-based dependency management.

cognee/modules/visualization/cognee_network_visualization.py (1)

24-24: Approve addition of ColumnValue color mapping
The new "ColumnValue": "#13613a" entry aligns with the recently added ColumnValue node type. Ensure any visualization tests or documentation are updated to include this type so it renders correctly.

cognee/modules/search/types/SearchType.py (1)

14-15: Validate new SearchType enum members dispatch
The additions GRAPH_COMPLETION_COT and GRAPH_COMPLETION_CONTEXT_EXTENSION enable the new graph retrievers. Please verify that all dispatch logic in specific_search (and related registration points) handles these values, and update any documentation or type hints accordingly.

cognee/modules/pipelines/models/PipelineRun.py (1)

9-9: Ensure database enum is in sync with Alembic migration
The new DATASET_PROCESSING_INITIATED status must match the PostgreSQL enum via your migration. Please confirm that the Alembic revision adding this value has been applied and that existing pipeline run records handle the new status gracefully.

examples/python/graphiti_example.py (3)

17-17: Good addition of user context support.

The import of get_default_user aligns with the broader codebase refactoring to explicitly handle user context in pipeline operations.


31-32: Proper user initialization.

Correctly initializing the default user before passing it to pipeline operations. This ensures proper user context throughout the example execution.


41-41: Appropriate user context propagation.

Passing the user object to run_tasks is consistent with the codebase refactoring to include user context in pipeline execution.

cognee/tasks/ingestion/ingest_data.py (1)

107-107: Correct update to match new create_dataset signature.

The change from passing user.id to passing the full user object aligns with the refactored create_dataset function in cognee/modules/data/methods/create_dataset.py which now accepts a User object and internally extracts the owner ID.

cognee/modules/data/methods/__init__.py (1)

10-10: Import is required for module's public API.

The static analysis tool flagged this as unused, but this is a false positive. The get_unique_dataset_id function is part of the dataset management refactoring and is used by create_dataset as mentioned in the AI summary. Including it in the module's public API is appropriate.

🧰 Tools
🪛 Ruff (0.11.9)

10-10: .get_unique_dataset_id.get_unique_dataset_id imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)

cognee/tests/unit/modules/retrieval/summaries_retriever_test.py (1)

130-130: Correct parameter name update.

The change from limit=20 to top_k=20 properly aligns with the SummariesRetriever constructor signature shown in cognee/modules/retrieval/summaries_retriever.py (lines 12-14), which accepts a top_k parameter.

cognee/infrastructure/llm/prompts/cot_followup_system_prompt.txt (1)

1-3: Well-structured prompt for chain-of-thought follow-up functionality.

The prompt is clear, concise, and provides specific instructions for generating focused clarifying questions. The formatting guidelines ensure consistent output.

cognee/api/v1/delete/routers/get_delete_router.py (1)

61-62:

✅ Verification successful

Verify cognee_delete function compatibility with file.file input.

The change to pass file.file instead of the entire UploadFile object is correct from a FastAPI perspective, as it provides access to the underlying file object. However, ensure the cognee_delete function can properly handle this input type.


🏁 Script executed:

#!/bin/bash
# Description: Verify cognee_delete function signature and usage
# Expected: Function should accept file-like objects and handle them properly

echo "Checking cognee_delete function signature..."
ast-grep --pattern 'def cognee_delete($$$):'

echo "Checking other usages of cognee_delete to ensure consistency..."
rg -A 3 -B 3 "cognee_delete\(" cognee/api/v1/delete/

echo "Checking imports and function definition..."
rg -A 10 -B 5 "from cognee.api.v1.delete import delete as cognee_delete"

Length of output: 3387


🏁 Script executed:

#!/bin/bash
# Locate the actual definition of the `delete` function aliased as `cognee_delete`
echo "Searching for the delete function signature in cognee/api/v1/delete..."
rg -n "def delete" -R cognee/api/v1/delete

Length of output: 600


🏁 Script executed:

#!/bin/bash
# Correctly search for the delete function definition under cognee/api/v1/delete
echo "Looking for delete function signature in cognee/api/v1/delete directory..."
rg -n "def delete" cognee/api/v1/delete

Length of output: 563


🏁 Script executed:

#!/bin/bash
# Retrieve the signature and docstring of the delete function to confirm accepted types
echo "Fetching delete function signature and docstring..."
rg -n "async def delete" -A5 -B5 cognee/api/v1/delete/delete.py

Length of output: 1609


Compatibility Verified: The cognee_delete function accepts BinaryIO inputs and file.file (a SpooledTemporaryFile) satisfies that interface. No changes needed—this update is correct.

cognee/tests/test_weaviate.py (1)

85-85: LGTM! Correct async/await usage.

The addition of await properly handles the asynchronous list_all() method call, aligning with the broader refactoring of the Weaviate adapter to use fully asynchronous operations.

cognee/modules/settings/get_settings.py (2)

2-2: LGTM! Proper import addition.

Correctly imports Optional to support the type annotations changes below.


24-25: LGTM! Improved configuration flexibility.

Making endpoint and api_version optional enhances configuration flexibility, allowing LLM providers that don't require these fields to work seamlessly. This aligns well with the broader system enhancements for adaptable LLM provider configurations.

cognee/tests/unit/modules/retrieval/graph_completion_retriever_test.py (1)

157-162: LGTM! Improved async test execution.

Consolidating the individual test method calls into a single main() function that sequentially awaits each test is a better practice. This ensures proper execution within one event loop instead of multiple separate runs, preventing potential async execution issues.

cognee/infrastructure/llm/generic_llm_api/adapter.py (1)

43-43: LGTM! Prompt simplification is consistent with other adapters.

The removal of the explicit instruction prefix aligns with similar changes made in other LLM adapters as mentioned in the summary. This simplification maintains the core functionality while streamlining the prompt structure.

cognee/api/v1/responses/routers/__init__.py (1)

1-3: LGTM! Proper module initialization pattern.

The import and export structure follows Python best practices for package organization. The __all__ declaration properly controls the public API of the routers subpackage.

cognee/api/client.py (2)

17-17: LGTM! Import follows established pattern.

The import statement correctly follows the established pattern for importing routers in this file and is placed in the appropriate location with other router imports.


171-172: LGTM! Router registration follows established conventions.

The router registration correctly follows the established pattern used for other API endpoints in this file, with appropriate prefix and tags.

cognee-frontend/src/app/wizard/ExploreStep/ExploreStep.tsx (2)

5-5: LGTM! Type change aligns with broader refactoring.

The dataset type change from {id: string} to {name: string} is consistent with the broader frontend refactoring described in the AI summary and aligns with similar changes in other components.


11-11: LGTM! Simplified prop passing.

The removal of the non-null assertion operator (!) when passing the dataset prop suggests improved type safety or guaranteed non-null values, which is a positive change.

cognee/infrastructure/llm/ollama/adapter.py (1)

42-42: LGTM! Prompt simplification aligns with standardization effort.

The removal of the explicit instruction prefix and sending raw text_input directly aligns with the standardization effort across different LLM adapters mentioned in the AI summary. This simplification maintains functionality while creating consistency across the codebase.

cognee/api/v1/responses/__init__.py (1)

1-3: LGTM! Standard package initialization following best practices.

This is a well-structured package initialization file that follows Python best practices:

  • Correct import path for the router function
  • Proper use of __all__ to control public exports
  • Clean and simple design that facilitates convenient imports
cognee-frontend/src/ui/Partials/Explorer/Explorer.tsx (1)

10-10:

✅ Verification successful

Verify breaking change impact across the codebase.

The change from dataset: { id: string } to dataset: { name: string } is a breaking change that affects the component contract. While the AI summary indicates coordinated updates across related components, this needs thorough verification.

Run the following script to ensure all usages of the Explorer component have been updated:


🏁 Script executed:

#!/bin/bash
# Description: Find all usages of Explorer component to verify they pass the correct dataset prop structure

# Search for Explorer component imports and usage
echo "=== Explorer component imports ==="
rg -A 3 -B 1 "import.*Explorer"

echo -e "\n=== Explorer component usage ==="
rg -A 5 -B 2 "<Explorer.*dataset"

echo -e "\n=== Dataset prop with id property ==="
rg -A 2 -B 2 'dataset.*id.*string'

Length of output: 6116


🏁 Script executed:

#!/bin/bash
# Inspect ExplorerProps and dataset usage in Explorer.tsx

echo "=== interface ExplorerProps and nearby lines ==="
rg -n "interface ExplorerProps" -A10 -B2 cognee-frontend/src/ui/Partials/Explorer/Explorer.tsx || true

echo -e "\n=== getExplorationGraphUrl usage ==="
rg -n "getExplorationGraphUrl" -A5 -B2 cognee-frontend/src/ui/Partials/Explorer/Explorer.tsx || true

echo -e "\n=== dataset usages within Explorer.tsx ==="
rg -n "dataset\." -A5 -B2 cognee-frontend/src/ui/Partials/Explorer/Explorer.tsx || true

Length of output: 1548


#!/bin/bash
# Locate and inspect the definition of getExplorationGraphUrl
echo "=== Searching for getExplorationGraphUrl definition ==="
rg -n "export function getExplorationGraphUrl" -C 5 cognee-frontend/src/modules/exploration || true

echo -e "\n=== Fallback: any function getExplorationGraphUrl ==="
rg -n "function getExplorationGraphUrl" -C 5 cognee-frontend/src/modules/exploration || true

#!/bin/bash
# Locate the file defining getExplorationGraphUrl
echo "=== File(s) containing getExplorationGraphUrl ==="
rg -l "getExplorationGraphUrl" -R cognee-frontend/src/modules/exploration || true

echo -e "\n=== Inspect definition of getExplorationGraphUrl ==="
FILE=$(rg -l "getExplorationGraphUrl" -R cognee-frontend/src/modules/exploration | head -n1)
if [ -n "$FILE" ]; then
  rg -n "getExplorationGraphUrl" -C5 "$FILE" || true
fi

🏁 Script executed:

#!/bin/bash
# Search for definitions or exports of getExplorationGraphUrl across the repository
echo "=== getExplorationGraphUrl references ==="
rg -n "getExplorationGraphUrl" -C5

Length of output: 3187


#!/bin/bash
# Verify the dataset prop type in ExploreStep component
echo "=== ExploreStepProps definition ==="
rg -n "interface ExploreStepProps" -A5 -B2 cognee-frontend/src/app/wizard/ExploreStep/ExploreStep.tsx || true

echo -e "\n=== dataset usage in ExploreStepProps ==="
rg -n "dataset" -A2 -B0 cognee-frontend/src/app/wizard/ExploreStep/ExploreStep.tsx || true

#!/bin/bash
# Inspect ExploreStepProps and dataset usage in ExploreStep component
echo "=== ExploreStepProps definition ==="
rg -n "interface ExploreStepProps" -A5 -B2 cognee-frontend/src/app/wizard/ExploreStep/ExploreStep.tsx || true

echo -e "\n=== dataset prop in ExploreStepProps ==="
rg -n "dataset" -A2 -B2 cognee-frontend/src/app/wizard/ExploreStep/ExploreStep.tsx || true


🏁 Script executed:

#!/bin/bash
# Verify the dataset type in ExploreStepProps for ExploreStep component
echo "=== ExploreStepProps definition ==="
rg -n "interface ExploreStepProps" -A5 -B2 cognee-frontend/src/app/wizard/ExploreStep/ExploreStep.tsx || true

echo -e "\n=== dataset property in ExploreStepProps ==="
rg -n "dataset" -A3 -B0 cognee-frontend/src/app/wizard/ExploreStep/ExploreStep.tsx || true

Length of output: 786


No breaking change detected—all Explorer usages supply a name field
All instances of <Explorer> now receive objects containing a name property (e.g. { id, name } in DatasetsView and { name } in ExploreStep), which satisfies the updated ExplorerProps signature. No further updates are needed.

cognee/modules/retrieval/graph_completion_retriever.py (1)

78-78:

✅ Verification successful

Verify that EntityNotFoundError is handled upstream.

The removal of EntityNotFoundError handling from this method could introduce unhandled exceptions if the error can still be raised by get_triplets() or its dependencies.

Run the following script to verify error handling has been properly centralized:


🏁 Script executed:

#!/bin/bash
# Description: Check if EntityNotFoundError handling exists in get_triplets or brute_force_triplet_search

echo "=== Checking for EntityNotFoundError in brute_force_triplet_search ==="
ast-grep --pattern $'try:
  $$$
except EntityNotFoundError:
  $$$'

echo -e "\n=== Checking if EntityNotFoundError can still be raised ==="
rg -A 3 -B 3 "EntityNotFoundError"

echo -e "\n=== Checking get_triplets method implementation ==="
ast-grep --pattern $'async def get_triplets($$$):
  $$$'

Length of output: 18573


I’ll locate and inspect the graph_completion_retriever.py file and all get_triplets references to confirm where it’s defined and how errors are handled upstream.


🏁 Script executed:

#!/bin/bash
echo "Finding graph_completion_retriever.py path..."
find . -type f -name 'graph_completion_retriever.py'

echo "Displaying graph_completion_retriever.py (first 300 lines)..."
FILE=$(find . -type f -name 'graph_completion_retriever.py')
sed -n '1,300p' "$FILE"

echo "Searching for get_triplets usage and definition..."
rg -n "get_triplets"

Length of output: 6337


EntityNotFoundError is safely handled in brute_force_triplet_search
The brute_force_triplet_search utility (cognee/modules/retrieval/utils/brute_force_triplet_search.py lines 66–73) already catches and suppresses EntityNotFoundError. As a result, neither get_triplets nor its callers will propagate that exception, and no additional try/except is needed here.

cognee/infrastructure/databases/graph/graph_db_interface.py (1)

61-61:

❓ Verification inconclusive

Reconsider downgrading database error logging severity.

Changing database error logs from logger.error to logger.debug significantly reduces their visibility. Database operations failing (adding relationships, committing sessions) are typically serious issues that should remain at error level for production monitoring and debugging.

Unless these errors are truly expected and recoverable, consider keeping them at error level:

-                        logger.debug(f"Error adding relationship: {e}")
+                        logger.error(f"Error adding relationship: {e}")
-                        logger.debug(f"Error adding relationship: {e}")
+                        logger.error(f"Error adding relationship: {e}")
-                logger.debug(f"Error committing session: {e}")
+                logger.error(f"Error committing session: {e}")

What was the rationale for downgrading these database errors to debug level? Are these errors expected to occur frequently in normal operation?

Also applies to: 81-81, 88-88


Confirm intent behind lowering database error logs to debug level

I noticed that in

  • cognee/infrastructure/databases/graph/graph_db_interface.py
    • Line 61: logger.errorlogger.debug when adding node relationships
    • Line 81: logger.errorlogger.debug when adding edge relationships
    • Line 88: logger.errorlogger.debug when committing the session

Database operation failures are typically critical and warrant error-level logging for production monitoring and post-mortem analysis. If these exceptions are truly expected and safely recoverable in normal operation, please clarify the rationale. Otherwise, consider reverting to:

-    logger.debug(f"Error adding relationship: {e}")
+    logger.error(f"Error adding relationship: {e}")
-    logger.debug(f"Error committing session: {e}")
+    logger.error(f"Error committing session: {e}")

Can you confirm whether these database errors should remain at debug level?

cognee/__init__.py (1)

1-5: Well-implemented version management approach.

The integration of dynamic version detection is well done. The placement at the top of the file properly avoids circular imports, and the approach supports both development (from pyproject.toml) and installed package scenarios as shown in the get_cognee_version function.

The comment clearly explains the positioning rationale, which will help future maintainers.

cognee/infrastructure/llm/gemini/adapter.py (2)

2-3: LGTM! Good import organization.

The reorganization of imports with pydantic and typing imports moved up follows Python conventions and improves readability.


6-7: Excellent observability standardization.

The replacement of conditional observe import with the standardized get_observe() approach improves consistency across LLM adapters and centralizes observability configuration.

Also applies to: 17-17

cognee/modules/observability/observers.py (1)

4-9: Well-implemented Observer enum.

The enum design follows Python best practices by inheriting from both str and Enum, allowing for easy string comparisons while maintaining type safety. The monitoring tool options are clearly defined and appropriately named.

cognee/tests/test_neo4j.py (1)

90-90: Verify the search history count calculation.

The assertion expects 6 entries after commenting out the natural language search. Ensure this count matches the actual number of search operations performed above.

The count appears correct based on the 3 search operations (INSIGHTS, CHUNKS, SUMMARIES) that remain active, but consider making this count dynamic if more tests are added/removed.

cognee/api/v1/cognify/cognify.py (1)

37-39: Good improvement for pipeline management.

Adding the explicit pipeline_name="cognify_pipeline" parameter enhances pipeline tracking and aligns with the broader pipeline management improvements mentioned in the AI summary.

This change improves observability and allows for better pipeline status tracking without breaking existing functionality.

cognee-frontend/src/modules/exploration/getExplorationGraphUrl.ts (1)

3-3:

❓ Verification inconclusive

Verify the dataset parameter usage.

The function parameter was changed from { id: string } to { name: string }, but the dataset parameter is not used anywhere in the function body. This raises several concerns:

  1. Unused parameter: The function fetches from a generic /v1/visualize endpoint without passing any dataset-specific information.
  2. Potential API mismatch: The endpoint might expect dataset identification but isn't receiving it.
  3. Incomplete refactoring: This change might be part of incomplete dataset identification standardization.

Please verify the intended behavior by checking:

  1. Does the /v1/visualize endpoint expect dataset parameters?
  2. Should the dataset name be included in the request URL or body?

🏁 Script executed:

#!/bin/bash
# Search for other usages of getExplorationGraphUrl to understand expected behavior
rg -A 5 -B 5 "getExplorationGraphUrl"

# Check the backend visualize endpoint implementation
rg -A 10 "\/v1\/visualize"

Length of output: 3133


Confirm dataset usage in getExplorationGraphUrl

The dataset parameter ({ name: string }) is declared but never used. The call to
fetch('/v1/visualize') does not include any dataset-specific information, which suggests an incomplete refactoring or a mismatch with the API.

Please verify:

  • Does the /v1/visualize endpoint require a dataset identifier (e.g. name or id)?
  • If so, should dataset.name be included as a query parameter or in the request body?

File:

  • cognee-frontend/src/modules/exploration/getExplorationGraphUrl.ts
cognee/infrastructure/llm/anthropic/adapter.py (1)

20-22: Excellent async client upgrade!

The change from anthropic.Anthropic() to anthropic.AsyncAnthropic() is a great improvement that:

  • Aligns with the async method signatures (acreate_structured_output)
  • Improves performance through asynchronous operations
  • Maintains consistency with other LLM adapters in the codebase
  • Properly integrates with the instructor.patch wrapper
cognee/api/v1/datasets/datasets.py (1)

31-31: Good alignment with pipeline status API changes.

Adding the explicit pipeline_name="cognify_pipeline" parameter is a positive change that:

  • Aligns with the updated get_pipeline_status function signature (as shown in the relevant code snippets)
  • Makes the status query more specific and targeted to the cognify pipeline
  • Improves code clarity by being explicit about which pipeline status is being retrieved
  • Maintains consistency with pipeline naming conventions used throughout the system
cognee/infrastructure/llm/prompts/cot_followup_user_prompt.txt (1)

1-14: Well-structured prompt template for chain-of-thought reasoning.

The prompt effectively guides the model to generate targeted follow-up questions by:

  • Providing clear context about knowledge graph exploration
  • Using structured tags for different input components
  • Emphasizing the need for exactly one question to resolve validation gaps
cognee/infrastructure/llm/prompts/cot_validation_user_prompt.txt (1)

1-11: Clean and well-structured validation prompt template.

The template provides a clear, organized structure for chain-of-thought validation with:

  • Consistent XML-style tag formatting
  • Logical separation of question, answer, and context
  • Simple and effective layout for LLM processing
cognee/modules/pipelines/operations/get_pipeline_status.py (1)

8-8: LGTM! Enhanced pipeline status filtering.

The addition of the pipeline_name parameter provides better specificity for pipeline status queries. This change aligns well with the broader pipeline improvements mentioned in the PR and correctly utilizes the pipeline_name column from the PipelineRun model.

Also applies to: 23-23

cognee/base_config.py (1)

5-5: LGTM! Clean refactoring from MonitoringTool to Observer.

The transition from MonitoringTool to Observer maintains the same functionality while using the updated enum structure. The Observer.LANGFUSE value is properly defined in the new enum, ensuring consistency across the codebase.

Also applies to: 11-11

cognee/tests/test_relational_db_migration.py (2)

115-118: LGTM! Updated assertions for enhanced migration granularity.

The increased expected counts (12 distinct nodes and 15 edges) reflect the enhanced migration logic that creates more granular graph structures, including individual column data as separate nodes.


161-162:

❓ Verification inconclusive

Verify the updated node and edge count expectations.

The substantial increases in expected counts (SQLite: 543 nodes/1317 edges, PostgreSQL: 522 nodes/961 edges) suggest significant changes to the migration process. While this aligns with the enhanced migration logic mentioned in the AI summary, please ensure these specific values are accurate.

Run the following script to validate the current migration output matches these expectations:

Also applies to: 192-193


🏁 Script executed:

#!/bin/bash
# Description: Verify that the migration process produces the expected node/edge counts

# Search for the migration function to understand the new logic
ast-grep --pattern 'async def migrate_relational_database($$$) {
  $$$
}'

# Look for ColumnValue model references to confirm the enhanced granularity
rg -A 5 "ColumnValue" --type py

# Check for any recent changes to migration logic
rg -A 10 -B 5 "migrate_column_data" --type py

Length of output: 5530


Confirm the migration produces 543 nodes and 1317 edges
We’ve identified the updated migration logic (including ColumnValue nodes), but haven’t yet validated that it yields exactly 543 nodes/1317 edges (and 522/961 for PostgreSQL). Please run the migration test against your representative SQLite and PostgreSQL datasets and adjust the assertions in cognee/tests/test_relational_db_migration.py (lines 161–162 and 192–193) if the actual counts differ.

• Run pytest cognee/tests/test_relational_db_migration.py with your sample databases
• Verify the node/edge counts reported by the migration
• Update the hard-coded expected values to match reality

notebooks/cognee_demo.ipynb (2)

470-470: LGTM! Import correction aligns with module structure.

The import statement correctly references the task module with lowercase naming, which aligns with the broader codebase refactoring mentioned in the summary.


508-508: LGTM! User context propagation implemented correctly.

The explicit user argument in the run_tasks call correctly aligns with the API changes that now require user context for pipeline execution.

cognee-frontend/src/app/page.tsx (2)

46-46: LGTM: Clear user guidance added.

The updated notification message effectively guides users to run "Cognify" when ready, improving the user experience flow.


106-106: LGTM: Proper prop integration.

The onCognify prop is correctly passed to the DataView component, enabling the cognification trigger functionality.

cognee/infrastructure/llm/openai/adapter.py (2)

18-20: LGTM: Standardized observability integration.

The import cleanup and direct use of get_observe() aligns with the observability framework standardization mentioned in the AI summary.


66-66: Verify that simplified prompts maintain structured output quality.

The removal of the instruction prefix "Use the given format to extract information from the following input:" from user messages may impact the LLM's ability to generate properly structured outputs, especially for complex extraction tasks.

Please test that structured output generation still works correctly with the simplified prompts, particularly for complex data extraction scenarios.

Also applies to: 93-93

cognee/modules/engine/models/ColumnValue.py (1)

1-10: LGTM: Well-structured model class.

The ColumnValue class correctly inherits from DataPoint and follows the established pattern with:

  • Proper field typing for name, description, and properties
  • Appropriate metadata configuration specifying properties as an index field
  • Clean integration with the existing engine model infrastructure
cognee-frontend/src/ui/Partials/SearchView/SearchView.tsx (2)

81-81: LGTM: Improved user experience with immediate input clearing.

Moving the input clearing to happen immediately after adding the user message improves responsiveness and user experience, rather than waiting for the fetch response.


108-110: LGTM: Good error handling for input recovery.

The error handling correctly restores the input value if the fetch request fails, preventing loss of user input and improving the overall user experience.

cognee/infrastructure/databases/graph/get_graph_engine.py (1)

70-81: LGTM! Memgraph integration follows established patterns.

The Memgraph integration is well-implemented and follows the same patterns as other database providers:

  • Proper credential validation with descriptive error messages
  • Consistent import and instantiation approach
  • Matches the structure used for Neo4j, FalkorDB, and Kuzu

The implementation correctly validates required credentials and provides clear error messaging when they're missing.

cognee/infrastructure/databases/vector/exceptions/exceptions.py (2)

9-9: Good fix for the incorrect default name parameter.

The default name parameter was incorrectly set to "DatabaseNotCreatedError" instead of "CollectionNotFoundError". This fix ensures the error name correctly reflects the exception type.


11-14: Enhancement: Added configurable logging parameters.

The addition of log and log_level parameters provides better control over error logging behavior. This allows callers to customize whether and at what level collection-not-found errors should be logged, which is particularly useful for scenarios where missing collections might be expected behavior.

notebooks/cognee_graphiti_demo.ipynb (5)

18-18: Good fix for the typo.

The correction from "libraires" to "libraries" improves readability.


131-132: Simplified prune calls are cleaner.

The removal of commented logging setup and direct async calls improve code readability.


135-136: Good addition of user context initialization.

Adding explicit user context aligns with the broader codebase pattern of associating user information with pipeline operations.


143-143: Pipeline execution improvements look good.

The cleaned up task list formatting and addition of user=user parameter to run_tasks properly integrates user context into pipeline execution.

Also applies to: 145-145


148-148: Simplified print statement is cleaner.

Direct printing of the result instead of formatted message reduces unnecessary complexity.

cognee/modules/pipelines/operations/log_pipeline_run_initiated.py (1)

6-14: Function signature and PipelineRun creation look correct.

The function properly accepts the required parameters and creates a PipelineRun instance with appropriate fields including the new status enum value.

alembic/versions/1d0bb7fede17_add_pipeline_run_status.py (2)

23-29: Migration logic is correct and database-safe.

The use of ADD VALUE IF NOT EXISTS prevents errors if the enum value already exists, and limiting to PostgreSQL dialect is appropriate.


32-33: Empty downgrade function is acceptable for enum additions.

Enum value removal in PostgreSQL is complex and often not necessary for this type of change.

cognee/modules/search/methods/search.py (3)

14-17: New retriever imports follow proper naming conventions.

The imports for GraphCompletionCotRetriever and GraphCompletionContextExtensionRetriever are properly structured and follow the existing pattern.


26-26: Import path correction is good.

Fixing the import to use absolute paths from cognee.modules.search.operations improves import clarity.


77-84: New search type integrations are properly implemented.

The new search types GRAPH_COMPLETION_COT and GRAPH_COMPLETION_CONTEXT_EXTENSION are correctly integrated into the search_tasks dictionary with appropriate parameters.

cognee/eval_framework/answer_generation/answer_generation_executor.py (2)

3-6: LGTM! New retriever imports added correctly.

The new retriever imports are properly formatted and follow the existing import pattern.


17-18: LGTM! New retriever options added correctly.

The new retriever entries follow the existing naming convention and are properly mapped to their respective classes.

cognee-frontend/src/modules/ingestion/DataView/DataView.tsx (4)

9-10: LGTM! New imports added correctly.

The CTAButton and useBoolean imports are properly added to support the new cognify functionality.


33-33: LGTM! OnCognify prop added correctly.

The new prop follows TypeScript conventions and properly defines the expected async function signature.


60-70: LGTM! Cognify state management implemented correctly.

The useBoolean hook provides clean state management for the disabled state, and the handleCognify function properly handles the async operation with loading state.


80-84: LGTM! Cognify button implementation is correct.

The CTAButton is properly configured with the disabled state and onClick handler, providing good UX during the cognify operation.

cognee/shared/logging_utils.py (3)

43-48: LGTM! System information constants added appropriately.

The version and OS information constants provide useful debugging context for log analysis.


326-326: Verify the condition change for SQLAlchemy warning suppression.

The condition changed from log_level > logging.WARNING to log_level > logging.DEBUG, which broadens when SQLAlchemy warnings are suppressed. This means warnings will now be suppressed at INFO level and above, not just ERROR and CRITICAL.

Please confirm this change is intentional. The new condition will suppress SQLAlchemy warnings for most common log levels (INFO, WARNING, ERROR, CRITICAL), only showing them when DEBUG level is set.


340-351: LGTM! System information logging enhances debugging capability.

The addition of system information logging provides valuable context for troubleshooting and debugging. The implementation correctly logs Python version, structlog version, cognee version, and OS information.

cognee-starter-kit/src/pipelines/custom-model.py (5)

8-27: LGTM! Proper directory setup and configuration.

The directory setup correctly uses pathlib for cross-platform compatibility and properly configures both data and system directories relative to the script location.


25-27: LGTM! Clean state management with pruning.

The pruning operations ensure a fresh state for the example, which is appropriate for a starter pipeline that should be repeatable.


36-52: LGTM! Well-designed custom data models.

The custom DataPoint classes are properly structured with:

  • Clear inheritance from DataPoint
  • Appropriate type annotations
  • Metadata configuration for indexing
  • Logical relationships between entities

70-88: LGTM! Comprehensive search examples.

The script demonstrates all major search types (graph completion, RAG completion, summaries, chunks) with clear output formatting, providing excellent learning examples.


91-92: LGTM! Proper async main pattern.

The script follows Python best practices for async execution with the standard if __name__ == "__main__": guard and asyncio.run().

cognee/tests/unit/modules/retrieval/chunks_retriever_test.py (2)

19-19: Good improvement for test isolation.

The updated directory paths with more specific test context names (e.g., test_chunks_context_simple, test_chunk_context_complex, test_chunk_context_empty) enhance test isolation by ensuring each test uses its own dedicated storage directories. This prevents potential interference between test runs.

Also applies to: 23-23, 76-76, 80-80, 165-165, 169-169


193-198: Cleaner async test execution pattern.

The consolidation of test execution into a single main() coroutine that sequentially awaits all test methods is a cleaner approach compared to multiple separate run() calls. This ensures proper sequential execution and better error handling.

examples/database_examples/chromadb_example.py (1)

8-87: Excellent example structure and documentation.

This example follows a clear, well-documented pattern:

  1. Configuration setup for ChromaDB
  2. Directory configuration
  3. Data cleanup and setup
  4. Data ingestion and processing
  5. Multiple search demonstrations

The inline comments and docstring provide good guidance for users. The example effectively demonstrates Cognee's integration with ChromaDB across different search types.

cognee/exceptions/exceptions.py (1)

15-16: Good enhancement for configurable logging.

Adding log and log_level parameters provides valuable flexibility for controlling error logging behavior across the application.

examples/data/car_and_tech_companies.txt (1)

1-37: Good sample data for testing and examples.

The content provides comprehensive descriptions of major companies in automotive and technology sectors, making it suitable for testing knowledge extraction and search functionality in Cognee examples.

🧰 Tools
🪛 LanguageTool

[duplication] ~2-~2: Possible typo: you repeated a word.
Context: text_1 = """ 1. Audi Audi is known for its modern designs and adv...

(ENGLISH_WORD_REPEAT_RULE)


[duplication] ~5-~5: Possible typo: you repeated a word.
Context: ...ns to high-performance sports cars. 2. BMW BMW, short for Bayerische Motoren Werke, is...

(ENGLISH_WORD_REPEAT_RULE)


[style] ~6-~6: Consider using a more concise synonym.
Context: ... reflects that commitment. BMW produces a variety of cars that combine luxury with sporty pe...

(A_VARIETY_OF)


[duplication] ~8-~8: Possible typo: you repeated a word.
Context: ...ine luxury with sporty performance. 3. Mercedes-Benz Mercedes-Benz is synonymous with luxury and quality. ...

(ENGLISH_WORD_REPEAT_RULE)


[duplication] ~11-~11: Possible typo: you repeated a word.
Context: ... catering to a wide range of needs. 4. Porsche Porsche is a name that stands for high-performa...

(ENGLISH_WORD_REPEAT_RULE)


[duplication] ~14-~14: Possible typo: you repeated a word.
Context: ...o value both performance and style. 5. Volkswagen Volkswagen, which means "people's car" in German, ...

(ENGLISH_WORD_REPEAT_RULE)


[grammar] ~17-~17: The plural determiner ‘these’ does not agree with the singular noun ‘car’.
Context: ...nce practicality with quality. Each of these car manufacturer contributes to Germany's r...

(THIS_NNS)


[uncategorized] ~17-~17: The grammatical number of this noun doesn’t look right. Consider replacing it.
Context: ...cality with quality. Each of these car manufacturer contributes to Germany's reputation as ...

(AI_EN_LECTOR_REPLACEMENT_NOUN_NUMBER)


[duplication] ~21-~21: Possible typo: you repeated a word.
Context: ...design excellence. """ text_2 = """ 1. Apple Apple is renowned for its innovative consumer...

(ENGLISH_WORD_REPEAT_RULE)


[duplication] ~27-~27: Possible typo: you repeated a word.
Context: ... in shaping the internet landscape. 3. Microsoft Microsoft Corporation has been a dominant force i...

(ENGLISH_WORD_REPEAT_RULE)


[style] ~28-~28: Consider using a synonym to be more concise.
Context: ...n both business and personal computing. In recent years, Microsoft has expanded into cloud comp...

(IN_RECENT_STYLE)


[uncategorized] ~31-~31: You might be missing the article “the” here.
Context: ...or innovation continues to reshape both retail and technology sectors. 5. Meta Meta, ...

(AI_EN_LECTOR_MISSING_DETERMINER_THE)


[duplication] ~33-~33: Possible typo: you repeated a word.
Context: ...both retail and technology sectors. 5. Meta Meta, originally known as Facebook, revoluti...

(ENGLISH_WORD_REPEAT_RULE)

cognee/modules/data/methods/create_dataset.py (4)

7-8: LGTM! Clean import additions for refactoring.

The new imports properly support the function signature change and the extracted dataset ID generation logic.


11-11: Excellent architectural improvement!

Changing from primitive owner_id to the full User object provides better type safety and richer context. This follows good OOP principles and makes the API more robust.


12-12: Proper handling of the signature change.

Correctly extracts the owner ID from the User object to maintain compatibility with existing database logic and relationships.


24-25: Well-implemented separation of concerns.

The async call to get_unique_dataset_id correctly extracts the UUID generation logic while maintaining the same functionality. The function parameters and async handling are properly implemented.

notebooks/graphrag_vs_rag.ipynb (4)

56-56: Appropriate version update.

Updating to cognee 0.1.39 aligns with the API changes throughout the notebook and ensures compatibility with the new import paths and function signatures.


152-152: Good API restructuring to versioned namespace.

The import path change from internal modules to cognee.api.v1.search represents better API organization with proper versioning. This is a good practice for maintaining backward compatibility.


153-153: Excellent API improvements with keyword arguments.

The consistent use of keyword arguments (query_type=, query_text=) across all search calls improves code clarity and maintainability. The enum name change from COMPLETION to RAG_COMPLETION is more descriptive and better communicates the search type's purpose.

Also applies to: 173-173, 202-202


56-56: Comprehensive and consistent API migration.

All API changes are applied consistently throughout the notebook - package version, import paths, and function call patterns are all properly aligned. No mixed usage of old and new patterns detected.

Also applies to: 152-152, 153-153, 173-173, 202-202

cognee/tasks/temporal_awareness/index_graphiti_objects.py (2)

34-34: Method name change looks correct.

The update from get_model_independent_graph_data() to get_graph_data() aligns with the API standardization mentioned in the summary.


36-40:

✅ Verification successful

Verify the new data structure format.

The iteration pattern change suggests nodes_data now returns tuples of (node_id, node_data) instead of just node dictionaries. This is a significant change that should be verified across the codebase.


🏁 Script executed:

#!/bin/bash
# Description: Search for other usages of get_graph_data or nodes_data iteration
# Expected: All should use the new tuple format

echo "Searching for other get_graph_data usages..."
rg "get_graph_data|nodes_data" -A 3 -B 1

Length of output: 14055


All graph_data consumers unpack nodes_data as (id, data) tuples
Verified that every call site of get_graph_data expects and correctly unpacks a list of (node_id, node_data) tuples:

  • cognee/tasks/temporal_awareness/index_graphiti_objects.py (L36–40)
  • cognee/modules/visualization/cognee_network_visualization.py
  • cognee/modules/graph/cognee_graph/CogneeGraph.py

No callers iterate over nodes_data as plain dictionaries. The new tuple format is applied consistently—no further changes needed.

cognee/tests/unit/modules/pipelines/run_tasks_with_context_test.py (2)

14-22: Task definitions look correct for context testing.

The task definitions properly demonstrate different scenarios:

  • task_1: Uses both input and context
  • task_2: Only uses input (no context parameter)
  • task_3: Uses both input and context

This effectively tests the context propagation mechanism.


37-39:

✅ Verification successful

Verify the expected result calculation.

Let me trace through the calculation:

  • Initial data: 5, context: 7
  • task_1: 5 + 7 = 12
  • task_2: 12 * 2 = 24
  • task_3: 24 ** 7 = 4,586,471,424

The calculation appears correct.


🏁 Script executed:

#!/bin/bash
# Description: Verify the mathematical calculation
# Expected: 24^7 should equal 4586471424

python3 -c "print(f'24^7 = {24**7}')"

Length of output: 63


Calculation Verified
The computation 24 ** 7 equals 4,586,471,424 as expected. No changes required.

examples/database_examples/qdrant_example.py (1)

66-85: Excellent demonstration of different search types.

The example effectively demonstrates the various search capabilities with clear output formatting and proper async handling.

cognee/modules/pipelines/operations/run_tasks.py (3)

23-25: Context parameter addition looks correct.

The addition of the optional context: dict = None parameter maintains backward compatibility while enabling context propagation.


88-93: Consider using keyword arguments for clarity.

The function call uses keyword arguments for most parameters, which improves readability and reduces the risk of parameter ordering errors.


77-93:

✅ Verification successful

Verify parameter forwarding consistency.

The context parameter is properly added and forwarded to run_tasks_with_telemetry. The parameter positioning and default value are consistent.

However, ensure that the parameter order in the function call matches the function signature:


🏁 Script executed:

#!/bin/bash
# Description: Check if run_tasks_with_telemetry signature matches the call
# Expected: Parameter order should be consistent

echo "Checking run_tasks_with_telemetry function signature..."
rg "async def run_tasks_with_telemetry" -A 2

echo -e "\nChecking function call..."
rg "run_tasks_with_telemetry" -A 5 -B 1

Length of output: 1682


Parameter forwarding verified: signature and call align

The run_tasks_with_telemetry function signature (tasks, data, user, pipeline_name, context=None) matches the named arguments in the call. No further changes needed—approving as-is.

cognee/tasks/ingestion/migrate_relational_database.py (3)

11-11: LGTM! Import addition supports new functionality.

The addition of ColumnValue to the imports correctly supports the new column data migration feature.


16-16: Good backward compatibility approach.

Adding the optional parameter migrate_column_data with a default value of True maintains backward compatibility while enabling the new functionality.


98-133: Column data migration implementation looks solid.

The new column data migration functionality is well-structured:

  • Properly filters out primary keys and foreign keys to avoid duplication
  • Creates appropriate ColumnValue nodes with meaningful IDs
  • Establishes correct relationships between table rows and column values
  • Uses consistent node ID generation patterns

The implementation integrates well with the existing migration logic.

cognee/api/v1/cognify/code_graph_pipeline.py (4)

5-6: Good import organization and observability refactoring.

The import reordering and move to centralized get_observe() improves code modularity and maintainability.


16-16: Excellent addition for user-aware dataset handling.

The import of get_unique_dataset_id supports the transition to user-aware dataset ID generation, which is a significant improvement over static UUIDs.


25-25: Centralized observability management.

Using get_observe() instead of conditional imports provides better consistency and maintainability across the codebase.


68-68: Improved dataset ID generation.

The change from static UUID generation to user-aware async method get_unique_dataset_id("codebase", user) enables better data isolation and user context handling.

examples/database_examples/falkordb_example.py (1)

8-87: Excellent comprehensive example for FalkorDB integration.

This example script is well-structured and demonstrates:

  • Proper FalkorDB configuration
  • Clear data directory setup
  • Complete workflow from data ingestion to search
  • Multiple search types (insights, chunks, graph completion)
  • Good documentation and comments

The implementation follows consistent patterns with other database examples and provides users with a ready-to-run reference.

examples/database_examples/kuzu_example.py (1)

8-85: Excellent KuzuDB integration example.

This example script provides a comprehensive demonstration of:

  • Proper KuzuDB configuration
  • Data directory management
  • Complete cognify workflow
  • Multiple search operations with different types
  • Clear documentation and structure

The implementation is consistent with other database examples and serves as an excellent reference for users wanting to integrate with KuzuDB.

notebooks/cognee_openai_compatable_demo.ipynb (1)

1-109: Excellent demonstration of the OpenAI-compatible API integration!

This notebook effectively showcases the new OpenAI-compatible responses API with clear examples of both cognify and search operations. The execution outputs demonstrate successful function calls and proper API responses.

examples/database_examples/neo4j_example.py (2)

8-18: Well-structured example with clear documentation.

The function documentation clearly explains the workflow and the script follows the established pattern of other database examples in the repository.


67-86: Consistent search pattern implementation.

The three search operations demonstrate different SearchType values effectively and match the pattern used in other database examples.

cognee/tests/test_starter_pipelines.py (4)

7-9: Clear test class with good documentation.

The class name and docstring clearly indicate the purpose of testing pipeline execution.


10-27: Excellent environment validation and setup.

The setUp method properly validates required environment variables and skips tests when prerequisites aren't met, preventing false negatives.


28-51: Robust pipeline execution with good error handling.

The helper method includes proper timeout handling, virtual environment detection, and detailed error reporting.


52-66: Comprehensive test coverage for all pipeline scripts.

All three pipeline scripts are tested with consistent assertion patterns.

examples/database_examples/milvus_example.py (4)

8-18: Clear documentation following established pattern.

The function documentation is consistent with other database examples and clearly explains the workflow.


29-38: Appropriate vector database configuration.

The script correctly uses set_vector_db_config for Milvus (a vector database) instead of set_graph_db_config, and the local database path construction is suitable for an example.


47-52: Relevant sample text for vector database demonstration.

The sample text effectively describes Milvus features and capabilities, making it appropriate for testing vector search functionality.


60-81: Consistent search pattern implementation.

The three search operations match the pattern used in other database examples, demonstrating different SearchType capabilities effectively.

cognee/eval_framework/eval_config.py (4)

3-3: LGTM!

Good addition of the Optional import to support the new instance_filter field.


17-17: LGTM!

The expansion of QA engine options to include the new graph completion variants looks good and aligns well with the new retrievers mentioned in the AI summary.


44-44: LGTM!

The new instance_filter field is properly typed and provides useful filtering functionality for evaluations.


70-70: LGTM!

Good consistency in including the new instance_filter field in the to_dict method output.

cognee/api/v1/responses/default_tools.py (1)

1-63: Well-structured tool definitions following OpenAI function calling standards.

The DEFAULT_TOOLS constant is well-implemented with:

  • Proper JSON schema structures for each tool
  • Clear descriptions and parameter types
  • Appropriate required/optional field definitions
  • Good practice of commenting out the potentially dangerous prune function

The search tool enum values and parameter structure look comprehensive for the various search types supported.

cognee/modules/retrieval/graph_completion_context_extension_retriever.py (1)

24-74: Well-implemented context extension logic with proper early stopping.

The get_completion method effectively implements iterative context extension with:

  • Proper initialization and context handling when none is provided
  • Smart early stopping when no new triplets are found (lines 54-58)
  • Good logging for monitoring the extension process
  • Deduplication of triplets using set() conversion (line 49)

The implementation follows good async patterns and handles the iterative process cleanly.

examples/database_examples/pgvector_example.py (2)

8-99: Excellent comprehensive example demonstrating PGVector integration.

This example script effectively demonstrates:

  • Proper configuration of PGVector and PostgreSQL settings
  • Complete workflow from data addition to knowledge extraction
  • Multiple search types (insights, chunks, graph completion)
  • Good documentation and comments explaining each step
  • Proper async/await usage throughout

The script serves as a clear reference for users wanting to integrate Cognee with PGVector.


8-18: Inconsistency between PR objectives and actual changes.

The PR objectives mention "add getting started tutorial to git" but the changes appear to be substantial enhancements to the Cognee codebase including new retrievers, API tools, and database examples. This seems inconsistent with the stated objectives.

Likely an incorrect or invalid review comment.

cognee/api/v1/responses/routers/default_tools.py (2)

1-37: LGTM! Well-structured search tool definition.

The search tool schema is well-defined with proper parameter types, descriptions, and validation. The enum values for search_type are comprehensive and the optional parameters are appropriately configured.


16-22: Inconsistent with AI summary - missing search types.

The AI summary mentions that "GRAPH_COMPLETION_COT" and "GRAPH_COMPLETION_CONTEXT_EXTENSION" were added to the search_type enum, but they're not present in the current code.

#!/bin/bash
# Search for any references to the missing search types in the codebase
rg -i "GRAPH_COMPLETION_COT|GRAPH_COMPLETION_CONTEXT_EXTENSION"

Likely an incorrect or invalid review comment.

examples/database_examples/weaviate_example.py (2)

8-18: Excellent documentation and structure.

The function documentation clearly explains the workflow and the script is well-organized with clear step-by-step comments.


64-84: Good variety of search examples.

The search examples demonstrate different search types effectively and provide good coverage of the API capabilities.

cognee-starter-kit/README.md (1)

1-13: Well-structured introduction and overview.

The README provides a clear and comprehensive introduction to the starter kit with good organization and helpful feature highlights.

cognee/modules/pipelines/operations/run_tasks_base.py (2)

17-17: Good implementation of context parameter propagation.

The addition of the optional context parameter with proper default value handling enables flexible context passing through the pipeline execution.

Also applies to: 66-66


35-36: Well-implemented conditional context passing.

The parameter inspection approach ensures context is only passed to functions that expect it, and the recursive propagation maintains context throughout the execution chain.

Also applies to: 40-40, 79-81

cognee/api/v1/responses/routers/get_responses_router.py (1)

74-74: ⚠️ Potential issue

Fix the static analysis warning for function call in argument defaults.

The static analysis tool correctly identifies that using Depends() in argument defaults can cause issues.

Move the dependency call inside the function or use a module-level variable:

     async def create_response(
         request: ResponseRequest,
-        user: User = Depends(get_authenticated_user),
+        user: User = Depends(get_authenticated_user),
     ) -> ResponseBody:

Actually, the current usage is correct for FastAPI. The static analysis tool is giving a false positive here since FastAPI specifically designed Depends() to be used this way in function signatures.

🧰 Tools
🪛 Ruff (0.11.9)

74-74: Do not perform function call Depends in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable

(B008)

cognee/infrastructure/databases/vector/pgvector/PGVectorAdapter.py (3)

186-188: Double-check CollectionNotFoundError signature

A new log_level="DEBUG" parameter is passed to the exception. If CollectionNotFoundError still only accepts message: str, this will raise a TypeError and mask the real “collection not found” problem.

-raise CollectionNotFoundError(
-    f"Collection '{collection_name}' not found!", log_level="DEBUG"
-)
+raise CollectionNotFoundError(f"Collection '{collection_name}' not found!", log_level="DEBUG")
# or drop the extra arg if the class signature has not been updated

Please confirm the updated constructor and adjust accordingly.


222-237: Avoid shadowing & clarify variable semantics

closest_items is first initialised as an empty list (line 223) and then immediately overwritten by the CursorResult returned from session.execute (line 236).
This shadowing:

  1. Obscures the real type held by the variable (list → CursorResult).
  2. Makes the later closest_items.all() call slightly confusing to read.

A tiny rename keeps intent crystal-clear:

-# NOTE: This needs to be initialized in case search doesn't return a value
-closest_items = []
+closest_items_result: CursorResult | None = None
 ...
-closest_items = await session.execute(query)
+closest_items_result = await session.execute(query)
 ...
-for row in closest_items_result.all():
+for row in closest_items_result.all():

[ suggest_optional_refactor ]


241-252: Graceful handling of empty result sets

Good catch adding the explicit early-return on an empty vector_list (lines 250-252). ✔️
This prevents a division-by-zero inside normalize_distances.

cognee/infrastructure/databases/graph/networkx/adapter.py (1)

624-636: Return value consistency

get_nodes now returns dictionaries containing "id" as a UUID.
Call-sites that previously expected strings (e.g., REST serialisation) will need to convert.
Suggest documenting the change in the adapter docstring and/or adding:

{"id": str(node_id), **self.graph.nodes[node_id]}

to minimise downstream breakage.

cognee/infrastructure/databases/vector/weaviate_db/WeaviateAdapter.py (2)

37-47: Repeated connect() calls can leak sockets

get_client() calls await self.client.connect() every time.
The async client usually keeps an internal pool; reconnecting on every DAO method may open superfluous websockets.

async def get_client(self):
-    await self.client.connect()
-    return self.client
+    if not getattr(self.client, "is_connected", False):
+        await self.client.connect()
+    return self.client

(or cache a boolean flag).
[ suggest_essential_refactor ]


166-174: search parameter validation looks good

Nice consolidation of distance handling and a higher default limit = 15.
Input validation correctly prevents None/None queries. ✔️

Comment on lines 332 to 333
# Drop all tables from the public schema
schema_list = ["public", "public_staging"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Verify the behavioral change from dynamic to static schema targeting.

The hardcoded schema list ["public", "public_staging"] replaces the dynamic get_schema_list() call, fundamentally changing the scope of the database deletion operation. This could be a breaking change if other schemas exist that need cleanup.

The method name delete_database() suggests complete database cleanup, but now it only targets specific schemas. Consider either:

  1. Updating the method name to reflect the limited scope (e.g., delete_core_schemas())
  2. Adding documentation explaining why only these schemas are targeted
  3. Reverting to dynamic schema detection if complete cleanup is intended

🏁 Script executed:

#!/bin/bash
# Description: Verify if there are other usages of get_schema_list() and check for potential schemas that might be missed
# Expected: Find other usages of get_schema_list() and understand the impact

echo "=== Checking other usages of get_schema_list() ==="
rg -A 3 -B 3 "get_schema_list"

echo "=== Checking for other schema references in the codebase ==="
rg -A 2 -B 2 "schema.*=|schema_name.*="

echo "=== Checking for migration or setup files that might create additional schemas ==="
fd -e sql -e py | xargs rg -l "CREATE SCHEMA|create.*schema"

Length of output: 86031


Clarify delete_database() scope
Our search found no other schema definitions or uses of get_schema_list() outside of public and public_staging. The change from dynamic to ["public","public_staging"] therefore limits cleanup to just those two schemas. To avoid confusion or future breakage if new schemas are introduced, please either:

• Update the delete_database() docstring (in cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py) to explain that it only drops tables in the “public” and “public_staging” schemas, or
• Rename the method (e.g., to delete_core_schemas()) to more accurately reflect its limited scope.

🤖 Prompt for AI Agents
In cognee/infrastructure/databases/relational/sqlalchemy/SqlAlchemyAdapter.py
around lines 332 to 333, the delete_database() method now only drops tables from
the hardcoded schemas "public" and "public_staging" instead of dynamically
targeting all schemas. To clarify this limited scope and prevent confusion,
either update the delete_database() method's docstring to explicitly state it
only affects these two schemas or rename the method to something like
delete_core_schemas() that better reflects this narrower functionality.

@@ -44,7 +44,7 @@ async def embed_text(self, text: List[str]) -> List[List[float]]:
embeddings = self.embedding_model.embed(
text,
batch_size=len(text),
parallel=0,
parallel=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Verify the correct parameter value for fastembed's parallel option.

The change from parallel=0 to parallel=None could have different semantics depending on the fastembed library's API. Please ensure this aligns with the intended behavior and the library's documentation.


🌐 Web query:

What is the correct usage of the parallel parameter in fastembed TextEmbedding.embed() method?

💡 Result:

The parallel parameter in FastEmbed's embed() method is used to enable data-parallel processing for embedding generation, which can significantly improve performance when working with large datasets.

Correct Usage of the Parallel Parameter

The parallel parameter accepts an integer value that determines how the embedding workload will be distributed:

  • If set to a value greater than 1, data-parallel encoding will be used, which is recommended for offline encoding of large datasets[6][9].
  • If set to 0, all available CPU cores will be used[9].
  • If set to None (default), data-parallel processing won't be used, and the default ONNX runtime threading will be used instead[9].

When working with GPU acceleration, there are additional considerations:

  • The value of parallel must match the number of GPUs specified in the device_ids parameter[5].
  • For example, if you specify device_ids=[0,1,2,3], you should set parallel=4[5].
  • This ensures proper utilization by spawning child processes for each GPU.

Implementation Example

Here's how you can use the parallel parameter with FastEmbed:

from fastembed.embedding import DefaultEmbedding

# Initialize the embedding model
embedding_model = DefaultEmbedding()

# For CPU-based parallel processing
embeddings = list(embedding_model.embed(documents, parallel=4, batch_size=256))

# For GPU-based parallel processing
embedding_model_gpu = TextEmbedding(
    model_name="intfloat/multilingual-e5-large",
    cuda=True,
    device_ids=[0, 1, 2, 3]
)

# Must match number of GPUs in device_ids
embeddings = list(embedding_model_gpu.embed(documents, parallel=4, batch_size=256))

Additional Considerations

When using the parallel parameter:

  • It works well with the batch_size parameter, which controls how many documents are processed in a single batch[6][7].
  • For single GPU usage, the parallel parameter is not necessary[5].
  • When using GPUs, you must use cuda=True and not explicitly specify providers[5].
  • The cuda and providers parameters are mutually exclusive[5].

This parameter is particularly useful for improving performance when embedding large datasets, allowing you to take advantage of multiple CPU cores or GPUs for parallel processing.

Citations:


Review the parallel parameter change in FastembedEmbeddingEngine

The update from parallel=0 to parallel=None is not equivalent:

  • parallel=0 ⇒ enables data-parallel encoding across all CPU cores.
  • parallel=None (the default) ⇒ disables data-parallel processing and falls back to ONNX Runtime’s internal threading.

Please confirm whether you intended to drop data-parallel CPU processing:

  • If you still want to leverage all cores for large-batch offline encoding, revert to parallel=0.
  • If you deliberately prefer ONNX Runtime threading only, keep parallel=None and consider documenting this behavioral change.

File:

  • cognee/infrastructure/databases/vector/embeddings/FastembedEmbeddingEngine.py (line 47)
🤖 Prompt for AI Agents
In cognee/infrastructure/databases/vector/embeddings/FastembedEmbeddingEngine.py
at line 47, the change from parallel=0 to parallel=None alters the behavior of
the embed method by disabling data-parallel CPU processing and relying on ONNX
Runtime threading instead. To fix this, decide if you want to keep data-parallel
processing across all CPU cores; if so, revert parallel back to 0. Otherwise,
keep parallel=None but add a comment explaining this intentional change in
threading behavior for clarity.

@@ -61,8 +61,7 @@ More on [use-cases](https://docs.cognee.ai/use-cases) and [evals](https://github

## Get Started

Get started quickly with a Google Colab <a href="https://colab.research.google.com/drive/1jHbWVypDgCLwjE71GSXhRL3YxYhCZzG1?usp=sharing">notebook</a> or <a href="https://github.com/topoteretes/cognee-starter">starter repo</a>

Get started quickly with a Google Colab <a href="https://colab.research.google.com/drive/1jHbWVypDgCLwjE71GSXhRL3YxYhCZzG1?usp=sharing">notebook</a> or our starter folder</a>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix incomplete link markup.

The link markup is incomplete and will break the README rendering. The opening <a> tag is missing.

-Get started quickly with a Google Colab  <a href="https://colab.research.google.com/drive/1jHbWVypDgCLwjE71GSXhRL3YxYhCZzG1?usp=sharing">notebook</a>  or  our starter folder</a>
+Get started quickly with a Google Colab  <a href="https://colab.research.google.com/drive/1jHbWVypDgCLwjE71GSXhRL3YxYhCZzG1?usp=sharing">notebook</a>  or  <a href="PATH_TO_STARTER_FOLDER">our starter folder</a>

Note: Replace PATH_TO_STARTER_FOLDER with the actual path to the starter folder.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Get started quickly with a Google Colab <a href="https://colab.research.google.com/drive/1jHbWVypDgCLwjE71GSXhRL3YxYhCZzG1?usp=sharing">notebook</a> or our starter folder</a>
Get started quickly with a Google Colab <a href="https://colab.research.google.com/drive/1jHbWVypDgCLwjE71GSXhRL3YxYhCZzG1?usp=sharing">notebook</a> or <a href="PATH_TO_STARTER_FOLDER">our starter folder</a>
🤖 Prompt for AI Agents
In README.md at line 64, the anchor tag for the starter folder link is missing
its opening <a> tag, causing broken link markup. Add the opening <a> tag with
the correct href attribute pointing to the starter folder path before the text
"our starter folder" to complete the link markup and ensure proper rendering.

entrypoint.sh Outdated
Comment on lines 36 to 42
debugpy --wait-for-client --listen 0.0.0.0:5678 -m gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level debug --reload cognee.api.client:app
else
exec gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level debug --reload cognee.api.client:app
gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level debug --reload cognee.api.client:app
fi
else
exec gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level error cognee.api.client:app
gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level error cognee.api.client:app
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider restoring exec for proper signal handling in containers.

The removal of exec from the gunicorn startup commands means the shell process remains as the parent, which can interfere with proper signal handling in containerized environments. Without exec, SIGTERM signals sent by container orchestrators may not reach the gunicorn process directly, potentially causing issues with graceful shutdowns.

Restore exec for proper signal propagation:

-        debugpy --wait-for-client --listen 0.0.0.0:5678 -m gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level debug --reload cognee.api.client:app
+        exec debugpy --wait-for-client --listen 0.0.0.0:5678 -m gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level debug --reload cognee.api.client:app
-        gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level debug --reload cognee.api.client:app
+        exec gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level debug --reload cognee.api.client:app
-    gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level error cognee.api.client:app 
+    exec gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level error cognee.api.client:app
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
debugpy --wait-for-client --listen 0.0.0.0:5678 -m gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level debug --reload cognee.api.client:app
else
exec gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level debug --reload cognee.api.client:app
gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level debug --reload cognee.api.client:app
fi
else
exec gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level error cognee.api.client:app
gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level error cognee.api.client:app
fi
exec debugpy --wait-for-client --listen 0.0.0.0:5678 -m gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level debug --reload cognee.api.client:app
else
exec gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level debug --reload cognee.api.client:app
fi
else
exec gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level error cognee.api.client:app
fi
🤖 Prompt for AI Agents
In entrypoint.sh around lines 36 to 42, the gunicorn startup commands lack the
'exec' prefix, causing the shell to remain the parent process and interfering
with proper signal handling in containers. To fix this, add 'exec' before each
gunicorn command so that the shell process is replaced by gunicorn, ensuring
SIGTERM and other signals are correctly propagated for graceful shutdowns.

@@ -29,7 +29,7 @@ export default function WizardPage({
setFalse: closeSettingsModal,
} = useBoolean(false);

const dataset = { id: 'main' };
const dataset = { name: 'main' };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Verify dataset property change coordination across components.

The change from id to name property looks correct based on the coordinated updates mentioned in the AI summary. However, ensure that all consuming components (CognifyStep and ExploreStep) have been updated to expect the name property instead of id.


🏁 Script executed:

#!/bin/bash
# Description: Verify that CognifyStep and ExploreStep components expect dataset.name property
# Expected: Components should reference dataset.name, not dataset.id

echo "Checking CognifyStep component..."
rg -A 5 -B 5 "dataset\.(id|name)" cognee-frontend/src/app/wizard/CognifyStep/

echo "Checking ExploreStep component..."
rg -A 5 -B 5 "dataset\.(id|name)" cognee-frontend/src/app/wizard/ExploreStep/

echo "Checking other potential references to dataset.id..."
rg -A 2 -B 2 "dataset\.id" cognee-frontend/src/

Length of output: 4262


Align dataset identifier usage across the codebase

The switch from dataset.id to dataset.name in WizardPage.tsx is incomplete—several modules still reference dataset.id, leading to runtime errors. Please update or map these to use dataset.name (or introduce a consistent adapter) in the following locations:

• src/modules/ingestion/useDatasets.ts
– statuses[dataset.id]
– removeDataset filter: dataset.id !== datasetId

• src/modules/ingestion/addData.ts
formData.append('datasetId', dataset.id)

• src/modules/ingestion/DatasetsView/DatasetsView.tsx
<DatasetItem key={dataset.id} ...>
– onClick(dataset) passing old id

• src/modules/datasets/getDatasetData.ts
– function signature getDatasetData(dataset: { id: string })
– URL: /v1/datasets/${dataset.id}/data

• src/modules/datasets/deleteDataset.ts
– function signature deleteDataset(dataset: { id: string })
– URL: /v1/datasets/${dataset.id}

• src/app/page.tsx
datasets.find((d) => d.id === selectedDataset)
onDatasetCognify({ id: dataset.id })

• src/modules/datasets/cognifyDataset.ts
datasets: [dataset.id || dataset.name] in request body

Suggested next steps:

  1. Update these functions/components to consume dataset.name.
  2. Adjust API calls or backend adapters to accept name as identifier (or map name→id).
  3. Revise type definitions/interfaces to reflect the new name property.

This refactor must be coordinated end-to-end to prevent breaking changes.

🤖 Prompt for AI Agents
In cognee-frontend/src/app/wizard/WizardPage.tsx at line 32, the dataset object
uses the property name instead of id, but multiple other modules still reference
dataset.id causing runtime errors. To fix this, update all references in the
listed files (useDatasets.ts, addData.ts, DatasetsView.tsx, getDatasetData.ts,
deleteDataset.ts, page.tsx, cognifyDataset.ts) to use dataset.name consistently.
Adjust function signatures, API calls, and type definitions accordingly to
accept name as the identifier or implement a mapping from name to id to maintain
compatibility. Ensure all consuming components and backend interactions align
with this change to avoid breaking issues.

Comment on lines 387 to 394
for neighbour in predecessors:
neighbour = neighbour["relation"]
connections.append((neighbour[0], {"relationship_name": neighbour[1]}, neighbour[2]))

for neighbour in successors:
neighbour = neighbour["relation"]
connections.append((neighbour[0], {"relationship_name": neighbour[1]}, neighbour[2]))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

get_connections indexes into a relationship instead of using its attributes

Same issue as above—relation[0], relation[1], relation[2] are invalid. You already return the nodes in the Cypher query; simply forward them:

MATCH (node)<-[relation]-(neighbour)
...
RETURN neighbour.id   AS src_id,
       type(relation) AS rel_type,
       node.id        AS dst_id

and then:

connections.append(
    (row["src_id"], {"relationship_name": row["rel_type"]}, row["dst_id"])
)
🤖 Prompt for AI Agents
In cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py around
lines 387 to 394, the code incorrectly indexes into the relationship object
using relation[0], relation[1], and relation[2], which is invalid. Instead,
modify the code to use the attributes returned by the Cypher query directly by
appending connections with the source ID, a dictionary containing the
relationship type, and the destination ID, like
connections.append((row["src_id"], {"relationship_name": row["rel_type"]},
row["dst_id"])). This change ensures the correct use of relationship data
without indexing errors.

Comment on lines 120 to 124
sanitized_id = node_id.replace(":", "_")

query = "MATCH (node: {{id: $node_id}}) DETACH DELETE node"
params = {"node_id": sanitized_id}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Malformed Cypher pattern prevents node deletion

MATCH (node: {{id: $node_id}}) contains an extra : and doubled braces, producing a syntax error. Replace with a property map:

-        query = "MATCH (node: {{id: $node_id}}) DETACH DELETE node"
+        query = "MATCH (node {id: $node_id}) DETACH DELETE node"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
sanitized_id = node_id.replace(":", "_")
query = "MATCH (node: {{id: $node_id}}) DETACH DELETE node"
params = {"node_id": sanitized_id}
sanitized_id = node_id.replace(":", "_")
query = "MATCH (node {id: $node_id}) DETACH DELETE node"
params = {"node_id": sanitized_id}
🤖 Prompt for AI Agents
In cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py around
lines 120 to 124, the Cypher query syntax is incorrect due to an extra colon and
doubled braces in the node pattern. Replace the pattern MATCH (node: {{id:
$node_id}}) with MATCH (node {id: $node_id}) to correctly specify the node by
its property map without the extra colon and braces, ensuring the query runs
without syntax errors.

Comment on lines 255 to 258
return [
(result["n"]["id"], result["m"]["id"], {"relationship_name": result["r"][1]})
for result in results
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Relationship object is not subscriptable

result["r"][1] assumes a list/tuple; Neo4j’s Relationship object exposes its type via .type. Accessing it by index will raise TypeError.

-        (result["n"]["id"], result["m"]["id"], {"relationship_name": result["r"][1]})
+        (
+            result["n"]["id"],
+            result["m"]["id"],
+            {"relationship_name": result["r"].type},
+        )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
return [
(result["n"]["id"], result["m"]["id"], {"relationship_name": result["r"][1]})
for result in results
]
return [
(
result["n"]["id"],
result["m"]["id"],
{"relationship_name": result["r"].type},
)
for result in results
]
🤖 Prompt for AI Agents
In cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py around
lines 255 to 258, the code incorrectly accesses the relationship type using
indexing on result["r"], which is a Relationship object and not subscriptable.
Replace result["r"][1] with result["r"].type to correctly access the
relationship type attribute without causing a TypeError.

Comment on lines 430 to 434
def serialize_properties(self, properties=dict()):
serialized_properties = {}

for property_key, property_value in properties.items():
if isinstance(property_value, UUID):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Avoid mutable default argument

properties=dict() creates a shared dictionary across calls.

-    def serialize_properties(self, properties=dict()):
+    def serialize_properties(self, properties: Optional[dict] = None):
         serialized_properties = {}
-
-        for property_key, property_value in properties.items():
+        properties = properties or {}
+        for property_key, property_value in properties.items():
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def serialize_properties(self, properties=dict()):
serialized_properties = {}
for property_key, property_value in properties.items():
if isinstance(property_value, UUID):
def serialize_properties(self, properties: Optional[dict] = None):
serialized_properties = {}
properties = properties or {}
for property_key, property_value in properties.items():
if isinstance(property_value, UUID):
...
🧰 Tools
🪛 Ruff (0.11.9)

430-430: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

🤖 Prompt for AI Agents
In cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py around
lines 430 to 434, the method serialize_properties uses a mutable default
argument properties=dict(), which can lead to unexpected behavior due to shared
state across calls. Change the default value of properties to None and inside
the method initialize it to an empty dictionary if it is None, ensuring each
call gets a fresh dictionary.

Comment on lines 415 to 418
UNWIND $node_ids AS id
MATCH (node:`{id}`)<-[r:{edge_label}]-(successor)
DELETE r;
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Cypher pattern in remove_connection_to_successors_of is invalid

Back-ticked {id} and inline braces break parsing. Use a parameterised property map instead:

-MATCH (node:`{id}`)<-[r:{edge_label}]-(successor)
+MATCH (node {id: id})<-[r]-(successor)
+WHERE type(r) = $edge_label

(Assuming you meant the UNWIND variable id.)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
UNWIND $node_ids AS id
MATCH (node:`{id}`)<-[r:{edge_label}]-(successor)
DELETE r;
"""
UNWIND $node_ids AS id
- MATCH (node:`{id}`)<-[r:{edge_label}]-(successor)
+ MATCH (node {id: id})<-[r]-(successor)
+ WHERE type(r) = $edge_label
DELETE r;
"""
🤖 Prompt for AI Agents
In cognee/infrastructure/databases/graph/memgraph/memgraph_adapter.py around
lines 415 to 418, the Cypher pattern uses back-ticked `{id}` and inline braces
which is invalid syntax. Replace the pattern `(node:`{id}`)` with a
parameterized property map using the UNWIND variable `id`, for example `(node
{id: id})`, to correctly match nodes by their id property without breaking
parsing.

entrypoint.sh Outdated
Comment on lines 34 to 36
if [ "$DEBUG" = "true" ]; then
echo "Waiting for the debugger to attach..."
exec python -m debugpy --wait-for-client --listen 0.0.0.0:5678 -m gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level debug --reload cognee.api.client:app
debugpy --wait-for-client --listen 0.0.0.0:5678 -m gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level debug --reload cognee.api.client:app
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exposed Debug Interface with Remote Code Execution Risk
Suggested Fix

@@ -1,4 +1,12 @@
+` tag.
+
+*Changes made*: The only change is replacing `--listen 0.0.0.0:5678` with `--listen 127.0.0.1:5678` in the invocation of `debugpy`, restricting access to the debugger port to localhost only.
+
+*Potential Impacts*: Developers must now attach debuggers from inside the container or via `kubectl exec`/`docker exec` as the debugger is not accessible remotely. This does not impact existing application startup, migration, or gunicorn operations. No other code is affected.
+
+</explanation>
+<patch>
 #!/bin/bash
 
 set -e  # Exit on error
 echo "Debug mode: $DEBUG"
@@ -32,11 +40,11 @@
 # Modified Gunicorn startup with error handling
 if [ "$ENVIRONMENT" = "dev" ] || [ "$ENVIRONMENT" = "local" ]; then
     if [ "$DEBUG" = "true" ]; then
         echo "Waiting for the debugger to attach..."
-        debugpy --wait-for-client --listen 0.0.0.0:5678 -m gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level debug --reload cognee.api.client:app
+        debugpy --wait-for-client --listen 127.0.0.1:5678 -m gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level debug --reload cognee.api.client:app
     else
         gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level debug --reload cognee.api.client:app
     fi
 else
     gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level error cognee.api.client:app 
-fi
+fi
\ No newline at end of file
Explanation of Fix

Vulnerability and Fix Summary:

Vulnerability: The script exposes the debugpy debugger on all interfaces (0.0.0.0:5678) when DEBUG=true and ENVIRONMENT is set to dev or local. This enables anyone with network access to attach to the debugger and execute arbitrary code, leading to a critical security risk (CWE-489, CWE-284).

Fix: The patch restricts the debugpy debugger to only listen on 127.0.0.1 (localhost) rather than 0.0.0.0, making it accessible only from within the container/pod itself and not from external hosts. This eliminates the risk of remote code execution by unauthorized users while preserving the intended debugging functionality for local development. No new dependencies are introduced and behavior remains unchanged except for this stricter binding.

Original Vulnerable Code:
if [ "$ENVIRONMENT" = "dev" ] || [ "$ENVIRONMENT" = "local" ]; then
if [ "$DEBUG" = "true" ]; then
echo "Waiting for the debugger to attach..."
debugpy --wait-for-client --listen 0.0.0.0:5678 -m gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level debug --reload cognee.api.client:app
else
gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level debug --reload cognee.api.client:app
fi
else
gunicorn -w 3 -k uvicorn.workers.UvicornWorker -t 30000 --bind=0.0.0.0:8000 --log-level error cognee.api.client:app
fi

Patched Code:
See below in the <patch> tag.

Changes made: The only change is replacing --listen 0.0.0.0:5678 with --listen 127.0.0.1:5678 in the invocation of debugpy, restricting access to the debugger port to localhost only.

Potential Impacts: Developers must now attach debuggers from inside the container or via kubectl exec/docker exec as the debugger is not accessible remotely. This does not impact existing application startup, migration, or gunicorn operations. No other code is affected.

Issues
Type Identifier Message Severity Link
Application
CWE-489, CWE-284
When DEBUG=true in a dev/local environment, the script launches debugpy listening on 0.0.0.0:5678. Anyone who can reach the container’s network can attach to this debugger and execute arbitrary Python code, effectively granting remote code-execution rights. Exposing a debug interface in production or an otherwise reachable environment constitutes both leftover debug code (CWE-489) and improper access control (CWE-284).
critical
Link

Copy link

pensarapp bot commented May 24, 2025

Arbitrary Python Code Execution via Unrestricted Module Loading
Suggested Fix

@@ -420,10 +420,38 @@
     return "\n".join(edge_strings)
 
 
 def load_class(model_file, model_name):
-    model_file = os.path.abspath(model_file)
-    spec = importlib.util.spec_from_file_location("graph_model", model_file)
+    """
+    Safely load a class from a Python file, restricting import to a safe subdirectory.
+    Only allows .py files within the same directory as this server.py or a designated safe folder.
+    """
+    # Base directory to allow imports from (e.g., the directory containing this file)
+    SAFE_BASE_DIR = os.path.dirname(os.path.realpath(__file__))
+    model_file_path = os.path.abspath(model_file)
+
+    # Check that it's a Python file
+    if not model_file_path.endswith('.py'):
+        raise ValueError("Only Python (.py) files are allowed for graph_model_file.")
+
+    # Prevent directory traversal and absolute path outside safe base
+    # Disallow any path containing .. segments that would escape the base dir
+    # Also disallow symlinks outside safe dir
+    try:
+        common = os.path.commonpath([SAFE_BASE_DIR, os.path.realpath(model_file_path)])
+    except ValueError:
+        # In case paths are on different drives/platforms
+        raise ValueError("The graph_model_file path is not permitted.")
+
+    if common != SAFE_BASE_DIR:
+        raise ValueError(
+            "graph_model_file must be located within the allowed directory."
+        )
+
+    if not os.path.isfile(model_file_path):
+        raise ValueError("The specified graph_model_file does not exist.")
+
+    spec = importlib.util.spec_from_file_location("graph_model", model_file_path)
     module = importlib.util.module_from_spec(spec)
     spec.loader.exec_module(module)
 
     model_class = getattr(module, model_name)
@@ -457,5 +485,5 @@
     try:
         asyncio.run(main())
     except Exception as e:
         logger.error(f"Error initializing Cognee MCP server: {str(e)}")
-        raise
+        raise
\ No newline at end of file
Explanation of Fix

Vulnerability summary:
The load_class function in /cognee-mcp/src/server.py dynamically imports and executes a Python file from a user-supplied path (via the graph_model_file argument) without restriction. This allows an attacker to execute arbitrary Python code on the server by specifying a malicious file path, causing Remote Code Execution (CWE-470, CWE-94).

Fix:
This patch hardens load_class by restricting dynamic imports to only files located within an explicit allow-list of safe directories, such as the project's own module directory, or a specifically designated "models" or schema directory. It only allows importing files with .py extensions, and rejects absolute or parent-directory navigations (.., /, \). If graph_model_file is outside the allowed directory, a ValueError is raised. This prevents attackers from supplying locations such as /tmp/evil.py or arbitrary paths. These checks are implemented without introducing new dependencies.

Changes made:

  • Modified load_class to:
    • Only allow .py files
    • Restrict the path to a safe directory: e.g., must be under the project's own directory or a configurable models directory (for this patch, we use the directory containing the server.py file, but this can be adjusted as needed)
    • Disallow absolute paths, parent directory traversals, and symlinks to outside the allowed directory
    • Raise a clear exception if the file path is not allowed
  • Modified callers (if needed) to provide clearer error messages to users if the import is not allowed
  • Added comments to clarify why the validation is present

Potential impacts:

  • Users will now be prohibited from specifying arbitrary files outside the allowed directory for graph_model_file, which could break any existing workflows relying on importing arbitrary paths.
  • All dynamic imports must reside in a permitted, controlled directory with .py extension, reducing attack surface.
  • No runtime overhead except for the path checks when importing user-supplied modules.

Unexpected import note:

  • We use os.path.realpath and os.path.commonpath in the main module (not new imports). These are standard for path checks and are already used in the file.

This patch is production ready and does not introduce new dependencies.

Issues
Type Identifier Message Severity Link
Application
CWE-470, CWE-94
The load_class function imports and executes a Python module whose path is supplied directly by the user via the graph_model_file argument in the cognify and cognee_add_developer_rules tools. Because exec_module executes the file at import time, an attacker can supply the path to any malicious script and achieve Remote Code Execution (RCE) under the privileges of the running service. This is unsafe reflection / dynamic code execution with externally controlled input.
critical
Link

Copy link

pensarapp bot commented May 24, 2025

Path Traversal Vulnerability in Local File Processing
Suggested Fix

@@ -1,8 +1,9 @@
 import dlt
 import s3fs
 import json
 import inspect
+import os  # Added for secure path handling
 from typing import Union, BinaryIO, Any, List, Optional
 import cognee.modules.ingestion as ingestion
 from cognee.infrastructure.databases.relational import get_relational_engine
 from cognee.modules.data.methods import create_dataset, get_dataset_data, get_datasets_by_name
@@ -12,12 +13,10 @@
 from cognee.modules.users.permissions.methods import give_permission_on_document
 from .get_dlt_destination import get_dlt_destination
 from .save_data_item_to_storage import save_data_item_to_storage
 
-
 from cognee.api.v1.add.config import get_s3_config
 
-
 async def ingest_data(
     data: Any, dataset_name: str, user: User, node_set: Optional[List[str]] = None
 ):
     destination = get_dlt_destination()
@@ -37,15 +36,27 @@
         fs = s3fs.S3FileSystem(
             key=s3_config.aws_access_key_id, secret=s3_config.aws_secret_access_key, anon=False
         )
 
+    # Define a safe base directory for all local files (must be set to your data storage location)
+    base_dataset_dir = "/path/to/your/dataset/storage"  # <-- Set this to the intended data storage directory
+
     def open_data_file(file_path: str):
         if file_path.startswith("s3://"):
             return fs.open(file_path, mode="rb")
         else:
             local_path = file_path.replace("file://", "")
-            return open(local_path, mode="rb")
 
+            # Secure absolute, canonical path
+            requested_path = os.path.abspath(os.path.realpath(local_path))
+            base_path = os.path.abspath(os.path.realpath(base_dataset_dir))
+
+            # Check that requested_path is a subpath of base_path
+            if not os.path.commonprefix([requested_path, base_path]) == base_path:
+                raise ValueError("Attempted file access outside of the permitted data directory.")
+
+            return open(requested_path, mode="rb")
+
     def get_external_metadata_dict(data_item: Union[BinaryIO, str, Any]) -> dict[str, Any]:
         if hasattr(data_item, "dict") and inspect.ismethod(getattr(data_item, "dict")):
             return {"metadata": data_item.dict(), "origin": str(type(data_item))}
         else:
@@ -188,5 +199,5 @@
     if datasets:
         dataset = datasets[0]
         data_documents = await get_dataset_data(dataset_id=dataset.id)
         return data_documents
-    return []
+    return []
\ No newline at end of file
Explanation of Fix

Vulnerability and Fix Explanation:

The vulnerability is a classic path traversal issue (CWE-22). In the original code, local file paths are accepted with a file:// prefix, which is stripped and then passed directly to Python's open(). Because there is no directory or filename validation, an attacker could provide a crafted path such as file:///../../../etc/passwd to access arbitrary files on the system.

Fix:
To address this, the fix strictly restricts the files that may be opened via local paths. The code now requires all local file accesses to be under a designated safe base directory. The code canonicalizes the resolved path and checks that it is a subpath of the defined base dataset storage directory. If the check fails (i.e., path traversal is attempted), an exception is raised, preventing unauthorized file access.

Note:

  • The patch imports os and uses it in the main module. If users are not expecting an import here, this is necessary for secure path handling and does not introduce new dependencies.
  • You must set the value of the base_dataset_dir variable to your actual dataset storage directory path for the restriction to take effect as intended.

Summary of impacts:

  • Only files below the designated data directory may be accessed via the local path interface. Any path traversal (“../”) or absolute pathing is denied.
  • No breaking changes or new dependencies are introduced. All function interfaces remain the same.
Issues
Type Identifier Message Severity Link
Application
CWE-22
For local paths the code merely strips the file:// scheme and feeds the remaining string directly to open(). Because no directory or filename validation is performed, a crafted value like file:///../../../etc/passwd will be opened, leading to path-traversal and arbitrary file read on the host.
critical
Link

Copy link

pensarapp bot commented May 24, 2025

Default User Authorization Bypass in Dataset Operations
Suggested Fix

@@ -1,40 +1,82 @@
 from uuid import UUID
-from cognee.modules.users.methods import get_default_user
 from cognee.modules.ingestion import discover_directory_datasets
 from cognee.modules.pipelines.operations.get_pipeline_status import get_pipeline_status
 
 
 class datasets:
     @staticmethod
-    async def list_datasets():
+    async def list_datasets(user):
+        """
+        Lists datasets owned by the specified user.
+
+        Args:
+            user: The authenticated user object whose datasets are to be listed.
+
+        Returns:
+            List of datasets belonging to the user.
+        """
         from cognee.modules.data.methods import get_datasets
 
-        user = await get_default_user()
         return await get_datasets(user.id)
 
     @staticmethod
     def discover_datasets(directory_path: str):
+        """
+        Discovers datasets in a directory. This is not user-context sensitive.
+
+        Args:
+            directory_path: Path to directory to discover datasets in.
+
+        Returns:
+            List of discovered dataset names.
+        """
         return list(discover_directory_datasets(directory_path).keys())
 
     @staticmethod
-    async def list_data(dataset_id: str):
-        from cognee.modules.data.methods import get_dataset, get_dataset_data
+    async def list_data(user, dataset_id: str):
+        """
+        Lists data in a specific dataset owned by the user.
 
-        user = await get_default_user()
+        Args:
+            user: The authenticated user object.
+            dataset_id: The ID of the dataset to retrieve data from.
 
+        Returns:
+            Data in the specified dataset.
+        """
+        from cognee.modules.data.methods import get_dataset, get_dataset_data
+
         dataset = await get_dataset(user.id, dataset_id)
 
         return await get_dataset_data(dataset.id)
 
     @staticmethod
     async def get_status(dataset_ids: list[UUID]) -> dict:
+        """
+        Gets the status for the provided dataset IDs.
+
+        Args:
+            dataset_ids: List of UUIDs for datasets.
+
+        Returns:
+            Dictionary of statuses for each dataset.
+        """
         return await get_pipeline_status(dataset_ids, pipeline_name="cognify_pipeline")
 
     @staticmethod
-    async def delete_dataset(dataset_id: str):
+    async def delete_dataset(user, dataset_id: str):
+        """
+        Deletes a dataset owned by the specified user.
+
+        Args:
+            user: The authenticated user object.
+            dataset_id: The ID of the dataset to be deleted.
+
+        Returns:
+            Result of delete operation.
+        """
         from cognee.modules.data.methods import get_dataset, delete_dataset
 
-        user = await get_default_user()
         dataset = await get_dataset(user.id, dataset_id)
 
-        return await delete_dataset(dataset)
+        return await delete_dataset(dataset)
\ No newline at end of file
Explanation of Fix

Vulnerability & Fix Explanation:

The root cause of the vulnerability is that all sensitive dataset operations (list_datasets, list_data, delete_dataset) hardcode the use of get_default_user() to determine which user's datasets to manipulate. This ignores the currently authenticated user context and breaks proper authorization, exposing (or allowing deletion of) datasets improperly.

Fix:
We patch these methods to accept an explicit user parameter (expected to be the currently authenticated user object). API endpoints or callers interacting with these methods should provide the correct user context. We do not hardcode any default user, leaving it to the caller to enforce authentication. This resolves CWE-284 (Improper Access Control) and helps prevent unauthorized access or manipulation.

Other class methods (such as discover_datasets and get_status) do not depend on user context and are left unchanged, as the vulnerability is specific to dataset ownership and operations.

Side Effects & Compatibility:
This change is backward-compatible if API endpoints and internal callers are updated to provide the authenticated user explicitly. If not, old calls without the user argument will fail; hence, integration points must be audited. No dependencies are added.

Issues
Type Identifier Message Severity Link
Application
CWE-284, CWE-22
All dataset operations (list_datasets, list_data, delete_dataset) rely on get_default_user() instead of the currently authenticated user context. If these static methods are exposed through an API endpoint that should act on the caller's identity, this design bypasses proper authorization checks. An attacker could invoke these endpoints and operate on datasets belonging to the default/system user—or delete datasets—without possessing legitimate privileges, resulting in Improper Access Control (CWE-284).
critical
Link

Comment on lines 8 to 40
from cognee.api.v1.search import search
from cognee.api.v1.cognify import cognify
from cognee.api.v1.prune import prune


from cognee.modules.users.methods import get_default_user
from cognee.api.v1.responses.default_tools import DEFAULT_TOOLS

logger = logging.getLogger(__name__)


async def dispatch_function(tool_call: Union[ToolCall, Dict[str, Any]]) -> str:
"""
Dispatches a function call to the appropriate Cognee function.
"""
if isinstance(tool_call, dict):
function_data = tool_call.get("function", {})
function_name = function_data.get("name", "")
arguments_str = function_data.get("arguments", "{}")
else:
function_name = tool_call.function.name
arguments_str = tool_call.function.arguments

arguments = json.loads(arguments_str)

logger.info(f"Dispatching function: {function_name} with args: {arguments}")

user = await get_default_user()

if function_name == "search":
return await handle_search(arguments, user)
elif function_name == "cognify":
return await handle_cognify(arguments, user)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default User Privilege Escalation in Function Dispatcher

Explanation of Fix

Vulnerability:
The original code in dispatch_function always uses get_default_user() to supply the user context to sensitive downstream functions (search, add, cognify, prune). This means that any caller, regardless of their real identity or tenant, has all actions executed as the default user. This results in privilege escalation, unauthorized access, and cross-tenant data exposure (CWE-285). Furthermore, logs may include potentially sensitive input arguments (CWE-532).

Fix:
The fix introduces a mandatory user argument to dispatch_function and properly passes it to all handlers, ensuring privileged operations are performed with the identity of the actual caller, not a hardcoded default. The get_default_user() call is removed. The function signatures and usages are updated to require the user argument, providing backward compatibility for the dispatcher by making the requirement explicit. Additionally, log messages are adjusted to avoid dumping the full arguments (partially mitigating info leakage, but main fix is privilege context).

Potential impacts:

  • All calls to dispatch_function must now supply a valid authenticated user object. If not, the dispatcher will return an explicit error.
  • Downstream functions requiring a user are unaffected (they already require a user), but the dispatcher no longer accidentally grants default user privileges.
  • No new dependencies are introduced and existing APIs/handlers (handle_search, etc.) remain unchanged except for propagation of the real user.
Issues
Type Identifier Message Severity Link
Application
CWE-532, CWE-285
Every function call is executed using a hard-coded get_default_user() identity instead of the real caller. This bypasses any per-user authorization and effectively grants the privileges of the default (potentially privileged) account to anyone who can invoke the dispatcher, enabling privilege escalation or unauthorized access across tenant boundaries.
critical
Link

Suggested Fix

Suggested change
from cognee.api.v1.search import search
from cognee.api.v1.cognify import cognify
from cognee.api.v1.prune import prune
from cognee.modules.users.methods import get_default_user
from cognee.api.v1.responses.default_tools import DEFAULT_TOOLS
logger = logging.getLogger(__name__)
async def dispatch_function(tool_call: Union[ToolCall, Dict[str, Any]]) -> str:
"""
Dispatches a function call to the appropriate Cognee function.
"""
if isinstance(tool_call, dict):
function_data = tool_call.get("function", {})
function_name = function_data.get("name", "")
arguments_str = function_data.get("arguments", "{}")
else:
function_name = tool_call.function.name
arguments_str = tool_call.function.arguments
arguments = json.loads(arguments_str)
logger.info(f"Dispatching function: {function_name} with args: {arguments}")
user = await get_default_user()
if function_name == "search":
return await handle_search(arguments, user)
elif function_name == "cognify":
return await handle_cognify(arguments, user)
from cognee.api.v1.search import search
from cognee.api.v1.cognify import cognify
from cognee.api.v1.prune import prune
# get_default_user import REMOVED
from cognee.api.v1.responses.default_tools import DEFAULT_TOOLS
logger = logging.getLogger(__name__)
async def dispatch_function(tool_call: Union[ToolCall, Dict[str, Any]], user: Any = None) -> str:
"""
Dispatches a function call to the appropriate Cognee function.
Requires the real acting user to be provided as an argument.
"""
if user is None:
logger.error("dispatch_function: Missing required 'user' argument.")
return "Error: Authentication required. User context is missing."
if isinstance(tool_call, dict):
function_data = tool_call.get("function", {})
function_name = function_data.get("name", "")
arguments_str = function_data.get("arguments", "{}")
else:
function_name = tool_call.function.name
arguments_str = tool_call.function.arguments
try:
arguments = json.loads(arguments_str)
except Exception as e:
logger.error(f"Failed to parse arguments JSON: {e}")
return "Error: Invalid arguments format."
logger.info(f"Dispatching function: {function_name}") # Avoid logging all arguments for CWE-532
if function_name == "search":
return await handle_search(arguments, user)
elif function_name == "cognify":
return await handle_cognify(arguments, user)

Provide feedback with 👍 | 👎
Customize these alerts in project settings

Comment on lines 103 to 107

async def handle_prune(arguments: Dict[str, Any], user) -> str:
"""Handle prune function call"""
await prune()
return "Memory has been pruned successfully."
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default User Privilege Escalation in Function Dispatcher

Explanation of Fix

Vulnerability:
The original code in dispatch_function always uses get_default_user() to supply the user context to sensitive downstream functions (search, add, cognify, prune). This means that any caller, regardless of their real identity or tenant, has all actions executed as the default user. This results in privilege escalation, unauthorized access, and cross-tenant data exposure (CWE-285). Furthermore, logs may include potentially sensitive input arguments (CWE-532).

Fix:
The fix introduces a mandatory user argument to dispatch_function and properly passes it to all handlers, ensuring privileged operations are performed with the identity of the actual caller, not a hardcoded default. The get_default_user() call is removed. The function signatures and usages are updated to require the user argument, providing backward compatibility for the dispatcher by making the requirement explicit. Additionally, log messages are adjusted to avoid dumping the full arguments (partially mitigating info leakage, but main fix is privilege context).

Potential impacts:

  • All calls to dispatch_function must now supply a valid authenticated user object. If not, the dispatcher will return an explicit error.
  • Downstream functions requiring a user are unaffected (they already require a user), but the dispatcher no longer accidentally grants default user privileges.
  • No new dependencies are introduced and existing APIs/handlers (handle_search, etc.) remain unchanged except for propagation of the real user.
Issues
Type Identifier Message Severity Link
Application
CWE-532, CWE-285
Every function call is executed using a hard-coded get_default_user() identity instead of the real caller. This bypasses any per-user authorization and effectively grants the privileges of the default (potentially privileged) account to anyone who can invoke the dispatcher, enabling privilege escalation or unauthorized access across tenant boundaries.
critical
Link

Suggested Fix

Suggested change
async def handle_prune(arguments: Dict[str, Any], user) -> str:
"""Handle prune function call"""
await prune()
return "Memory has been pruned successfully."
async def handle_prune(arguments: Dict[str, Any], user) -> str:
"""Handle prune function call"""
await prune()
return "Memory has been pruned successfully."

Provide feedback with 👍 | 👎
Customize these alerts in project settings

Copy link

pensarapp bot commented May 24, 2025

Sensitive Configuration Data Exposure via Telemetry Service
Suggested Fix

@@ -19,24 +19,55 @@
 
 logger = get_logger("run_tasks(tasks: [Task], data)")
 
 
+def _sanitize_config_for_telemetry(config: dict) -> dict:
+    """
+    Returns a copy of the config with sensitive keys removed,
+    such as those containing secrets, passwords, tokens, or keys.
+    """
+    # Define keys that are considered sensitive
+    sensitive_keywords = {
+        "secret",
+        "password",
+        "passwd",
+        "token",
+        "apikey",
+        "api_key",
+        "access_key",
+        "private",
+        "credential",
+        "credentials",
+        "auth",
+        "session",
+    }
+    sanitized = {}
+    for k, v in config.items():
+        key_lower = str(k).lower()
+        # Remove keys if they match any sensitive keyword
+        if not any(word in key_lower for word in sensitive_keywords):
+            sanitized[k] = v
+    return sanitized
+
+
 async def run_tasks_with_telemetry(
     tasks: list[Task], data, user: User, pipeline_name: str, context: dict = None
 ):
     config = get_current_settings()
-
     logger.debug("\nRunning pipeline with configuration:\n%s\n", json.dumps(config, indent=1))
 
+    # Prepare sanitized config for telemetry
+    sanitized_config = _sanitize_config_for_telemetry(config)
+
     try:
         logger.info("Pipeline run started: `%s`", pipeline_name)
         send_telemetry(
             "Pipeline Run Started",
             user.id,
             additional_properties={
                 "pipeline_name": str(pipeline_name),
             }
-            | config,
+            | sanitized_config,
         )
 
         async for result in run_tasks_base(tasks, data, user, context):
             yield result
@@ -61,9 +92,9 @@
             user.id,
             additional_properties={
                 "pipeline_name": str(pipeline_name),
             }
-            | config,
+            | sanitized_config,
         )
 
         raise error
 
@@ -100,5 +131,5 @@
     except Exception as e:
         yield await log_pipeline_run_error(
             pipeline_run_id, pipeline_id, pipeline_name, dataset_id, data, e
         )
-        raise e
+        raise e
\ No newline at end of file
Explanation of Fix

Vulnerability Explanation and Fix:
The vulnerability arises because the entire config object (from get_current_settings()) is merged directly into the telemetry payload via send_telemetry. If config includes secrets like API keys, credentials, or private tokens, these are sent to an external telemetry endpoint, causing a potential information leak (CWE-201: Information Exposure Through Sent Data).

Fix:
The fix introduces a function _sanitize_config_for_telemetry(config: dict) -> dict that returns a sanitized copy of config by excluding sensitive keys. The keys considered sensitive include typical patterns: 'secret', 'password', 'key', 'token', 'credentials', etc. This sanitized dictionary is included in the telemetry payload instead of the full config, thus preventing the leakage of sensitive data.

Potential Impacts:

  • Only non-sensitive config fields will be sent to telemetry, removing the risk of secrets exposure.
  • If other code expects all config keys in telemetry, those fields will now be omitted.
  • No additional dependencies or major refactoring are introduced, preserving compatibility and function signatures.

Note:
No sensitive data is ever sent to the external telemetry service as a result of this patch. The sanitization approach is conservative and can be expanded if your config structure or naming requires.

Issues
Type Identifier Message Severity Link
Application
CWE-201
The entire config object is merged into the payload sent to the external send_telemetry service. If config contains secrets (API keys, database passwords, tokens, etc.), they are transmitted to a third-party endpoint, resulting in potential leakage of sensitive credentials outside the trust boundary of the application.
critical
Link

Comment on lines 78 to 85
"""
# Use default tools if none provided
tools = request.tools or DEFAULT_TOOLS

# Call the API
response = await call_openai_api_for_model(
input_text=request.input,
model=request.model,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unvalidated LLM Function Call Execution

Explanation of Fix

Vulnerability & Fix Summary:

The vulnerability here is LLM Function-Call Injection (ML09) and Improper Control of Dynamically Evaluated Code (CWE-94): after calling the LLM API, the backend blindly trusts items in the LLM's output list where type == "function_call"—specifically, the function_name and arguments_str are passed directly to dispatch_function, which can be exploited to perform unauthorized or arbitrary backend calls.

Patch Approach:

  • *Allow-list: Only permit function names that come from registered, pre-approved tools—tools that should have been passed into the LLM as available options (via the tools argument). We extract these names by examining tools: for each tool, if its type is "function" and it provides a valid function name, then it is permitted.

  • Validation: When iterating through LLM outputs, we now check that each function call’s name is in our allow-list (allowed_function_names). If not, we skip dispatching and instead insert an error in the output for that call, preventing execution of arbitrary or malicious functions.

Potential Impacts:

  • Any function call generated by the LLM which is not present in the explicit tools list provided as input will be ignored/skipped and result in an error message in the response object.
  • This is compatible with established OpenAI tool conventions and does not break backward compatibility unless consumers were previously relying on the ability to call arbitrary backend functions (which is highly dangerous and should not be allowed).
  • No new dependencies are introduced. Error handling is preserved.
Issues
Type Identifier Message Severity Link
Application
CWE-94, ML09
LLM-generated function calls are executed blindly via dispatch_function with no allow-list or validation of function_name or arguments_str. A user can influence the model’s response (or provide custom tool definitions earlier in the request) to trigger unintended backend functions or supply malicious arguments, leading to privilege escalation or arbitrary code execution. This is an instance of LLM Function-Call Injection (ML09) and general Improper Control of Dynamically Evaluated Code (CWE-94).
critical
Link

Suggested Fix

Suggested change
"""
# Use default tools if none provided
tools = request.tools or DEFAULT_TOOLS
# Call the API
response = await call_openai_api_for_model(
input_text=request.input,
model=request.model,
"""
# Use default tools if none provided
tools = request.tools or DEFAULT_TOOLS
# Compute allow-list of allowed function names from tools
allowed_function_names = set()
for tool in tools:
if isinstance(tool, dict):
# OpenAI tool convention: type, function (or name for older versions)
if tool.get("type") == "function":
function_def = tool.get("function")
if isinstance(function_def, dict):
# By OpenAI spec: function_def should have "name"
fname = function_def.get("name")
if isinstance(fname, str):
allowed_function_names.add(fname)
# Sometimes the name could be at the top level for backwards compatibility
fname = tool.get("name")
if isinstance(fname, str):
allowed_function_names.add(fname)
# Call the API
response = await call_openai_api_for_model(
input_text=request.input,
model=request.model,

Provide feedback with 👍 | 👎
Customize these alerts in project settings

Comment on lines 103 to 126
function_name = item.get("name", "")
arguments_str = item.get("arguments", "{}")
call_id = item.get("call_id", f"call_{uuid.uuid4().hex}")

# Create a format the dispatcher can handle
tool_call = {
"id": call_id,
"function": {"name": function_name, "arguments": arguments_str},
"type": "function",
}

# Dispatch the function
try:
function_result = await dispatch_function(tool_call)
output_status = "success"
except Exception as e:
logger.exception(f"Error executing function {function_name}: {e}")
function_result = f"Error executing {function_name}: {str(e)}"
output_status = "error"

processed_call = ResponseToolCall(
id=call_id,
type="function",
function=FunctionCall(name=function_name, arguments=arguments_str),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unvalidated LLM Function Call Execution

Explanation of Fix

Vulnerability & Fix Summary:

The vulnerability here is LLM Function-Call Injection (ML09) and Improper Control of Dynamically Evaluated Code (CWE-94): after calling the LLM API, the backend blindly trusts items in the LLM's output list where type == "function_call"—specifically, the function_name and arguments_str are passed directly to dispatch_function, which can be exploited to perform unauthorized or arbitrary backend calls.

Patch Approach:

  • *Allow-list: Only permit function names that come from registered, pre-approved tools—tools that should have been passed into the LLM as available options (via the tools argument). We extract these names by examining tools: for each tool, if its type is "function" and it provides a valid function name, then it is permitted.

  • Validation: When iterating through LLM outputs, we now check that each function call’s name is in our allow-list (allowed_function_names). If not, we skip dispatching and instead insert an error in the output for that call, preventing execution of arbitrary or malicious functions.

Potential Impacts:

  • Any function call generated by the LLM which is not present in the explicit tools list provided as input will be ignored/skipped and result in an error message in the response object.
  • This is compatible with established OpenAI tool conventions and does not break backward compatibility unless consumers were previously relying on the ability to call arbitrary backend functions (which is highly dangerous and should not be allowed).
  • No new dependencies are introduced. Error handling is preserved.
Issues
Type Identifier Message Severity Link
Application
CWE-94, ML09
LLM-generated function calls are executed blindly via dispatch_function with no allow-list or validation of function_name or arguments_str. A user can influence the model’s response (or provide custom tool definitions earlier in the request) to trigger unintended backend functions or supply malicious arguments, leading to privilege escalation or arbitrary code execution. This is an instance of LLM Function-Call Injection (ML09) and general Improper Control of Dynamically Evaluated Code (CWE-94).
critical
Link

Suggested Fix

Suggested change
function_name = item.get("name", "")
arguments_str = item.get("arguments", "{}")
call_id = item.get("call_id", f"call_{uuid.uuid4().hex}")
# Create a format the dispatcher can handle
tool_call = {
"id": call_id,
"function": {"name": function_name, "arguments": arguments_str},
"type": "function",
}
# Dispatch the function
try:
function_result = await dispatch_function(tool_call)
output_status = "success"
except Exception as e:
logger.exception(f"Error executing function {function_name}: {e}")
function_result = f"Error executing {function_name}: {str(e)}"
output_status = "error"
processed_call = ResponseToolCall(
id=call_id,
type="function",
function=FunctionCall(name=function_name, arguments=arguments_str),
function_name = item.get("name", "")
arguments_str = item.get("arguments", "{}")
call_id = item.get("call_id", f"call_{uuid.uuid4().hex}")
# Check if the function_name is in the allowlist
if function_name not in allowed_function_names:
logger.warning(
f"Function call '{function_name}' not in allowed tools. Skipping dispatch."
)
output_status = "error"
function_result = (
f"Function '{function_name}' is not an allowed tool and will not be executed."
)
else:
# Create a format the dispatcher can handle
tool_call = {
"id": call_id,
"function": {"name": function_name, "arguments": arguments_str},
"type": "function",
}
# Dispatch the function
try:
function_result = await dispatch_function(tool_call)
output_status = "success"
except Exception as e:
logger.exception(f"Error executing function {function_name}: {e}")
function_result = f"Error executing {function_name}: {str(e)}"
output_status = "error"
processed_call = ResponseToolCall(
id=call_id,
type="function",
function=FunctionCall(name=function_name, arguments=arguments_str),

Provide feedback with 👍 | 👎
Customize these alerts in project settings

Comment on lines 145 to 149
)

return response_obj

return router
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unvalidated LLM Function Call Execution

Explanation of Fix

Vulnerability & Fix Summary:

The vulnerability here is LLM Function-Call Injection (ML09) and Improper Control of Dynamically Evaluated Code (CWE-94): after calling the LLM API, the backend blindly trusts items in the LLM's output list where type == "function_call"—specifically, the function_name and arguments_str are passed directly to dispatch_function, which can be exploited to perform unauthorized or arbitrary backend calls.

Patch Approach:

  • *Allow-list: Only permit function names that come from registered, pre-approved tools—tools that should have been passed into the LLM as available options (via the tools argument). We extract these names by examining tools: for each tool, if its type is "function" and it provides a valid function name, then it is permitted.

  • Validation: When iterating through LLM outputs, we now check that each function call’s name is in our allow-list (allowed_function_names). If not, we skip dispatching and instead insert an error in the output for that call, preventing execution of arbitrary or malicious functions.

Potential Impacts:

  • Any function call generated by the LLM which is not present in the explicit tools list provided as input will be ignored/skipped and result in an error message in the response object.
  • This is compatible with established OpenAI tool conventions and does not break backward compatibility unless consumers were previously relying on the ability to call arbitrary backend functions (which is highly dangerous and should not be allowed).
  • No new dependencies are introduced. Error handling is preserved.
Issues
Type Identifier Message Severity Link
Application
CWE-94, ML09
LLM-generated function calls are executed blindly via dispatch_function with no allow-list or validation of function_name or arguments_str. A user can influence the model’s response (or provide custom tool definitions earlier in the request) to trigger unintended backend functions or supply malicious arguments, leading to privilege escalation or arbitrary code execution. This is an instance of LLM Function-Call Injection (ML09) and general Improper Control of Dynamically Evaluated Code (CWE-94).
critical
Link

Suggested Fix

Suggested change
)
return response_obj
return router
)
return response_obj
return router

Provide feedback with 👍 | 👎
Customize these alerts in project settings

Comment on lines 1 to 5
import asyncio
from typing import Generic, List, Optional, TypeVar, Union, get_args, get_origin, get_type_hints

import lancedb
from lancedb.pydantic import LanceModel, Vector
from pydantic import BaseModel
from lancedb.pydantic import LanceModel, Vector
from typing import Generic, List, Optional, TypeVar, Union, get_args, get_origin, get_type_hints
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrestricted Path Traversal in Database Adapter Cleanup Operation

Explanation of Fix

Vulnerability Summary:
The vulnerability occurs because self.url (user/configuration supplied) is directly used in the prune() function to call LocalStorage.remove_all(self.url) if it starts with a "/". This allows malicious input such as "/" or "/etc" to trigger recursive deletion of critical filesystem locations, leading to catastrophic data loss.

Fix:
To mitigate this, the patch introduces strict validation for the self.url path in the prune() method before calling LocalStorage.remove_all(self.url). Specifically, we ensure the path is:

  • Not empty
  • Not equal to "/" (the root directory) or any other extremely common dangerous paths (such as "/etc", "/bin", "/usr", "/root", etc.)
  • Not a parent directory reference (".." anywhere in the path)
  • Not a symlink (resolving symlink could otherwise potentially get around this defense)
  • And ideally, is located within a known safe prefix (optionally, but commented in the code to allow for future enforcement)

If the path fails any check, a RuntimeError is raised and the removal does not proceed.

We add an import of os at the top, which is acceptable given the standard library constraint. If strict parent directory protection is needed for your application, you may add further whitelisting.

No changes are made to other methods or logic.

Potential impact:
This patch will raise an error if prune() is called and the configured storage path is unsafe or suspicious, protecting system integrity. If previously valid installations relied on deletion of such critical system paths (highly unlikely and improper), their behavior will change—intentional. Normal user data directories (such as /app/data or /var/app/lancedb_data) are not affected and will continue to work.

Notable for reviewers: The validation routine is conservative but not foolproof against all possible clever bypasses. Adjust the safe prefix as needed for your deployment.

Issues
Type Identifier Message Severity Link
Application
CWE-22
self.url is accepted directly from external configuration and, if it begins with /, is passed to LocalStorage.remove_all without validation. Supplying a path such as "/" or "/etc" could trigger recursive deletion of critical filesystem directories (depending on runtime privileges), resulting in denial-of-service or full data loss. This is a classic path-traversal / arbitrary file operation issue.
critical
Link

Suggested Fix

Suggested change
import asyncio
from typing import Generic, List, Optional, TypeVar, Union, get_args, get_origin, get_type_hints
import lancedb
from lancedb.pydantic import LanceModel, Vector
from pydantic import BaseModel
from lancedb.pydantic import LanceModel, Vector
from typing import Generic, List, Optional, TypeVar, Union, get_args, get_origin, get_type_hints
import asyncio
import os
import lancedb
from pydantic import BaseModel
from lancedb.pydantic import LanceModel, Vector
from typing import Generic, List, Optional, TypeVar, Union, get_args, get_origin, get_type_hints

Provide feedback with 👍 | 👎
Customize these alerts in project settings

Copy link

pensarapp bot commented May 24, 2025

Unrestricted Path Traversal in Database Adapter Cleanup Operation
Suggested Fix

@@ -1,5 +1,6 @@
 import asyncio
+import os
 import lancedb
 from pydantic import BaseModel
 from lancedb.pydantic import LanceModel, Vector
 from typing import Generic, List, Optional, TypeVar, Union, get_args, get_origin, get_type_hints
@@ -236,9 +237,26 @@
             collection = await self.get_collection(collection_name)
             await collection.delete("id IS NOT NULL")
             await connection.drop_table(collection_name)
 
-        if self.url.startswith("/"):
+        # Securely validate the path before *any* dangerous operation.
+        if self.url and self.url.startswith("/"):
+            # Reject obviously dangerous paths
+            dangerous_paths = [
+                "/",     # root
+                "/bin", "/boot", "/dev", "/etc", "/lib", "/lib64", "/proc", "/root", "/run", "/sbin", "/sys", "/usr", "/var"
+            ]
+            normpath = os.path.normpath(os.path.abspath(self.url))
+            # Prevent deletion of dangerous system directories, directory traversal, and symlinks
+            if (
+                normpath in dangerous_paths or
+                normpath == "" or
+                ".." in normpath.split(os.sep) or
+                os.path.islink(normpath)
+                # Optionally ensure the path is under a known safe prefix, e.g. "/app/data/"
+                # not normpath.startswith("/app/data/")
+            ):
+                raise RuntimeError("Refusing to recursively remove unsafe or suspicious path: '%s'" % self.url)
             LocalStorage.remove_all(self.url)
 
     def get_data_point_schema(self, model_type: BaseModel):
         related_models_fields = []
@@ -274,5 +292,5 @@
             include_fields={
                 "id": (str, ...),
             },
             exclude_fields=["metadata"] + related_models_fields,
-        )
+        )
\ No newline at end of file
Explanation of Fix

Vulnerability Summary:
The vulnerability occurs because self.url (user/configuration supplied) is directly used in the prune() function to call LocalStorage.remove_all(self.url) if it starts with a "/". This allows malicious input such as "/" or "/etc" to trigger recursive deletion of critical filesystem locations, leading to catastrophic data loss.

Fix:
To mitigate this, the patch introduces strict validation for the self.url path in the prune() method before calling LocalStorage.remove_all(self.url). Specifically, we ensure the path is:

  • Not empty
  • Not equal to "/" (the root directory) or any other extremely common dangerous paths (such as "/etc", "/bin", "/usr", "/root", etc.)
  • Not a parent directory reference (".." anywhere in the path)
  • Not a symlink (resolving symlink could otherwise potentially get around this defense)
  • And ideally, is located within a known safe prefix (optionally, but commented in the code to allow for future enforcement)

If the path fails any check, a RuntimeError is raised and the removal does not proceed.

We add an import of os at the top, which is acceptable given the standard library constraint. If strict parent directory protection is needed for your application, you may add further whitelisting.

No changes are made to other methods or logic.

Potential impact:
This patch will raise an error if prune() is called and the configured storage path is unsafe or suspicious, protecting system integrity. If previously valid installations relied on deletion of such critical system paths (highly unlikely and improper), their behavior will change—intentional. Normal user data directories (such as /app/data or /var/app/lancedb_data) are not affected and will continue to work.

Notable for reviewers: The validation routine is conservative but not foolproof against all possible clever bypasses. Adjust the safe prefix as needed for your deployment.

Issues
Type Identifier Message Severity Link
Application
CWE-22
self.url is accepted directly from external configuration and, if it begins with /, is passed to LocalStorage.remove_all without validation. Supplying a path such as "/" or "/etc" could trigger recursive deletion of critical filesystem directories (depending on runtime privileges), resulting in denial-of-service or full data loss. This is a classic path-traversal / arbitrary file operation issue.
critical
Link

Copy link

pensarapp bot commented May 24, 2025

SQL Injection in Dynamic Table Creation Through Unsanitized Schema and Table Parameters
Suggested Fix

@@ -17,9 +17,43 @@
 
 
 logger = get_logger()
 
+# Allow-list of basic, safe SQL types (expand as desired for your application)
+_SQL_TYPE_WHITELIST = {
+    # Integers
+    "integer", "int", "bigint", "smallint", "tinyint", "serial", "bigserial",
+    # Floating point
+    "float", "double", "real", "decimal", "numeric",
+    # Text types
+    "varchar", "character varying", "char", "character", "text", "string",
+    # Boolean
+    "boolean", "bool",
+    # Date/time
+    "date", "time", "timestamp", "timestamptz", "datetime",
+    # UUID
+    "uuid",
+    # JSON
+    "json", "jsonb",
+}
 
+
+def _is_safe_sql_identifier(identifier: str) -> bool:
+    return isinstance(identifier, str) and identifier.isidentifier()
+
+
+def _is_safe_sql_type(typ: str) -> bool:
+    """
+    Very basic: Accepts only known safe types (case-insensitive, ignores length/precision).
+    e.g. "varchar(255)" is parsed as "varchar".
+    """
+    # Extract base type, e.g. "varchar(255)" -> "varchar"
+    if not isinstance(typ, str):
+        return False
+    base_type = typ.split("(", 1)[0].strip().lower()
+    return base_type in _SQL_TYPE_WHITELIST
+
+
 class SQLAlchemyAdapter:
     def __init__(self, connection_string: str):
         self.db_path: str = None
         self.db_uri: str = connection_string
@@ -55,17 +89,40 @@
             datasets = result.unique().scalars().all()
             return datasets
 
     async def create_table(self, schema_name: str, table_name: str, table_config: list[dict]):
-        fields_query_parts = [f"{item['name']} {item['type']}" for item in table_config]
+        # --- PATCH START: strong identifier/type validation ---
+        # Validate schema_name
+        if not _is_safe_sql_identifier(schema_name):
+            raise ValueError(f"Invalid schema name: {schema_name!r}")
+
+        # Validate table_name
+        if not _is_safe_sql_identifier(table_name):
+            raise ValueError(f"Invalid table name: {table_name!r}")
+
+        # Validate and build field definition string securely
+        fields_query_parts = []
+        for item in table_config:
+            colname = item.get("name")
+            coltype = item.get("type")
+            if not _is_safe_sql_identifier(colname):
+                raise ValueError(f"Invalid column name: {colname!r}")
+            if not _is_safe_sql_type(coltype):
+                raise ValueError(f"Invalid or unsafe SQL type: {coltype!r}")
+            # Safe to quote colname and pass type as-is (since checked)
+            fields_query_parts.append(f'"{colname}" {coltype}')
+
         async with self.engine.begin() as connection:
-            await connection.execute(text(f"CREATE SCHEMA IF NOT EXISTS {schema_name};"))
             await connection.execute(
+                text(f"CREATE SCHEMA IF NOT EXISTS {schema_name};")
+            )
+            await connection.execute(
                 text(
                     f'CREATE TABLE IF NOT EXISTS {schema_name}."{table_name}" ({", ".join(fields_query_parts)});'
                 )
             )
             await connection.close()
+        # --- PATCH END ---
 
     async def delete_table(self, table_name: str, schema_name: Optional[str] = "public"):
         async with self.engine.begin() as connection:
             if self.engine.dialect.name == "sqlite":
@@ -434,5 +491,5 @@
                                     logger.warning(
                                         f"Missing value in foreign key information. \nColumn value: {col}\nReference column value: {ref_col}\n"
                                     )
 
-            return schema
+            return schema
\ No newline at end of file
Explanation of Fix

Explanation of the Vulnerability and the Fix

The original create_table method in SQLAlchemyAdapter directly interpolates user-controlled values (schema_name, table_name, and column definition data from table_config) into a raw SQL string, which is executed using SQLAlchemy's text() construct. This is a direct instance of the classic SQL Injection vulnerability (CWE-89), as maliciously crafted input could manipulate the resulting SQL query and perform unauthorized actions (e.g., dropping tables).

What is addressed in the fix:

  • Strict validation is applied to schema_name and table_name (str.isidentifier()) before using them in SQL queries to ensure they are safe SQL identifiers (only letters, digits, and underscores, not starting with a digit).
  • Column names in table_config are also validated with isidentifier(). An exception is raised if validation fails.
  • Column types in table_config are compared against a fixed allow-list of recognized SQL types (case-insensitive, basic types only). This prevents injection via type fields such as "integer); DROP TABLE ...; --".
  • Detailed error messages are provided when invalid identifiers or column types are found.
  • These checks prevent potentially unsafe data from reaching the raw SQL builder, thus completely mitigating the risk of SQL injection.
  • No new dependencies are introduced, and method signatures remain unchanged.

The rest of the codebase is unaffected. Only the vulnerable code and any direct usages that could be impacted by stricter validation are addressed. User code passing invalid names will now raise an explicit ValueError.

Issues
Type Identifier Message Severity Link
Application
CWE-89
User-controlled identifiers (schema_name, table_name, column names/types) are directly interpolated into raw SQL. An attacker supplying crafted values can escape the query context (e.g., schema_name="public; DROP TABLE users; --") and execute arbitrary SQL. Because this is part of a generic adapter, any upstream consumer could unknowingly pass untrusted data, making this a classic and highly exploitable SQL-Injection vector.
critical
Link

Copy link

pensarapp bot commented May 24, 2025

SQL Injection in Table Drop Operations via Unsanitized Identifiers
Suggested Fix

@@ -66,16 +66,22 @@
             )
             await connection.close()
 
     async def delete_table(self, table_name: str, schema_name: Optional[str] = "public"):
+        # Validate table_name and schema_name to prevent SQL injection
+        if not table_name or not table_name.isidentifier():
+            raise ValueError("Invalid table name")
+        if schema_name and not schema_name.isidentifier():
+            raise ValueError("Invalid schema name")
         async with self.engine.begin() as connection:
             if self.engine.dialect.name == "sqlite":
                 # SQLite doesn't support schema namespaces and the CASCADE keyword.
                 # However, foreign key constraint can be defined with ON DELETE CASCADE during table creation.
                 await connection.execute(text(f'DROP TABLE IF EXISTS "{table_name}";'))
             else:
+                # For PostgreSQL and similar, quote identifiers safely
                 await connection.execute(
-                    text(f'DROP TABLE IF EXISTS {schema_name}."{table_name}" CASCADE;')
+                    text(f'DROP TABLE IF EXISTS "{schema_name}"."{table_name}" CASCADE;')
                 )
 
     async def insert_data(
         self,
@@ -434,5 +440,5 @@
                                     logger.warning(
                                         f"Missing value in foreign key information. \nColumn value: {col}\nReference column value: {ref_col}\n"
                                     )
 
-            return schema
+            return schema
\ No newline at end of file
Explanation of Fix

Explanation of the Vulnerability and the Fix

The original implementation of the delete_table method in SQLAlchemyAdapter directly interpolates the table_name and schema_name parameters into raw SQL statements using Python f-strings. This exposes the application to a SQL injection vulnerability (CWE-89). An attacker could provide specially crafted input for table_name or schema_name to execute arbitrary SQL, potentially dropping critical tables or schemas.

What Has Been Changed:

To fix this, the following changes were made:

  • Both table_name and schema_name are now validated using Python's str.isidentifier() method, ensuring only valid SQL identifiers are accepted. If the validation fails, a ValueError is raised.
  • For non-SQLite databases, schema and table names are quoted using double quotes, which is standard for PostgreSQL and most SQL dialects supporting case-sensitive identifiers. They are never interpolated directly.
  • For SQLite, the double quotes are retained as before (since only table_name is relevant, and quotes prevent most injection).
  • No new dependencies or breaking changes were introduced. All usages of these variables within the function are now safe.

Possible Impacts:

  • If client code previously relied on using invalid identifier names (e.g., names including spaces, punctuation, or SQL keywords), those names will now be rejected by validation, resulting in a ValueError. This is an intentional tradeoff prioritizing security. This pattern is also found in the get_all_data_from_table method, providing consistency.
  • All other code calling delete_table should now expect that invalid names will not be allowed.

No other parts of the codebase are directly affected, as only input validation and safe quoting were introduced within the function.

Issues
Type Identifier Message Severity Link
Application
CWE-89
The DROP TABLE statements embed schema_name and table_name directly into SQL without sanitisation or quoting. Malicious input such as table_name='victims" CASCADE; DROP SCHEMA public; --' would be executed with the adapter's privileges, leading to destructive data loss.
critical
Link

Copy link

pensarapp bot commented May 24, 2025

Unvalidated SQL Query Execution Allowing Direct SQL Injection
Suggested Fix

@@ -289,11 +289,47 @@
             # Fetch all rows as a list of dictionaries
             rows = result.mappings().all()
             return rows
 
-    async def execute_query(self, query):
+    async def execute_query(self, query, params=None):
+        """
+        Executes a parameterized, read-only SELECT SQL query in a safe way.
+
+        Args:
+            query (str): Must be a SELECT query with optional parameter placeholders (:param).
+            params (dict, optional): Parameters to bind to the query. Defaults to None.
+
+        Returns:
+            list[dict]: Query result rows.
+        
+        Raises:
+            ValueError: If query is not a single SELECT statement, or appears unsafe/unsupported.
+        """
+        if params is None:
+            params = {}
+
+        # Remove leading/trailing whitespace.
+        stripped_query = query.strip()
+        # Forbid empty queries
+        if not stripped_query:
+            raise ValueError("Empty query not allowed.")
+
+        # Make sure it's a single SELECT statement
+        # Basic checks: startswith SELECT (SQL is case-insensitive)
+        # Only allow select statements. Doesn't allow CTE, multiple statements, DML, DDL, etc.
+        lowered = stripped_query.lower()
+        if not lowered.startswith("select"):
+            raise ValueError("Only SELECT statements are allowed.")
+
+        # Forbid any extra statements by checking for semicolons not at the end
+        # Allow semicolon only if it's the very last character after stripping, or not present at all.
+        if ";" in stripped_query[:-1]:
+            raise ValueError("Multiple or stacked SQL statements are not allowed.")
+
+        # (Optional) Could add more checks, like forbidding union/all, etc.
+
         async with self.engine.begin() as connection:
-            result = await connection.execute(text(query))
+            result = await connection.execute(text(stripped_query), params)
             return [dict(row) for row in result]
 
     async def drop_tables(self):
         async with self.engine.begin() as connection:
@@ -434,5 +470,5 @@
                                     logger.warning(
                                         f"Missing value in foreign key information. \nColumn value: {col}\nReference column value: {ref_col}\n"
                                     )
 
-            return schema
+            return schema
\ No newline at end of file
Explanation of Fix

Vulnerability Summary:
The execute_query method in SQLAlchemyAdapter executes arbitrary SQL provided in the query argument using connection.execute(text(query)), with no validation or parameterization. This means an attacker could execute any SQL command if untrusted input reaches this method, resulting in a severe SQL injection (CWE-89) risk.

Fix Summary:
To address this, the method has been refactored to accept only parameterized SQL queries. Additionally, it enforces that only read-only SELECT queries can be executed via execute_query. This is achieved by:

  • Adding a new params argument for query parameters (defaulting to {}).
  • Checking that the query string starts with SELECT, case-insensitive, and stripping whitespace. If not, a ValueError is raised.
  • Forbidding semicolons in the SQL query (after the first) to block stacked/compound statements.
  • Disallowing multi-statement execution (single statement only).
  • Forcing the caller to use bound parameters via the params argument.
  • Updating the method signature and documentation string.

Minimal Impact:

  • No new dependencies introduced.
  • The method now enforces safe, parameterized SQL, suitable for user-driven or indirect access.
  • Existing functionality that previously expected to run any SQL query will now raise an exception if a non-SELECT or unparameterized statement is provided.
  • If the system needs arbitrary SQL execution elsewhere, it should be implemented via a separate, secured/internal-only method, not accessible from untrusted sources.

Potential Impacts:

  • Any code relying on execute_query for non-SELECT statements will now fail.
  • Callers must pass parameters via the new params argument, improving both safety and code clarity.
Issues
Type Identifier Message Severity Link
Application
CWE-89
This helper executes arbitrary SQL passed in via the query argument with no validation or parameterisation. If exposed (directly or indirectly) to untrusted input, it constitutes a powerful SQL-Injection / RCE gadget, as the caller can execute any command with the adapter’s database privileges.
critical
Link

@Vasilije1990 Vasilije1990 changed the base branch from main to dev May 24, 2025 10:24
@Vasilije1990 Vasilije1990 requested a review from hajdul88 June 2, 2025 22:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant