Skip to content

Question: File Counts and Dataset Size #44

Open
@darien-schettler

Description

@darien-schettler

I recently downloaded The Stack (the-stack-dedup) from Huggingface via GIT LFS. I have two questions that I need help with:

  1. The size on disk of the dedup datset is only around 900GB (much smaller than the 1.5TB indicated on the data card - https://huggingface.co/datasets/bigcode/admin/resolve/main/the-stack-infographic-v11.png)

  2. Is there somewhere were the file counts are listed in full for each dataset by language (dedup and full)?

Essentially I am looking to make sure that I have accessed the entirety of the dataset, so I either need to understand the dataset size difference, or know how many files there should be for each language so I can validate my download. Ideally both.

Thanks in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions