Skip to content

Commit dc44501

Browse files
authored
Merge pull request #14 from huggingface/v2
v2
2 parents f2f0b83 + e284591 commit dc44501

File tree

2 files changed

+13
-1
lines changed

2 files changed

+13
-1
lines changed

README.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ A Spark Data Source for accessing [🤗 Hugging Face Datasets](https://huggingfa
1616
- Stream datasets from Hugging Face as Spark DataFrames
1717
- Select subsets and splits, apply projection and predicate filters
1818
- Save Spark DataFrames as Parquet files to Hugging Face
19+
- Fast deduped uploads
1920
- Fully distributed
2021
- Authentication via `huggingface-cli login` or tokens
2122
- Compatible with Spark 4 (with auto-import)
@@ -78,6 +79,17 @@ df = (
7879
)
7980
```
8081

82+
## Fast deduped uploads
83+
84+
Hugging Face uses Xet: a dedupe-based storage which enables fast deduped uploads.
85+
86+
Unlike traditional remote storage, uploads are faster on Xet because duplicate data is only uploaded once.
87+
For example: if some or all of the data already exists in other files on Xet, it is not uploaded again, saving bandwidth and speeding up uploads. Deduplication for Parquet is enabled through Content Defined Chunking (CDC).
88+
89+
Thanks to Parquet CDC and Xet deduplication, saving a dataset on Hugging Face is faster than on any traditional remote storage.
90+
91+
For more information, see [https://huggingface.co/blog/parquet-cdc](https://huggingface.co/blog/parquet-cdc).
92+
8193
## Backport
8294

8395
While the Data Source API was introcuded in Spark 4, this package includes a backport for older versions.

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[project]
22
name = "pyspark_huggingface"
3-
version = "1.0.0"
3+
version = "2.0.0"
44
description = "A DataSource for reading and writing HuggingFace Datasets in Spark"
55
authors = [
66
{name = "allisonwang-db", email = "[email protected]"},

0 commit comments

Comments
 (0)