Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A misprint in the "Big data? 🤗 Datasets to the rescue!" chapter of the NLP Course? #767

Open
TopCoder2K opened this issue Dec 20, 2024 · 0 comments

Comments

@TopCoder2K
Copy link

There is the following code in the "The magic of memory mapping" section:

print(f"Number of files in dataset : {pubmed_dataset.dataset_size}")
size_gb = pubmed_dataset.dataset_size / (1024**3)
print(f"Dataset size (cache file) : {size_gb:.2f} GB")

It seems there should be "Number of bytes in dataset" instead of "Number of files in dataset", since the number of rows is 15 518 009 and dividing pubmed_dataset.dataset_size by 1024**3 suggests measuring information rather than the number of files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant