-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eurosat #122
base: main
Are you sure you want to change the base?
Eurosat #122
Conversation
Hello @Prateek0xeo, Thanks a lot for this contribution ! I have a few comment on it :
Could you please fix those two as well ? Thanks a lot in advance. |
Sure! Thank you for the feedback, I'll improve upon the PR and make sure that all other Pull requests i make also reflect the same. |
Then checking the use of
Thus |
I've contact the website owner for a fix / workaround. |
@cregouby I have implemented all the changes you mentioned. |
Hello @cregouby, I referred to the PyTorch repository and tried implementing the new URL and MD5 checksum provided in the code, but it still didn't resolve the issue. While the SSL certificate problem was fixed there, the dataset still doesn't load as expected. I also took inspiration from the spam-loader review by @dfalbel and applied similar improvements to this PR. |
Hello @Prateek0xeo I've recieved an answer from the website administrator:
I can see in the meantime that you switch to huggingface dataset, which fails in my case with a wrong MD5
We can find the download failure root cause in the zip file : Browse[1]> readLines(zip_file)
[1] "Found. Redirecting to https://cdn-lfs.hf.co/repos/fc/1d/fc1dee780dee1dae2ad48856d0961ac6aa5dfcaaaa4fb3561be4aedf19b7ccc7/8ebea626349354c5328b142b96d0430e647051f26efc2dc974c843f25ecf70bd?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27EuroSAT.zip%3B+filename%3D%22EuroSAT.zip%22%3B&response-content-type=application%2Fzip&Expires=1737474034&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTczNzQ3NDAzNH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9yZXBvcy9mYy8xZC9mYzFkZWU3ODBkZWUxZGFlMmFkNDg4NTZkMDk2MWFjNmFhNWRmY2FhYWE0ZmIzNTYxYmU0YWVkZjE5YjdjY2M3LzhlYmVhNjI2MzQ5MzU0YzUzMjhiMTQyYjk2ZDA0MzBlNjQ3MDUxZjI2ZWZjMmRjOTc0Yzg0M2YyNWVjZjcwYmQ%7EcmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlvbj0qJnJlc3BvbnNlLWNvbnRlbnQtdHlwZT0qIn1dfQ__&Signature=buJTsafUfhVxwvReQoUSJcNl3VwU74g1YghMxdzBwM6G25WA3-eyL3M6edcgNDBDCEBGjaoJ026dpGx9RZ%7EXsyirvlbOdND95c1jOqIZU5Y3Lh67bSvZcbbSbgGOjq5OvHvKbMSmXBWvPGzG3Ody8Z8Fm%7ExN9mhW6jqwi6LaQ84pH2VW-bWLLXTyrzexXmrmY2FBplfM3ZGD%7EmiTx1JHu97SbY7S8x9GYAPW16vV6w%7EVIWJ5KztnTpF%7Ea78q8tYWasUttVF9q9SbMw2DJMLq3s7MyFZlxXnO6cSM4jMN1fowTxbjbV5-O6rSpf0jp0-FtpK5DfVQWEvTAnBZ3ZOg%7Ew__&Key-Pair-Id=K3RPWS32NSSJCE" So see my remarks inline. |
R/dataset-eurosat.R
Outdated
self$resources$url, | ||
destfile = zip_file, | ||
mode = "wb", | ||
method = "curl", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo design : default libcurl
method do not require specific installation, and is following redirections, where curl
don't. So I would keep the default
R/dataset-eurosat.R
Outdated
destfile = zip_file, | ||
mode = "wb", | ||
method = "curl", | ||
extra = "--insecure" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo security : as mentionned in https://curl.se/docs/manpage.html#-k, this should not be used. Please remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @Prateek0xeo
Thanks for submitting,
Please fix the remarks, and fix the tests.
question : as the huggingface dataset already includes a train / test / val split, is there a way to make those split available here with an option to the dataset like in mnist_dataset()
?
suggestion My proposal would be
#' @param split (character, optional): If `train` (default), creates the training dataset. Otherwise
#' value should be either `test` or `val` for rerpectively test set or validation set.
expect_true(length(files) > 0, info = "Files should be downloaded in the temporary directory") | ||
|
||
extracted_dir <- file.path(temp_root, "2750") | ||
expect_true(dir.exists(extracted_dir), info = "Extracted data folder should exist") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fails to me :
> expect_true(dir.exists(extracted_dir), info = "Extracted data folder should exist")
Erreur : dir.exists(extracted_dir) is not TRUE
`actual`: FALSE
`expected`: TRUE
Extracted data folder should exist
> head(files, 2)
[1] "/tmp/Rtmpmm3AfN/file6970544f1c340//eurosat/2750/AnnualCrop/AnnualCrop_1.jpg" "/tmp/Rtmpmm3AfN/file6970544f1c340//eurosat/2750/AnnualCrop/AnnualCrop_10.jpg"
R/dataset-eurosat.R
Outdated
name = "eurosat", | ||
|
||
resources = list( | ||
url = "https://huggingface.co/datasets/torchgeo/eurosat/resolve/c877bcd43f099cd0196738f714544e355477f3fd/EuroSAT.zip", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question : That URL points to a specific commit of the Dataset onto huggingface platform. Is it what we want ? or do we want to use the latest version of this dataset ? In that case url would be the one accessible on the front-page of the Dataset: url = "https://huggingface.co/datasets/torchgeo/eurosat/resolve/main/EuroSAT.zip?download=true"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pytorch/vision#8563
I tried to implement the fix taking this issue in pytorch as reference. So i copied the same URL from the code. I'll implement the latest version of dataset as you pointed out.
@cregouby I have already tried the data hosted on zenedo it is not working for me. I'll try to improvise on the huggingface dataset and make it work. |
[1,] 0.5843137 0.5764706 0.5725490 0.5764706 0.5843137 0.5921569 0.5843137 the dataloader seems to be working now I'll push the code after self review |
@cregouby In this commit dataloader is working. I am still working on figuring out the splits functionalilty until then pls review the new commit. |
Hello @cregouby Hi Prateek, All of the datasets in https://huggingface.co/datasets/torchgeo/ are intended for use with the TorchGeo library. You can use it like so: from torchgeo.datasets import EuroSAT train_dataset = EuroSAT('data', split='train', download=True) So i implemented the functionality using these URL Endpoints curl -X GET but the Val split was not working and throwing an error So i have contacted the Owner again for this issue https://huggingface.co/datasets/torchgeo/eurosat/discussions/2 |
@cregouby there is a problem with image decoding with this dataset |
Hello @Prateek0xeo 1 - The new download only gets the first 100 sample, which is not what we want. We want the full train dataset, i.e. 32.5k rows. I get the following error running the tests :
In order to ensure the test runs correctly on all platforms, you should activate the github actions in the setup of your git repository, and then activate the R-CMD-check workflow. So that for each or your commit, you will see test results over all platforms. All must be green, or you have to fix code. |
Experimenting the R-CMD-check
Hello @Prateek0xeo |
@cregouby yes actually I was trying figure out that only when I noticed it
was already done in the git repo so I deleted the folder.
Update - I have made the split work but the problem is I can't download
more than 100 from the URL. I tried using parquet but that also failed. I
am now trying to use the text file to split the data downloaded from the zip
…On Thu, 23 Jan, 2025, 00:27 cregouby, ***@***.***> wrote:
Hello @Prateek0xeo <https://github.com/Prateek0xeo>
There is no need to change something in th e.github/workflow/ folder as
everything is already here.
The setup I mentionned is to be done in *your* github repository.
—
Reply to this email directly, view it on GitHub
<#122 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BAG3KFCC2OWZICS2BENR7M32L7S23AVCNFSM6AAAAABVAE7UOGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMBYGAZTAMJSGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
||
# Test train split | ||
train_ds <- eurosat_dataset(root = "./data/eurosat", split = "train", download = TRUE) | ||
expect_true(length(train_ds) > 0, info = "Train dataset should have a non-zero length") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expect_true(length(train_ds) > 0, info = "Train dataset should have a non-zero length") | |
expect_length(train_ds), 16200, info = "Train dataset should have the expected length") |
|
||
# Test test split | ||
test_ds <- eurosat_dataset(root = "./data/eurosat", split = "test", download = TRUE) | ||
expect_true(length(test_ds) > 0, info = "Test dataset should have a non-zero length") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expect_true(length(test_ds) > 0, info = "Test dataset should have a non-zero length") | |
expect_length(test_ds), 5400,, info = "Test dataset should have the expected length") |
|
||
# Test validation split | ||
validation_ds <- eurosat_dataset(root = "./data/eurosat", split = "validation", download = TRUE) | ||
expect_true(length(validation_ds) > 0, info = "Validation dataset should have a non-zero length") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expect_true(length(validation_ds) > 0, info = "Validation dataset should have a non-zero length") | |
expect_length(validation_ds), 5400, info = "Validation dataset should have the expected length") |
expect_true(!is.null(sample$x), info = "Image should not be null") | ||
expect_true(!is.null(sample$y), info = "Label should not be null") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo : would be much better to check the tensor shape and tensor dtype of each x and y object
#104
md5 mismatch issue solved