Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eurosat #122

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open

Eurosat #122

wants to merge 11 commits into from

Conversation

Prateek0xeo
Copy link

#104
md5 mismatch issue solved

@cregouby
Copy link
Contributor

Hello @Prateek0xeo,

Thanks a lot for this contribution !

I have a few comment on it :

  • We do not want to host archive files in the code repository, as it would make the package too heavy, up to the point it would not pass the CRAN check. Could you remove those zip and tar.gz file from the pull-request please ?
  • the check fails with information that should be fixed :
==> devtools::check()

══ Documenting ══════════════════════════════════════════════════════════
ℹ Updating torchvision documentation
ℹ Loading torchvision
Avis : Objects listed as exports, but not present in namespace:
• inat2021_dataset
✖ dataset-eurosat.R:28: @export must be a single line, not 5.
ℹ The first line is "Download the EuroSAT dataset"
✖ transforms-magick.R:56: S3 method `get_image_size.magick-image` needs
  @export or @exportS3method tag.
Writing NAMESPACE
...
── R CMD check results ────────────────────── torchvision 0.6.0.9000 ────
Duration: 47.6s

❯ checking Rd \usage sections ... WARNING
  Documented arguments not in \usage in Rd file 'eurosat_dataset.Rd':
    ‘i’
  
  Functions with \usage entries need to have the appropriate \alias
  entries, and all their arguments documented.
  The \usage entries must correspond to syntactically valid R code.
  See chapter ‘Writing R documentation files’ in the ‘Writing R
  Extensions’ manual.

Could you please fix those two as well ?

Thanks a lot in advance.

@Prateek0xeo
Copy link
Author

Sure! Thank you for the feedback, I'll improve upon the PR and make sure that all other Pull requests i make also reflect the same.

@cregouby
Copy link
Contributor

Then checking the use of ds <- eurosat_dataset(download = TRUE) I get an SSL error from the server certificate:

utils::download.file(self$resources$url, destfile = zip_file,  :
  URL 'https://madm.dfki.de/files/sentinel/EuroSAT.zip': statut 'SSL peer certificate or SSH remote key was not OK'

Thus libcurl do not download the zip file and the dataloader fails.

Error is confirmed by online SSL checkers :
image

@cregouby
Copy link
Contributor

I've contact the website owner for a fix / workaround.

@Prateek0xeo
Copy link
Author

@cregouby I have implemented all the changes you mentioned.
i have now removed the data files from the PR and improved Documents and ran the check cmd which gave no errors.
Pls check the PR once and tell me if there is anything else left unchecked so that i can move forward with rest of the Dataset loaders

@Prateek0xeo
Copy link
Author

Prateek0xeo commented Jan 17, 2025

Hello @cregouby,

I referred to the PyTorch repository and tried implementing the new URL and MD5 checksum provided in the code, but it still didn't resolve the issue. While the SSL certificate problem was fixed there, the dataset still doesn't load as expected.

I also took inspiration from the spam-loader review by @dfalbel and applied similar improvements to this PR.

@cregouby
Copy link
Contributor

Hello @Prateek0xeo

I've recieved an answer from the website administrator:

Dear Christophe,
First of all, thank you for reaching out and for your interest in our dataset.
Unfortunately, this is not the first time this issue has occurred, so I decided to host the dataset on Zenodo as well.
For a permanent fix, I would recommend using the version hosted on Zenodo (https://zenodo.org/records/7711810).
If you have any questions or need further information, please feel free to contact me.
Kind regards,

I can see in the meantime that you switch to huggingface dataset, which fails in my case with a wrong MD5

> ds <- eurosat_dataset(download = TRUE)
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1173  100  1173    0     0   1123      0  0:00:01  0:00:01 --:--:--  1124
Erreur dans self$download() : 
  Downloaded file has an invalid MD5 checksum. Please try again.
Called from: self$download()

We can find the download failure root cause in the zip file :

Browse[1]> readLines(zip_file)
[1] "Found. Redirecting to https://cdn-lfs.hf.co/repos/fc/1d/fc1dee780dee1dae2ad48856d0961ac6aa5dfcaaaa4fb3561be4aedf19b7ccc7/8ebea626349354c5328b142b96d0430e647051f26efc2dc974c843f25ecf70bd?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27EuroSAT.zip%3B+filename%3D%22EuroSAT.zip%22%3B&response-content-type=application%2Fzip&Expires=1737474034&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTczNzQ3NDAzNH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9yZXBvcy9mYy8xZC9mYzFkZWU3ODBkZWUxZGFlMmFkNDg4NTZkMDk2MWFjNmFhNWRmY2FhYWE0ZmIzNTYxYmU0YWVkZjE5YjdjY2M3LzhlYmVhNjI2MzQ5MzU0YzUzMjhiMTQyYjk2ZDA0MzBlNjQ3MDUxZjI2ZWZjMmRjOTc0Yzg0M2YyNWVjZjcwYmQ%7EcmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlvbj0qJnJlc3BvbnNlLWNvbnRlbnQtdHlwZT0qIn1dfQ__&Signature=buJTsafUfhVxwvReQoUSJcNl3VwU74g1YghMxdzBwM6G25WA3-eyL3M6edcgNDBDCEBGjaoJ026dpGx9RZ%7EXsyirvlbOdND95c1jOqIZU5Y3Lh67bSvZcbbSbgGOjq5OvHvKbMSmXBWvPGzG3Ody8Z8Fm%7ExN9mhW6jqwi6LaQ84pH2VW-bWLLXTyrzexXmrmY2FBplfM3ZGD%7EmiTx1JHu97SbY7S8x9GYAPW16vV6w%7EVIWJ5KztnTpF%7Ea78q8tYWasUttVF9q9SbMw2DJMLq3s7MyFZlxXnO6cSM4jMN1fowTxbjbV5-O6rSpf0jp0-FtpK5DfVQWEvTAnBZ3ZOg%7Ew__&Key-Pair-Id=K3RPWS32NSSJCE"

So see my remarks inline.

self$resources$url,
destfile = zip_file,
mode = "wb",
method = "curl",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo design : default libcurl method do not require specific installation, and is following redirections, where curl don't. So I would keep the default

destfile = zip_file,
mode = "wb",
method = "curl",
extra = "--insecure"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo security : as mentionned in https://curl.se/docs/manpage.html#-k, this should not be used. Please remove.

Copy link
Contributor

@cregouby cregouby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @Prateek0xeo

Thanks for submitting,
Please fix the remarks, and fix the tests.

question : as the huggingface dataset already includes a train / test / val split, is there a way to make those split available here with an option to the dataset like in mnist_dataset() ?
suggestion My proposal would be

#' @param split (character, optional): If `train` (default), creates the training dataset. Otherwise
#'   value should be either `test` or `val` for rerpectively test set or validation set.

expect_true(length(files) > 0, info = "Files should be downloaded in the temporary directory")

extracted_dir <- file.path(temp_root, "2750")
expect_true(dir.exists(extracted_dir), info = "Extracted data folder should exist")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fails to me :

>   expect_true(dir.exists(extracted_dir), info = "Extracted data folder should exist")
Erreur : dir.exists(extracted_dir) is not TRUE

`actual`:   FALSE
`expected`: TRUE 
Extracted data folder should exist
> head(files, 2)
[1] "/tmp/Rtmpmm3AfN/file6970544f1c340//eurosat/2750/AnnualCrop/AnnualCrop_1.jpg"  "/tmp/Rtmpmm3AfN/file6970544f1c340//eurosat/2750/AnnualCrop/AnnualCrop_10.jpg"

name = "eurosat",

resources = list(
url = "https://huggingface.co/datasets/torchgeo/eurosat/resolve/c877bcd43f099cd0196738f714544e355477f3fd/EuroSAT.zip",
Copy link
Contributor

@cregouby cregouby Jan 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question : That URL points to a specific commit of the Dataset onto huggingface platform. Is it what we want ? or do we want to use the latest version of this dataset ? In that case url would be the one accessible on the front-page of the Dataset: url = "https://huggingface.co/datasets/torchgeo/eurosat/resolve/main/EuroSAT.zip?download=true"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytorch/vision#8563
I tried to implement the fix taking this issue in pytorch as reference. So i copied the same URL from the code. I'll implement the latest version of dataset as you pointed out.

@Prateek0xeo
Copy link
Author

@cregouby I have already tried the data hosted on zenedo it is not working for me. I'll try to improvise on the huggingface dataset and make it work.

@Prateek0xeo
Copy link
Author

source("C:/Coding/torchvision/torchvision/R/dataset-eurosat.R")
ds <- eurosat_dataset(download = TRUE)
trying URL 'https://huggingface.co/datasets/torchgeo/eurosat/resolve/main/EuroSAT.zip?download=true'
Content type 'application/zip' length 94280567 bytes (89.9 MB)
downloaded 89.9 MB

sample <- ds[1]
print(sample$x)
, , 1

       [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]

[1,] 0.5843137 0.5764706 0.5725490 0.5764706 0.5843137 0.5921569 0.5843137
[2,] 0.5843137 0.5843137 0.5843137 0.5803922 0.5882353 0.5921569 0.5803922

the dataloader seems to be working now I'll push the code after self review

@Prateek0xeo
Copy link
Author

@cregouby In this commit dataloader is working. I am still working on figuring out the splits functionalilty until then pls review the new commit.

@Prateek0xeo
Copy link
Author

Prateek0xeo commented Jan 20, 2025

Hello @cregouby
I contacted the dataset owner and received this message

Hi Prateek,

All of the datasets in https://huggingface.co/datasets/torchgeo/ are intended for use with the TorchGeo library. You can use it like so:

from torchgeo.datasets import EuroSAT

train_dataset = EuroSAT('data', split='train', download=True)
val_dataset = EuroSAT('data', split='val', download=True)
test_dataset = EuroSAT('data', split='test', download=True)
See https://torchgeo.readthedocs.io/en/latest/api/datasets.html#eurosat for documentation.

So i implemented the functionality using these URL Endpoints
List the split names for this dataset

curl -X GET
"https://datasets-server.huggingface.co/splits?dataset=torchgeo%2Feurosat"

but the Val split was not working and throwing an error
Screenshot 2025-01-20 210917

So i have contacted the Owner again for this issue https://huggingface.co/datasets/torchgeo/eurosat/discussions/2
The new PR should work once the dataset host is fixed
Pls go through the new commit and comment for any change

@Prateek0xeo
Copy link
Author

@cregouby there is a problem with image decoding with this dataset
ERROR - The dataset API returned a value for row$image that is not a valid Base64-encoded string, and the decoding step failed. This issue likely arises from the structure or format of the dataset returned by the API
Because of this i can't access the image.

@cregouby
Copy link
Contributor

cregouby commented Jan 21, 2025

Hello @Prateek0xeo

1 - The new download only gets the first 100 sample, which is not what we want. We want the full train dataset, i.e. 32.5k rows.
2 - You should not replace the previous test, but rather add the new tests in the same file.
3 - Image are jpg images, so the decoding cannot be made by base64. You should use jpeg:: that is already imported by the package.

I get the following error running the tests :

[ FAIL 2 | WARN 1 | SKIP 0 | PASS 3 ]

── Warning (test-dataset-eurosat.R:18:3): eurosat_dataset API handles splits correctly via API ─────────────────────────────────────────────────────────────────────────────────────────────
Skipping row 1 due to image decoding failure: I can only decode base64 strings
Backtrace:
    ▆
 1. ├─train_ds[1] at test-dataset-eurosat.R:18:3
 2. └─torch:::`[.dataset`(train_ds, 1) at test-dataset-eurosat.R:18:3
 3.   └─x$.getitem(y)
 4.     └─base::tryCatch(...) at torchvision_prateek/R/dataset-eurosat.R:82:5
 5.       └─base (local) tryCatchList(expr, classes, parentenv, handlers)
 6.         └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
 7.           └─value[[3L]](cond)

── Failure (test-dataset-eurosat.R:19:3): eurosat_dataset API handles splits correctly via API ─────────────────────────────────────────────────────────────────────────────────────────────
!is.null(sample$x) is not TRUE

`actual`:   FALSE
`expected`: TRUE 
Image should not be null

── Failure (test-dataset-eurosat.R:20:3): eurosat_dataset API handles splits correctly via API ─────────────────────────────────────────────────────────────────────────────────────────────
!is.null(sample$y) is not TRUE

`actual`:   FALSE
`expected`: TRUE 
Label should not be null
[ FAIL 2 | WARN 1 | SKIP 0 | PASS 3 ]

In order to ensure the test runs correctly on all platforms, you should activate the github actions in the setup of your git repository, and then activate the R-CMD-check workflow.
image

So that for each or your commit, you will see test results over all platforms. All must be green, or you have to fix code.

@cregouby
Copy link
Contributor

Hello @Prateek0xeo
There is no need to change something in th e.github/workflow/ folder as everything is already here.
The setup I mentionned is to be done in your github repository.

@Prateek0xeo
Copy link
Author

Prateek0xeo commented Jan 22, 2025 via email


# Test train split
train_ds <- eurosat_dataset(root = "./data/eurosat", split = "train", download = TRUE)
expect_true(length(train_ds) > 0, info = "Train dataset should have a non-zero length")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
expect_true(length(train_ds) > 0, info = "Train dataset should have a non-zero length")
expect_length(train_ds), 16200, info = "Train dataset should have the expected length")


# Test test split
test_ds <- eurosat_dataset(root = "./data/eurosat", split = "test", download = TRUE)
expect_true(length(test_ds) > 0, info = "Test dataset should have a non-zero length")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
expect_true(length(test_ds) > 0, info = "Test dataset should have a non-zero length")
expect_length(test_ds), 5400,, info = "Test dataset should have the expected length")


# Test validation split
validation_ds <- eurosat_dataset(root = "./data/eurosat", split = "validation", download = TRUE)
expect_true(length(validation_ds) > 0, info = "Validation dataset should have a non-zero length")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
expect_true(length(validation_ds) > 0, info = "Validation dataset should have a non-zero length")
expect_length(validation_ds), 5400, info = "Validation dataset should have the expected length")

Comment on lines +19 to +20
expect_true(!is.null(sample$x), info = "Image should not be null")
expect_true(!is.null(sample$y), info = "Label should not be null")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo : would be much better to check the tensor shape and tensor dtype of each x and y object

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants