Eurosat #122

Prateek0xeo · 2025-01-11T17:29:16Z

#104
md5 mismatch issue solved

cregouby · 2025-01-12T09:51:06Z

Thanks a lot for this contribution !

I have a few comment on it :

We do not want to host archive files in the code repository, as it would make the package too heavy, up to the point it would not pass the CRAN check. Could you remove those zip and tar.gz file from the pull-request please ?
the check fails with information that should be fixed :

==> devtools::check()

══ Documenting ══════════════════════════════════════════════════════════
ℹ Updating torchvision documentation
ℹ Loading torchvision
Avis : Objects listed as exports, but not present in namespace:
• inat2021_dataset
✖ dataset-eurosat.R:28: @export must be a single line, not 5.
ℹ The first line is "Download the EuroSAT dataset"
✖ transforms-magick.R:56: S3 method `get_image_size.magick-image` needs
  @export or @exportS3method tag.
Writing NAMESPACE
...
── R CMD check results ────────────────────── torchvision 0.6.0.9000 ────
Duration: 47.6s

❯ checking Rd \usage sections ... WARNING
  Documented arguments not in \usage in Rd file 'eurosat_dataset.Rd':
    ‘i’
  
  Functions with \usage entries need to have the appropriate \alias
  entries, and all their arguments documented.
  The \usage entries must correspond to syntactically valid R code.
  See chapter ‘Writing R documentation files’ in the ‘Writing R
  Extensions’ manual.

Could you please fix those two as well ?

Thanks a lot in advance.

Prateek0xeo · 2025-01-12T10:00:04Z

Sure! Thank you for the feedback, I'll improve upon the PR and make sure that all other Pull requests i make also reflect the same.

cregouby · 2025-01-12T12:27:16Z

Then checking the use of ds <- eurosat_dataset(download = TRUE) I get an SSL error from the server certificate:

utils::download.file(self$resources$url, destfile = zip_file,  :
  URL 'https://madm.dfki.de/files/sentinel/EuroSAT.zip': statut 'SSL peer certificate or SSH remote key was not OK'

Thus libcurl do not download the zip file and the dataloader fails.

Error is confirmed by online SSL checkers :

cregouby · 2025-01-13T08:29:34Z

I've contact the website owner for a fix / workaround.

Prateek0xeo · 2025-01-13T18:00:36Z

@cregouby I have implemented all the changes you mentioned.
i have now removed the data files from the PR and improved Documents and ran the check cmd which gave no errors.
Pls check the PR once and tell me if there is anything else left unchecked so that i can move forward with rest of the Dataset loaders

Prateek0xeo · 2025-01-17T11:08:54Z

Hello @cregouby,

I referred to the PyTorch repository and tried implementing the new URL and MD5 checksum provided in the code, but it still didn't resolve the issue. While the SSL certificate problem was fixed there, the dataset still doesn't load as expected.

I also took inspiration from the spam-loader review by @dfalbel and applied similar improvements to this PR.

cregouby · 2025-01-18T16:02:15Z

Hello @Prateek0xeo

I've recieved an answer from the website administrator:

Dear Christophe,
First of all, thank you for reaching out and for your interest in our dataset.
Unfortunately, this is not the first time this issue has occurred, so I decided to host the dataset on Zenodo as well.
For a permanent fix, I would recommend using the version hosted on Zenodo (https://zenodo.org/records/7711810).
If you have any questions or need further information, please feel free to contact me.
Kind regards,

I can see in the meantime that you switch to huggingface dataset, which fails in my case with a wrong MD5

> ds <- eurosat_dataset(download = TRUE)
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1173  100  1173    0     0   1123      0  0:00:01  0:00:01 --:--:--  1124
Erreur dans self$download() : 
  Downloaded file has an invalid MD5 checksum. Please try again.
Called from: self$download()

We can find the download failure root cause in the zip file :

Browse[1]> readLines(zip_file)
[1] "Found. Redirecting to https://cdn-lfs.hf.co/repos/fc/1d/fc1dee780dee1dae2ad48856d0961ac6aa5dfcaaaa4fb3561be4aedf19b7ccc7/8ebea626349354c5328b142b96d0430e647051f26efc2dc974c843f25ecf70bd?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27EuroSAT.zip%3B+filename%3D%22EuroSAT.zip%22%3B&response-content-type=application%2Fzip&Expires=1737474034&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTczNzQ3NDAzNH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9yZXBvcy9mYy8xZC9mYzFkZWU3ODBkZWUxZGFlMmFkNDg4NTZkMDk2MWFjNmFhNWRmY2FhYWE0ZmIzNTYxYmU0YWVkZjE5YjdjY2M3LzhlYmVhNjI2MzQ5MzU0YzUzMjhiMTQyYjk2ZDA0MzBlNjQ3MDUxZjI2ZWZjMmRjOTc0Yzg0M2YyNWVjZjcwYmQ%7EcmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlvbj0qJnJlc3BvbnNlLWNvbnRlbnQtdHlwZT0qIn1dfQ__&Signature=buJTsafUfhVxwvReQoUSJcNl3VwU74g1YghMxdzBwM6G25WA3-eyL3M6edcgNDBDCEBGjaoJ026dpGx9RZ%7EXsyirvlbOdND95c1jOqIZU5Y3Lh67bSvZcbbSbgGOjq5OvHvKbMSmXBWvPGzG3Ody8Z8Fm%7ExN9mhW6jqwi6LaQ84pH2VW-bWLLXTyrzexXmrmY2FBplfM3ZGD%7EmiTx1JHu97SbY7S8x9GYAPW16vV6w%7EVIWJ5KztnTpF%7Ea78q8tYWasUttVF9q9SbMw2DJMLq3s7MyFZlxXnO6cSM4jMN1fowTxbjbV5-O6rSpf0jp0-FtpK5DfVQWEvTAnBZ3ZOg%7Ew__&Key-Pair-Id=K3RPWS32NSSJCE"

So see my remarks inline.

cregouby · 2025-01-18T16:09:54Z

R/dataset-eurosat.R

+      self$resources$url,
+      destfile = zip_file,
+      mode = "wb",
+      method = "curl",


todo design : default libcurl method do not require specific installation, and is following redirections, where curl don't. So I would keep the default

cregouby · 2025-01-18T16:09:54Z

R/dataset-eurosat.R

+      destfile = zip_file,
+      mode = "wb",
+      method = "curl",
+      extra = "--insecure"


todo security : as mentionned in https://curl.se/docs/manpage.html#-k, this should not be used. Please remove.

cregouby

Hello @Prateek0xeo

Thanks for submitting,
Please fix the remarks, and fix the tests.

question : as the huggingface dataset already includes a train / test / val split, is there a way to make those split available here with an option to the dataset like in mnist_dataset() ?
suggestion My proposal would be

#' @param split (character, optional): If `train` (default), creates the training dataset. Otherwise
#'   value should be either `test` or `val` for rerpectively test set or validation set.

cregouby · 2025-01-18T16:12:15Z

tests/testthat/test-dataset-eurosat.R

+  expect_true(length(files) > 0, info = "Files should be downloaded in the temporary directory")
+
+  extracted_dir <- file.path(temp_root, "2750")
+  expect_true(dir.exists(extracted_dir), info = "Extracted data folder should exist")


This fails to me :

> expect_true(dir.exists(extracted_dir), info = "Extracted data folder should exist") Erreur : dir.exists(extracted_dir) is not TRUE `actual`: FALSE `expected`: TRUE Extracted data folder should exist > head(files, 2) [1] "/tmp/Rtmpmm3AfN/file6970544f1c340//eurosat/2750/AnnualCrop/AnnualCrop_1.jpg" "/tmp/Rtmpmm3AfN/file6970544f1c340//eurosat/2750/AnnualCrop/AnnualCrop_10.jpg"

cregouby · 2025-01-18T16:23:47Z

R/dataset-eurosat.R

+  name = "eurosat",
+
+  resources = list(
+    url = "https://huggingface.co/datasets/torchgeo/eurosat/resolve/c877bcd43f099cd0196738f714544e355477f3fd/EuroSAT.zip",


question : That URL points to a specific commit of the Dataset onto huggingface platform. Is it what we want ? or do we want to use the latest version of this dataset ? In that case url would be the one accessible on the front-page of the Dataset: url = "https://huggingface.co/datasets/torchgeo/eurosat/resolve/main/EuroSAT.zip?download=true"

pytorch/vision#8563
I tried to implement the fix taking this issue in pytorch as reference. So i copied the same URL from the code. I'll implement the latest version of dataset as you pointed out.

Prateek0xeo · 2025-01-18T18:02:09Z

@cregouby I have already tried the data hosted on zenedo it is not working for me. I'll try to improvise on the huggingface dataset and make it work.

Prateek0xeo · 2025-01-18T18:38:16Z

source("C:/Coding/torchvision/torchvision/R/dataset-eurosat.R")
ds <- eurosat_dataset(download = TRUE)
trying URL 'https://huggingface.co/datasets/torchgeo/eurosat/resolve/main/EuroSAT.zip?download=true'
Content type 'application/zip' length 94280567 bytes (89.9 MB)
downloaded 89.9 MB

sample <- ds[1]
print(sample$x)
, , 1

       [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]

[1,] 0.5843137 0.5764706 0.5725490 0.5764706 0.5843137 0.5921569 0.5843137
[2,] 0.5843137 0.5843137 0.5843137 0.5803922 0.5882353 0.5921569 0.5803922

the dataloader seems to be working now I'll push the code after self review

Prateek0xeo · 2025-01-19T18:21:17Z

@cregouby In this commit dataloader is working. I am still working on figuring out the splits functionalilty until then pls review the new commit.

Prateek0xeo · 2025-01-20T16:01:17Z

Hello @cregouby
I contacted the dataset owner and received this message

Hi Prateek,

All of the datasets in https://huggingface.co/datasets/torchgeo/ are intended for use with the TorchGeo library. You can use it like so:

from torchgeo.datasets import EuroSAT

train_dataset = EuroSAT('data', split='train', download=True)
val_dataset = EuroSAT('data', split='val', download=True)
test_dataset = EuroSAT('data', split='test', download=True)
See https://torchgeo.readthedocs.io/en/latest/api/datasets.html#eurosat for documentation.

So i implemented the functionality using these URL Endpoints
List the split names for this dataset

curl -X GET
"https://datasets-server.huggingface.co/splits?dataset=torchgeo%2Feurosat"

but the Val split was not working and throwing an error

So i have contacted the Owner again for this issue https://huggingface.co/datasets/torchgeo/eurosat/discussions/2
The new PR should work once the dataset host is fixed
Pls go through the new commit and comment for any change

Prateek0xeo · 2025-01-21T16:41:12Z

@cregouby there is a problem with image decoding with this dataset
ERROR - The dataset API returned a value for row$image that is not a valid Base64-encoded string, and the decoding step failed. This issue likely arises from the structure or format of the dataset returned by the API
Because of this i can't access the image.

cregouby · 2025-01-21T18:19:48Z

Hello @Prateek0xeo

1 - The new download only gets the first 100 sample, which is not what we want. We want the full train dataset, i.e. 32.5k rows.
2 - You should not replace the previous test, but rather add the new tests in the same file.
3 - Image are jpg images, so the decoding cannot be made by base64. You should use jpeg:: that is already imported by the package.

I get the following error running the tests :

[ FAIL 2 | WARN 1 | SKIP 0 | PASS 3 ]

── Warning (test-dataset-eurosat.R:18:3): eurosat_dataset API handles splits correctly via API ─────────────────────────────────────────────────────────────────────────────────────────────
Skipping row 1 due to image decoding failure: I can only decode base64 strings
Backtrace:
    ▆
 1. ├─train_ds[1] at test-dataset-eurosat.R:18:3
 2. └─torch:::`[.dataset`(train_ds, 1) at test-dataset-eurosat.R:18:3
 3.   └─x$.getitem(y)
 4.     └─base::tryCatch(...) at torchvision_prateek/R/dataset-eurosat.R:82:5
 5.       └─base (local) tryCatchList(expr, classes, parentenv, handlers)
 6.         └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
 7.           └─value[[3L]](cond)

── Failure (test-dataset-eurosat.R:19:3): eurosat_dataset API handles splits correctly via API ─────────────────────────────────────────────────────────────────────────────────────────────
!is.null(sample$x) is not TRUE

`actual`:   FALSE
`expected`: TRUE 
Image should not be null

── Failure (test-dataset-eurosat.R:20:3): eurosat_dataset API handles splits correctly via API ─────────────────────────────────────────────────────────────────────────────────────────────
!is.null(sample$y) is not TRUE

`actual`:   FALSE
`expected`: TRUE 
Label should not be null
[ FAIL 2 | WARN 1 | SKIP 0 | PASS 3 ]

In order to ensure the test runs correctly on all platforms, you should activate the github actions in the setup of your git repository, and then activate the R-CMD-check workflow.

So that for each or your commit, you will see test results over all platforms. All must be green, or you have to fix code.

Experimenting the R-CMD-check

cregouby · 2025-01-22T18:57:24Z

Hello @Prateek0xeo
There is no need to change something in th e.github/workflow/ folder as everything is already here.
The setup I mentionned is to be done in your github repository.

Prateek0xeo · 2025-01-22T19:02:08Z

@cregouby yes actually I was trying figure out that only when I noticed it was already done in the git repo so I deleted the folder. Update - I have made the split work but the problem is I can't download more than 100 from the URL. I tried using parquet but that also failed. I am now trying to use the text file to split the data downloaded from the zip

…

On Thu, 23 Jan, 2025, 00:27 cregouby, ***@***.***> wrote: Hello @Prateek0xeo <https://github.com/Prateek0xeo> There is no need to change something in th e.github/workflow/ folder as everything is already here. The setup I mentionned is to be done in *your* github repository. — Reply to this email directly, view it on GitHub <#122 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BAG3KFCC2OWZICS2BENR7M32L7S23AVCNFSM6AAAAABVAE7UOGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMBYGAZTAMJSGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

cregouby · 2025-01-22T19:03:58Z

tests/testthat/test-dataset-eurosat.R

+
+  # Test train split
+  train_ds <- eurosat_dataset(root = "./data/eurosat", split = "train", download = TRUE)
+  expect_true(length(train_ds) > 0, info = "Train dataset should have a non-zero length")


Suggested change

expect_true(length(train_ds) > 0, info = "Train dataset should have a non-zero length")

expect_length(train_ds), 16200, info = "Train dataset should have the expected length")

cregouby · 2025-01-22T19:05:43Z

tests/testthat/test-dataset-eurosat.R

+
+  # Test test split
+  test_ds <- eurosat_dataset(root = "./data/eurosat", split = "test", download = TRUE)
+  expect_true(length(test_ds) > 0, info = "Test dataset should have a non-zero length")


Suggested change

expect_true(length(test_ds) > 0, info = "Test dataset should have a non-zero length")

expect_length(test_ds), 5400,, info = "Test dataset should have the expected length")

cregouby · 2025-01-22T19:07:23Z

tests/testthat/test-dataset-eurosat.R

+
+  # Test validation split
+  validation_ds <- eurosat_dataset(root = "./data/eurosat", split = "validation", download = TRUE)
+  expect_true(length(validation_ds) > 0, info = "Validation dataset should have a non-zero length")


Suggested change

expect_true(length(validation_ds) > 0, info = "Validation dataset should have a non-zero length")

expect_length(validation_ds), 5400, info = "Validation dataset should have the expected length")

cregouby · 2025-01-22T19:08:47Z

tests/testthat/test-dataset-eurosat.R

+  expect_true(!is.null(sample$x), info = "Image should not be null")
+  expect_true(!is.null(sample$y), info = "Label should not be null")


todo : would be much better to check the tensor shape and tensor dtype of each x and y object

Eurosat

22d9442

Prateek0xeo added 2 commits January 13, 2025 23:10

Remove archive files to reduce package size

4d2605f

feedback implemented

feffdd2

Pytorch dataloader refered and small changes

bbc34c9

small changes

68f45cf

cregouby reviewed Jan 18, 2025

View reviewed changes

cregouby suggested changes Jan 18, 2025

View reviewed changes

cregouby reviewed Jan 18, 2025

View reviewed changes

SSl issue fix

5e5a075

test and train working

2b76272

Val working

65859b7

removal of datafiles

57a32eb

Prateek0xeo added 2 commits January 22, 2025 15:31

Create r.yml

f3db85a

Delete .github/workflows/r.yml

de833e1

Experimenting the R-CMD-check

cregouby suggested changes Jan 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eurosat #122

Eurosat #122

Prateek0xeo commented Jan 11, 2025

cregouby commented Jan 12, 2025

Prateek0xeo commented Jan 12, 2025

cregouby commented Jan 12, 2025

cregouby commented Jan 13, 2025

Prateek0xeo commented Jan 13, 2025

Prateek0xeo commented Jan 17, 2025 •

edited

Loading

cregouby commented Jan 18, 2025

cregouby Jan 18, 2025

cregouby Jan 18, 2025

cregouby left a comment •

edited

Loading

cregouby Jan 18, 2025

cregouby Jan 18, 2025 •

edited

Loading

Prateek0xeo Jan 18, 2025

Prateek0xeo commented Jan 18, 2025

Prateek0xeo commented Jan 18, 2025

Prateek0xeo commented Jan 19, 2025

Prateek0xeo commented Jan 20, 2025 •

edited

Loading

Prateek0xeo commented Jan 21, 2025

cregouby commented Jan 21, 2025 •

edited

Loading

cregouby commented Jan 22, 2025

Prateek0xeo commented Jan 22, 2025 via email

cregouby Jan 22, 2025

cregouby Jan 22, 2025

cregouby Jan 22, 2025

cregouby Jan 22, 2025

	expect_true(length(train_ds) > 0, info = "Train dataset should have a non-zero length")
	expect_length(train_ds), 16200, info = "Train dataset should have the expected length")

	expect_true(length(test_ds) > 0, info = "Test dataset should have a non-zero length")
	expect_length(test_ds), 5400,, info = "Test dataset should have the expected length")

	expect_true(length(validation_ds) > 0, info = "Validation dataset should have a non-zero length")
	expect_length(validation_ds), 5400, info = "Validation dataset should have the expected length")

		expect_true(!is.null(sample$x), info = "Image should not be null")
		expect_true(!is.null(sample$y), info = "Label should not be null")

Eurosat #122

Are you sure you want to change the base?

Eurosat #122

Conversation

Prateek0xeo commented Jan 11, 2025

cregouby commented Jan 12, 2025

Prateek0xeo commented Jan 12, 2025

cregouby commented Jan 12, 2025

cregouby commented Jan 13, 2025

Prateek0xeo commented Jan 13, 2025

Prateek0xeo commented Jan 17, 2025 • edited Loading

cregouby commented Jan 18, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cregouby left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cregouby Jan 18, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Prateek0xeo commented Jan 18, 2025

Prateek0xeo commented Jan 18, 2025

Prateek0xeo commented Jan 19, 2025

Prateek0xeo commented Jan 20, 2025 • edited Loading

Prateek0xeo commented Jan 21, 2025

cregouby commented Jan 21, 2025 • edited Loading

cregouby commented Jan 22, 2025

Prateek0xeo commented Jan 22, 2025 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Prateek0xeo commented Jan 17, 2025 •

edited

Loading

cregouby left a comment •

edited

Loading

cregouby Jan 18, 2025 •

edited

Loading

Prateek0xeo commented Jan 20, 2025 •

edited

Loading

cregouby commented Jan 21, 2025 •

edited

Loading