Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diff based culling #3

Open
maxnth opened this issue Jul 19, 2021 · 0 comments
Open

Diff based culling #3

maxnth opened this issue Jul 19, 2021 · 0 comments

Comments

@maxnth
Copy link
Collaborator

maxnth commented Jul 19, 2021

Use case

When correcting ground truth in big datasets it's often useful to check the diff between a very good prediction and the ground truth in LAREX and correct it if necessary. Culling the correction data set of all files which don't contain any diff between prediction and ground truth makes this a lot easier.

Implementation

The CLI should accept:

  • a list of PAGE XML files and two indices (for TextEquiv/@index) with denominate prediction and ground truth
  • two lists of files with one index each in case GT and Pred are stored in two different XML files
  • whether to apply Unicode normalization / regularization
  • An output directory
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant