-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Big" RecordLink example #23
Comments
Would that example work if the canonical database is very large? The example just reads the entire canonical file in when indexing. Would a more memory efficient solution involve creating an inverted index via server side queries as in the mysql example? |
Hi! I started looking for data sets to produce a big record link example. Does dedupeio/record link require training data with examples of matches? (From the small record link example, it appears not since there isn't a column in the data sets for the target variable... I see that the program may ask the user to verify matches while running..) I've been looking into matching pre-prints with records of journal publications--something that would have been really useful to me early in grad school when it was hard to tell if a pre-print on arxiv.org (most popular math repo) had been published or not (perhaps under a different title, with different collaborators, etc). |
Here's a gist of how what this could look like if someone wants to take it and make into a full example https://gist.github.com/fgregg/e45280fa32a9eee8daab65a95f385656 |
Are there any examples of using RecordLink on larger data sets that do not fit into memory -- something similar to the MySQL or PostgreSQL big deduplication examples?
The text was updated successfully, but these errors were encountered: