Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue #578 (RecordLink.blocker and Gazetteer.blocker create a huge number of blocks) is not fixed in version 1.7.0 #587

Closed
ofershar opened this issue Jul 10, 2017 · 4 comments

Comments

@ofershar
Copy link

I've installed version 1.7.0 of Dedupe and re-ran the test code for issue #578 (see link in that issue's description).
The number of blocks created by RecordLink.blocker is still huge. In fact, it seems to be even larger than before. I stopped the run when the CSV file containing the blocks reached a size of 10G.

I use Python 3.5.3 and RHEL 6.5 (but that's probably irrelevant to the problem).

@fgregg
Copy link
Contributor

fgregg commented Jul 10, 2017

You need to to use the target=True argument for blocking your target dataset. Then you'll need to reproduce the logic of _blockGenerator with your db https://github.com/dedupeio/dedupe/blob/master/dedupe/api.py#L398-L420

@fgregg
Copy link
Contributor

fgregg commented Jul 10, 2017

This will be resolved when we have a big record link example dedupeio/dedupe-examples#23

@fgregg fgregg closed this as completed Jul 10, 2017
@ofershar
Copy link
Author

ofershar commented Jul 11, 2017 via email

@fgregg
Copy link
Contributor

fgregg commented Jul 11, 2017

Then, I need to block my "messy" dataset with target=False. But calling it "as is" would still produce a huge number of blocks (as reported in issue #578).

This is right, but you don't need to store the blocks in your database or anywhere else. As soon as you generate a block key for a messy record you can see if it matches any, stored block key of your target record. That's what's going on in that method I linked to. You don't have to subclass the dedupe class (though you can), but you do need that type of logic.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants