-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue #578 (RecordLink.blocker and Gazetteer.blocker create a huge number of blocks) is not fixed in version 1.7.0 #587
Comments
You need to to use the |
This will be resolved when we have a big record link example dedupeio/dedupe-examples#23 |
Hi Forest,
Thanks for your reply (and for sending it so quickly).
I just wanted to make sure I understand correctly what you meant. So here is how I figured it out:
Using the Gazetteer class, I first need to block my target dataset with target=True argument. (I already tested it, and it seems to be working fine).
Then, I need to block my "messy" dataset with target=False. But calling it "as is" would still produce a huge number of blocks (as reported in issue #578).
For preventing this from happening, I need to create a derived class of Gazetteer, similar to DatabaseGazetteer in https://github.com/dedupeio/address-matching/blob/sqlclass/address_matching.py ,
with my own implementation of _blockRecords that would access my DB (similar to the code of _blockData in the above link). This should make the blocking work properly even with target=False.
Is that right?
Or maybe I should be using the same instance of Gazetteer / DatabaseGazetteer for blocking both the target and the messy data? If that is the case, wouldn’t the overridden implementation
of _blockRecords tamper with the proper blocking of the target dataset?
Thanks,
Ofer
From: Forest Gregg [mailto:[email protected]]
Sent: Monday, July 10, 2017 3:34 PM
To: dedupeio/dedupe <[email protected]>
Cc: Ofer Sharon <[email protected]>; Author <[email protected]>
Subject: Re: [dedupeio/dedupe] Issue #578 (RecordLink.blocker and Gazetteer.blocker create a huge number of blocks) is not fixed in version 1.7.0 (#587)
You need to to use the target=True argument for blocking your target dataset. Then you'll need to reproduce the logic of _blockGenerator with your db https://github.com/dedupeio/dedupe/blob/master/dedupe/api.py#L398-L420
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#587 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ASePy_G9sKtvF31B8ivmgzgPmvfOR0yuks5sMho1gaJpZM4OSbio>.
…____________________________
The information contained in this communication (including its attachments) is for the intended recipient only.
It may contain confidential, proprietary or otherwise protected information.
If you received this communication in error, please: (a) note that any use, disclosure, copying, distribution hereof, and/or taking any action in reliance on its contents, is strictly prohibited and may be unlawful, and (b) notify us immediately, by replying to the message, and then delete it from your system.
|
This is right, but you don't need to store the blocks in your database or anywhere else. As soon as you generate a block key for a messy record you can see if it matches any, stored block key of your target record. That's what's going on in that method I linked to. You don't have to subclass the dedupe class (though you can), but you do need that type of logic. |
I've installed version 1.7.0 of Dedupe and re-ran the test code for issue #578 (see link in that issue's description).
The number of blocks created by RecordLink.blocker is still huge. In fact, it seems to be even larger than before. I stopped the run when the CSV file containing the blocks reached a size of 10G.
I use Python 3.5.3 and RHEL 6.5 (but that's probably irrelevant to the problem).
The text was updated successfully, but these errors were encountered: