Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve RecordLink blocking documentation #601

Closed
potash opened this issue Aug 3, 2017 · 3 comments
Closed

Improve RecordLink blocking documentation #601

potash opened this issue Aug 3, 2017 · 3 comments

Comments

@potash
Copy link

potash commented Aug 3, 2017

Can you please provide more documentation onRecordLink.blocker() and related methods?

Looking at the source, it seems that what happens is:

  1. In _blockData, data_2 gets indexed and then blocked with target=True.
  2. In _blockGenerator you block data_1 (which is confusingly referred to as messy_data even though in this context both datasets are clean) with target=False.
  3. You generate blocks containing one record from data_1 and all records of data_2 that share any of its block keys.
  4. You seem to manually assign empty set to all of the covered blocks sets for data_2 records.

My concrete questions are:

A) what does the target argument to blocker() actually do?
B) why don't you have to do self.blocker.indexAll on data_1 a.k.a. messy_data before blocking it, like you do with data_2?
C) why are the covered block sets empty? couldn't a data_2 record have appeared in an earlier block? i'm looking at the matchBlocks() documentation for my understanding of the covered blocks set.
D) Gazeteer seems to inherit all of this. Can you point me to where the Gazetteer logic differs from RecordLink to allow for multiple records from messy_data to match one record from data_2?

Thanks!

@potash
Copy link
Author

potash commented Aug 3, 2017

And a higher-level question:

E) Are the results of RecordLink invariant under swapping data_2 and data_1? It seems like in principle they should be, but I wonder whether their different roles in blocking affect that.

@fgregg
Copy link
Contributor

fgregg commented Aug 3, 2017

It sounds like you are asking for documentation on the internal implementation of methods and classes.

While we do want to document the public API, these internal details are not something we want expose and document.

If there's some particular problem that you are having please open up a separate issue, and I can try to point you to the relevant part of the code.

@fgregg fgregg closed this as completed Aug 3, 2017
@potash
Copy link
Author

potash commented Aug 3, 2017

Ok I opened #602 with the pieces that pertain to the public API.

FYI the reason I want to understand the private methods better is that I am writing an SQL version of RecordLink (c.f. dedupeio/dedupe-examples#23).

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants