You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Can you please provide more documentation onRecordLink.blocker() and related methods?
Looking at the source, it seems that what happens is:
In _blockData, data_2 gets indexed and then blocked with target=True.
In _blockGenerator you block data_1 (which is confusingly referred to as messy_data even though in this context both datasets are clean) with target=False.
You generate blocks containing one record from data_1 and all records of data_2 that share any of its block keys.
You seem to manually assign empty set to all of the covered blocks sets for data_2 records.
My concrete questions are:
A) what does the target argument to blocker() actually do?
B) why don't you have to do self.blocker.indexAll on data_1 a.k.a. messy_data before blocking it, like you do with data_2?
C) why are the covered block sets empty? couldn't a data_2 record have appeared in an earlier block? i'm looking at the matchBlocks() documentation for my understanding of the covered blocks set.
D) Gazeteer seems to inherit all of this. Can you point me to where the Gazetteer logic differs from RecordLink to allow for multiple records from messy_data to match one record from data_2?
Thanks!
The text was updated successfully, but these errors were encountered:
E) Are the results of RecordLink invariant under swapping data_2 and data_1? It seems like in principle they should be, but I wonder whether their different roles in blocking affect that.
Can you please provide more documentation on
RecordLink.blocker()
and related methods?Looking at the source, it seems that what happens is:
_blockData
,data_2
gets indexed and then blocked withtarget=True
._blockGenerator
you block data_1 (which is confusingly referred to asmessy_data
even though in this context both datasets are clean) withtarget=False
.My concrete questions are:
A) what does the
target
argument toblocker()
actually do?B) why don't you have to do
self.blocker.indexAll
ondata_1
a.k.a.messy_data
before blocking it, like you do withdata_2
?C) why are the covered block sets empty? couldn't a data_2 record have appeared in an earlier block? i'm looking at the
matchBlocks()
documentation for my understanding of the covered blocks set.D) Gazeteer seems to inherit all of this. Can you point me to where the Gazetteer logic differs from RecordLink to allow for multiple records from
messy_data
to match one record fromdata_2
?Thanks!
The text was updated successfully, but these errors were encountered: