db: extract entity names into separate table #152

aspiers · 2017-02-07T23:13:18Z

Description

As agreed at the 2017/2/7 meeting, we should move the entity names to a separate database table, which has a many:one relationship with the entity model.

Currently each entity's name is stored as a field within the entity's record in the relevant database table. This does not support tracking of multiple possible variations / aliases an entity might have. For example, the ex-PM might be any of:

David Cameron
Mr David Cameron
Mr. David Cameron
Mr D Cameron
Mr. D Cameron
The Right Honourable David Cameron MP
Right Honourable David Cameron MP

and lots more.

Comments, Questions and Considerations

This should be a fairly simple schema change and db migration. It would require tweaking of API calls so that they include relevant entity names.
There are several reasons why we need to be able to track all of these in the backend as corresponding to the same entity:
- It means we only need to deduplicate each variation / alias once, since any further imports of entities can be cross-checked against the whole list. This makes the whole data cleansing process far more efficient (e.g. volunteers wouldn't need to manually deduplicate the same entity name pair over and over again)
- It means that we don't have to worry as much about our automated deduplication heuristics being 100% consistent over time. For example if we make it more sophisticated, but in the process, accidentally break its ability to spot that "Mr. D Cameron" and "Mr D Cameron" actually refer to the same entity, it won't matter quite so much, because the database already knows that and can still deduplicate based on the variations / aliases it already knows about.
- It provides a way to collect data about variations / aliases which can then be used to perform automated regression testing against our automated deduplication heuristics. This is important because even though the regression above would not break the overall system's ability to spot that "Mr. D Cameron" and "Mr D Cameron" actually refer to the same entity (since they are already equated in the database), it might break its ability to spot that "Mr. J Corbyn" and "Mr J Corbyn" refer to the same entity (assuming that these are not yet equated in the database).
- It could potentially allow us to track the frequency of usage of each alias, which might be helpful data for improving the automated deduplication heuristics.

Acceptance Criteria

This story can be considered done when the following acceptance tests
are satisfied:

Given a data file to import containing at least one entity referring to an entity already in the database, but with a slightly different name,
When the entity is imported
Then a new record in the entity model is not created
And a new record in the entity name model is created, which refers to the id of the entity which already existed in the database.

aspiers added the Data Storage and API label Feb 7, 2017

aspiers assigned JohnSmall Feb 7, 2017

aspiers mentioned this issue Feb 7, 2017

rails: add ui for manual deduplication workflow #153

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

db: extract entity names into separate table #152

db: extract entity names into separate table #152

aspiers commented Feb 7, 2017