Skip to content
This repository has been archived by the owner on Mar 5, 2019. It is now read-only.

db: extract entity names into separate table #152

Open
aspiers opened this issue Feb 7, 2017 · 0 comments
Open

db: extract entity names into separate table #152

aspiers opened this issue Feb 7, 2017 · 0 comments
Assignees

Comments

@aspiers
Copy link
Member

aspiers commented Feb 7, 2017

Description

As agreed at the 2017/2/7 meeting, we should move the entity names to a separate database table, which has a many:one relationship with the entity model.

Currently each entity's name is stored as a field within the entity's record in the relevant database table. This does not support tracking of multiple possible variations / aliases an entity might have. For example, the ex-PM might be any of:

  • David Cameron
  • Mr David Cameron
  • Mr. David Cameron
  • Mr D Cameron
  • Mr. D Cameron
  • The Right Honourable David Cameron MP
  • Right Honourable David Cameron MP

and lots more.

Comments, Questions and Considerations

  • This should be a fairly simple schema change and db migration. It would require tweaking of API calls so that they include relevant entity names.
  • There are several reasons why we need to be able to track all of these in the backend as corresponding to the same entity:
    • It means we only need to deduplicate each variation / alias once, since any further imports of entities can be cross-checked against the whole list. This makes the whole data cleansing process far more efficient (e.g. volunteers wouldn't need to manually deduplicate the same entity name pair over and over again)
    • It means that we don't have to worry as much about our automated deduplication heuristics being 100% consistent over time. For example if we make it more sophisticated, but in the process, accidentally break its ability to spot that "Mr. D Cameron" and "Mr D Cameron" actually refer to the same entity, it won't matter quite so much, because the database already knows that and can still deduplicate based on the variations / aliases it already knows about.
    • It provides a way to collect data about variations / aliases which can then be used to perform automated regression testing against our automated deduplication heuristics. This is important because even though the regression above would not break the overall system's ability to spot that "Mr. D Cameron" and "Mr D Cameron" actually refer to the same entity (since they are already equated in the database), it might break its ability to spot that "Mr. J Corbyn" and "Mr J Corbyn" refer to the same entity (assuming that these are not yet equated in the database).
    • It could potentially allow us to track the frequency of usage of each alias, which might be helpful data for improving the automated deduplication heuristics.

Acceptance Criteria

This story can be considered done when the following acceptance tests
are satisfied:

Given a data file to import containing at least one entity referring to an entity already in the database, but with a slightly different name,
When the entity is imported
Then a new record in the entity model is not created
And a new record in the entity name model is created, which refers to the id of the entity which already existed in the database.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants