By Laura Gribble, Data Consultant, GeoPlace.
When matching data, one of the things you need to determine is how you (as an organisation) feel about false negative and/or false positive results. What would you deem to be false negative, a false positive? What risks might each of these mean for your organisation? What is your attitude to these risks?
What is a false result?
A false result is when what is recorded is not true. By way of example, in the context of address matching:
- A positive result is where a matching is made between a record and a UPRN. Therefore, a true positive result is where a matching is made between a record and the correct UPRN. A false positive result is where a matching is made to the wrong UPRN
- A negative result is where no match can be made between a record and a UPRN. Therefore a true negative result is where there is no appropriate UPRN to match to. A false negative result is where no match has been made between a record and the UPRN which it should have been matched to.
What risks might each of these mean?
A false positive match will mean data is associated with the wrong property. The severity of this error will depend on your organisations functions; not all of the risks will be on you. For example, if a delivery gets sent to the wrong place your organisation might need to provide a replacement, but the customer has also been inconvenienced.
A false negative result will mean data which should be associated with an address cannot be, so a service cannot be provided. Alternatively, you might find you are unable to conduct your business, such as to carry out a legally required inspection, and incur a fine.
What is your attitude to these risks?
Once you have established what your risks are, you need to determine if these are acceptable to your organisation. There are no hard and fast rules which can be applied here; only you can say if you think losing a percentage of custom is tolerable or if you are willing to shoulder the loss of reputation in the event of a mistake appearing on the front page of newspapers.
There are a number of risk management methodologies which can help you. Tip: ask your project managers for help or if your organisation has a preferred method.
Whatever method you choose, you will probably end up with a list of risks which are categorised as:
- Broadly Acceptable Risks - the risk is currently acceptable but it will be necessary to maintain assurance
- Tolerable Risk - the risk is as low as reasonably practicable (ALARP) given the resources it would take to mitigate further. A cost-benefit analysis may be required to determine if more can be done
- Unacceptable Risk - the risk cannot be justified.
What do you do next?
Any unacceptable risk needs to be mitigated. In the context of false negatives and positives, this means identifying and correcting the errors.
Negatives are almost always indicative of an error, as they are either a false negative that requires matching or a true negative (unmatchable) because the source data is wrong. False negatives are relatively common if you have only done an automated match. There are many advantages to running auto-matches (primarily time and cost) but there are always records which would benefit from a manual review as humans are good at spotting patterns or silly data entry errors which machines can’t catch yet. Once you have completed a manual review it might be possible to remove records from your source data if the address does not in fact exist (eg. when only 12 houses were built, rather than the 16 originally planned).
False positives are harder to identify as they are hidden amongst true positives and it would be resource heavy to bulk review all matches without making the same erroneous matches again. However, you might review a sample or if you find any errors in matches as part of your normal business processes, you might be able to query the data for similar issues.
A common pattern which indicates false positive address matches, is when a UPRN has been matched to multiple times. This can happen if there are duplicates in the source data, in which case one should be archived or removed. However, it can also be that there is a false and a true match to the same address and a duplicate set is created. For example, if Flat 14 and Flat 15 are both matched to the UPRN for Flat 15, you can confirm Flat 15 is a true positive and Flat 14 is a false positive. By un-matching Flat 14 from the Flat 15 UPRN and re-matching it to the Flat 14 UPRN, both records will now be true positives.
Tip: If you do not wish to fix all existing errors (i.e. they are broadly acceptable risks) it may still be helpful to run a root cause analysis on any found. This will help you understand what went wrong in that instance and mitigate against new errors being created in the same way. This can assure your risk management process and keep all risks at ALARP.
Help, I’m stuck!
If you have any more questions about address matching, we are always happy to talk. Email [email protected] and we can arrange a call to discuss your specific data issues.