The site’s crowdsourced editing model is prone to vandalism and biases. While its reputation for accuracy has improved, even Wikipedia doesn’t consider itself a reliable source. The Wikimedia Foundation, the non-profit organization that oversees Wikipedia, regularly explores new solutions for these shortcomings. A new effort to address them harnesses the power of AI. The AI team at Meta has launched a research initiative to improve Wikipedia’s citations. These references are used to corroborate crowdsourced information on the site — but they’re often missing, incomplete, or inaccurate. While Wikipedia volunteers double-check the footnotes, it’s hard for them to keep up when more than 17,000 new articles are added every month. This scale makes the problem a compelling use case for machine learning. Meta’s proposal fact-checks the references. The team says it’s the first model that can automatically scan hundreds of thousands of citations at once to check their accuracy.
Source code
The model’s knowledge source is a new dataset of 134 million public web pages. Dubbed Sphere, Meta says the open-source library is larger and more intricate than any corpus ever used for such research.
Our work can assist fact-checking efforts.
To find appropriate sources in the dataset, the researchers trained their algorithms on 4 million Wikipedia citations. This enabled the system to unearth a single source to validate each statement.
An evidence-ranking model compares the alternative sources with the original reference.
If a citation appears irrelevant, the system will recommend a better source, alongside a specific passage that supports the claim. A human editor can then review and approve the suggestion.
To illustrate how this works, the researchers used the example of a Wikipedia page on retired boxer Joe Hipp.
The entry describes the Blackfeet Tribe member as the first Native American to compete for the WBA World Heavyweight title. But the model found that the citation for this claim was a webpage that didn’t even mention Hipp or boxing.
The system then searched the Sphere corpus for a replacement reference. It unearthed this passage from a 2015 article in the Great Falls Tribune:
While the passage doesn’t explicitly mention boxing, the model inferred the context from clues. These included the term “heavyweight” and the word “challenge” as a synonym for “compete,” which featured in the original Wikipedia entry.
Future fact-checking
The team now aims to turn their research into a comprehensive system. In time, they plan to create a platform that Wikipedia editors can use to systematically spot and resolve citation issues. Meta has also open-sourced the project, which could give external researchers new tools to develop their own AI language systems. “Our results indicate that an AI-based system could be used, in tandem with humans, to improve the verifiability of Wikipedia,” the study authors wrote. “More generally, we hope that our work can be used to assist fact-checking efforts and increase the general trustworthiness of information online.“ The research may further fears about automated fact-checking and Big Tech firms becoming arbiters of truth. The more optimistic view is that Meta’s finally found a way to use its misinformation experience for good.