Feature extraction for the author name disambiguation problem in a bibliographic database

Thumbnail Image
Date
2017
Authors
Jorge Miguel Silva
Fernando Silva
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Author name disambiguation in bibliographic databases has been, and still is, a challenging research task due to the high uncertainty there is when matching a publication author with a concrete researcher. Common approaches normally either resort to clustering to group author's publications, or use a binary classifier to decide whether a given publication is written by a specific author. Both approaches benefit from authors publishing similar works (e.g. subject areas and venues), from the previous publication history of an author (the higher, the better), and validated publicationauthor associations for model creation. However, whenever such an algorithm is confronted with different works from an author, or an author without publication history, often it makes wrong identifications. In this paper, we describe a feature extraction method that aims to avoid the previous problems. Instead of generally characterizing an author, it selectively uses features that associate the author to a certain publication. We build a Random Forest model to assess the quality of our set of features. Its goal is to predict whether a given author is the true author of a certain publication. We use a bibliographic database named Authenticus with more than 250, 000 validated author-publication associations to test model quality. Our model achieved a top result of 95.37% accuracy in predicting matches and 91.92% in a real test scenario. Furthermore, in the last case the model was able to correctly predict 61.86% of the cases where authors had no previous publication history. Copyright 2017 ACM.
Description
Keywords
Citation