Metalearning to Choose the Level of Analysis in Nested Data: A Case Study on Error Detection in Foreign Trade Statistics

Thumbnail Image
Date
2015
Authors
Mohammad Nozari
Carlos Manuel Soares
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Traditionally, a single model is developed for a data mining task. As more data is being collected at a more detailed level, organizations are becoming more interested in having specific models for distinct parts of data (e. g. customer segments). From the business perspective, data can be divided naturally into different dimensions. Each of these dimensions is usually hierarchically organized (e. g. country, city, zip code), which means that, when developing a model for a given part of the problem (e. g. a zip code) the training data may be collected at different levels of this nested hierarchy (e. g. the same zip code, the city and the country it is located in). Selecting different levels of granularity may change the performance of the whole process, so the question is which level to use for a given part. We propose a metalearning model which recommends a level of granularity for the training data to learn the model that is expected to obtain the best performance. We apply decision tree and random forest algorithms for metalearning. At the base level, our experiment uses results obtained by outlier detection methods on the problem of detecting errors in foreign trade transactions. The results show that using metalearning help finding the best level of granularity.
Description
Keywords
Citation