Sérgio Nunes; Cristina Ribeiro; Gabriel David
Documents on the World Wide Web are dynamic entities. Mainstream information retrieval systems and techniques are primarily focused on the latest version a document, gen- erally ignoring its evolution over time. In this work, we study the term frequency dynamics in web documents over their lifespan. We use the Wikipedia as a document collec- tion because it is a broad and public resource and, more im- portant, because it provides access to the complete revision history of each document. We investigate the progression of similarity values over two projection variables, namely re- vision order and revision date. Based on this investigation we find that term frequency in encyclopedic documents – i.e. comprehensive and focused on a single topic – exhibits a rapid and steady progression towards the document’s cur- rent version. The content in early versions quickly becomes very similar to the present version of the document.