Knowledge Aware Web Content

By playing the video you accept the privacy policy of YouTube.Learn more OK
ETH Media Technology Center – Knowledge Aware Web Content

Abstract

This thesis proposes several lightweight mechanisms for detecting similar information between texts. The mechanisms are implemented in a stand-alone web browser extension, which creates a model of its user’s knowledge by collecting the content of already visited websites. Based on this model newly visited websites are examined on new information. If new information is detected, it is marked on the website to allow for an efficient reading. The chosen mechanisms have been implemented with a focus on lightness. Therefore, measures for detecting known information are based on the bag-of-words model. We apply measures such as the cosine similarity and the jaccard similarity. In addition, we developed asymmetric measures based on cosine and jaccard similarity as well as methods which make use of a word embedding. All methods can compute the similarity between text elements, such as sentences, paragraphs, and whole documents. For evaluating the accuracy of text elements’ classification we describe a newly developed benchmark. Based on this benchmark, which consists of 200 news articles, the best performing method achieves an AUC of 0.94 out of 1.0 with the asymmetric cosine similarity and the asymmetric jaccard similarity when comparing paragraphs to known paragraphs or known documents.


Robin Bisping

Bachelor's Thesis

Status:

Completed

JavaScript has been disabled in your browser