Paper: Topic Classification of Blog Posts Using Distant Supervision

ACL ID W12-0604
Title Topic Classification of Blog Posts Using Distant Supervision
Venue Workshop on Semantic Analysis in Social Media
Session  
Year 2012
Authors

Classifying blog posts by topics is useful for applications such as search and market- ing. However, topic classification is time consuming and error prone, especially in an open domain such as the blogosphere. The state-of-the-art relies on supervised meth- ods, requiring considerable training effort, that use the whole corpus vocabulary as fea- tures, demanding considerable memory to process. We show an effective alternative whereby distant supervision is used to ob- tain training data: we use Wikipedia arti- cles labelled with Freebase domains. We address the memory requirements by using only named entities as features. We test our classifier on a sample of blog posts, and re- port up to 0.69 accuracy for multi-class la- belling and 0.9 for binary classification.

@InProceedings{husby-barbosa:2012:SASM2012,
  author    = {Husby, Stephanie  and  Barbosa, Denilson},
  title     = {Topic Classification of Blog Posts Using Distant Supervision},
  booktitle = {Proceedings of the Workshop on Semantic Analysis in Social Media},
  month     = {April},
  year      = {2012},
  address   = {Avignon, France},
  publisher = {Association for Computational Linguistics},
  pages     = {28--36},
  url       = {http://www.aclweb.org/anthology/W12-0604}
}