Paper: Sketching Techniques for Large Scale NLP

ACL ID W10-1503
Title Sketching Techniques for Large Scale NLP
Venue Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop
Year 2010

In this paper, we address the challenges posed by large amounts of text data by exploiting the power of hashing in the context of streaming data. We explore sketch techniques, especially the Count- Min Sketch, which approximates the fre- quency of a word pair in the corpus with- out explicitly storing the word pairs them- selves. We use the idea of a conservative update with the Count-Min Sketch to re- duce the average relative error of its ap- proximate counts by a factor of two. We show that it is possible to store all words and word pairs counts computed from 37 GB of web data in just 2 billion counters (8 GB RAM). The number of these coun- ters is up to 30 times less than the stream size which is a bigmemory and space gain. In Semantic Orientation experiments, the PMI scores computed f...

  author    = {Goyal, Amit  and  Jagaralamudi, Jagadeesh  and  Daum\'{e} III, Hal  and  Venkatasubramanian, Suresh},
  title     = {Sketching Techniques for Large Scale NLP},
  booktitle = {Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop},
  month     = {June},
  year      = {2010},
  address   = {NAACL-HLT, Los Angeles},
  publisher = {Association for Computational Linguistics},
  pages     = {17--25},
  url       = {}