Sentimentality and Polarity for the ClueWeb09 Dataset

A. Gural Vural1, B. Barla Cambazoglu2, Pinar Karagoz3, Middle East Technical University, Ankara, Turkey, Yahoo! Labs, Barcelona, Spain, Middle East Technical University, Ankara, Turkey

This site provides the sentimentality and polarity scores for the English documents in the ClueWeb09-B dataset.

The ClueWeb09 Dataset is a crawl of 1 billion Web pages available for information retrieval research. ClueWeb09-B dataset is the first 50 million English pages of ClueWeb09.

The method by which the sentimentality scores were computed is described here:

   Sentiment-focused web crawling, by Vural, A. G., Cambazoglu, B. B., and Senkul, P. (2012)

The score file can be downloaded here: clueweb09B.sentiment.bz2 (405MB) The file will have 44,218,678 lines with the following format:

   clueweb-docid sentimentality polarity spam-rank page-rank

Note 1: Scores for Wiki pages of ClueWeb09-B are currently excluded..
Note 2: The decompressed file is 2GB..
Note 3: Spam-rank scores are taken from Waterloo Spam Rankings.
Note 4: Page-rank scores are taken form Clueweb Wiki