Sentimentality and Polarity for the ClueWeb09 Dataset

A. Gural Vural1, B. Barla Cambazoglu2, Pinar Karagoz3

1gural@ceng.metu.edu.tr, Middle East Technical University, Ankara, Turkey

2barla@yahoo-inc.com, Yahoo! Labs, Barcelona, Spain

3karagoz@ceng.metu.edu.tr, Middle East Technical University, Ankara, Turkey

This site provides the sentimentality and polarity scores for the English documents in the ClueWeb09-B dataset.

The ClueWeb09 Dataset is a crawl of 1 billion Web pages available for information retrieval research. ClueWeb09-B dataset is the first 50 million English pages of ClueWeb09.

The method by which the sentimentality scores were computed is described here:

   Sentiment-focused web crawling, by Vural, A. G., Cambazoglu, B. B., and Senkul, P. (2012)

The score file can be downloaded here: clueweb09B.sentiment.bz2 (405MB) The file will have 44,218,678 lines with the following format:

   clueweb-docid sentimentality polarity spam-rank page-rank

Note 1: Scores for Wiki pages of ClueWeb09-B are currently excluded..
Note 2: The decompressed file is 2GB..
Note 3: Spam-rank scores are taken from Waterloo Spam Rankings.
Note 4: Page-rank scores are taken form Clueweb Wiki