Sentimentality and Polarity for the ClueWeb09 Dataset
A. Gural Vural1, B. Barla Cambazoglu2, Pinar Karagoz3
1gural@ceng.metu.edu.tr, Middle East Technical University, Ankara, Turkey
3karagoz@ceng.metu.edu.tr, Middle East Technical University, Ankara, Turkey
This site provides the sentimentality and polarity scores for the English documents in the ClueWeb09-B dataset.
The ClueWeb09 Dataset is a crawl of 1 billion Web pages available for information retrieval research. ClueWeb09-B dataset is the first 50 million English pages of ClueWeb09.
The method by which the sentimentality scores were computed is described here:
Sentiment-focused web crawling, by Vural, A. G., Cambazoglu, B. B., and Senkul, P. (2012)
The score file can be downloaded here: clueweb09B.sentiment.bz2 (405MB)
The file will have 44,218,678 lines with the following format:
clueweb-docid sentimentality polarity spam-rank page-rank
Note 1: Scores for Wiki pages of ClueWeb09-B are currently excluded..
Note 2: The decompressed file is 2GB..
Note 3: Spam-rank scores are taken from Waterloo Spam Rankings.
Note 4: Page-rank scores are taken form Clueweb Wiki