CS464 Project

Project topic This semester all projects will be based on the movie analysis and movie tag prediction. The movie tags will be obtained from the tag-genome dataset. This dataset provides a number of movie names and the associated tags. Each project group is expected to define a particular task based on the tag-genome dataset. Some example project titles are as follows:

supervised tag prediction, where manually assigned tags for the movies in the training set are used to predict tags for novel movies (ie. those in the test set),
clustering, where predefined tags are assigned to movies without any training samples, by analysing the contents of the movies and similarity across the movies,
tag discovery, where the task is to find novel tags for movies using textual data (see below) for a large set movies, and then assigning these tags to the target movies from the tag-genome dataset,
noisily-supervised tag prediction, where noisy and incomplete movie tag information may be available for the training set.

In exceptional cases, using a different (but related) dataset than the tag-genome dataset may be allowed. Overall, we expect novelty in your projects. Brainstorm and do your research so that you will find a project that you will work with enthusiasm, and will be proud to put in your homepage/resume/blog.

Data type The tag-genome dataset does not provide any information about the movie contents, therefore, you need to collect/use data to model movies in terms of their contents. For example, you may use one or more of the following data:

movie posters (image),
movie-related images collected automatically collected on the web (image),
movie frames (video),
script or subtitles (text),
per-user movie ratings (as in collaborative filtering).

Of course, any input data that would render the project task trivial shall be avoided. For example, while pre-training an unsupervised language model on articles is perfectly fine, extracting movie-specific features from the corresponding Wikipedia articles would be unacceptable, as the Wikipedia articles may explicitly state similar tags for the target movies.

Data collection You are required to find/collect data yourself, as needed for your project. Since data collection can be fun yet time-consuming, you may propose to collect data for a subset of the tag-genome dataset. Similarly, you probably want to choose a subset of the tags, as some tags may be too specific/noise/etc. Also bear in mind the computational resources that you may need for data processing. Make sure you collect (and share) data using legal sources and learn about “fair-use”.

Project groups Each group is required to consist of exactly four members (there may be very few exceptions due to class size). Students without a proposed group membership will be randomly assigned into new project groups.

Deliverables (to be uploaded to Moodle)

Group proposal
Project proposal Each project group is required to send a one-page project proposal. Minor changes in the proposed project may be allowed throughout the semester, but you are required to get confirmation from your TA or me.
Progress report Each group will prepare a progress report.
Final report Each group will prepare a final report.
Code and data Together with the final report, the code and data will also be submitted by each group.

Deadlines are announced on Moodle (see Key Dates page)

Implementation

You are allowed (and encouraged) to learn and use mainstream machine learning and deep learning libraries (TensorFlow, Torch, Caffe, Keras, Theano, MatConvNet, scikit-learn, etc. and other useful libraries like NLTK) as part of your project. You may still need to implement certain models from scratch.
Visual tools, like WEKA GUI, are not allowed without prior permission (you need to have a very good reason).

Experimental evaluation As will be discussed in class, experimental evaluation is a critically important part of machine learning research and practice. Therefore, each project report (progress and final) should contain a carefully prepared and detailed experimental analysis section. Some required but not sufficient experimental analysis guidelines are provided below:

Training and test set should not overlap for the supervised models (will be discussed in class).
Distribution of examples used in each tag class should be reported. Ideally examples should be chosen such that the tags are balanced in the train/val/test sets
For supervised tag prediction, the performance shall be analysed on a constant test set, with a varying number of training+validation examples.

Important Notes

Inter-group collaboration is not allowed without prior permission (again, you need to have a very good reason).
If you have taken or if you are currently taking a related course, make sure there is no overlap in projects (or homework submissions) between your work in another course and CS464.
Aforementioned or any other type of plagiarism will not be tolerated. See this link for more information.
Project reports should be considered as scientific papers, therefore, shall be prepared professionally. Again, consider these as materials that you will want to present when making job/MSc/PhD applications.
Final report can be an extension of your progress report. However, it should include a section that summarizes your progress since the progress report.
Project grading will depend on several criteria, including novelty and interestingness of the project topic and the proposed solution, proper machine learning practice, amount of work done, and the quality of reports and presentations.

We reserve the right to make changes in the course content.