This lesson is based on materials developed and made available by Tammy Piermann of Springfield Twp High School as part of her CS Principles course. The data sets were the result of an NSF-funded collaboration between Tammy and Slobodan Vucetic of Drexel University.
The data sets we are using are made available through the generosity of Steve Glassman of the Compaq Systems Research Center. They are data that was gathered as part of a late-1990s research project.
One of the growing trends in the Big Data field is that more and more organizations, particularly government agencies, are putting datasets online and making them available under open source licenses.
The U.S. government, NYC, and other cities have sponsored contests encouraging developers to write software that uses these data sets. In fact NYC is running its BigApps 3.0 contest.
The data sets we are using were gathered as part of research project conducted by Digitial Equipment Corporation. They gather data from users about their movie preferences. There are three data files:
Source: Details about the details are taken from here.
ID: Number -- primary key
Age: Number
Gender: Text -- one of "M", "F"
Zip_Code: Text
ID: Number -- primary key
Name: Text
PR_URL: Text -- URL of studio PR site
IMDb_URL: Text -- URL of Internet Movie Database entry
Theater_Status: Text -- either "old" or "current"
Theater_Release: Date/Time
Video_Status: Text -- either "old" or "current"
Video_Release: Date/Time
Action, Animation, Art_Foreign, Classic, Comedy, Drama, Family, Horror, Romance, Thriller: Yes/No
Person_ID: Number
Movie_ID: Number
Score: Number -- 0 <= Score <= 1
Weight: Number -- 0 < Weight <= 1
Modified: Date/Time
A movie's score is the rating provided by this person for this movie. The zero-to-five star rating used externally on EachMovie is mapped linearly to the interval [0,1]. Here's a histogram of the Score values:
Score Count
0 347191
0.2 150495
0.4 339718
0.6 701236
0.8 761676
1.0 511667
Weight is only relevant in the case of a Score of zero, in which case it distinguishes whether the person rated a movie as zero stars (weight = 1) or "sounds awful" (weight < 1). (Most "sounds awful" weights are 0.2, but for historical reasons about 10% are 0.5.) The idea behind "sounds awful" was to let a user indicate he never planned to see a movie (hence we would omit it from future list of predictions). Our collaborative filtering algorithm treated such a declaration as less authoratative than a regular rating of zero stars.
Open each of the data files in separate Browser tabs. You won't be editing the files, just browsing them and answering questions about them.
Answer each of the of the following questions: