CS Principles
This activity focuses on Big Idea 3: Big Data Sets. It is
associated with the following learning objectives:
- The student can use computers to process information to gain insight and knowledge.
- The student can communicate how computer programs are used to process information to gain insight and knowledge.
- The student can use computing to facilitate exploration and the discovery of connections in information.
- The student can use large datasets to explore and discover information and knowledge.
- The student can analyze the considerations involved in the computational manipulation of information.
Credits
This lesson is based on materials developed and made available
by Tammy Pirmann of Springfield Twp High School as part of
her CS
Principles course. The data sets were the result of an NSF-funded
collaboration between Tammy and Slobodan Vucetic of Temple University.
The data sets we are using are made available through the generosity of
Steve Glassman of the Compaq Systems Research Center. They are data that
was gathered as part of a late-1990s research project.
Introduction
The data sets we are using were gathered as part of research
project conducted by Digitial Equipment Corporation. They gather data
from users about their movie preferences. There are three data
files:
Overview of the Data
Source: Details about the details are taken from here.
- The Person.txt
file provides optional, unaudited demographic data supplied by each
person. It contains 72,916 records giving the following information
about each user:
ID: Number -- primary key
Age: Number
Gender: Text -- one of "M", "F"
Zip_Code: Text
-
The
Movie.txt file
provides descriptive information about each movie. It contains the
following information for 1628 different movies:
ID: Number -- primary key
Name: Text
PR_URL: Text -- URL of studio PR site
IMDb_URL: Text -- URL of Internet Movie Database entry
Theater_Status: Text -- either "old" or "current"
Theater_Release: Date/Time
Video_Status: Text -- either "old" or "current"
Video_Release: Date/Time
Action, Animation, Art_Foreign, Classic, Comedy, Drama, Family, Horror, Romance, Thriller: Yes/No
- The
vote.txt file (too big to download) is the actual rating
data. It contains 2,811,983 user recommendations, each of which
contains the following information:
Person_ID: Number
Movie_ID: Number
Score: Number -- 0 <= Score <= 1
Weight: Number -- 0 < Weight <= 1
Modified: Date/Time
A movie's score is the rating provided by this person for this movie. The
zero-to-five star rating used externally on EachMovie is mapped
linearly to the interval [0,1]. Here's a histogram of the Score
values:
Score Count
0 347191
0.2 150495
0.4 339718
0.6 701236
0.8 761676
1.0 511667
In other words, voters were asked to rate movies from awful (0) to great (1),
with 4 intermediate rankings.
Exercises: In-class and Homework
Open each of the data files in separate Browser tabs. You won't be
editing the files, just browsing them and answering questions about them.
Answer each of the of the following questions:
- Which scores (0 through 1) received the most and least votes according to the above histogram?
- What percentage of movies received the highest rating, a 1?
- Open the Person.txt file. What information is being
gathered about each person? Is it optional?
- Given the Person data file, describe an algorithm for determining
what percentage of the raters were male and female. Would it add up
to 100%? Why or why not?
- Would it be possible to identify a person using just the data
in this file? If not, how much additional information would be
necessary before it became possible to personally identify some of the
participants?
- Given the data contained in these files, which of the following questions
might you be able to answer. Explain, briefly, how or why not.
- Do males and females like the same kinds of movies?
- What age cohort is most interested in Animation movies?
- Is there any difference in movie preferance between the West coast and the East coast?
- In general, do ratings systems such as these benefit consumers
or the movie industry?
- Assuming that ratings are gathering through an online web site, do you see any
way that one's ratings could be associated with an individual? a household?
- Generally speaking, I think rating systems of this sort are good (bad).