CPSC 110-08: Computing on Mobile Phones
Spring 2012

Big Data

CS Principles

This activity focuses on Big Idea 3: Big Data Sets. It is associated with the following learning objectives:

Credits

This lesson is based on materials developed and made available by Tammy Pirmann of Springfield Twp High School as part of her CS Principles course. The data sets were the result of an NSF-funded collaboration between Tammy and Slobodan Vucetic of Temple University.

The data sets we are using are made available through the generosity of Steve Glassman of the Compaq Systems Research Center. They are data that was gathered as part of a late-1990s research project.

Introduction

The data sets we are using were gathered as part of research project conducted by Digitial Equipment Corporation. They gather data from users about their movie preferences. There are three data files:

Overview of the Data

Source: Details about the details are taken from here.

  1. The Person.txt file provides optional, unaudited demographic data supplied by each person. It contains 72,916 records giving the following information about each user:
              ID: Number -- primary key 
              Age: Number 
              Gender: Text -- one of "M", "F" 
              Zip_Code: Text 
    

  2. The Movie.txt file provides descriptive information about each movie. It contains the following information for 1628 different movies:
              ID: Number -- primary key 
              Name: Text 
              PR_URL: Text -- URL of studio PR site 
              IMDb_URL: Text -- URL of Internet Movie Database entry 
              Theater_Status: Text -- either "old" or "current" 
              Theater_Release: Date/Time 
              Video_Status: Text -- either "old" or "current" 
              Video_Release: Date/Time 
              Action, Animation, Art_Foreign, Classic, Comedy, Drama, Family, Horror, Romance, Thriller: Yes/No 
    

  3. The vote.txt file (too big to download) is the actual rating data. It contains 2,811,983 user recommendations, each of which contains the following information:
              Person_ID: Number 
              Movie_ID: Number 
              Score: Number -- 0 <= Score <= 1 
              Weight: Number -- 0 < Weight <= 1
              Modified: Date/Time 
    

A movie's score is the rating provided by this person for this movie. The zero-to-five star rating used externally on EachMovie is mapped linearly to the interval [0,1]. Here's a histogram of the Score values:

          Score   Count
          0       347191
          0.2     150495
          0.4     339718
          0.6     701236
          0.8     761676
          1.0     511667

In other words, voters were asked to rate movies from awful (0) to great (1), with 4 intermediate rankings.

Exercises: In-class and Homework

Open each of the data files in separate Browser tabs. You won't be editing the files, just browsing them and answering questions about them.

Answer each of the of the following questions:

  1. Which scores (0 through 1) received the most and least votes according to the above histogram?

  2. What percentage of movies received the highest rating, a 1?

  3. Open the Person.txt file. What information is being gathered about each person? Is it optional?

  4. Given the Person data file, describe an algorithm for determining what percentage of the raters were male and female. Would it add up to 100%? Why or why not?

  5. Would it be possible to identify a person using just the data in this file? If not, how much additional information would be necessary before it became possible to personally identify some of the participants?

  6. Given the data contained in these files, which of the following questions might you be able to answer. Explain, briefly, how or why not.
    1. Do males and females like the same kinds of movies?
    2. What age cohort is most interested in Animation movies?
    3. Is there any difference in movie preferance between the West coast and the East coast?

  7. In general, do ratings systems such as these benefit consumers or the movie industry?

  8. Assuming that ratings are gathering through an online web site, do you see any way that one's ratings could be associated with an individual? a household?

  9. Generally speaking, I think rating systems of this sort are good (bad).