110 HW 10

CPSC 110-08: Computing on Mobile Phones
Spring 2012

Big Data

CS Principles

This activity focuses on Big Idea 3: Big Data Sets. It is associated with the following learning objectives:

The student can use computers to process information to gain insight and knowledge.
The student can communicate how computer programs are used to process information to gain insight and knowledge.
The student can use computing to facilitate exploration and the discovery of connections in information.
The student can use large datasets to explore and discover information and knowledge.
The student can analyze the considerations involved in the computational manipulation of information.

Credits

This lesson is based on materials developed and made available by Tammy Pirmann of Springfield Twp High School as part of her CS Principles course. The data sets were the result of an NSF-funded collaboration between Tammy and Slobodan Vucetic of Temple University.

The data sets we are using are made available through the generosity of Steve Glassman of the Compaq Systems Research Center. They are data that was gathered as part of a late-1990s research project.

Introduction

The data sets we are using were gathered as part of research project conducted by Digitial Equipment Corporation. They gather data from users about their movie preferences. There are three data files:

Overview of the Data

Source: Details about the details are taken from here.

The Person.txt file provides optional, unaudited demographic data supplied by each person. It contains 72,916 records giving the following information about each user:
```
          ID: Number -- primary key 
          Age: Number 
          Gender: Text -- one of "M", "F" 
          Zip_Code: Text 
```

The Movie.txt file provides descriptive information about each movie. It contains the following information for 1628 different movies:

          ID: Number -- primary key 
          Name: Text 
          PR_URL: Text -- URL of studio PR site 
          IMDb_URL: Text -- URL of Internet Movie Database entry 
          Theater_Status: Text -- either "old" or "current" 
          Theater_Release: Date/Time 
          Video_Status: Text -- either "old" or "current" 
          Video_Release: Date/Time 
          Action, Animation, Art_Foreign, Classic, Comedy, Drama, Family, Horror, Romance, Thriller: Yes/No

The vote.txt file (too big to download) is the actual rating data. It contains 2,811,983 user recommendations, each of which contains the following information:

          Person_ID: Number 
          Movie_ID: Number 
          Score: Number -- 0 <= Score <= 1 
          Weight: Number -- 0 < Weight <= 1
          Modified: Date/Time

A movie's score is the rating provided by this person for this movie. The zero-to-five star rating used externally on EachMovie is mapped linearly to the interval [0,1]. Here's a histogram of the Score values:

          Score   Count
          0       347191
          0.2     150495
          0.4     339718
          0.6     701236
          0.8     761676
          1.0     511667

In other words, voters were asked to rate movies from awful (0) to great (1), with 4 intermediate rankings.

Exercises: In-class and Homework

Open each of the data files in separate Browser tabs. You won't be editing the files, just browsing them and answering questions about them.

Answer each of the of the following questions: