110 HW 10

CPSC 110-08: Computing with Mobile Phones

Big Data

CS Principles

This activity focuses on Big Idea 3: Big Data Sets. It is associated with the following learning objectives:

The student can use computers to process information to gain insight and knowledge.
The student can communicate how computer programs are used to process information to gain insight and knowledge.
The student can use computing to facilitate exploration and the discovery of connections in information.
The student can use large datasets to explore and discover information and knowledge.
The student can analyze the considerations involved in the computational manipulation of information.

Credits

This lesson is based on materials developed and made available by Tammy Piermann of Springfield Twp High School as part of her CS Principles course. The data sets were the result of an NSF-funded collaboration between Tammy and Slobodan Vucetic of Drexel University.

The data sets we are using are made available through the generosity of Steve Glassman of the Compaq Systems Research Center. They are data that was gathered as part of a late-1990s research project.

Open Data Contest

One of the growing trends in the Big Data field is that more and more organizations, particularly government agencies, are putting datasets online and making them available under open source licenses.

The U.S. government, NYC, and other cities have sponsored contests encouraging developers to write software that uses these data sets. In fact NYC is running its BigApps 3.0 contest.

Introduction

The data sets we are using were gathered as part of research project conducted by Digitial Equipment Corporation. They gather data from users about their movie preferences. There are three data files:

Overview of the Data

Source: Details about the details are taken from here.

The Person.txt file provides optional, unaudited demographic data supplied by each person. It contains 72,916 records giving the following information about each user:
```
          ID: Number -- primary key 
          Age: Number 
          Gender: Text -- one of "M", "F" 
          Zip_Code: Text 
```

The Movie.txt file provides descriptive information about each movie. It contains the following information for 1628 different movies:

          ID: Number -- primary key 
          Name: Text 
          PR_URL: Text -- URL of studio PR site 
          IMDb_URL: Text -- URL of Internet Movie Database entry 
          Theater_Status: Text -- either "old" or "current" 
          Theater_Release: Date/Time 
          Video_Status: Text -- either "old" or "current" 
          Video_Release: Date/Time 
          Action, Animation, Art_Foreign, Classic, Comedy, Drama, Family, Horror, Romance, Thriller: Yes/No

The vote.txt file is the actual rating data. It contains 2,811,983 user recommendations, each of which contains the following information:

          Person_ID: Number 
          Movie_ID: Number 
          Score: Number -- 0 <= Score <= 1 
          Weight: Number -- 0 < Weight <= 1 
          Modified: Date/Time

A movie's score is the rating provided by this person for this movie. The zero-to-five star rating used externally on EachMovie is mapped linearly to the interval [0,1]. Here's a histogram of the Score values:

          Score   Count
          0       347191
          0.2     150495
          0.4     339718
          0.6     701236
          0.8     761676
          1.0     511667

Weight is only relevant in the case of a Score of zero, in which case it distinguishes whether the person rated a movie as zero stars (weight = 1) or "sounds awful" (weight < 1). (Most "sounds awful" weights are 0.2, but for historical reasons about 10% are 0.5.) The idea behind "sounds awful" was to let a user indicate he never planned to see a movie (hence we would omit it from future list of predictions). Our collaborative filtering algorithm treated such a declaration as less authoratative than a regular rating of zero stars.

Exercises: In-class and Homework

Open each of the data files in separate Browser tabs. You won't be editing the files, just browsing them and answering questions about them.

Answer each of the of the following questions:

Describe how the scores in the histogram (above) correspond to user ratings?
What are the most common ratings?
What percentage of movies received the highest rating?
Open the Person.txt file. What information is being gathered about each person? Is it optional?
Given the Person data file, describe an algorithm for determining what percentage of the raters were male and female. Would it add up to 100%? Why or why not?
Would it be possible to identify a person using just the data in this file? If not, how much additional information would be necessary before it became possible to personally identify some of the participants?
Open the Vote.txt file. Describe an algorithm for finding how many movies were rated by a single individual, given the individual's Id number.
Given the data contained in these files, which of the following questions might you be able to answer. Explain, briefly, how or why not.
1. Do males and females like the same kinds of movies?
2. What age cohort is most interested in Animation movies?
3. Is there any difference in movie preferance between the West coast and the East coast?
In general, do you find movie ratings to be helpful to consumers? Why or why not?
In general, do you find movie ratings to be helpful to producers? Why or why not?
Assuming that ratings are gathering through an online web site, do you see any way that one's ratings could be associated with an individual? a household?
Generally speaking, I think rating systems of this sort are good (bad).