Project 1 - MapReduce

Overview

CommonCrawl is a free, publicly accessible "crawl" of the web - that is, an archive of web content that has been downloaded and saved for future analysis. As of late 2011, their archive is 40TB in size, encompassing over 5 billion pages and documents. In this project, you will work in groups of 1-2 people to process a subset of the CommonCrawl dataset using MapReduce. But, what portion of data you process, and how you process it, is up to you!

Part 1 - Project Idea

Rather than assign a specific project goal, for this graduate class you must determine the specifics of your project and propose the details to me! Your project must meet the following requirements:

The data source is CommonCrawl - you don't need to download the web first!

You should process a "significant fraction" of the full data set, but the exact amount will vary by project and the complexity of the data processing required.

The data processing mechanism is MapReduce
The compute resources are Amazon Elastic MapReduce (i.e. Amazon EC2 instances running customized Hadoop software)

Your project must answer a specific question about the dataset. For example:

What are the top 100 keywords used in a website title? (Or link, description, etc..)
What percentage of pages uses AJAX, dynamic HTML, or <insert new web tech trend here>?
What languages are present in the crawl? (across all HTML pages, only in PDF documents, etc..)
What percentage of documents are labeled with the incorrect content type? (Web servers return a header field ("Content-Type") specifying what type of document is being provided, such as text/html, application/pdf, etc. But, this information could be wrong.)
What pages/documents are duplicated the most times? (i.e. at least 90% of the content on page A appears on 1500 other pages in the crawl)
Of the 5+ billion items in the crawl, are they mostly large files or small files? (i.e. a histogram of the document sizes)
What are the most common viruses / spyware / malware that were captured in the crawl? (Note that the crawl is deliberately unfiltered for this!)
And...
Many....
More...
Possible...
Ideas....

Submission instructions: Upload a 1-page PDF document to Sakai. This document should name the group members, describe your idea, and provide a timeline for the project in the form of a Gantt chart. Also, for two-person groups, describe the division of labor between group members.

Part 2 - Project Proposal

Once you have a rough idea of the question you want your project to answer, it is time to begin working on a complete project proposal. Writing a good proposal will almost certainly require you to begin work on the project itself, perhaps by doing a rough implementation of the algorithm and running it on a small subset of data.

Your proposal should be approximately 4 pages in length and include the following elements:

Introduction

What specific question about the dataset are you answering?
Why is this question important and interesting?
Why is this question relevant to the broader public Internet?

Algorithm Details

What specific algorithm(s) will you be implementing in order to accomplish the high-level goals described in your introduction?
What open-source tools to you intend to use to accelerate project development? (This is encouraged!)
How do these algorithms work?
Why did you choose them?

Infrastructure

Dataset: How much of the 40TB (60TB according to Amazon in 2012) CommonCrawl dataset do you intend to process? How did you arrive at this number?
Computation Resources: How many EC2 nodes (of what size?) will be needed to process the data in parallel? How many hours do you estimate it take to run the analysis to completion? How did you arrive at this estimate?

(For full credit, this should be based on first-hand experimentation, not merely a random guess!)
(I'm assuming the small EC2 nodes will be sufficient, but if your project has significant memory requirements, be sure to describe them and explain why)

Cost: How much $$$ will this project cost to execute trial runs on a small data subset and do a final "production" run? How did you calculate this total cost?

Given that the entire class has been allocated $2000 in Amazon credits for all projects (not just this one!), does your proposal fall within the overall class budget?
Note that the CommonCrawl dataset is located in Amazon's US / Eastern region. You should run your analysis in the same region in order to avoid data transfer charges.
Note that you are charged by the hour for a compute node, even if you only use it for 5 minutes and then terminate it. So, don't be greedy and spin up 100 nodes for 5 minutes just to finish your job a bit faster. 10 nodes each used for 50 minutes is more price efficient.
Can you use Amazon Spot Instances instead of the conventional On Demand instances to lower the cost of doing your final "production" run? What are the tradeoffs with this technique? You can use the --bid-price X argument in the Ruby client to request Spot instances if desired.

Analysis

After running your final project on the data set, what results will you produce and how will they be presented? (Tables of ...? Graphs of ...? Lists of ...?) Will you need to use any other software tools to display or analyze the data collected from MapReduce?

Submission instructions: Upload the final proposal (in PDF format) to Sakai.

Part 3 - Project Implementation and Reporting

In this stage of the project, you will finish implementing everything you proposed. :-)

There are four deliverables for this part:

Full source code
Installation and execution instructions - what steps would a classmate need to take to reproduce your work? (For your instructions, it is safe to assume that the reader is already familiar with the CommonCrawl tutorial. Thus, you can be brief for those "obvious" steps)
A final report documenting your completed project. The report should contain the following sections:

Introduction (re-use and polish from the proposal. Address any comments from instructor...)
Algorithm details (update and polish from the proposal. Address any comments from instructor...)
Infrastructure actually used (update and polish from the proposal. Address any comments from instructor...)

How close were you to your estimate? Why was there a difference (if any)? (I'm not grading you on your accuracy. Rather, I'm curious and am looking for ways to make more accurate estimates the next time this course is taught.)

Results and Analysis - What is the answer to your question?

Double points for this section - Put some effort into writing a polished discussion of what you learned from the dataset! Point out anything interesting or surprising that was found. Make sure your tables and figures are neat, clearly labeled, and easy to read.

Final thoughts and feedback - What was the easiest and hardest part of the project? What suggestions would you have if this project is repeated for future students?
References - What sources did you use in building your project? Provide links to public source code, tutorials, discussion forums, mailing lists, etc..

A 6 minute in-class presentation describing your project, methods used, and results, with 2-3 PowerPoint slides (no more!).

Submission instructions: Upload the source code (in a compressed tarball or zip file, please), the final report (in PDF format), and presentation slides to Sakai.

Timeline

Week	Working On	Deliverables at End
1	Part 1: Independent research and brainstorming for idea Due: Monday, Jan 30th by 11:55pm	1 page document describing idea
2 3	Part 2: Proposal writing and initial programming in order to understand your own solution (and its performance) better Due: Monday, Feb 13th by 11:55pm	4-5 page proposal document
4 5 6	Part 3: Final programming, data processing and analysis, and report writeup Due: Friday, Mar 2nd by 11:55pm	Source code, final report, and in-class presentation

Grading

10% - Project Idea Document [Grading Rubric]
30% - Project Proposal [Grading Rubric]
60% - Final Report, Source code, and In-class Presentation [Grading Rubric]

References and "Useful" Links

(Please suggest more links that you find useful!)

CommonCrawl
Accessing the CommonCrawl data
CommonCrawl Q&A
CommonCrawl source code repository (including some utility classes related to the Arc file format)
"Official" Tutorial #1 - WordCount applied to subset of data (project code repository)
"Unofficial" Tutorial #2 - Parsing data for langage and splitting into sentences

Note that you don't need the "distributed copy" code near the beginning, because you will be processing data directly from S3

Arc file format specification
Companion libraries?

Boilerpipe - Full-text extraction from HTML pages
Jsoup - Java HTML parser
Apache Tika - Document parsing
Apache Mahout - Machine learning library

To download a sample of the CommonCrawl dataset for local processing, install s3curl, configure its local .s3curl file with your AWS account information, and then use the following command to add the "RequesterPays" header to the request:
./s3curl.pl --id=myaccount -- -H "x-amz-request-payer: requester" http://s3.amazonaws.com/commoncrawl-crawl-002/2010/01/07/18/1262876244253_18.arc.gz > 2010_01_07_18_1262876244253_18.arc.gz