You are here: Home / Past Courses / Spring 2016 - ECPE 276 / Projects / Project 1 - MapReduce

Project 1 - MapReduce

Overview

CommonCrawl is a free, publicly accessible "crawl" of the web - that is, an archive of web content that has been downloaded and saved for future analysis. Their November 2015 crawl results alone is over 151TB in size and holds more than 1.82 billion urls.  In this project, you will work in groups of 1-2 people to process a subset of the CommonCrawl dataset using MapReduce.  But, what portion of data you process, and how you process it, is up to you!

 

CommonCrawl Datasets

Full information on the CommonCrawl datasets is available on their Getting Started page, although late-breaking information on the newest crawls might only be documented on the CommonCrawl Blog.  Each recent dataset includes three different types of files:

  • WARC files (Full HTTP request and HTTP response. This is raw data exactly as the crawler sees it.)
  • WARC-encoded WAT files (computed metadata of each request/response in JSON format)
  • WARC-encoded WET files (plain text extracted from each response payload)
Note that each crawl (e.g. Sept 2015 -vs- Nov 2015) is stored in a different directory on S3, so pay careful attention to your filenames.  For your projects you should concentrate on data from 2013 and newer that uses the WARC (Web ARChive) format.  Earlier datasets used the ARC format, which requires a different parser to process.

  

Part 1 - Project Idea

Rather than assign a specific project goal, for this graduate-level class you must determine the specifics of your project and propose the details to me.  Your project must meet the following requirements:

  • The data source is CommonCrawl - you don't need to download the web first!
    • You should process a "significant fraction" of the full data set, but the exact amount will vary by project and the complexity of the data processing required.
  • The data processing mechanism is MapReduce
  • The compute resources are Amazon Elastic MapReduce (i.e. Amazon EC2 instances running customized Hadoop software)

 

Your project must answer a specific question about the dataset.  For example:

  • What are the top 100 keywords used in a website title? (Or link, description, etc..)
  • What percentage of pages uses AJAX, dynamic HTML, or <insert new web tech trend here>?
  • What languages are present in the crawl? (across all HTML pages, only in PDF documents, etc..)
  • What percentage of documents are labeled with the incorrect content type?  (Web servers return a header field ("Content-Type") specifying what type of document is being provided, such as text/html, application/pdf, etc. But, this information could be wrong.)
  • What pages/documents are duplicated the most times? (i.e. at least 90% of the content on page A appears on 1500 other pages in the crawl)
  • What percentage of the web pages have advertisements on them?
  • What are the most popular JavaScript libraries used in web pages?
  • Of the 5+ billion items in the crawl, are they mostly large files or small files? (i.e. a histogram of the document sizes)
  • What are the most common viruses / spyware / malware that were captured in the crawl?  (Note that the crawl is deliberately unfiltered for this!)
  • And...
  • Many....
  • More...
  • Possible...
  • Ideas....
  • Note: I should not be able to Google your question, and find an analysis of the CommonCrawl data that already answers your question.

 

Submission instructions: Upload a 1-page PDF document to Canvas. This document should name the group members, describe your idea, and provide a timeline for the project in the form of a Gantt chart. Also, for two-person groups, describe the division of labor between group members.

 

Part 2 - Project Proposal

Once you have a rough idea of the question you want your project to answer, it is time to begin working on a complete project proposal. Writing a good proposal will almost certainly require you to begin work on the project itself, perhaps by doing a rough implementation of the algorithm and running it on a small subset of data.

Your proposal should be approximately 4 pages in length and include the following elements:

Introduction

  • What specific question about the dataset are you answering?
  • Why is this question important and interesting?
  • Why is this question relevant to the broader public Internet?

 

Algorithm Details

  • What specific algorithm(s) will you be implementing in order to accomplish the high-level goals described in your introduction?
  • What open-source tools to you intend to use to accelerate project development?  (This is encouraged!)
  • How do these algorithms work?
  • Why did you choose them?

 

Infrastructure

  • Dataset: How much of the CommonCrawl dataset (and what year(s)) do you intend to process?  How did you arrive at this number?
  • Computation Resources: How many EC2 nodes (of what size?) will be needed to process the data in parallel?  How many hours do you estimate it take to run the analysis to completion? How did you arrive at this estimate?
    • (For full credit, this should be based on first-hand experimentation, not merely a random guess!)
    • (I'm assuming using some number of "m1.large" EC2 nodes with 2 vCPUs and 7.5 GB of RAM will be sufficient, but if your project has significant memory requirements and needs larger nodes, be sure to describe them and explain why)
  • Cost: How much $$$ will this project cost to execute trial runs on a small data subset and do a final "production" run? How did you calculate this total cost?
    • Note that the CommonCrawl dataset is located in Amazon's US / Eastern region.  You should run your analysis in the same region in order to avoid data transfer charges.
    • Note that you are charged by the hour for a compute node, even if you only use it for 5 minutes and then terminate it. So, don't be greedy and spin up 100 nodes for 5 minutes just to finish your job a bit faster. 10 nodes each used for 50 minutes is more price efficient.
    • Can you use Amazon Spot Instances instead of the conventional On Demand instances to lower the cost of doing your final "production" run? What are the tradeoffs with this technique?  You can use the --instance-groups parameter with the BidPrice argument in the CLI to request Spot instances if desired.

 

    Analysis

    • After running your final project on the data set, what results will you produce and how will they be presented?  (Tables of ...?  Graphs of ...?  Lists of ...?)  Will you need to use any other software tools to display or analyze the data collected from MapReduce?

     

    Submission instructions: Upload the final proposal (in PDF format) to Canvas.

     

    Part 3 - Project Implementation and Reporting

    In this stage of the project, you will finish implementing everything you proposed.  :-)

     

    There are four deliverables for this part:

    1. Full source code
    2. Installation and execution instructions - what steps would a classmate need to take to reproduce your work?  (For your instructions, it is safe to assume that the reader is already familiar with the CommonCrawl tutorial. Thus, you can be brief for those "obvious" steps)
    3. A final report documenting your completed project. The report should contain the following sections:
      1. Introduction (re-use and polish from the proposal. Address any comments from instructor...)
      2. Algorithm details (update and polish from the proposal. Address any comments from instructor...)
      3. Infrastructure actually used (update and polish from the proposal. Address any comments from instructor...)
        1. How close were you to your estimate?  Why was there a difference (if any)?   (I'm not grading you on your accuracy. Rather, I'm curious and am looking for ways to make more accurate estimates the next time this course is taught.)
      4. Results and Analysis - What is the answer to your question?
        1. Double points for this section - Put some effort into writing a polished discussion of what you learned from the dataset! Point out anything interesting or surprising that was found. Make sure your tables and figures are neat, clearly labeled, and easy to read.
      5. Final thoughts and feedback - What was the easiest and hardest part of the project? What suggestions would you have if this project is repeated for future students?
      6. References - What sources did you use in building your project?  Provide links to public source code, tutorials, discussion forums, mailing lists, etc..
    4. An 8 minute in-class presentation describing your project, methods used, and results, with 3-4 PowerPoint slides (no more!).

     

    Submission instructions: Upload the source code (in a compressed tarball or zip file, please), the final report (in PDF format), and presentation slides to Canvas.

     

    Timeline

    WeekWorking OnDeliverables at End
    1

    Part 1: Independent research and brainstorming for idea
    Due: Tuesday, Feb 9th by 11:59pm

    1 page document describing idea
    2
    3
    Part 2: Proposal writing and initial programming in order to understand your own solution (and its performance) better
    Due: Tuesday, Feb 23rd by 11:59pm
    4-5 page proposal document
    4
    5
    Part 3: Final programming, data processing and analysis, and report writeup
    Due: Thursday, Mar 10th by 11:59pm
    Source code, final report, and in-class presentation

       

      Grading

       

      References and "Useful" Links

      (Please suggest more links that you find useful!)