Project 1

CommonCrawl is a free, publicly accessible "crawl" of the web - that is, an archive of web content that has been downloaded and saved for future analysis. As of late 2011, their archive is 40TB in size, encompassing over 5 billion pages and documents. In this project, you will work in groups of 1-2 people to process a subset of the CommonCrawl dataset using MapReduce.  But, what portion of data you process, and how you process it, is up to you!



  • 10% - Project Idea document - Grading Rubric
    • Due: Monday, Jan 30th by 11:55pm
  • 30% - Project Proposal - Grading Rubric
    • Due: Monday, Feb 13th by 11:55pm
  • 60% - Final Report, Source code, and In-class Presentation - Grading Rubric
    • Due: Friday, Mar 2nd by 11:55pm


Project 2

Writing applications in a scalable fashion is a central challenge to cloud computing.  In this project, you will work in groups of 1-2 people to write an application for Amazon's cloud platform that takes advantage of their monitoring and management features.



  • 10% - Project Idea Document - Grading Rubric
    • Due: Monday, Mar 26th by 11:55pm
  • 15% - Initial Project Demonstration and Project Meeting - Grading Rubric
    • Due: Monday, Apr 9th (in class)
  • 60% - Final Project Demonstration - Grading Rubric
    • Due: Monday, Apr 23rd  (in class)
  • 15% - Final Report and Source code - Grading Rubric
    • Due: Wed, Apr 25th by 11:55pm