Project 1

CommonCrawl is a free, publicly accessible "crawl" of the web - that is, an archive of web content that has been downloaded and saved for future analysis. Their 2013 archive alone is 250TB in size, encompassing over 4.3 billion pages and documents. In this project, you will work in groups of 1-2 people to process a subset of the CommonCrawl dataset using MapReduce.  But, what portion of data you process, and how you process it, is up to you!



    • 10% - Project Idea document - Grading Rubric
      • Due: Tuesday, Feb 4th by 11:55pm
    • 30% - Project Proposal - Grading Rubric
      • Due: Tuesday, Feb 18th by 11:55pm
    • 60% - Final Report, Source code, and In-class Presentation - Grading Rubric
      • Due: Thursday, Mar 6th by 11:55pm


      Project 2

      In this project, you will work in groups of 1-2 people to write an application for Amazon's cloud platform that takes advantage of their monitoring and management features.



      • 10% - Project Idea Document - Grading Rubric
        • Due: Tuesday, Mar 25th by 11:55pm
      • 15% - Initial Project Demonstration - Grading Rubric
        • Due: Tuesday, Apr 8th (in class)
      • 60% - Final Project Demonstration - Grading Rubric
        • Due: Tue, Apr 29th  (in class)
      • 15% - Final Report and Source code - Grading Rubric
        • Due: Wed, Apr 30th by 11:55pm