CommonCrawl Tutorial

See also: New tutorial (updated for Spring 2016)

This tutorial is based on Steve Salevan's blog post MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl. But, that original tutorial used the web interface for Amazon Elastic MapReduce, while this new tutorial uses a command-line interface to Elastic MapReduce (link 1, link 2). While it's a little more work to setup, it's a better method for dealing with larger, more complicated projects.

Last updated: January 2014

Install Development Tools

Install Oracle Java 6 SE JDK
Note 1: Version 6 is CRITICAL! Although you can compile the tutorial code with Java 7, it WILL NOT RUN on Amazon Elastic MapReduce. (Instead, you will get an unhelpful "Class not found" or "Unsupported Major.Minor version" error when Amazon tries to run your Java 7 file using Java 6)
Note 2: You want the JDK (Java Development Kit), not the JRE (Java Runtime Environment, which can only run Java code)

Windows: Download from Oracle, install, and manually set the JAVA_HOME environment variable afterwards.
Linux: sudo apt-get install openjdk-6-jdk

Install Eclipse (for writing Hadoop code)
Latest version: Eclipse Kepler 4.3.x

Windows and OS X: Download the “Eclipse IDE for Java developers” installer package located at: http://www.eclipse.org/downloads/
Linux: sudo apt-get install eclipse

Once Eclipse is installed, insure that Java 6 is set as your default compiler. Go to Windows -> Preferences -> Java -> Installed JREs. It also would be helpful to set the compliance level from 1.7 to 1.6 in the Java -> Compiler option.

Install Git
Any version

Windows: Install the latest .exe from: http://code.google.com/p/msysgit/downloads/list

To use Git later, go to the Start Menu, find Git, and then choose "Git Bash"

OS X: Install the appropriate .dmg from: http://code.google.com/p/git-osx-installer/downloads/list
Linux: sudo apt-get install git

Install Ruby
Version 1.8x or 1.9x

Windows: Download the latest installer from https://www.ruby-lang.org

Select the option to add Ruby to your path.
To use Ruby later (for the Amazon command-line interface), go to the Start Menu, find Ruby, and then choose "Start Command Prompt with Ruby". Then, all of your commands will be of the format "ruby command", e.g. "ruby elastic-mapreduce --help"

Linux: sudo apt-get install ruby

Install Amazon Elastic MapReduce Ruby Client
Note: The Amazon-listed version is horribly out-of-date. (Why they don't update it is beyond me...) Don't use it, and instead pull from the developer's Github site at https://github.com/tc/elastic-mapreduce-ruby

git clone https://github.com/tc/elastic-mapreduce-ruby.git

Install a stand-alone Amazon S3 browser. Examples include:

CyberDuck: http://cyberduck.io (Mac)
S3Browser: http://s3browser.com/ (Windows)
S3Fox Firefox extension: http://www.s3fox.net/ (Mac, Windows, Linux)

Although you can use Amazon's web interface to upload/download from S3, a stand-alone S3 file browser on your computer might make this easier in the future. Further, it would allow you to explore outside of your own buckets. For example, you could explore the CommonCrawl datasets at s3://aws-publicdatasets/common-crawl/crawl-data

Get your Amazon Web Services Credentials

Login to Sakai and check the course Dropbox to find a text file containing your super-secret login information. These accounts are linked to the master (instructor) account for centralized billing for the course. This utilizes Amazon's Identity and Access Management (IAM) feature, which was designed to allow multiple employees of a company to work together on Amazon's cloud services without the need to share a single username and password.

Following the super-secret login link contained in the text file, login to the Amazon Web Services console (via your web browser).

Configure a keypair for Amazon's Elastic Compute Cloud (EC2) service:

Select the EC2 button in the web control panel.
Create a keypair for your account. Remember the name of this keypair - you'll need it next! Save the precious .pem download file to your home directory or some place where you will remember its location later.

Although we won't use it directly, this keypair would allow you to SSH directly into the EC2 Linux compute nodes spawned as part of this tutorial.

Create a bucket (aka "folder") in Amazon's Simple Storage Service (S3)

Select the S3 button in the web control panel.
Click "Create Bucket", give your bucket a name, and click the "Create" button. Bucket names must be unique across both the whole class account and the entire world! So, something like "yourusername-commoncrawl-tutorial" is a good choice. Remember this bucket name! You can leave logging disabled for the bucket.

Configure Amazon Elastic MapReduce Ruby Client

The client needs access to your Amazon credentials in order to work. Create a text file called credentials.json in the same directory as the ruby client, i.e. elastic-mapreduce-ruby/. Into this file, place the following information, customized for your specific account and directory setup. Note that this example is pre-set to the us-east-1 region because the CommonCrawl data is stored in the corresponding "US Standard" region. (The us-west-1 and us-west-2 regions might also work, but have not been tested...)

{
"access-id": "[YOUR AWS secret key ID]",
"private-key": "[YOUR AWS secret key]",
"key-pair": "[YOUR key pair name]",
"key-pair-file": "[The path and name of YOUR PEM file]",
"log-uri": "s3n://[YOUR bucket name in S3]/mylog-uri/",
"region": "us-east-1"
}

Compile and Build CommonCrawl Example

Now that you’ve installed the packages, you need to play with the CommonCrawl example code. A special ECPE 293A version is provided to reduce installation and compilation problems. Run the following command from a terminal/command prompt to pull down the code (Windows users - run this in your Git Bash shell):

cd
git clone https://bitbucket.org/shafer/2014_spring_commoncrawl_example.git

Now, start Eclipse and create a project.

Open the File menu then select “Project” from the “New” menu.
Open the “Java” folder and select “Java Project from Existing Ant Buildfile”.
Click Browse, then locate the folder containing the code you just checked out (if you didn’t change the directory when you opened the terminal, it should be in your home directory) and select the “build.xml” file. Eclipse will find the right targets.
Check the “Link to the buildfile in the file system” box, as this will enable you to share the edits you make to it in Eclipse with git.

Next, tell Eclipse how to build our JAR.

Right-click on the base project folder (by default it’s named “Common Crawl Examples”) and select “Properties” from the menu that appears.
Navigate to the Builders tab in the left hand panel of the Properties window
Uncheck the existing "Java Builder" option
Click “New”.
Select “Ant Builder” from the dialog which appears, then click OK.

Finally, configure our new Ant builder:

Select the Main tab at top:

"Buildfile": Click the “Browse File System” button under the “Buildfile:” field, and find the build.xml file which you found earlier.
"Base Directory": Click the “Browse File System” button under the “Base Directory:” field, and select the folder into which you checked out our code.
"Arguments": Leave blank
Give the builder a name (at the top) such as Ant_Builder

Select the Targets tab at the top and set the following options:

"After a Clean": uncheck all the targets, so that it says "Builder is not set to run for this build kind"
"Manual Build": ensure that it says "Default Target selected"
"Auto Build": ensure that it says "Builder is not set to run for this build kind"
"During a Clean": ensure that only the "clean" target runs

Once all configuration is completed, right-click on the base project folder and select “Build Project”. Ant will assemble a JAR file ready for use in Elastic MapReduce.

Upload the Example JAR to Amazon S3

Go back to the web console interface to S3, or launch the stand-alone S3 file browser program you installed previously. Upload the JAR file you built in Eclipse into the S3 bucket you created previously. The JAR file should be located here:
<your checkout dir>/dist/lib/commoncrawl-examples-1.0.1.jar

Create an Elastic MapReduce job based on your new JAR

Now that the JAR is uploaded into S3, all we need to do is to point Elastic MapReduce to it. The original tutorial uses the AWS web interface to accomplish this, but as serious programmers we will do it via the command-line interface.

From inside the Ruby client directory you created previously, use the following commands:

Get a list of all commands - Helpful for future reference and debugging! (Windows users, run this in your Ruby shell)

./elastic-mapreduce --help

Create a job flow - A job can contain many different MapReduce programs to execute in sequence. We only need one today.

./elastic-mapreduce \
 --create \
 --alive \
 --name "[Human-friendly name for this entire job]" \
 --num-instances 3 \
 --master-instance-type m1.small \
 --slave-instance-type m1.small \
 --log-uri s3n://[S3 bucket with program]/logs \
 --hadoop-version "0.20.205"

Note 1: The --alive argument tells Amazon that you want to keep your EC2 servers running even when the specific tasks finish! While this saves a lot of time in starting up servers while doing development/testing, it will cost you $$ if you forget to turn them off when you truly are finished!

Note 2: A variety of virtual machine types are available for you to use with Elastic MapReduce. This is a subset of the full list of EC2 machines available for rental. But, before you jump straight to the largest, fastest, most-expensive machine, ask yourself: Will this actually get my program done faster? Or would I be better using a larger number of the small cheap machines? If the bottleneck is Amazon's S3 storage system, then renting an expensive machine is a waste of money, because its CPU and memory will be idle much of the time, waiting for new data from the network.

Add a step to the job flow - This will add your specific MapReduce program to the job flow.

./elastic-mapreduce --jobflow [job flow ID produced from --create command previously]
 --jar s3n://[S3 bucket with program]/commoncrawl-examples-1.0.1.jar \
 --main-class org.commoncrawl.examples.ExampleTextWordCount \
 --arg -Dfs.s3n.awsAccessKeyId=[Your AWS secret key ID] \
 --arg -Dfs.s3n.awsSecretAccessKey=[Your AWS secret key] \
 --arg s3n://[S3 bucket with program]/emr/output/ExampleTextWordCount

What are we doing with this step? We're telling Amazon that we want to run a .jar file for our MapReduce program, and providing a number of other details, including:

The main() method for Hadoop to run is found in org.commoncrawl.examples.ExampleTextWordCount
Your secret key ID and secret key. Why do we need this when we already created a credentials.json file for this Ruby script? This information is going straight to Hadoop as program command-line arguments. Hadoop will use it to log into Amazon S3 because the CommonCrawl data is pay per access - Amazon needs to know who to bill! (Note: It's free if you access it from inside Amazon, but a login is still required)
Output the results as a series of CSV files into your Amazon S3 bucket (in a directory called emr/output/ExampleTextWordCount)

Now, your Hadoop job will be sent to the Amazon cloud.

Watch the show

It will take Amazon some time (5+ minutes?) to spin up your requested instances and begin processing. You can watch the status of your job in several places:

Via command-line, i.e. ./elastic-mapreduce --list
Via web interface

Select the Elastic MapReduce tab.

Is your job / step there? Is it running, or if it failed, are there error messages / error logs available to view?
What does the cluster CPU / memory / I/O usage look like?

Select the EC2 tab. Have your servers been started yet?
Select the S3 tab.

Are there log files (standard output, standard error) in the mylog-uri folder?
Has your output directory been created yet? (That's a very good way to know it finished!). You can download the output files via the S3 web interface if you want to inspect them.

Via Hadoop web console - NOT the Amazon web interface, but a Hadoop-specific webpage - See below

Advanced Topics - Watch the show with Hadoop web interface

Hadoop provides a web interface that is separate from the general Amazon Web Services interface. (i.e. it works even if you run Hadoop on your own personal cluster, instead of Amazon's). To access it, first obtain the public DNS address of the Hadoop master node. It should be something like ec2-######.amazonaws.com:

./elastic-mapreduce --list

Using SSH, listen locally on port 8157 and create an encrypted tunnel that connects between your computer and the Hadoop master node. We'll use this tunnel to transfer web traffic to/from the cluster. Rather than using a username/password, this command uses the public/private key method to authenticate. (Windows users: You'll have to use the PuTTY client and check the right checkboxes to accomplish this same task)

ssh -i [path to .pem key file] -ND 8157 hadoop@ec2-#######.compute-1.amazonaws.com

You won't see any messages on the screen if this works. Leave SSH running for as long as you want the tunnel to exist. Note that, if you left off the "-ND 8157" part of the command, this would give you SSH command-line access to the master server to do whatever you want! Remember, Elastic MapReduce isn't magic; it's just doing the work of configuring a virtual machine and installing software automatically in it.

You need to configure your web browser to access URLs of the format *ec2*.amazonaws.com* --or-- *ec2.internal* by using the SSH tunnel (port 8157) you just created as a SOCKSv5 proxy instead of accessing those URLs directly. Setup details differ by browser. One good option for Firefox is using the FoxyProxy plugin, since you can configure it to only proxy those specific URLs, not any website you access. Read installation instructions here (scroll to the bottom).

Finally, access the following URLs for different Hadoop services:

Hadoop Map/Reduce Administration (aka the "JobTracker"): http://ec2-########.amazonaws.com:9100

Here you can see all submitted (current and past) MapReduce jobs, watch their execution progress, and access log files to view status and error messages.

Hadoop NameNode: http://ec2-########.amazonaws.com:9101 (this storage mechanism is not used for your projects, so it's supposed to be empty)

Advanced Topics - FAQs from Prior Projects

Q: Can I get a single output file?

A: Yes! To get a single output file at the end of processing, you need to run only *1* reducer across the entire cluster. This is a bottleneck for large datasets, but it might be worth it compared to the alternative (writing a simple program that merges your N sorted output files into 1 sorted output file).

Try inserting the following line of code in the run() method immediately before the line that creates a new JobConf:

this.getConf().set("mapred.reduce.tasks", "1");
JobConf job = new JobConf(this.getConf());

Or, alternately, you can configure the job after it has been created, but before it runs:

job.setNumReduceTasks(1);
if (JobClient.runJob(job).isSuccessful())
etc...

Q: Can I change the output data type of a map task, i.e. changing a key/value pair such that the "value" is a string, not an int?

A: Yes! A discussion at StackOverflow notes that you need to change the type in three different places:

the map task
the reduce task
and the output value (or key) class in main().

For this example, the original poster wanted to change from type Int (IntWriteable) to Double (Doublewritable). You should be able to follow the same method to change from Long (LongWritable) to Text (TextWriteable)

Q: Can I run an external program from inside Java?

A: Yes! There are a few examples online that cover invoking an external binary, sending it data (via stdin), and capturing its output (via stdout) from Java.

http://www.linglom.com/2007/06/06/how-to-run-command-line-or-execute-external-application-from-java/

http://www.rgagnon.com/javadetails/java-0014.html

In fact, Hadoop even has a "Shell" class for this purpose: http://stackoverflow.com/questions/7291232/running-hadoop-mapreduce-is-it-possible-to-call-external-executables-outside-of

Q: Can I use a Bootstrapping action (when booting the EC2 node via Elastic MapReduce) to access files in S3?

A: Yes!

Amazon provides an example bootstrap script at http://aws.amazon.com/articles/3938 (about halfway down the page). In the example, they use the 'wget' utility to download an example .tar.gz archive and uncompress it. Note, however, that while the download file is in S3, it is a PUBLIC file, allowing wget to work in HTTP mode. (You can mark files as public via the S3 web interface).

A second example is posted at http://aws.typepad.com/aws/2010/04/new-elastic-mapreduce-feature-bootstrap-actions.html This uses the command hadoop fs -copyToLocal to copy from S3 to the local file system. This *should* work even if your S3 files aren't marked as public. (After all, if Hadoop hasn't been already configured by Amazon with your correct key, there's no way for it to load the .jar file with your MapReduce program to execute).

Cleaning Up

Don't forget to shut down the EC2 servers you (implicitly) started when you created your job flow!

./elastic-mapreduce --list
./elastic-mapreduce --terminate [job flow ID]

S3 charges you per gigabyte of storage used per month. Be tidy with your output files, and delete them when finished!

CommonCrawl Datasets

(Why they make it so hard to figure out the location and sizes of their various datasets is beyond me...)

Date	Location	Announcement	Size	Notes
2013 #2	s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-48	[Link]	148TB, 2.3 billion pages	48th week of 2013, NEW WARC FORMAT
2013 #1	s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20	[Link]	102TB, 2 billion pages	20th week of 2013, NEW WARC FORMAT
2012	s3://aws-publicdatasets/common-crawl/parse-output	[Link]	100TB, 3.8 billion pages	ARC + Text + Metadata files
2009/2010	s3://aws-publicdatasets/common-crawl/crawl-002	[Link]		ARC files only
2008/2010	s3://aws-publicdatasets/common-crawl/crawl-001/	[Link]		ARC files only

Data for the 2013 datasets is in 3 different formats. For more details on each of these data formats and the overall directory structure, visit a Blog post.

WARC file (HTTP request and HTTP response)
WARC-encoded WAT files (metadata of each request/response)
WARC-encoded WET files (text extracted from each response payload)

Note: Before deciding to use 2013 data, first ensure that the appropriate software libraries to process this data are available! (The example code provided does not currently support WARC files...)

Data for the 2012 dataset is available in 3 different formats. Each file is gzip compressed to save space. For more details on each of these data formats, visit Wiki Documentation : About the Dataset

ARC-formatted files - Contain the full HTTP response and payload provided by server, in "ARC" format

For example: s3://aws-publicdatasets.s3.amazonaws.com/common-crawl/parse-output/segment/1341690166822/1341804771845_1459.arc.gz

Text files - Contain only the parsed text from the payload without any formatting. This is limited to HTML and RSS files only, not any other file type downloaded by the crawl.

For example: s3://aws-publicdatasets.s3.amazonaws.com/common-crawl/parse-output/segment/1341690166822/textData-00681

Metadata files - Contains the HTTP response code, file names, and offsets of ARC files where the raw content can be found

For example: s3://aws-publicdatasets.s3.amazonaws.com/common-crawl/parse-output/segment/1341690166822/metadata-01096

The long number after the "segments" directory name indicates the UNIX timestamp (seconds since 1/1/1970) when the crawl first started. If you're curious, you can use an online converter to see that 1341690166822 is really Saturday, June 7th, 2012 at 19:42 GMT.
Warning! Not all the 2012 data is accessible. There are many old segments with incomplete data that the crawl is in the process of either updating or removing. For the moment, they have set the permissions on these files to block read access. If your program tries to read them by mistake, it will crash. To avoid this headache, check this file first: s3://aws-publicdatasets/common-crawl/parse-output/valid_segments.txt. It lists all the valid segment numbers, one number per line. For more information on this approach, and for some helpful Java code that can integrate this file into your overall program, view this discussion thread on the Common Crawl Google Group.

Data for the 2008-2010 datasets is only available as ARC-formatted files. No text or metadata content is available. Each one is indexed using the following directory scheme: /YYYY/MM/DD/the hour that the crawler ran in 24-hour format/*.arc.gz.

References and "Useful" Links

(Please suggest more links that you find useful!)

Tutorials:

Original tutorial and code at GitHub
Revised / official tutorial
Unofficial Tutorial - Parsing data for language and splitting into sentences
(Note that you don't need the "distributed copy" code near the beginning, because you will be processing data directly from S3)

Common Crawl Wiki documentation
Common Crawl Google Groups Discussion Forum
Common Crawl GitHub repository (library to read data format + several project examples)
Amazon Elastic MapReduce Ruby Client and the better, more up-to-date GitHub repository
Amazon Elastic MapReduce guide - with step-by-step examples!
Amazon Hadoop Web Interface guide
Amazon Public Datasets - Common Crawl Corpus
Arc file format specification