CommonCrawl Tutorial
This tutorial is based on Steve Salevan's (older) blog post MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl and the (newer) Common Crawl WARC examples. In addition, it has updates for Hadoop 2.6.x / AWS EMR 4.2.x and and is written to use the Amazon command-line interface (CLI) to Elastic MapReduce. Be sure to use the custom code for the tutorial.
Last updated: January 2016
Install Development Tools
Install Oracle Java 7 SE JDK
Note: You want the JDK (Java Development Kit), not the JRE (Java Runtime Environment, which can only run Java code)
- Linux: sudo apt-get install openjdk-7-jdk openjdk-7-jre
- Windows: Download from Oracle, install, and manually set the JAVA_HOME environment variable afterwards.
Install Eclipse (for writing Hadoop code)
Latest version: Eclipse Mars
- Windows and Linux: Download the “Eclipse IDE for Java developers” installer package located at: http://www.eclipse.org/downloads/
- Linux installation:
tar -xzf eclipse-java-mars-1-linux-gtk-x86_64.tar.gz
cd eclipse
./eclipse &
Install Git
Any version
- Linux: sudo apt-get install git
- Windows: Install the latest .exe from: https://git-for-windows.github.io/
- To use Git later, go to the Start Menu, find Git, and then choose "Git Bash"
Install a stand-alone Amazon S3 browser. Examples include:
- S3Browser: http://s3browser.com/ (Windows)
- S3Fox Firefox extension: http://www.s3fox.net/ (Mac, Windows, Linux)
Although you can use Amazon's web interface to upload/download from S3, a stand-alone S3 file browser on your computer might make this easier in the future. Further, it would allow you to explore outside of your own buckets. For example, you could explore the CommonCrawl datasets at s3://aws-publicdatasets/common-crawl/crawl-data
Install the AWS Command Line Interface (CLI)
- Linux:
sudo apt-get install python-pip python-dev build-essential
sudo pip install awscli - Windows: Download the latest installer from: http://aws.amazon.com/cli/
Configure Amazon CLI
Obtain your AWS Access Key ID and Secret Access Key. Your should have saved a copy when you initially setup your account, otherwise the secret key is lost to you forever. If you failed to do so, don't worry, you can always delete the old key and create a new one under the "Security Credentials" section of the AWS web console (under your name dropdown menu).
Configure the CLI with your credentials:
cd
aws configure
# Enter your 'AWS Access Key ID' when prompted
# Enter your 'AWS Secret Access Key' when prompted
# Enter your default region: us-east-1
# (this region is US East Coast - Virginia)
# Enter your default output format: Press ENTER for NONE
Configure a keypair for Amazon's Elastic Compute Cloud (EC2) service. This keypair is saved on all EC2 nodes you launch, and allows you to SSH directly into them as part of this tutorial.
cd
# Create key-pair
aws ec2 create-key-pair --key-name ECPE276KeyPair --query 'KeyMaterial' --output text > ~/ECPE276KeyPair.pem
# Make key-pair readable to only your user
chmod 400 ECPE276KeyPair.pem
Save this keypair file!
Create a bucket (aka "folder") in Amazon's Simple Storage Service (S3)
aws s3api create-bucket --bucket [TODO-BUCKET-NAME] --region us-east-1
aws s3api list-buckets # Should show your new bucket
Note that bucket names must be unique across the entire world! So, something like "yourusername-commoncrawl-tutorial" is a good choice. Remember this bucket name!
Compile and Build CommonCrawl Example
Now that you’ve installed the packages, you need to play with the CommonCrawl example code. A special ECPE 276 version is provided to reduce installation and compilation problems. Run the following command from a terminal/command prompt to pull down the code:
cd
git clone https://bitbucket.org/shafer/2016_spring_commoncrawl_example.git
Now, start Eclipse and import the demo code into a new project.
- Open the File menu and then select "Import..."
- Open the "Maven" folder and select "Existing Maven Projects"
- Next to "Root Directory", select the "Browse" button and navigate to the "2016_spring_commoncrawl_example" folder you just checked out from Git.
- Ensure that the "pom.xml" file is checked. This Project Object Model (POM) is an XML configuration file for Maven that contains build information about the project.
- Click "Finish".
- Right-click on the base project folder ("ecpe276-cc-example") in the Package Explorer panel and select "Properties" from the menu that appears.
- Navigate to the Builders tab in the left hand panel of the Properties window.
- Uncheck the existing "Java Builder" option, leaving the "Maven" builder selected, then click OK.
To compile the demo code, you need to invoke Maven.
- Right-click on the base project folder ("ecpe276-cc-example") in the Package Explorer panel
- Choose "Run As..." and select "Maven Install"
- Maven should download all the project dependencies and then invoke the Java compiler. At the end, Maven will produce a JAR file ready for use in Elastic MapReduce.
Upload the JAR file to Amazon S3
Go to the web console interface to S3, or launch the stand-alone S3 file browser program you installed previously. Upload the JAR file you built in Eclipse into the S3 bucket you created previously. The JAR file should be located here:
<your checkout dir>/target/ecpe276-cc-example-1-jar-with-dependencies.jar
(The "with-dependencies.jar" file is the one that you want to upload)
Create an Elastic MapReduce job based on your new JAR
Now that the JAR is uploaded into S3, all we need to do is to point Elastic MapReduce to it. The original tutorial uses the AWS web interface to accomplish this, but as serious programmers we will do it via the command-line interface.
Get a list of all commands - Helpful for future reference and debugging!
(Web-friendly documentation is available at http://docs.aws.amazon.com/cli/latest/reference/emr/index.html)
aws emr help
Create a default set of roles before building your first cluster (only needs to be done once)
aws emr create-default-roles
Create a new cluster, choosing its name, number of nodes, node types, and software versions pre-installed on the node:
aws emr create-cluster \
--no-auto-terminate \
--name "TODO: ENTER HUMAN FRIENDLY NAME" \
--instance-count 3 \
--instance-type m1.large \
--log-uri s3n://[TODO: YOUR S3 BUCKET WITH PROGRAM]/logs \
--enable-debugging \
--ec2-attributes KeyName=ECPE276KeyPair \
--release-label emr-4.2.0
******************************************
WARNING WARNING WARNING!!!
The --no-auto-terminate argument tells Amazon that you want to keep your EC2 servers running even when the specific tasks finish! While this saves a lot of time in starting up servers while doing development/testing, it will cost you $$ if you forget to turn them off when you truly are finished!
WARNING WARNING WARNING!!!
******************************************
Note 1: A variety of virtual machine types are available for you to use with Elastic MapReduce. This is a subset of the full list of EC2 machines available for rental. But, before you jump straight to the largest, fastest, most-expensive machine, ask yourself: Will this actually get my program done faster? Or would I be better using a larger number of the small cheap machines? If the bottleneck is Amazon's S3 storage system, then renting an expensive machine is a waste of money, because its CPU and memory will be idle much of the time, waiting for new data from the network.
Note 2: The --release-label option specifies the versions of pre-installed software. Amazon EMR 4.2.0 includes Hadoop 2.6.0, along with many other tools.
Note 3: Record the cluster-id output by this command - you need it in the following steps.
Add a step to your cluster, thus adding your specific MapReduce program:
Create a text file called "step.json" (or whatever name you choose) configuring this step:
[
{
"Name": "TODO: ENTER HUMAN FRIENDLY NAME",
"Args": ["s3n://[TODO: YOUR S3 BUCKET]/output/WETWordCount"],
"Jar": "s3n://[TODO: YOUR S3 BUCKET]/ecpe276-cc-example-1-jar-with-dependencies.jar",
"ActionOnFailure": "CANCEL_AND_WAIT",
"MainClass": "org.commoncrawl.examples.mapreduce.WETWordCount",
"Type": "CUSTOM_JAR"
}
]
What are we doing with this step? We're telling Amazon that we want to run a .jar file for our MapReduce program, and providing a number of other details, including:
- The main() method for Hadoop to run is found in org.commoncrawl.examples.mapreduce.WETWordCount
- Output the results as a series of text files into your Amazon S3 bucket (in a directory called /output/WETWordCount)
Note 1: You can have multiple steps in a single JSON file, simply separate the { } blocks with a comma.
Note 2: While debugging, you don't need to shut down your cluster each time. Just add a new step to run your revised code on the currently-active cluster.
Submit the step to EMR:
aws emr add-steps --cluster-id [TODO: CLUSTER-ID] --steps file://./step.json
Now, your Hadoop job will be sent to the Amazon cloud.
Watch the show
It will take Amazon some time (5+ minutes?) to spin up your requested instances and begin processing. You can watch the status of your job in several places:
- Via command-line, e.g.
aws emr list-clusters
aws emr list-steps --cluster-id [CLUSTER-ID] - Via web interface
- Select the EMR option
- Is your cluster and step there? Is it running, or if it failed, are there error messages / error logs available to view? Log files may take 1-2 minutes to appear, so just keep clicking the refresh button.
- What does the cluster CPU / memory / I/O usage look like?
- Select the EC2 tab
- Have your servers been started yet?
- Select the S3 tab
- Is your new bucket there?
- Are there log files (standard output, standard error) there?
- Has your output directory been created yet? (That's a very good way to know it finished!). You can download the output files via the S3 web interface if you want to inspect them.
- Via Hadoop web console - NOT the Amazon web interface, but a Hadoop-specific webpage - See below
Cleaning Up
Don't forget to shut down the EC2 servers you (implicitly) started when you created your EMR cluster!
aws emr list-clusters
aws emr terminate-clusters --cluster-id [CLUSTER-ID]
S3 charges you per gigabyte of storage used per month. Be tidy with your output files, and delete them when finished!
Advanced Topics - SSH to the Hadoop Master Node
You can obtain an SSH console on the Hadoop master node via the Amazon CLI.
First, edit the security group used for the MapReduce master node, and add a custom TCP rule for port 22 from anywhere (0.0.0.0/0). Otherwise, your SSH connection will be blocked. This rule is permanent for future connections.
aws ec2 authorize-security-group-ingress --group-name ElasticMapReduce-master --protocol tcp --port 22 --cidr 0.0.0.0/0
Then, proceed with the SSH connection:
aws emr ssh --cluster-id [CLUSTER-ID] --key-pair-file ~/ECPE276KeyPair.pem
Advanced Topics - Watch the show with Hadoop web interface
Hadoop provides a web interface that is separate from the general Amazon Web Services interface. (i.e. it works even if you run Hadoop on your own personal cluster, instead of Amazon's). To access it, first obtain the public DNS address of the Hadoop master node. It should be something like ec2-######.amazonaws.com:
aws emr describe-cluster --cluster-id [CLUSTER ID]
# Look for the line "MasterPublicDnsName"
Using SSH, listen locally on port 8157 and create an encrypted tunnel that connects between your computer and the Hadoop master node. We'll use this tunnel to transfer web traffic to/from the cluster. Rather than using a username/password, this command uses the public/private key method to authenticate.
aws emr socks --cluster-id [CLUSTER ID] --key-pair-file ~/ECPE276KeyPair.pem
You won't see any messages on the screen if this works. Leave SSH running for as long as you want the tunnel to exist. Elastic MapReduce isn't magic; it's just doing the work of configuring a virtual machine and installing software automatically in it.
You need to configure your web browser to access URLs of the format *ec2*.amazonaws.com* --or-- *ec2.internal* by using the SSH tunnel (port 8157) you just created as a SOCKSv5 proxy instead of accessing those URLs directly. Setup details differ by browser. AWS has proxy configuration instructions online.
Finally, access the following URLs for different Hadoop services:
- Hadoop Map/Reduce Administration (aka the "JobTracker"): http://ec2-########.amazonaws.com:8088
- Here you can see all submitted (current and past) MapReduce jobs, watch their execution progress, and access log files to view status and error messages.
- Hadoop NameNode: http://ec2-########.amazonaws.com:50070 (this storage mechanism is not used for your projects, so it's supposed to be empty)
Advanced Topics - FAQs from Prior Projects
Q: Can I get a single output file?
A: Yes! To get a single output file at the end of processing, you need to run only *1* reducer across the entire cluster. This is a bottleneck for large datasets, but it might be worth it compared to the alternative (writing a simple program that merges your N sorted output files into 1 sorted output file). Configure the job after it has been created, but before it runs:
job.setNumReduceTasks(1);
...
job.waitForCompletion(true);
...
Q: Can I change the output data type of a map task, i.e. changing a key/value pair such that the "value" is a string, not an int?
A: Yes! A discussion at StackOverflow notes that you need to change the type in three different places:
- the map task
- the reduce task
- and the output value (or key) class in main().
For this example, the original poster wanted to change from type Int (IntWriteable) to Double (Doublewritable). You should be able to follow the same method to change from Long (LongWritable) to Text (TextWriteable)
Q: Can I run an external program from inside Java?
A: Yes! There are a few examples online that cover invoking an external binary, sending it data (via stdin), and capturing its output (via stdout) from Java.
http://www.linglom.com/2007/06/06/how-to-run-command-line-or-execute-external-application-from-java/
http://www.rgagnon.com/javadetails/java-0014.html
In fact, Hadoop even has a "Shell" class for this purpose: http://stackoverflow.com/questions/7291232/running-hadoop-mapreduce-is-it-possible-to-call-external-executables-outside-of
Q: Can I use a Bootstrapping action (when booting the EC2 node via Elastic MapReduce) to access files in S3?
A: Yes!
Amazon provides an example bootstrap script (about halfway down the page). In the example, they use the 'wget' utility to download an example .tar.gz archive and uncompress it. Note, however, that while the download file is in S3, it is a PUBLIC file, allowing wget to work in HTTP mode. (You can mark files as public via the S3 web interface).
A second example is posted at http://aws.typepad.com/aws/2010/04/new-elastic-mapreduce-feature-bootstrap-actions.html This uses the command hadoop fs -copyToLocal to copy from S3 to the local file system. This *should* work even if your S3 files aren't marked as public. (After all, if Hadoop hasn't been already configured by Amazon with your correct key, there's no way for it to load the .jar file with your MapReduce program to execute).
References and "Useful" Links
(Please suggest more links that you find useful!)
- Common Crawl Google Groups Discussion Forum
- Common Crawl GitHub repository (library to read data format + several project examples)
- Common Crawl Project Examples
- Amazon Elastic MapReduce guide - with step-by-step examples!
- Amazon Public Datasets - Common Crawl Corpus