Project 2 - Simple wget

This assignment can be completed either individually, or in groups with a maximum of 2 members. You can discuss problems and potential solutions with other students, but you cannot share completed programs or significant pieces of completed code. See the honor code in the syllabus for more details.

Project Objectives

Hands-on experience with HTTP
Hands-on experience with TCP sockets
Hands-on experience with C programming, including details such as file I/O, C string parsing, and command-line argument parsing.

To support these objectives, you are not allowed to use any pre-built HTTP, URL-parsing, or socket management libraries.

Project Description

GNU Wget is a free utility for non-interactive download of files from the Web. (See http://www.gnu.org/software/wget/) It supports HTTP, HTTPS, and FTP protocols, as well as retrieval through HTTP proxies. In this project, you will be implementing a simplified version of wget (called swget) in C that supports a small subset of the features of the full program.

Given a target URL, your version of wget ("swget") will attempt to download that file using the HTTP 1.1 protocol over TCP/IP and save it to the desired directory on local disk. This file might be an HTML file (a web page), a binary image, a zip file containing a program, or other data. Your program does not need to implement recursive downloading of all linked HTML pages. (A common usage of the real wget utility is to mirror an entire website to local disk). When saving the file to disk, do not save the HTTP header information in the file. After downloading the requested file, your program should exit.

Your project should produce a single binary named swget that supports the following command-line arguments: (See tips section below for a standard method to parse arguments given in this format)

 swget [OPTIONS]
--url=<url to download>   This argument is REQUIRED.
--destdir=<destination directory to save files to> - This argument is REQUIRED
--verbose   Turn on verbose output for debugging. This option should also print the headers sent by the HTTP server (see Desired Output section below)
--help  Display help message  (which should print this list of commands)

Example usage at the command line

swget --url=http://www.google.com --destdir=/path/to/myhome --verbose

Example program flow for HTTP 1.1

Client opens TCP connection to server on port 80
Client sends a request to the server:
GET /filename HTTP/1.1
Host: www.server.com
Connection: close
Server sends response
Server closes TCP connection

In HTTP 1.1, the default behavior is for the client and server to keep the connection open for re-use. But, we specifically requested it be closed to simplify program design, and because swget only downloads one file at a time.

When attempting to download a URL, your program must parse the HTTP response to determine if the request succeeded. While there are many HTTP responses, your program only needs to support the following options and take the indicated action:

200 - OK - Request succeeded! Save this page to disk with the requested file name
301 - Moved Permanently - The file is no longer at the URL you specified, but the server is telling you the new URL where the file now permanently lives. Parse the response header, determine the new location, and re-request at that address. Note that you might have to follow several re-directs in a row in order to arrive at the actual file. Further, there is nothing to prevent several re-directs to eventually end at a 404 error page!
302 - Found - This works similarly to "301 Moved Permanently", with the change that in the future, this redirect might point to a different location. (i.e., it is not permanent)
400 - Bad Request - This indicates an error in your swget implementation. Your request to the server has a syntax error or is otherwise invalid.
404 - Not Found - The file you requested was not found. Do not save the error page (if any) to disk. Rather, display an error message on the console, and exit the program.

While there are many other possible status codes, these should be sufficient for your simplified wget program.

Desired Output

The output of your program should be "inspired" by the real wget utility. Here is example output for both verbose and non-verbose mode. You do not have to match this output character-by-character, but the overall information conveyed should be similar.

Verbose output:

#> swget --url=http://www.pacific.edu/Documents/registrar/acrobat/2010-2011-catalog.pdf --destdir=/path/to/myhome --verbose
 
Downloading http://www.pacific.edu/Documents/registrar/acrobat/2010-2011-catalog.pdf
Resolving www.pacific.edu... 192.168.200.100
Connecting to www.pacific.edu|192.168.200.100|:80... connected.
HTTP request sent, awaiting response...
Received HTTP response header:
HTTP/1.1 200 OK
Content-Length: 2122207
Content-Type: application/pdf
Last-Modified: Wed, 09 Jun 2010 17:06:57 GMT
Accept-Ranges: bytes
ETag: "95ee2b2bf67cb1:239"
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
Date: Thu, 28 Oct 2010 16:27:10 GMT
Connection: keep-alive
Length: 2122207 (2.0M) [application/pdf]
Saving to: '/path/to/myhome/2010-2011-catalog.pdf'
Finished

Non-verbose output:

#> swget --url=http://www.pacific.edu/Documents/registrar/acrobat/2010-2011-catalog.pdf --destdir=/path/to/myhome
 
Downloading http://www.pacific.edu/Documents/registrar/acrobat/2010-2011-catalog.pdf
Length: 2122207 (2.0M) [application/pdf]
Saving to: `/path/to/myhome/2010-2011-catalog.pdf'
Finished

Testing

You can use the following URLs as example user input when testing your program. Some will download files, and others will produce errors for testing various aspects of your program. (When grading your program, I reserve the right to test URLs not specified in this list, as your program should function equally well with any web server that supports HTTP 1.1).

Use in arguments of the form --url=<url to download>

http://www.google.com/images/logos/google_logo_41.png (should work and produce an image)
http://www.google.com/about/products/ (should work and produce an HTML file)
http://www.pacific.edu/Documents/registrar/acrobat/2010-2011-catalog.pdf (should produce a binary file)

For other big files to download, consider the PDF slides for this class in the Resources section!

http://www.google.com (should produce an HTML file)
www.google.com (should produce an HTML file)
google.com (should produce an HTML file)
google.com/about (should redirect to www.google.com/about)

74.125.224.144/about (should redirect to www.google.com/about)
www.google.com/non-existent-page.html (should produce a 404 error)

google/about (should not work! Either detect this error immediately, or try to do a DNS lookup on "google" and fail.)
www.yahoo.com/privacy (Does several redirects in a row)

Note about default files: Suppose you specify a URL of http://www.google.com/. What is the filename? There is none! The webserver takes the default file for that (root) directory, and returns that instead. Each webserver can make a different determination of what default file to send, but usually it's something like "index.html" or "default.htm". When using wget to download a URL with no file specified, wget chooses a default name for the file it creates on disk of "index.html". Your program should do the same. (Note that your request to the webserver does not specify a filename. This default is merely for the file saved to local disk, which does need a filename).

Reliability

Obviously, it does the user no good at all if the requested file is corrupted in the process of downloading it. Although TCP is reliable, once the bytes are given to your swget program, it's up to you to treat them properly! The question is, how do we know for sure if a file was corrupted or not? For testing, you should download a file using both the real wget and your swget. Then, run "md5sum <filename>" on Linux to calculate a md5 fingerprint for each file. If the fingerprints match, you can be reasonably sure that your program did not corrupt the download. (I will also calculate md5 fingerprints on output files when grading your project)

For the 2011-2012-catalog.pdf document listed above in the testing section, the md5 fingerprint reported is:

59a81a7460923f366295cd411ca9f408 2010-2011-catalog.pdf

When creating a reliable program, remember that calls to send() and recv() are not guaranteed to send/receive as much data as you desire! (i.e. they could do less work). Thus, your program should be sure to call them repeatedly until all the work is finished. For more discussion and an example wrapper function for a reliable send, see section 7.3 in Beej's guide, http://beej.us/guide/bgnet/output/html/multipage/advanced.html#sendall

Requirements

You must implement the program in C and use the following GCC options in Eclipse to set the compiler to a very picky mode: -std=c99 -Wall -Wextra -D_POSIX_SOURCE

-std=c99 (Use the more modern C99 standard for the C language)
-Wall and -Wextra (Turn on all warnings. By viewing and fixing issues that generate warnings, you will produce better, safer, C code)
-D_POSIX_SOURCE (Includes the POSIX libraries, which provide essential functions for socket libraries)
These options are set in: Project Menu-> Properties -> C/C++ Build (expand the category) -> Settings -> Tool Settings -> GCC C Compiler

Warnings tab: Ensure box for -Wall is checked
Warnings tab: Ensure box for -Wextra is checked
Miscellaneous tab: Type in -std=c99 Append this to what is already in the field! The completed line should look like this: -c -fmessage-length=0 -std=c99

Add a new entry for _POSIX_SOURCE (without the D, because Eclipse provides that for you) in the symbols/defined symbols category.

If your program produces any warnings during compilation (with these options), 5 points will be deducted.
If your program doesn't compile, zero points will be awarded.

All communication must be done using TCP sockets for reliable data communication.

Tips and Comments

If you have a question about the desired behavior of your program, a general rule to follow is "do what the real wget does in that situation".
The pre-written argp() function can greatly simplify the parsing of program arguments, and it even handles the --help command automatically! (You really don't want to have to do this all from scratch! Spending a few minutes understanding how to use argp() is much better than writing the code from scratch.) For more information, see:

You can find many intro-to-HTTP tutorials on the web. Here is a useful one: http://www.jmarshall.com/easy/http/
When testing for new lines in the HTTP response headers, check for '\r' and '\n' (new line and carriage return) in subsequent characters, not just '\n'.
The more error-checking you do in your program (e.g. response codes from socket and file system calls), the easier it will be to troubleshoot and debug your program.
You will need to tell Eclipse what command-line arguments to run when executing your program. Go to Run->Run Configurations to set the arguments. Or, you may prefer to simply run your program at the command lie by opening up two terminal windows or tabs.
When writing and testing your program, compare its output to the output of telnetting to port 80 of the web server and typing in requests manually. This was demonstrated in class.
Use the Valgrind program to locate (and then fix) memory related errors in your project.

When writing and testing your program, compare its output to the output of the real wget program, configured to show server headers (using the --server-response argument). For example, examine the output of the following command. Here, the URL originally requested was"yahoo.com/privacy", but the server issued a series of 301-redirects that pointed wget, in sequence, to "http://www.yahoo.com/privacy", then "http://privacy.yahoo.com/", then "http://info.yahoo.com/privacy/us/yahoo/details.html" which finally held the requested file.

Command:

wget --server-response www.yahoo.com/privacy

Output:

--2012-12-10 17:40:39--  http://www.yahoo.com/privacy
Resolving www.yahoo.com (www.yahoo.com)... 72.30.38.140, 2001:4998:f011:1fe::3000, 2001:4998:c:401::c:9101, ...
Connecting to www.yahoo.com (www.yahoo.com)|72.30.38.140|:80... connected.
HTTP request sent, awaiting response... 
 HTTP/1.1 301 Moved Permanently
 Date: Tue, 11 Dec 2012 01:40:39 GMT
 P3P: policyref="http://info.yahoo.com/w3c/p3p.xml", CP="CAO DSP COR CUR ADM DEV TAI PSA PSD IVAi IVDi CONi TELo OTPi OUR DELi SAMi OTRi UNRi PUBi IND PHY ONL UNI PUR FIN COM NAV INT DEM CNT STA POL HEA PRE LOC GOV"
 Cache-Control: private
 Location: http://privacy.yahoo.com/
 Vary: Accept-Encoding
 Content-Type: text/html; charset=utf-8
 Age: 0
 Transfer-Encoding: chunked
 Connection: keep-alive
 Server: YTS/1.20.13
Location: http://privacy.yahoo.com/ [following]
--2012-12-10 17:40:39--  http://privacy.yahoo.com/
Resolving privacy.yahoo.com (privacy.yahoo.com)... 98.137.133.155
Connecting to privacy.yahoo.com (privacy.yahoo.com)|98.137.133.155|:80... connected.
HTTP request sent, awaiting response... 
 HTTP/1.1 302 Found
 Date: Tue, 11 Dec 2012 01:40:39 GMT
 P3P: policyref="http://info.yahoo.com/w3c/p3p.xml", CP="CAO DSP COR CUR ADM DEV TAI PSA PSD IVAi IVDi CONi TELo OTPi OUR DELi SAMi OTRi UNRi PUBi IND PHY ONL UNI PUR FIN COM NAV INT DEM CNT STA POL HEA PRE LOC GOV"
 Location: http://info.yahoo.com/privacy/us/yahoo/details.html
 Vary: Accept-Encoding
 Connection: close
 Transfer-Encoding: chunked
 Content-Type: text/html; charset=utf-8
 Cache-Control: private
Location: http://info.yahoo.com/privacy/us/yahoo/details.html [following]
--2012-12-10 17:40:39--  http://info.yahoo.com/privacy/us/yahoo/details.html
Resolving info.yahoo.com (info.yahoo.com)... 98.137.133.155
Connecting to info.yahoo.com (info.yahoo.com)|98.137.133.155|:80... connected.
HTTP request sent, awaiting response... 
 HTTP/1.1 200 OK
 Date: Tue, 11 Dec 2012 01:40:39 GMT
 P3P: policyref="http://info.yahoo.com/w3c/p3p.xml", CP="CAO DSP COR CUR ADM DEV TAI PSA PSD IVAi IVDi CONi TELo OTPi OUR DELi SAMi OTRi UNRi PUBi IND PHY ONL UNI PUR FIN COM NAV INT DEM CNT STA POL HEA PRE LOC GOV"
 Vary: Accept-Encoding
 Connection: close
 Transfer-Encoding: chunked
 Content-Type: text/html; charset=utf-8
 Cache-Control: private
Length: unspecified [text/html]

Suggested Implementation Strategy

Start off trying to download one of the text web pages, not a binary image file. (The text will be easier to visually troubleshoot errors). This is a good process to follow:

Work on using argp() to parse the input arguments to the program (This task can be done by one group member)
Work on writing a custom URL parser to split a URL (saved as an array of characters) into host, path, and file components. You need the host name when creating your socket, you need the path+file name when sending the HTTP request, and you need the filename (plus the destination directory) when saving the file to disk. (This task can be done in parallel by the other group member)
Work on sending the HTTP request for the path and file to the specified server, and saving the response to disk. At this point, just save the header to the file too. (This should be similar to your previous socket program)
Work on parsing the headers of the HTTP response to identify success and error codes. At this point, you should stop saving the headers to disk with the rest of the file. Now you should be able to compare your program output with output from wget, and see if the files differ in any way. (The md5sum and the diff command-line utilities may be useful here in rapidly comparing files, and highlighting where files differ.)
Work on following 301-redirects and fetching them automatically.
Polish your output.
Test example links above for both text and binary files. Retest.

Extra Credit Opportunity

For an additional +20% on the project, provide a more full-featured implementation of the HTTP 1.1 protocol by your client. The following additional features must be supported by your client:

Support the "Transfer-Encoding: chunked" mode as a reply from the server, where each grouping of the reply file is preceded with a number (in hexadecimal) specifying the number of subsequent bytes.

See: http://en.wikipedia.org/wiki/Chunked_transfer_encoding

Support the "100 Continue" response from the server (by ignoring it and waiting for the final reply)
Gracefully handle persistent connections. Rather than specifying to the server that your connection after each request (as specified above), you should instead leave the connection open. If the request leads to a redirect, you can request the new URL immediately. Only close the connection after the final file has been downloaded. List before, swget only downloads a single file at a time.

Information on HTTP 1.1 can be found at http://www.apacheweek.com/features/http11 and http://www.jmarshall.com/easy/http/ , among many other online sources.

Example program flow for HTTP 1.1 using persistent connections

Client opens TCP connection to server on port 80
Client sends a request to the sever:
Get /filename HTTP/1.1
Host: www.server.com
Server sends response indicating the file is now located at www.server.com/filename2, and leaves connection open
Clients sends another request to the server
Get /filename2 HTTP/1.1
Host: www.server.com
Server sends response
Client closes TCP connection

Resources

Need C programming resources? Visit the class Resources page
Interested in HTTP/1.1 protocol minutia? Bulk up on the FULL SPECIFICATIONS here! (W3 RFC 2616)

Submission

The Eclipse IDE can package up your entire project into a Zip-compressed archive file. Inside the archive are all of your source files along with a "Makefile" indicating how your project is to be compiled and executed. This is a common way in which Linux applications are distributed in source code form. I can uncompress, build, and run your project with the following commands

unzip yourproject.zip
cd yourproject/Debug
make
./yourprogram

To produce the archive, go to File->Export->General->Archive File. Select your project to include in the archive and provide a filename. Upload the resulting compressed file to Sakai.