Project 3 - Web Proxy Server

This assignment can be completed either individually, or in groups with a maximum of 2 members. You can discuss problems and potential solutions with other students, but you cannot share completed programs or significant pieces of completed code. See the honor code in the syllabus for more details.

Project Objectives

Hands-on experience with HTTP and HTTP proxy functionality
Hands-on experience with TCP sockets
Hands-on experience with C programming, including details such as file I/O, C string parsing, and command-line argument parsing
Hands-on experience with concurrent programming (see description of extra credit)

To support these objectives, you are not allowed to use any pre-built HTTP, URL-parsing, or socket management libraries.

Project Description

In this project, you will build a basic web proxy capable of accepting HTTP version 1.0 requests from clients (a web browser), making requests to remote servers, and returning data to the client. You shouldn’t assume that your server will be running on a particular IP address, or that clients will be coming from a pre-determined IP. Further, you shouldn't assume that the server will always be listening on a fixed port number. If the security option is specified, your client should only accept incoming requests from clients within the 1 specified subnet. (Note that a more sophisticated proxy server could allow for finer-grained filtering on many subnets). Your proxy server should always be running and listening for more requests. Remember, when a web browser uses your proxy server to download a web page, many requests will be issued: a request for the original HTML file along with many subsequent requests for all of the images, videos, style sheets, javascript programs, and other objects contained on that page.

Your project should produce a single binary named "proxy" that supports the following command-line arguments: (See tips section below for a standard method to parse arguments given in this format)

proxy [OPTIONS]
--port=<port to listen on>   This argument is REQUIRED.
--security=<IP>,<NETMASK>     Limit incoming connections to clients with IP address within the subnet specified by IP, netmask pair.
--verbose   Turn on verbose output for debugging. This option should print the headers sent by the web browswer and HTTP server (see Desired Output section below)
--help  Display help message  (which should print this list of commands)

Example usage at the command line to listen on port 4567 and only accept connections from the (very large, private) 10/8 subnet.

#> proxy --port=4567 --security=10.0.0.0,255.0.0.0 --verbose

About HTTP Proxies

Ordinarily, HTTP is a client-server protocol. The client (usually your web browser) communicates directly with the server (the web server software). However, in some circumstances it may be useful to introduce an intermediate entity called a proxy. Conceptually, the proxy sits between the client and the server. In the simplest case, instead of sending requests directly to the server the client sends all its requests to the proxy. The proxy then opens a connection to the server, and passes on the client’s request. The proxy receives the reply from the server, and then sends that reply back to the client. Notice that the proxy is essentially acting like both a HTTP client (to the remote server) and a HTTP server (to the initial client).

Why use a proxy? There are a few possible reasons:

Performance: By saving a copy of the pages that it fetches, a proxy can reduce the need to create connections to remote servers. This can reduce the overall delay involved in retrieving a page, particularly if a server is remote or under heavy load.
Content Filtering and Transformation: While in the simplest case the proxy merely fetches a resource without inspecting it, there is nothing that says that a proxy is limited to blindly fetching and serving files. The proxy can inspect the requested URL and selectively block access to certain domains, reformat web pages (for instances, by stripping out images to make a page easier to display on a handheld or other limited-resource client), or perform other transformations and filtering.
Privacy: Normally, web servers log all incoming requests for resources. This information typically includes at least the IP address of the client, the browser or other client program that they are using (called the User-Agent), the date and time, and the requested file. If a client does not wish to have this personally identifiable information recorded, routing HTTP requests through a proxy is one solution. All requests coming from clients using the same proxy appear to come from the IP address and User-Agent of the proxy itself, rather than the individual clients. If a number of clients use the same proxy (say, an entire business or university), it becomes much harder to link a particular HTTP transaction to a single computer or individual.

When a client (your web browser) uses a proxy, the HTTP requests it sends to the proxy differ in at least one way from normal HTTP requests. In the first line of the request, the complete URL of the resource being requested is used, instead of just the path. As an example, assume you are fetching the object http://www.google.com/about If you are not using a proxy, the request that is sent to the server is:

GET /about HTTP/1.0

But, if you are using a proxy, the request to the proxy for the same object is:

GET http://www.google.com/about HTTP/1.0

With this new request from the client, the proxy knows which server to forward the request to.

Web Browser Settings

Your web browser configuration must be modified to direct all HTTP requests not to the destination server IP and port (80), but rather to the IP address and port number that your proxy server is listening on. Your proxy server will then forward the request to its final destination. Instructions are provided below for the Firefox browser. If you are using a different web browser as a client, you will need to adapt these instructions for your own system.

Firefox (version 2.0 and up):

Open the main Firefox preferences window. (Select Edit->Preferences, Tools->Options, or File->Preferences from the menu, depending on your operating system and Firefox version).
Click on the ‘Advanced’ icon in the Options dialog.
Select the ‘Network’ tab, and click on ‘Settings’ in the ‘Connections’ area.
Select ‘Manual Proxy Configuration’ from the options available. In the boxes, enter the hostname (or IP address) and port where proxy program is running.
- Note that this means that the proxy does not have to run on the same machine as the browser!

Because Firefox defaults to using HTTP/1.1 and your proxy speaks HTTP/1.0, there are a couple of minor changes that need to be made to Firefox’s advanced configuration to tweak the browser’s behavior.

Type ‘about:config’ in the title bar (the same place where you would enter a URL to access) to access the (hidden) advanced settings for Firefox.
Click the "I'll be careful" button to promise not to break anything
In the search/filter bar, type ‘network.http.proxy’ to filter the list and only show a subset of the options
You should see three keys: network.http.proxy.keepalive, network.http.proxy.pipelining, and network.http.proxy.version
- Set network.http.proxy.keepalive to false (browser will expect socket to be closed after each object is received)
- Set network.http.proxy.version to 1.0 (browser will only send HTTP 1.0 requests)
- Set network.http.proxy.pipelining to false (browser will only send 1 request at a time)

By making these changes, Firefox should only send 1 connection to your proxy server at a time. Thus, objects on a page (HTML, images, videos) will download sequentially, one after another, each creating a new socket connection to your server (as specified by the HTTP/1.0 standard).

Proxy Header Manipulations

Your proxy, upon receiving a HTTP request from a web browser (client), needs to modify that request before forwarding it to the web server. Specifically, your proxy needs to modify 1 header line, remove 1 header line, and add 2 headers lines.

Modify the GET request header
- In the GET request line, the hostname should be removed from the request. For example, the line sent by the browser might be "GET http://www.google.com/about HTTP/1.0\r\n\r\n". The GET request sent by your proxy should then be "GET /about HTTP/1.0\r\n\r\n"
Delete the Proxy-Connection header
- The web browser might send a line to your proxy that reads "Proxy-Connection: Close" This line should ignored and/or deleted by your proxy, and not be passed through to the web server. (It is a request to close the connection between the web browser and the proxy after the request is fulfilled).
Add the Via header
- This header allows the web server to identify the chain of proxy servers that handed the request, as well as their capabilities (i.e HTTP protocol versions).
- The format of this header is "Via: <proxy1-version> <proxy1-IP> <(proxy name)>, <proxy2-version, proxy2-IP> <(proxy name)>, ...". Each proxy version and name is appended to the end of the string. Thus, proxy1 was the first proxy, which forwarded data to proxy2, and so on...
- Tip: When building this header, just append the protocol version, proxy IP, and proxy name to the end of the header. (Create the header if it does not exist).
Add the X-Forwarded-For header
- This header allows the web server to identify the original requester (the web browser) as well as the chain of proxy servers (excluding the last ) handling the request.
- The format of this header is: "X-Forwarded-For: <client1>, <proxy1>, <proxy2>". Each proxy IP address is appended to the end of the string. Thus, proxy1 was the first proxy, which forwarded data to proxy2, and so on... Note that the last proxy IP address in the chain does NOT appear in this header, as it is implicit in the socket connection.
- Tip: When building this header, just append the IP address of the client connecting to your proxy to the end of this header. (Create the header if it does not exist).

All other lines in the original HTTP header (from the web browser) can be safely passed by the proxy to the destination web server.

Example: Assume that the web browser (located at 192.168.182.15) sends a request to the proxy server at 192.168.1.92 using HTTP/1.1, and the proxy server sends the request to the web server. Thus, the headers sent by the proxy server to the web server should be as follows: (Items in bold represent fields changed/added by the proxy)

GET /about HTTP/1.0 
Host: www.google.com
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.10) Gecko/20100914 SUSE/3.6.10-0.3.1 Firefox/3.6.10
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: close
Cookie: PREF=ID=26f64f4db9ded6b0:TM=1282757759:LM=1282757759:S=Let1OpLRTDqORqvq
Via: 1.0 192.168.1.92 (Pacific comp177 proxy server v1.0)
X-Forwarded-For: 192.168.182.15

Proxy Security

A web proxy should not simply forward requests for any client on the Internet who happens to find its address. Rather, it proxy should compare the IP address of the connecting client against a list of approved (or denied) addresses or ranges and make a decision accordingly. For this (simplified) project, security is an optional argument on the command line. If the argument is not specified, your proxy will accept connections from all hosts. If the security argument is specified (with a ip-address,netmask value) your proxy should only allow the client to connect if the client's IP address falls within the subnet specified by the security option. (Do you remember how to determine if an IP address is within your local subnet for an Ethernet network? The process should be similar here).

Buffering

Your proxy should not download the entire requested object from the server to memory or disk, and then (after the download is finished) start sending the object to the client. Imagine that the client has requested a 3GB+ ISO file to download. It is unreasonable to expect the proxy server to store that much data locally, and unreasonable to expect the client to wait for the proxy to completely download the full file before the client is sent the first byte.

Instead, your proxy server should have a small buffer. (The definition of "small" is up to you, but 64kB would be a reasonable number). You should download some data from the web server, send it to the web browser, download more data, send more data to the web browser, and repeat until no more data is available.

Desired Output

During normal operation (i.e. without the verbose flag), your proxy should be completely silent after an initial launch message, except for fatal errors that prevent the proxy from continuing operation.

#> ./proxy --port=4567

Proxy launched and awaiting requests on port 4567...

During verbose operation, your proxy should print out status messages for every step of the transaction process. So, you should output a message when the web browser sends the initial request, when your proxy sends a request to the server, when the server returns a response to your proxy, and when your proxy returns the response to the web browser. These messages should include the full HTTP headers sent, but their exact formatting is up to you.

#> proy --port=4567 --verbose
 
Proxy launched 
Creating server socket...
Binding socket to port 4567...
Listening for incoming requests...
 
Accepted a request from client!
Client sent request to proxy with headers:
connect to [127.0.0.1] from localhost [127.0.0.1] 58449
----------------------------------------------------------------------
GET http://www.google.com/about HTTP/1.0
Host: www.google.com
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.10) Gecko/20100914 SUSE/3.6.10-0.3.1 Firefox/3.6.10
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: close
Proxy-Connection: close
Cookie: PREF=ID=26f64f4db9ded6b0:TM=1282757759:LM=1282757759:S=Let1OpLRTDqORqvq
------------------------------------------------------------------------
Proxy opening connection to server www.google.com [74.125.19.104]... Connection opened.
Proxy sent request to server with headers:
<headers go here...>
Server sent response to proxy with headers:
<headers go here>
Proxy sent response to client with headers:
<headers go here>

Requirements

I will compile and test your program on the ecs-network server using Eclipse. It must work there!
You must implement the program in C and use the following GCC options in Eclipse to set the compiler to a very picky mode: -std=c99 -Wall -Wextra -D_POSIX_SOURCE

-std=c99 (Use the more modern C99 standard for the C language)
-Wall and -Wextra (Turn on all warnings. By viewing and fixing issues that generate warnings, you will produce better, safer, C code)
-D_POSIX_SOURCE (Includes the POSIX libraries, which provide essential functions for socket libraries)
These options are set in Project -> Properties -> C/C++ Build -> Settings -> Tool Settings -> GCC C Compiler.

Check the box for Wall in the "Warnings" category
Add -std=c99 -Wextra in the box in the "Miscellaneous" category
Add a new entry for _POSIX_SOURCE (without the D, because Eclipse provides that for you) in the symbols/defined symbols category.

If your program produces any warning during compilation (with these options), 5 points will be deducted.
If your program doesn't compile on the class server, zero points will be awarded.

All communication must be done using TCP sockets for reliable data communication
Your proxy should not be terminated abruptly when the user enters CTRL-C. Just think of all the temporary data stored in memory that might be lost if that happened. Instead, your server should "capture" the user's keystroke (technically, the SIGINT interrupt triggered by CTRL-C), call a function that properly shuts down and cleans up the proxy by closing the sockets, and then exit gracefully.

Tips and Comments

The proxy application combines much of the functionality that you have previously written into a single program. It includes the server functionality from the first programming assignment (listening on a port, spinning off a new socket to handle an incoming connection, etc...) as well as the client and HTTP header parsingfunctionality from the simple-wget programming project.
You cannot assume that calls to send() are fully successful. You might call send() with 512 bytes, and it only takes 400 bytes! It is up to you to re-send the remaining 112 bytes. For a helpful wrapper function for send() that will fully send the data, see Beej's guide section 7.3.
You can use the "netcat" utility (as demonstrated in the class) to listen to a port and print out data received. This could be useful, for example, to view the data that the web client is sending to the proxy server. As an example, this command allows netcat to listen to port 4567:
- #> netcat -l -p 4567 -v
- Note: depending on your OS, Netcat might either be netcat or nc
The pre-written argp() function can greatly simplify the parsing of program arguments, and it even handles the --help command automatically! (You really don't want to have to do this all from scratch!) For more information, see:
- http://www.gnu.org/s/libc/manual/html_node/Argp.html#Argp
- http://www.crasseux.com/books/ctutorial/argp-example.html
You can find many intro-to-HTTP tutorials on the web. Here is a useful one: http://www.jmarshall.com/easy/http/
When testing for new lines in the HTTP response headers, check for '\r' and '\n' (new line and carriage return) in subsequent characters, not just '\n'.
The more error-checking you do in your program (e.g. response codes from socket and file system calls), the easier it will be to troubleshoot and debug your program.

Use the Valgrind program to locate (and then fix) memory related errors in your project.

You will need to tell Eclipse what command-line arguments to run when executing your program. Go to Run->Run Configurations to set the arguments. Or, you may prefer to simply run your program at the command lie by opening up two terminal windows or tabs.

Extra Credit Option - Concurrent Programming

For an additional +20% on the project (which translates to +4% in your overall course grade), modify your proxy server to support multiple concurrent download requests using either multiple processes (i.e. fork()) or multiple threads.

For more details on this, please see Dr. Shafer.

Submission

The Eclipse IDE can package up your entire project into a Zip-compressed archive file. Inside the archive are all of your source files along with a "Makefile" indicating how your project is to be compiled and executed. This is a common way in which Linux applications are distributed in source code form. I can uncompress, build, and run your project with the following commands

unzip yourproject.zip
cd yourproject/Debug
make
./yourprogram

To produce the archive, go to File->Export->General->Archive File. Select your project to include in the archive and provide a filename. Upload the resulting compressed file to Sakai.

With thanks to the Stanford Virtual Network System (VNS) project team - http://yuba.stanford.edu/vns/assignments/web-proxy/