Simplified HTTPS
The full HTTP protocol is complex. You will be implementing a very small subset. The basic syntax of a command is:
<verb> <url> <version>
<option-lines>
<newline>
Where <verb> will be either GET or POST, <option-lines> is zero or more non-empty lines, and <newline> is an empty line. Note that lines may be terminated by \r\n; you should always simply ignore \r characters. The server will read a command line, and an arbitrary number of non-empty lines terminated by a null line. In real HTTP, those non-empty lines are used to pass assorted options. By accepting and ignoring them, you can use a standard web browser to talk to your stripped-down server. One of these lines must be accepted and interpreted: Content-Length: (and yes, the colon is mandatory). Note that the option name is case-insensitive; you must accept, e.g., conTent-lEngth: as well.
<version> is, of course, the version number; always send HTTP/1.0 but ignore the version on receipt.
A <url> is a URL. All URLs for this project will be of one or two forms:
https://host/file/path[?parameters]
https://host:port/file/path[?parameters]
The host field is, of course, a hostname or IP address. Always assume port 443 unless a port field (a string of digits) is present. Why? You may find it useful to use a second port number when doing client-side authentication. The meaning of the /file/path string is up to you, but you can (and should) impose reasonable length restrictions.
The parameter list is a ? followed by &-separated keyword=value sequences; we've all seen these. It is up to you if you use parameters; I suspect you will find it easier not to. Note that passwords or other secret values must not be passed in URLs: URLs are often logged.
A GET request simply asks the server to send some data; the URL . A POST request is used when uploading data. When using POST, the Content-Length: line must be used. After the colon and optional white space, there is a length in bytes given as a sequence of digits; that denotes the number of bytes of data to read. You will use this, for example, when uploading a message. If the server receives an end-of-file indication before that number of bytes, it may discard everything.
I do not require any particular format for input; I strongly suggest that you make it as simple as possible, e.g., a line for the username, a line for the password, etc.
A real web server can receive and send ASCII or binary data back, depending on option lines; don't do that. Instead, have your programs know from context what's coming and decode things accordingly. I strongly suggest using simple hexadecimal to send binary, though you can use base64 if you're really concerned about efficiency (you shouldn't be for this project).
A response from the server consist is:
<version> <status-code> <text>
<option-lines>
<newline>
<body>
where <version> will be HTTP/1.0 (but accept anything), <status-code> is a 3-digit number, and <text> should be ignored for non-error situations and displayed to the user for error situations.
Always send 200 as the status code, but accept any status code whose first (decimal) digit is a 2 as indicating success. If the first digit is a 3, there must be a Location: option line showing a new URL to go to instead. Why? This allows web servers to redirect you to a different URL, e.g., one that takes a port number as shown above. A Content-Length: option indicates that the server is sending back data, e.g., a message or a certificate; interpret it as above.
Status codes beginning with 4 or 5 are error codes; display <text> to the user and exit.
Here is an actual transcript of me connecting to the CS department web server and getting a 301 redirect:
$ telnet www.cs.columbia.edu 80
Trying 128.59.11.206...
Connected to webcluster.cs.columbia.edu.
Escape character is '^]'.
GET / HTTP/1.0
HTTP/1.1 301 Moved Permanently
content-length: 0
location: https:///
connection: close
Connection closed by foreign host.
Everything up to the "Escape character" line and the last line are from the telnet command.
Servers need not send Content-Length:; if they don't, read until end-of-file. As before, ignore all \r characters.