The Web and HTTP
The World Wide Web is a global information system that allows users to access and exchange documents and other web resources. It's powered by the HyperText Transfer Protocol (HTTP), the application-layer protocol that defines how web clients (like browsers) and web servers communicate with each other.
Overview of HTTP
HTTP is the foundational application-layer protocol for the modern web defined in RFC 1945 and RFC 2616.
A web page itself consists of a base HTML file and multiple referenced objects, like images, scripts, and stylesheets, each accessible via a unique URL.
- A URL has two components: the hostname (e.g.,
www.someSchool.edu
) and the path name (e.g.,/someDepartment/picture.gif
).
Key characteristics of HTTP include:
Client-Server Model: Communication follows a strict request-response pattern. A client (e.g., a web browser) sends a request for a resource, and a server responds with the requested data or an error message.
Runs on TCP: HTTP uses the TCP protocol for its transport layer services. This provides a reliable data transfer service, ensuring that all data arrives in order and without errors, so HTTP doesn't have to manage this itself.
Stateless Protocol: By default, HTTP is stateless, meaning the server does not store any information about past client requests. Each request is treated as a completely independent transaction.
HTTP Connections: Persistent vs. Non-Persistent
HTTP manages the underlying TCP connections in one of two ways.
Non-Persistent Connections
In this older approach, each object on a web page requires a separate TCP connection. For a page with 10 images, this would mean establishing and tearing down 11 separate TCP connections (1 for the HTML file, 10 for the images).
After sending an object, the server closes the TCP connection, making the protocol stateless. This method is inefficient and incurs significant overhead, as each connection requires at least two round-trip times (RTTs) just for setup and the request.
Persistent Connections
Modern web applications use persistent connections, which are the default in HTTP/1.1.
This method allows multiple requests and responses to be sent over a single TCP connection, which remains open until a timeout period or until it is explicitly closed. This dramatically reduces the latency and overhead associated with establishing multiple connections, leading to faster page load times.
The Structure of HTTP Messages
HTTP defines two types of messages for communication: Request Messages (from Client) and Response Messages (from Server).
HTTP Request Message
A request message is sent by the client to trigger an action on the server. Its structure includes:
Request Line: Contains the HTTP method (e.g.,
GET
,POST
), the URL of the resource, and the HTTP version. (e.g.,GET /index.html HTTP/1.1
)Header Lines: Key-value pairs that provide additional information, such as the host, the user's browser (
User-agent
), and the types of content the client can accept.A Blank Line to separate headers from the entity body
Entity Body: An optional section that contains data, typically used with
POST
requests to submit form data.
Common HTTP methods include GET (to retrieve data), POST (to submit data), HEAD (to retrieve only the headers), PUT (to upload a resource), and DELETE (to remove a resource).
HTTP Response Message
A response message is sent by the server back to the client. Its structure includes:
Status Line: Contains the HTTP version, a status code, and a status message. (e.g.,
HTTP/1.1 200 OK
)Header Lines: Key-value pairs with metadata about the response, such as the date, server type, content length, and content type.
- A blank line
- Entity Body: The actual content of the requested resource, such as the HTML of a web page or the data of an image.
Common status codes include 200 OK (success), 301 Moved Permanently, 400 Bad Request, and 404 Not Found.
HTTP headers help with caching, language negotiation, user-agent detection, and connection control.
POST requests use the entity body to send user-inputted form data, while GET requests may embed form data in the URL.
HTTP Request and Response Message Example
Client IP:
192.168.1.10
Server IP:
93.184.216.34
(e.g., example.com)The client wants to request a webpage:
/index.html
HTTP Request Message Example (sent from client to server):
Line | Type |
---|---|
GET /index.html HTTP/1.1 | Request Line |
Host: example.com | Header Line |
Connection: close | Header Line |
User-agent: Mozilla/5.0 | Header Line |
Accept-language: en-US | Header Line |
Blank Line (separates header from body) | |
Entity Body (empty for GET) |
HTTP Response Message Example (sent from server to client):
Line | Type |
---|---|
HTTP/1.1 200 OK | Status Line |
Date: Sun, 22 Jun 2025 12:00:00 GMT | Header Line |
Server: Apache/2.4.41 (Ubuntu) | Header Line |
Last-Modified: Sun, 22 Jun 2025 10:00:00 GMT | Header Line |
Content-Length: 1256 | Header Line |
Content-Type: text/html | Header Line |
Connection: close | Header Line |
Blank Line | |
<html>...</html> | Entity Body starts here |
User-Server Interaction: Cookies for Maintaining State
HTTP is a stateless protocol, it cannot natively remember a user across multiple requests. Server does not retain the user information across sessions.
To solve this, websites use cookies. A cookie is a small piece of data that a server sends to a user's browser, which the browser then sends back with every subsequent request to that server. These are used to maintain user-specific information across HTTP requests.
When a user first visits a site, the server includes a
Set-Cookie
header in its HTTP response, containing a unique ID to the client.The browser stores this cookie.
On subsequent visits, the browser includes a
Cookie
header with that same unique ID in its requests.The server can then use this ID to retrieve the user's session information, such as login status, shopping cart items, or personalized preferences over HTTP.
This adds a session layer over HTTP, allowing persistent user identity across multiple visits.
While cookies are essential for a modern, personalized web experience, they also raise privacy concerns, as they can be used to track and share user activity across different sites.
Web Caching and Conditional GET : Improving Performance
To improve performance and reduce network traffic, the web relies heavily on caching.
Web Caching (Proxy Servers)
A web cache, or proxy server, is a server that sits between a user and the origin server and saves copies of recently requested web objects (like images and pages) to reduce redundant data transfer.
When a user requests a resource, the request goes to the cache first.
If the object is in the cache (a "cache hit"), the cache returns it directly to the user, which is much faster than fetching it from the origin server.
If the object is not in the cache (a "cache miss"), the cache requests it from the origin server, saves a copy for future requests, and forwards it to the user.
Caching significantly reduces response time for users and decreases traffic on the internet and increases performance by serving content locally.
A cache acts as both a client (to the original server) and a server (to the requesting browser).
Caching is a cost-effective alternative to upgrading network links, especially when many users request the same content.
Web caches are often installed by ISPs, universities, or enterprises to reduce bandwidth usage and improve response time.
Conditional GET
When using caching, a browser or proxy may store web objects (e.g., HTML pages, images) locally. If original server updates an object, the cached copy may become stale.
To ensure that a cached object is not stale (i.e., outdated) and ensure it is still valid, HTTP uses a mechanism called a conditional GET.
To verify a cached object, cache or browser sends a request to the origin server with an If-Modified-Since: <timestamp>
header, containing the timestamp of its cached version.
If the object has not changed, the server sends back a minimal response:
304 Not Modified
. The browser can then safely use its cached copy.If the object has changed, the server sends back a
200 OK
response along with the new, updated version of the object.
This process saves bandwidth by avoiding the re-download of content that has not changed and improves performance without compromising on content freshness.