When a client on the internet uses their browser to make a request for a URL to a server, more information that the URL itself is passed in the form of HTTP request headers. This includes information such as the name and version of the web browser (user-agent), the language the client prefers (accept-language), and the DNS name of the server it wants to access (host). These fields are often relied upon by applications or web servers to operate properly.
For example, the (host) header field has been relied upon by web hosting companies for over 20 years – it allows them to host dozens or hundreds of websites on a single public IP address. Before including the (host) header became standard, web servers had to use a unique public IP for each website.
Unfortunately, a server is completely reliant upon the client to provide accurate information in these fields. Nothing is stopping a client from claiming to be a different type of device than it is (purposely or mistakenly). While this isn’t a crime, it can break server-side mechanisms that do things such as hand out one video resolution for mobile devices and another for desktops.
With Amazon CloudFront, these request headers can be overwritten at the edge so that when a request comes into the origin server, it sees whatever headers, URLs, and query strings the edge wants it to see – not what the request contained.
Let’s imagine a website that has grown over the years, such that JPGs in its /images folder are referred to in different ways by different parts of the site. One page might point to /images/myimage.jpg, while another might point to the same object as /images/myimage.jpg?user=bob. Those are going to be seen as different objects by a CDN such as Amazon CloudFront. This means they take up twice as much space in the edge cache and result in a lower overall cache hit rate.
For a moment, imagine that myimage.jpg isn’t a flat file – instead, it is dynamically generated by a plugin on each request. In this case, the application relies on the (user-agent) header field to know which version of myimage.jpg to return. Should it return the high-resolution one or the low-resolution one? Maybe it relies on the (accept-language) field to know if it should hand out the one with French text in the Latin alphabet or the one with Bulgarian in Cyrillic. Unlike the situation with query strings, we have the opposite problem – the French version of myimage.jpg will cache as if it were the same object as the Bulgarian one. This is a problem because the first one that’s returned will always be returned – sometimes inaccurately.
These are both examples of situations where you might want to alter the URL query string to force a caching behavior you prefer.