Content based HTTP Cache

  robbin        2013-05-24 05:12:59       3,759        0         

Browsers may cache the webpages we visited, when user types a URL on the address bar, the browser may cache the webpage returned from server while displaying it. If there is no update on the webpage, then next time when the browser requests the same page, it will not download the page again, instead it will load the cached page. If the website explicitly specify that the page is updated, then the browser will download the page again from the server.

What's HTTP Cache?

You may be familiar with the cache mechanism of web browser. For example, the RSS subscribe page for JavaEye is http://www.iteye.com/rss/news. When web browser or RSS feed reader access this page, JavaEye server will send a response which has two lines as status mark:

Etag    "427fe7b6442f2096dff4f92339305444"
Last-Modified   Fri, 04 Sep 2009 05:55:43 GMT

This is to tell the browser that the last modification and Etag of this page. Then browser will save these two status and also the webpage contents. When you access this page again later, browser may send below two status to the JavaEye server:

If-None-Match   "427fe7b6442f2096dff4f92339305444"
If-Modified-Since   Fri, 04 Sep 2009 05:55:43 GMT

This is to tell the server that my local cache's last modification time and Etag, the server will check whether the resource is updated after the last modification time. If there is no update, then it will not generate RSS again, it will directly tell the browser to use the cached page and it will send a 304 Not Modified message to browser.

This is HTTP Cache, this kind of content based cache mechanism can largely save the server resource and also reduces the number of downloads which in turn reduces the bandwidth cost.

What's the use of HTTP Cache?

When we do dynamic website design, the server side program will not handle the If-None-Match and If-Modified-Since status, it will generate webpage and send to the browser as long as there is a request. Generally users will not refresh the page frequently, so people may think this content based cache mechanism is useful, but the fact is not like this.

1. Web crawlers like Google crawler is smart enough to recognize the status information of a resource, if you use cache mechanism, it can reduce the crawling times.

For example, Google crawler will crawl JavaEye 150,000 times per day, but in fact contents updated on JavaEye everyday are not more than 10,000 pages. Since some pages may update frequently, so Google crawler may crawl it frequently, this waste many server resources. If we use HTTP Cache, the page needs to be crawled only when page contents are updated, otherwise it can send 304 Not Modified message to Google crawler, this not only reduces the server load but also improves the efficiency of Google crawler.

2. For those pages which are not updated frequently, we can use HTTP Cache

For some historical discussion pages, it may be discussed a few months back and now the page content is rarely updated, users may search to visit this page, then after they visit once, the subsequent visits to this page should not request any download from he server. The local cache is enough.

How to use HTTP Cache?

If we want to implement HTTP Cache, it's actually an very simple task, especially for Rails. Take the above JavaEye RSS subscribe page as an example, only a few lines of codes needed:

def news
  fresh_when(:last_modified => News.last.created_at, :etag => News.last)
end

Using the latest news as the Etag and the last modification time of the news as the resource last modification time. This is ok. If the status sent from the browser is the same as the status on server, it means the content is not updated, it will send 304 Not Modified directly; if they don't match, then server will generate the page and send to the browser.

Above is just a simple example, if we need status to do more work, it's also not very difficult, for example the blog RSS subscribe page of JavaEye :http://robbin.iteye.com/rss .

@blogs = @blog_owner.last_blogs
@hash = @blogs.collect{|b| {b.id => b.post.modified_at.to_i + b.posts_count}}.hash
if stale?(:last_modified => (@blog_owner.last_blog.post.modified_at || @blog_owner.last_blog.post.created_at), :etag => @hash)
  render :template => "rss/blog"
end

This implementation is a bit complicated. We need to check the content of the blog is updated or not. So we apply a hash on the blog content last modification time and comment number, then we use this hash value as the Etag, if if there is any content change or comment change, the Etag value will change as well.

Reference : http://robbinfan.com/blog/13/http-cache-implement

HTTP CACHE  WEB CRAWLER 

       

  RELATED


  0 COMMENT


No comment for this article.



  RANDOM FUN

How normal people and developer look at coding