If you are following my blog, you know I don't like to discuss things from other bloggers blogs.
But, Greg Linden opened a discussion relating to caching in search systems. This was also discussed in the newly created blog of Chad.
They are both discussing a comparison of caching methods in search engines, resulting from a research recently done in Yahoo! "The Impact of Caching on Search Engines" (PDF).
It is super interesting to read about the static vs dynamic caching of search queries vs terms and what their conclusions are.
Although some of this discussion is not so relevant to our work in outbrain, caching in general is something that I'm thinking about lately in order to reduce load on back-end servers.
My previous experience in Shopping.com showed that you better invest on quick and scalable back-end solution based on commodity hardware then invest on caching solutions that in some cases result in very low hit-rate. But that was correct for Shopping.com case.
The outbrain case is different, the data is very dynamic, it changes very frequently and changes should be reflected withing less then a minute in the queries results. A fact that enforce very high rate of "dirty flagging" cached results, especially the most frequently used.
This bring to discussion the question of where to locate caching layer. During my readings on this subject I bumped into few solutions:
- Caching in the web application level.
- + in case of good hit-rate it can significantly it can significantly reduce load on the DB servers.
- + usually no networking to second tier is needed in case of cache hit.
- - in distributed architecture the chances to have high hit-rate is fairly low.
- - cache dirty flagging is not an easy task on higher level cache.
- Network caching solution such as memcached
- + cached objects are held once over all the system so it produce fairly high chances for hit-rates.
- + this solution uses memory on the web-servers fairly efficiently.
- + the objects that are cached are actually serialized objects of the programming language you use - so no need to generate objects from raw data.
- - still involves networking to the server holding the data you are looking for.
- - dirty flagging is still an issue, although smaller since it is a virtual "single" location of the cached object.
- Caching data in the DB (like using MySql query cache).
- + most sensitive to dirty flagging which is very important argument in systems with data that changes all the time.
- + easier in terms of development effort.
- - networking is needed to the DB tier.
- - object building from raw data is needed.
So far, we are far from hitting any bottleneck but observing the system to see what the trends are in order to see what will be the most appropriate solution.
Any thoughts or hints from readers?