Django Cacheback¶
Cacheback is an extensible caching library that refreshes stale cache items asynchronously using a Celery or rq task (utilizing django-rq). The key idea being that it’s better to serve a stale item (and populate the cache asynchronously) than block the response process in order to populate the cache synchronously.
Using this library, you can rework your views so that all reads are from cache - which can be a significant performance boost.
A corollary of this technique is that cache stampedes can be easily avoided, avoiding sudden surges of expensive reads when cached items becomes stale.
Cacheback provides a decorator for simple usage, a subclassable base class for more fine-grained control and helper classes for working with querysets.
Example¶
Consider a view for showing a user’s tweets:
from django.shortcuts import render
from myproject.twitter import fetch_tweets
def show_tweets(request, username):
return render(
request,
'tweets.html',
{'tweets': fetch_tweets(username)}
)
This works fine but the fetch_tweets
function involves a HTTP round-trip and
is slow.
Performance can be improved by using Django’s low-level cache API:
from django.shortcuts import render
from django.cache import cache
from myproject.twitter import fetch_tweets
def show_tweets(request, username):
return render(
request,
'tweets.html',
{'tweets': fetch_cached_tweets(username)}
)
def fetch_cached_tweets(username):
tweets = cache.get(username)
if tweets is None:
tweets = fetch_tweets(username)
cache.set(username, tweets, 60*15)
return tweets
Now tweets are cached for 15 minutes after they are first fetched, using the twitter username as a key. This is obviously a performance improvement but the shortcomings of this approach are:
For a cache miss, the tweets are fetched synchronously, blocking code execution and leading to a slow response time.
This in turn exposes the view to a ‘cache stampede’ where multiple expensive reads run simultaneously when the cached item expires. Under heavy load, this can bring your site down and make you sad.
Now, consider an alternative implementation that uses a Celery task to repopulate the cache asynchronously instead of during the request/response cycle:
import datetime
from django.shortcuts import render
from django.cache import cache
from myproject.tasks import update_tweets
def show_tweets(request, username):
return render(
request,
'tweets.html',
{'tweets': fetch_cached_tweets(username)}
)
def fetch_cached_tweets(username):
item = cache.get(username)
if item is None:
# Scenario 1: Cache miss - return empty result set and trigger a refresh
update_tweets.delay(username, 60*15)
return []
tweets, expiry = item
if expiry > datetime.datetime.now():
# Scenario 2: Cached item is stale - return it but trigger a refresh
update_tweets.delay(username, 60*15)
return tweets
where the myproject.tasks.update_tweets
task is implemented as:
import datetime
from celery import task
from django.cache import cache
from myproject.twitter import fetch_tweets
@task()
def update_tweets(username, ttl):
tweets = fetch_tweets(username)
now = datetime.datetime.now()
cache.set(username, (tweets, now+ttl), 2592000)
Some things to note:
Items are stored in the cache as tuples
(data, expiry_timestamp)
using Memcache’s maximum expiry setting (2592000 seconds). By using this value, we are effectively bypassing memcache’s replacement policy in favour of our own.As the comments indicate, there are two scenarios to consider:
Cache miss. In this case, we don’t have any data (stale or otherwise) to return. In the example above, we trigger an asynchronous refresh and return an empty result set. In other scenarios, it may make sense to perform a synchronous refresh.
Cache hit but with stale data. Here we return the stale data but trigger a Celery task to refresh the cached item.
This pattern of re-populating the cache asynchronously works well. Indeed, it is the basis for the cacheback library.
Here’s the same functionality implemented using a django-cacheback decorator:
from django.shortcuts import render
from django.cache import cache
from myproject.twitter import fetch_tweets
from cacheback.decorators import cacheback
def show_tweets(request, username):
return render(
request,
'tweets.html',
{'tweets': cacheback(60*15, fetch_on_miss=False)(fetch_tweets)(username)}
)
Here the decorator simply wraps the fetch_tweets
function - nothing else is
needed. Cacheback ships with a flexible Celery task that can run any function
asynchronously.
To be clear, the behaviour of this implementation is as follows:
The first request for a particular user’s tweets will be a cache miss. The default behaviour of Cacheback is to fetch the data synchronously in this situation, but by passing
fetch_on_miss=False
, we indicate that it’s ok to returnNone
in this situation and to trigger an asynchronous refresh.A Celery worker will pick up the job to refresh the cache for this user’s tweets. If will import the
fetch_tweets
function and execute it with the correct username. The resulting data will be added to the cache with a lifetime of 15 minutes.Any requests for this user’s tweets during the period that Celery is refreshing the cache will also return
None
. However Cacheback is aware of cache stampedes and does not trigger any additional jobs for refreshing the cached item.Once the cached item is refreshed, any subsequent requests within the next 15 minutes will be served from cache.
The first request after 15 minutes has elapsed will serve the (now-stale) cache result but will trigger a Celery task to fetch the user’s tweets and repopulate the cache.
Much of this behaviour can be configured by using a subclass of
cacheback.Job
. The decorator is only intended for simple use-cases. See
the Sample usage and API documentation for more information.
All of the worker related things above an also be done using rq instead of Celery.
Contents¶
Installation¶
You need to do three things:
Install django-cacheback¶
To install with Celery support, run:
$ pip install django-cacheback[celery]
If you want to install with RQ support, just use:
$ pip install django-cacheback[rq]
After installing the package and dependencies, add cacheback
to your INSTALLED_APPS
.
If you want to use RQ as your task queue, you need to set CACHEBACK_TASK_QUEUE
in your settings to rq
.
Install a message broker¶
Celery requires a message broker. Use Celery’s tutorial to help set one up. I recommend rabbitmq.
For RQ you need to set up a redis-server and configure django-rq
. Please look
up the django-rq installation guide for more details.
Set up a cache¶
You also need to ensure you have a cache set up. Most likely, you’ll be using memcache so your settings will include something like:
CACHES = {
'default': {
'BACKEND': 'django.core.cache.backends.memcached.MemcachedCache',
'LOCATION': '127.0.0.1:11211',
}
}
Logging¶
You may also want to configure logging handlers for the ‘cacheback’ named logger. To set up console logging, use something like:
LOGGING = {
'version': 1,
'disable_existing_loggers': False,
'filters': {
'require_debug_false': {
'()': 'django.utils.log.RequireDebugFalse'
}
},
'handlers': {
'console': {
'level': 'DEBUG',
'class': 'logging.StreamHandler',
}
},
'loggers': {
'cacheback': {
'handlers': ['console'],
'level': 'DEBUG',
'propagate': False,
},
}
}
Sample usage¶
As a decorator¶
Simply wrap the function whose results you want to cache:
import requests
from cacheback.decorators import cacheback
@cacheback()
def fetch_tweets(username):
url = "https://twitter.com/statuses/user_timeline.json?screen_name=%s"
return requests.get(url % username).json
The default behaviour of the cacheback
decorator is to:
Cache items for 10 minutes.
When the cache is empty for a given key, the data will be fetched synchronously.
You can parameterise the decorator to cache items for longer and also to not block on a cache miss:
import requests
from cacheback.decorators import cacheback
@cacheback(lifetime=1200, fetch_on_miss=False)
def fetch_tweets(username):
url = "https://twitter.com/statuses/user_timeline.json?screen_name=%s"
return requests.get(url % username).json
Now:
Items will be cached for 20 minutes;
For a cache miss,
None
will be returned and the cache refreshed asynchronously.
As an instance of cacheback.Job
¶
Subclassing cacheback.Job
gives you complete control over the caching
behaviour. The only method that must be overridden is fetch
which is
responsible for fetching the data to be cached:
import requests
from cacheback.base import Job
class UserTweets(Job):
def fetch(self, username):
url = "https://twitter.com/statuses/user_timeline.json?screen_name=%s"
return requests.get(url % username).json
Client code only needs to be aware of the get
method which returns the
cached data. For example:
from django.shortcuts import render
def tweets(request, username):
return render(request,
'tweets.html',
{'tweets': UserTweets().get(username)})
You can control the lifetime and behaviour on cache miss using either class attributes:
import requests
from cacheback.base import Job
class UserTweets(Job):
lifetime = 60*20
fetch_on_miss = False
def fetch(self, username):
url = "https://twitter.com/statuses/user_timeline.json?screen_name=%s"
return requests.get(url % username).json
or by overriding methods:
import time
import requests
from cacheback.base import Job
class UserTweets(Job):
def fetch(self, username):
url = "https://twitter.com/statuses/user_timeline.json?screen_name=%s"
return requests.get(url % username).json
def expiry(self, username):
now = time.time()
if username.startswith(a):
return now + 60*20
return now + 60*10
def should_item_be_fetched_synchronously(self, username):
return username.startswith(a)
In the above toy example, the cache behaviour will be different for usernames starting with ‘a’.
Invalidation¶
If you want to programmatically invalidate a cached item, use the invalidate
method on a job instance:
job = UserTweets()
job.invalidate(username)
This will trigger a new asynchronous refresh of the item.
You can also simply remove an item from the cache so that the next request will trigger the refresh:
job.delete(username)
Setting cache values¶
If you want to update the cache programmatically use the set
method on
a job instance (this can be useful when your program can discover updates through a
separate mechanism for example, or for caching partial or derived data):
tweets_job = UserTweets()
user_tweets = tweets_job.get(username)
new_tweet = PostTweet(username, 'Trying out Cacheback!')
# Naive example, assuming no other process would have updated the tweets
tweets_job.set(username, user_tweets + [new_tweet])
The data to be cached can be specified in a few ways. Firstly it can be the last
positional argument, as above. If that is unclear, you can also use the keyword data
:
tweets_job.set(username, data=(current_tweets + [new_tweet]))
And if your cache method already uses a keyword argument called data
you can specify
the name of a different parameter as a class variable called set_data_kwarg
:
class CustomKwUserTweets(UserTweets):
set_data_kwarg = 'my_cache_data'
custom_tweets_job = CustomKwUserTweets()
custom_tweets_job.set(username, my_cache_data=(user_tweets + [new_tweet]))
This also works with a decorated function:
@cacheback()
def fetch_tweets(username):
url = "https://twitter.com/statuses/user_timeline.json?screen_name=%s"
return requests.get(url % username).json
user_tweets = fetch_tweets(username)
new_tweet = PostTweet(username, 'Trying out Cacheback!')
fetch_tweets.job.set(fetch_tweets, username, (user_tweets + [new_tweet])))
or:
fetch_tweets.job.set(fetch_tweets, username, data=(current_tweets + [new_tweet])))
And you can specify the set_data_kwarg
in the decorator params as you’d expect:
@cacheback(set_data_kwarg='my_cache_data')
def fetch_tweets(username):
url = "https://twitter.com/statuses/user_timeline.json?screen_name=%s"
return requests.get(url % username).json
fetch_tweets.job.set(fetch_tweets, username, my_cache_data=(user_tweets + [new_tweet])))
NOTE: If your fetch
method, or cacheback-decorated function takes a named parameter
of data
and you wish to use the set
method, you must provide a new value for the
set_data_kwarg
parameter, and not pass in the data to cache as the last positional argument.
Otherwise the value of the data
parameter will be used as the data to cache.
Checking what’s in the cache¶
On occasion you may wish to check exactly what Cacheback has stored in the cache without triggering a refresh — this is ususally useful for seeing if values have updated since the last time they were retrieved. The raw_get
method allows you to do that, and uses the same semantics as get
, set
, etc. It returns the value that’s actually stored in the cache, i.e., the (expiry, data)
tuple, or None
if no value has yet been set:
# Don't want to trigger a refetch at this point
raw_cache_value = fetch_tweets.job.raw_get(fetch_tweets, username)
if raw_cache_value is not None:
expiry, cached_tweets = raw_cache_value
Post-processing¶
The cacheback.Job
instance provides a process_result method that can be
overridden to modify the result value being returned. You can use this to append
information about whether the result is being returned from cache or not.
API¶
Jobs¶
The main class is cacheback.base.Job
. The methods that are intended to be
called from client code are:
-
class
cacheback.base.
Job
¶ A cached read job.
This is the core class for the package which is intended to be subclassed to allow the caching behaviour to be customised.
-
delete
(*raw_args, **raw_kwargs)¶ Remove an item from the cache
-
get
(*raw_args, **raw_kwargs)¶ Return the data for this function (using the cache if possible).
This method is not intended to be overidden
-
invalidate
(*raw_args, **raw_kwargs)¶ Mark a cached item invalid and trigger an asynchronous job to refresh the cache
-
It has some class properties than can be used to configure simple behaviour:
-
class
cacheback.base.
Job
A cached read job.
This is the core class for the package which is intended to be subclassed to allow the caching behaviour to be customised.
-
cache_alias
= None Secifies which cache to use from your CACHES setting. It defaults to default.
-
fetch_on_miss
= True Whether to perform a synchronous refresh when a result is missing from the cache. Default behaviour is to do a synchronous fetch when the cache is empty. Stale results are generally ok, but not no results.
-
fetch_on_stale_threshold
= None Whether to perform a synchronous refresh when a result is in the cache but stale from. Default behaviour is never to do a synchronous fetch but there will be times when an item is _too_ stale to be returned.
-
lifetime
= 600 Default cache lifetime is 10 minutes. After this time, the result will be considered stale and requests will trigger a job to refresh it.
-
refresh_timeout
= 60 Timeout period during which no new tasks will be created for a single cache item. This time should cover the normal time required to refresh the cache.
-
task_options
= None Overrides options for refresh_cache.apply_async (e.g. queue).
-
There are also several methods intended to be overridden and customised:
-
class
cacheback.base.
Job
A cached read job.
This is the core class for the package which is intended to be subclassed to allow the caching behaviour to be customised.
-
empty
() Return the appropriate value for a cache MISS (and when we defer the repopulation of the cache)
-
expiry
(*args, **kwargs) Return the expiry timestamp for this item.
-
fetch
(*args, **kwargs) Return the data for this job - this is where the expensive work should be done.
-
key
(*args, **kwargs) Return the cache key to use.
If you’re passing anything but primitive types to the
get
method, it’s likely that you’ll need to override this method.
-
key
(*args, **kwargs) Return the cache key to use.
If you’re passing anything but primitive types to the
get
method, it’s likely that you’ll need to override this method.
-
process_result
(result, call, cache_status, sync_fetch=None) Transform the fetched data right before returning from .get(…)
- Parameters
result – The result to be returned
call – A named tuple with properties ‘args’ and ‘kwargs that holds the call args and kwargs
cache_status – A status integrer, accessible as class constants self.MISS, self.HIT, self.STALE
sync_fetch – A boolean indicating whether a synchronous fetch was performed. A value of None indicates that no fetch was required (ie the result was a cache hit).
-
should_missing_item_be_fetched_synchronously
(*args, **kwargs) Return whether to refresh an item synchronously when it is missing from the cache
-
should_stale_item_be_fetched_synchronously
(delta, *args, **kwargs) Return whether to refresh an item synchronously when it is found in the cache but stale
-
timeout
(*args, **kwargs) Return the refresh timeout for this item
-
Queryset jobs¶
There are two classes for easy caching of ORM reads. These don’t need
subclassing but rather take the model class as a __init__
parameter.
-
class
cacheback.jobs.
QuerySetFilterJob
(model, lifetime=None, fetch_on_miss=None, cache_alias=None, task_options=None)¶ For ORM reads that use the
filter
method.-
fetch
(*args, **kwargs)¶ Return the data for this job - this is where the expensive work should be done.
-
-
class
cacheback.jobs.
QuerySetGetJob
(model, lifetime=None, fetch_on_miss=None, cache_alias=None, task_options=None)¶ For ORM reads that use the
get
method.-
fetch
(*args, **kwargs)¶ Return the data for this job - this is where the expensive work should be done.
-
Example usage:
from django.contrib.auth import models
from django.shortcuts import render
from cacheback.jobs import QuerySetGetJob, QuerySetFilterJob
def user_detail(request, username):
user = QuerySetGetJob(models.User).get(username=username)
return render(request, 'user.html', {'user': user})
def staff(request):
staff = QuerySetFilterJob(models.User).get(is_staff=True)
return render(request, 'staff.html', {'users': staff})
These classes are helpful for simple ORM reads but won’t be suitable for more
complicated queries where filter
is chained together with exclude
.
Settings¶
CACHEBACK_CACHE_ALIAS
¶
This specifies which cache to use from your CACHES
setting. It defaults to
default
.
CACHEBACK_VERIFY_CACHE_WRITE
¶
This verifies the data is correctly written to memcache. If not, then a
RuntimeError
is raised. Defaults to True
.
CACHEBACK_TASK_QUEUE
¶
This defines the task queue to use. Valid options are rq
and celery
.
Make sure that the corresponding task queue is configured too.
CACHEBACK_TASK_IGNORE_RESULT
¶
This specifies whether to ignore the result of the refresh_cache
task
and prevent Celery/RQ from storing it into its results backend.
Advanced usage¶
Three thresholds for cache invalidation¶
It’s possible to employ three threshold times to control cache behaviour:
A time after which the cached item is considered ‘stale’. When a stale item is returned, an async job is triggered to refresh the item but the stale item is returned. This is controlled by the
lifetime
attribute of theJob
class - the default value is 600 seconds (10 minutes).A time after which the cached item is removed (a cache miss). If you have
fetch_on_miss=True
, then this will trigger a synchronous data fetch. This is controlled by thecache_ttl
attribute of theJob
class - the default value is 2592000 seconds, which is the maximum ttl that memcached supports.A timeout value for the refresh job. If the cached item is not refreshed after this time, then another async refresh job will be triggered. This is controlled by the
refresh_timeout
attribute of theJob
class and defaults to 60 seconds.
Contributing¶
Make sure to have poetry installed. Then, start by cloning the repo, and installing the dependencies:
$ pip install poetry # if not already installed $ cd <repository directory> $ poetry install
Running tests¶
Use:
# only runs actual tests
$ make pytests
or:
# runs tests but also linters like black, isort and flake8
$ make tests
To generate html coverage:
$ make coverage-html
Finally, you can also use tox to run tests against all supported Django and Python versions:
$ tox
Sandbox VM¶
There is a VagrantFile
for setting up a sandbox VM where you can play around
with the functionality. Bring up the Vagrant box:
$ vagrant up
This may take a while but will set up a Ubuntu Precise64 VM with RabbitMQ installed. You can then SSH into the machine:
$ vagrant ssh
$ cd /vagrant/sandbox
You can now decide to run the Celery implementation:
$ honcho -f Procfile.celery start
Or you can run the RQ implementation:
$ honcho -f Procfile.rq start
The above commands will start a Django runserver and the selected task worker.
The dummy site will be available at http://localhost:8080
on your host
machine. There are some sample views in sandbox/dummyapp/views.py
that
exercise django-cacheback.