Caching¶
An opt-in, in-process LRU cache attached per deployment. Disabled by
default — turn it on per ModelSpec if your workload has repeated
inputs (forecasting the same series multiple times in an hour, the
same prompt with the same seed, etc.).
from sheaf import ModelSpec
from sheaf.api.base import ModelType
from sheaf.cache import CacheConfig
spec = ModelSpec(
name="forecaster",
model_type=ModelType.TIME_SERIES,
backend="chronos2",
cache=CacheConfig(
enabled=True,
max_size=1024, # entries; LRU evicts beyond this
ttl_seconds=300, # optional; entries expire after 5 min
),
)
The cache key is SHA-256(deployment_name || JSON-canonical request),
with request_id always excluded so two calls that differ only in
client-generated UUID share a hit.
You can exclude additional fields via CacheConfig.exclude_fields.
A common case is diffusion seed: same prompt + different seed should
miss, but same prompt + same seed should hit on a retry — leave seed
in the key.
Where the cache sits in the request path¶
HTTP request
│
├─▶ Feast resolution (if feature_ref)
│
├─▶ Cache lookup (key includes resolved features)
│ │
│ └── HIT → return decoded response
│
├─▶ Batch dispatch (@serve.batch + bucket_by)
│
├─▶ backend.batch_predict
│
└─▶ Cache store + HTTP response
The lookup happens after Feast resolves features, so the cache key reflects actual values, not a feature reference. Two requests for the same entity at different times (with different resolved features) produce distinct entries, as they should.
Process-wide disable¶
Set SHEAF_CACHE_DISABLED=1 to skip every cache regardless of spec
config. Useful in integration tests where you want every request to
exercise the backend.
Reference¶
Full schema in the Caching API reference.