Writing a simple query system in Rust

Published on 27 July 2020 About 22 min reading time

Topics: #rust #walk-through

What I mean by query system here is a pattern for requesting and computing data in a program in on-demand way. The data don’t need to be computed ahead of time at the start of the program, but only when they are actually needed. In this post, I will walk you through a solution to this approach I came up with in Rust, which happened to be a good fit in one of my projects. If not yet, you will eventually be able to implement your own, tailored to your needs.

An illustration of a query system. Photo by Pixabay on Pexels.com

In the explanations, I assume that you know the basics of Rust, especially those which are specific to this language. We are going to encounter issues regarding some of those. To solve them, a couple of interesting types from the standard library will be used and hence the post may be also seen as a discovery journey. On the other hand, if you are a professional Rustacean, you might find this article a bit boring and perhaps even know a more efficient and ergonomic solution. But anyway, let’s dive in.

Table of contents

TL;DR #

This is quite a long post. Thus I am going to summarize it here, and if you remain interested I will be more than happy if you continue reading.

The queries will be actually types implementing a trait and stored in a cache using TypeId type and Any trait for dynamic typing. To enable memoization, interior mutability (RefCell) will be utilized, as well as shared ownership (Rc) to satisfy lifetimes.

To simplify you random access in this article, here is an organization: The first section is just an introduction to the pattern with examples of use cases. The second one is for introducing a toy example. The third section further talks about the desired API of our system. Next three sections discuss individual topics: dynamic typing, interior mutability and shared ownership. Before conclusion, some interesting facts about a secret-for-now poem are revealed using implemented queries.

Introduction and context #

We can split the phrase “query system” into two parts – “query” and “system”. The system (or database) encapsulates some underlying data in a form specific to given application and it provides a way how to make queries to these data to obtain some high-level information. Conceptually, a query can be imagined as a (pure) function taking the data and computing a value. We will introduce a concrete toy example very soon, but let’s first look at the pattern in a code:

// Our underlying data.
let data = String::from("Hello world!");

// The encapsulating system.
let system = System::new(data);

// Making a defined query.
let n_alpha = system.query_num_of_alpha_chars();

To make queries more useful, they could accept arguments to parametrize them. For example, there might be another query computing the number of occurrences of a substring: &str argument. We will leave this extension for another article – this decision will be more clear as we go.

But what will not be avoided in our implementation is memoization. The queries could be (and in practice indeed are) computationally expensive and thus we want to store the results internally when they are computed for the first time. To simplify things substantially, the data will be immutable to us. Otherwise, we would need to handle modifications and introduce cache invalidation and that sounds scary.

There are only two hard things in Computer Science: cache invalidation and naming things.

— Phil Karlton

I would like to pause now for a while and reveal some context for this pattern. My discovery of this approach happened when I read that Rust compiler has been transitioning to using it internally. And interestingly, it is not the only one.

Why? Because it enables convenient IDE integration and incremental compilation. When you slightly change a single file foo.rs and save it, then the compiler sets the content to its database and calls “compile query”. Under the hood a long chain of queries is spawned but only few of them must be recomputed – those which are affected by the change in that file. The rest can be provided from the cache. Structuring the codebase in such queries makes things much more simple.

If you want to get your hands dirty and try this pattern while using a library not really developed by beginners, then take a look at salsa. I also do recommend the talk Responsive compilers by Niko Matsakis, where he explains (among others) this approach and how it is used within Rust compiler.

Setting up our scenario #

For not being lost in a sea of abstractness, we need a simple yet interesting example. So we will do some analysis on a text. And for the text, let’s grab an obvious choice – The Raven by Edgar Allan Poe.

The first stanza of The Raven. A screenshot from Poetry Foundation — The first stanza of *The Raven*. A screenshot from Poetry Foundation

What could interest us on this classic piece of poetry is for example the count of occurrences of the word “Raven”, the number of verses and stanzas, or the average number of verses per stanza. Notice that the queries can be even nested, because while calling our advanced math skills from our studies we get that the average number of verse per stanza is computed as the number of verses divided by the number of stanzas. We certainly want to support this requirement.

Let’s get the content of the poem

curl -s http://www.gutenberg.org/cache/epub/1065/pg1065.txt | tail -n +43 | head -n 126 > raven.txt

and load it with a few lines of Rust.

use std::env;
use std::fs::File;
use std::io::prelude::*;

fn main() {
    // The file can be specified as the first argument of our program
    let filename = match env::args().nth(1) {
        Some(filename) => filename,
        None => panic!("provide a filename as the first argument"),
    };

    let mut file = File::open(filename).expect("file does not exist");
    let mut text = String::new();

    file.read_to_string(&mut text)
        .expect("an error during reading the file");

    print!("{}", text);
}

Now, run the program using cargo run -- raven.txt and enjoy the art for a while.

Desired goal #

We will make our structure a bit more flexible than the code introduced in the beginning of this post. Instead of defining a method for every and each query, our system will allow any query represented by a type that implements a Query trait of ours.

The only task required by this trait is to initialize the query itself from the underlying data. Thanks to this, the set of queries will not be limited by our implementation and can be possibly extended by external code (think of plugins for example). A modified snippet from the beginning of this post would then look like this:

// Our underlying data.
let data = String::from("Hello world!");

// The encapsulating system.
let system = System::new(data);

// Making a defined query.
let n_alpha = system.query::<NumberOfAlphaChars>();

However, this design choice has several implications as we will see. One of them can be revealed right now: it makes parameterization of the queries somewhat difficult. And that’s why we are going to ignore it and leave the solution for the future.

Even though The Raven is not an enormously long text, it’s still not a representative of short instapoetry pieces. To satisfy one of the advantages of query system pattern, the queries will be cached so they are computed at most once. There is one caveat though. We would like to hide the memoization as an implementation detail and so making the query should take immutable reference.

First attempt #

A straightforward idea for memoization is to maintain a cache in the form of HashMap. Since we want to store queries of any type (as long as they implement a trait we require), we need to exploit reflection capabilities of Rust. In particular, instead of an actual type, a reference to Any trait will be stored. As such, we will be later able to cast the reference to the original type. The key for HashMap is going to be TypeId structure which represents a globally unique identifier for any type.

Our first naive attempt could look like this:

use std::any::{Any, TypeId};
use std::collections::HashMap;

// Query needs to be 'static so we can cast it to `Any`.
pub trait Query: 'static {
    // The arguments are &String for computing the query and &System for calling
    // other queries.
    fn init_query(text: &String, system: &System) -> Self;
}

pub struct System {
    // Our underlying data.
    text: String,
    // Our cache where the key is the type of the query and the value is ts
    // value. We use `Any` type for "dynamic" typing here.
    queries: HashMap<TypeId, Box<dyn Any>>,
}

impl System {
    pub fn new(text: String) -> Self {
        System {
            text,
            queries: HashMap::new(),
        }
    }

    // Usage: `let result = system.query::<MyQuery>()`
    pub fn query<Q: Query>(&self) -> &Q {
        let type_id = TypeId::of::<Q>();

        // We first check if the query has been already computed in the past.
        if !self.queries.contains_key(&type_id) {
            // The query is not in the cache so we need to compute it.
            let query = Q::init_query(&self.text, &self);
            // And store it into our cache.
            self.queries.insert(type_id, Box::new(query));
        }

        // The previous conditioned block ensures that the cache already
        // contains the query so we can obtain it (and safely call unwrap).
        let any = self.queries.get(&type_id).unwrap();
        // Cast the `Any` type to the type of this query. Since we maintain
        // our cache by the type id, we are sure that this cast will
        // succeed.
        any.downcast_ref::<Q>().unwrap()
    }
}

But this code gives us one of the most iconic errors for Rust. It is depicted in the following figure.

The compiler error. — *The* compiler error.

We borrow self by non-exclusive & reference, but try to modify the hash map using insert method. Well, this is not possible in Rust (for good reasons). The justification for our immutable borrow choice is our requirement that the caching must be invisible to the user. That is, from their point of view it is just taking readonly text and computing the query.

Note that to be more idiomatic in our naivety, we could use Entry API. But we would still remain naive as it uses mutable reference as well.

I would like to emphasize one very important implication of using Any. And it is explicitly visible in the definition of our Query trait – everything that implements it is constrained by 'static lifetime. This effectively means that the queries can not hold a non-static reference to some data. For those who are not afraid of reading about compiler internals and terms such as “soundness” and “unsound”, I refer you to this RFC and related discussion for a detailed reasoning.

To give you a vague explanation that is even related to our use case: lifetimes are actually erased during compile-time since they are used to prove data availability only at this stage, and it would be from difficult to impossible (and in most cases unnecessary) to have this information at runtime. Therefore the TypeId structure that we use doesn’t have access to lifetime information.

Now, imagine that we store a query which holds a reference bounded by lifetime 'a. We store it into our hash map, but the lifetime is not encoded anywhere in the key. What could then guarantee whether the data behind the lifetime 'a is still valid when we retrieve the value from the cache some time later? Nothing. And that’s why it’s not possible to do it.

Interior mutability #

What we need is so-called interior mutability. It will allow us to insert values into the hash map while still borrowing &self immutably. Recall that one of the most important rules in Rust enforced by the compiler is that any value can have only one of the following (but not both):

Several immutable & references.
One mutable &mut reference.

Fortunately, the standard library provides us with some ways how to silence this compile error by postponing the check of this rule to runtime. This should look to you as a stinky workaround to evade Rust core (and greatly useful) concepts, but luckily we can find our excuse right in the documentation itself. In the section When to choose interior mutability we can read that it’s fine to use it for

Implementation details of logically-immutable methods.

And that is exactly our situation.

As HashMap does not implement Copy trait, we are predestined to use RefCell type. It offers us borrow method as well as precious borrow_mut method, both taking (from static point of view) immutable reference. And this is what we need.

At this point, it’s time for another side note. Choosing to use RefCell (or any other std::cell type) imposes a limitation to our code. The reference of RefCell cannot be sent across threads because the validation of the rules in these types is not thread-safe. If you are interested in the details, see this chapter of the Rust Book. Let’s accept this and leave thread synchronization for the future – the circumstances are complicated enough already.

Let’s try our second chance, this time with interior mutability:

+use std::cell::RefCell;

 pub struct System {
     text: String,
-    queries: HashMap<TypeId, Box<dyn Any>>,
+    queries: RefCell<HashMap<TypeId, Box<dyn Any>>>,
 }

 impl System {
     pub fn new(text: String) -> Self {
         System {
             text,
-            queries: HashMap::new(),
+            queries: RefCell::new(HashMap::new()),
         }
     }

     pub fn query<Q: Query>(&self) -> &Q {
         let type_id = TypeId::of::<Q>();

-        if !self.queries.contains_key(&type_id) {
+        if !self.queries.borrow().contains_key(&type_id) {
             let query= Q::init_query(&self.text, &self);
-            self.queries.insert(type_id, Box::new(query));
+            self.queries.borrow_mut().insert(type_id, Box::new(query));
         }

-        let any = self.queries.get(&type_id).unwrap();
+        let any = self.queries.borrow().get(&type_id).unwrap();
         any.downcast_ref::<Q>().unwrap()
     }
 }

How these borrowing methods work? They return a guard value and, for the lifetime of this guard, RefCell is keeping a note in its diary that there exists a reference of the corresponding kind. When other borrows happen (both “immutable” or “mutable”), it checks for a violation of the rules. As soon as the guard, which provides a way how to access the borrowed value, is not needed, it is released and RefCell clears the corresponding note in its diary.

Good, we resolved our issue with “taking immutable reference mutably”. Since we ignore multi-threading we can be quite confident in telling that our program will not break Rust aliasing rules in runtime. When we are borrow_muting queries, there is no active borrow reference. So we are fine.

Unfortunately, no. A new issue has been born: the line with any.downcast_ref::<Q>().unwrap() shouts at us that it returns a value referencing data owned by the current function. Another Rust error that everyone comes across quite frequently.

The parents of this issue are these guards discussed few moments ago. The guards are values owned by the function and are intended to be dropped as soon as possible to release the borrow. When we downcast_ref the value held by one such borrow, we take a reference of it (with lifetime let’s say 'guard) and then we return the casted reference having the 'guard lifetime as well. However, as the guard is owned by the function that comes to its end, we cannot return this reference, because the guard will be released at the latest at the end of the function. Ugh…

An important observation here: we can’t use aforementioned Entry API anymore. Consider the following snippet:

let any = &*self.queries.borrow_mut().entry(type_id).or_insert_with(|| {
    let query = Q::init_query(&self.text, &self);
    Box::new(query)
});

The lifetime of the guard produced by borrow_mut spans over all four lines here. In particular, it’s still active when the closure is executed. If in the query initialization a dependency query is requested, it would borrow_mut the cache for the second time. This would result in a panic during runtime. Let’s now spend a minute for acknowledging the dangers of using RefCell and praising safe and early compile errors.

It is also important to really split the query initialization and storing the query value into the cache via borrow_mut. If we put the initialization call as the argument of insert, calling nested queries would violate aliasing rules, too.

At this point, it seems that there is no way for returning a &Q reference while using RefCell semantics. And honestly, I don’t know any indeed. Hence we need an alternative.

Shared ownership #

A straightforward way is imposing the Clone requirement for Query trait and cloning the query value every time. This might not be that bad if our queries are not large in regards of space and only need to be cached due to their computational cost. Fortunately, there is a bit smarter way of achieving (sort of) what we want. It’s Rc type.

Reference counting (hence the name Rc) provides us with shared ownership of a value allocated on the heap. When clone is invoked, it does not clone the value itself, but produces a new pointer to the same allocation instead and increments an internal counter. This operation is therefore very cheap both time- and memory-wise.

We could use the reference-counted pointer directly, but it is a good idea to introduce a new type – let’s call it QueryRef. It will just wrap Rc<Q> and implement Deref<Target = Q> trait. The purpose of these ceremonies is two-fold:

It forbids to use functions implemented for Rc that do not make sense in our case. The best example is get_mut function. Calling this on a value returned by our query method would always result in None as we are holding another reference in our cache – that is, there would be two values pointing to the same memory. No, Rc cannot be used to fool Rust’s aliasing rules either. Fortunately, we don’t need it.
It hides the use of Rc as an implementation detail. Who knows, we may come up with a better implementation in the future. For example, if we happen to resolve thread unsafety of RefCell, we need to replace Rc with its atomic variant Arc.

Ok, third attempt:

+use std::rc::Rc;

+// We may need to derive some traits for convenience, but be careful not to
+// impose too restrictive requirements for underlying implementation.
+pub struct QueryRef<Q>(Rc<Q>);
+
+impl<Q> std::ops::Deref for QueryRef<Q> {
+    type Target = Q;
+
+    fn deref(&self) -> &Self::Target {
+        // This works because `Rc` implements `Deref` too
+        &*self.0
+    }
+}

 pub struct System {
     text: String,
-    queries: RefCell<HashMap<TypeId, Box<dyn Any>>>,
+    queries: RefCell<HashMap<TypeId, Rc<dyn Any>>>,
 }

-    pub fn query<Q: Query>(&self) -> &Q {
+    pub fn query<Q: Query>(&self) -> QueryRef<Q> {
         let type_id = TypeId::of::<Q>();

         if !self.queries.borrow().contains_key(&type_id) {
             let query = Q::init_query(&self.text, &self);
-            self.queries.borrow_mut().insert(type_id, Box::new(query));
+            self.queries.borrow_mut().insert(type_id, Rc::new(query));
         }

-        let any = self.queries.borrow().get(&type_id).unwrap();
+        let any = self.queries.borrow().get(&type_id).unwrap().clone();
-        any.downcast_ref::<Q>().unwrap()
+        QueryRef(any.downcast::<Q>().unwrap())
     }

All these changes are rather mechanic work. But there are two exceptions residing at the end of the query method. After obtaining the reference from the cache, we need to clone it to release the guard (but as we already know, in case of Rc it’s very cheap). On the next line, instead of casting Any into a reference to Q, we need to cast to Q itself. And that is why downcast_ref becomes downcast. By the way, the former is defined on Any itself, while the latter is a method of Rc.

At this moment, we finally compile without errors!

A moderately populated city in France celebrating that we compiled our program without errors.
Photo by Nicolas Tissot on Unsplash. — A moderately populated city in France celebrating that we compiled our program without errors. Photo by Nicolas Tissot on Unsplash.

Short recap #

So what we have so far? We have a System structure that provides us with query method to be used for querying the underlying string. It takes immutable reference to the system as querying is immutable computation in our eyes.

However, the values are actually cached internally using a HashMap with TypeId → Any structure. We use these two to be able to store different types in the cache. In order to insert new values into the hash map – despite the fact that we have access only to immutable reference of System – we employ interior mutability technique provided by RefCell.

The return value of a query Q is wrapped to a QueryRef<Q> which allows to dereference itself to &Q. We use this approach to hide implementation details. So far, the implementation detail is that we hold the value inside Rc smart pointer, because we are forced to return an owned value from type system point of view.

Writing queries #

Now is finally the time when we start to write our actual queries. Let’s use those mentioned when we introduced the The Raven example:

the count of occurrences of the word “Raven”,
the number of verses,
the number of stanzas,
the average number of verses per stanza.

We start with the first one:

use crate::system::{Query, System};

pub struct RavenCount {
    count: usize,
}

impl RavenCount {
    pub fn get(&self) -> usize {
        self.count
    }
}

impl Query for RavenCount {
    fn init_query(text: &String, _system: &System) -> Self {
        let count = text
            // Iterate over all characters of the text.
            .char_indices()
            .filter(|(idx, _)| {
                // Check if all characters of the word "Raven" match their
                // counterparts in substring of the text beginning at `idx`.
                text[*idx..]
                    .chars()
                    .zip("Raven".chars())
                    .all(|(lhs, rhs)| lhs == rhs)
            })
            .count();

        RavenCount { count }
    }
}

To simulate computational expensiveness of the query, we use brute force algorithm running in $mathcal{O}(n m)$ time. If you are interested in speeding it up, you can consult this reference of well-known string matching algorithms.

I try to have the implementation as idiomatic as possible and that’s why various iterator-based techniques are used instead of “C-idiomatic” nested for loops. We don’t use the &System argument in this case so it gets prefixed with the underscore to silence “unused variable” warning.

Now the query can be called as simply as that:

use crate::queries::RavenCount;
use crate::system::System;

fn main() {
    // ...

    let system = System::new(text);
    let raven_count = system.query::<RavenCount>();
    println!("raven count: {}", raven_count.get());
}

Are you excited to see the result?

raven count: 10

Let’s look at the remaining queries to see if calling dependencies work. First we implement the two counts, one for verses and the second for stanzas:

pub struct VersesCount {
    count: usize,
}

impl VersesCount {
    pub fn get(&self) -> usize {
        self.count
    }
}

impl Query for VersesCount {
    fn init_query(text: &String, _system: &System) -> Self {
        let count = text
            .split('\n')
            .map(|line| line.trim())
            .filter(|line| !line.is_empty())
            .count();

        VersesCount { count }
    }
}

pub struct StanzasCount {
    count: usize,
}

impl StanzasCount {
    pub fn get(&self) -> usize {
        self.count
    }
}

impl Query for StanzasCount {
    fn init_query(text: &String, _system: &System) -> Self {
        let count = 1 + text
            .trim()
            .split('\n')
            .map(|line| line.trim())
            .filter(|line| line.is_empty())
            .count();

        StanzasCount { count }
    }
}

For the purposes of this article, it’s not important how these are actually implemented. I hope you excuse me if I omit the description. Because what we should be more curious about is the next query:

pub struct AvgVersesPerStanza {
    avg: f64,
}

impl AvgVersesPerStanza {
    pub fn get(&self) -> f64 {
        self.avg
    }
}

impl Query for AvgVersesPerStanza {
    fn init_query(_text: &String, system: &System) -> Self {
        let avg = (system.query::<VersesCount>().get() as f64)
            / (system.query::<StanzasCount>().get() as f64);

        AvgVersesPerStanza { avg }
    }
}

Here we even don’t touch the original text and just use &System to compute two queries which are used to compute the result of the query itself. This dependency calling is a test for our RefCell implementation and whether we did everything right. By the way, this “framework” does not prevent or check circular dependencies in any way. They just lead to a stack overflow crash.

When we use the new queries in the identical way as we did in RavenCount case, we get the following:

verses count: 108
stanzas count: 18
avg verses per stanza: 6

Well, the last number is a bit boring. The Raven is actually a carefully structured poem with regular stanzas, each of them having exactly six verses. In the end, that’s how poetry was written at those times. But a nice thing is that our statistics do not contradict the numbers presented on Wikipedia.

Conclusion #

At this point, we have something that works. On the way, we introduced several simplifications – in particular, we are limited to a single thread and there is no way how queries can be parametrized. We may look at these features in the future. Also, in practice there might be a need to support fallible queries that return Result.

My hope is that this solution can be considered nice and simple in the way that it only uses types from the standard library and no dirty tricks. However, the concepts that we used are not trivial to understand! It actually took me quite a lot of time to walk the path I have been telling you in this post.

I would not be that surprised if you can think of some improvements or even a completely different approach. In that case, it would be really appreciated if you get in touch with me somehow, as this is not just for blogging purposes, but something that I actually need.

I guess that this pattern is quite niche in the set of possible use cases. But I personally used it in Aardwolf to forbid direct access to raw data in plugins and encourage to compute (possibly custom) high-level structures from it instead. If we add compilers mentioned in the beginning of this post, we already have two concrete use cases, and that’s more than one. I believe it is worth to keep this approach in mind and an application may occur later.

Or you already have one in your mind! Or you will come up with a project where it can be utilized. The example used in this post was interesting, but not that much useful. But I am sure you can think of a scenario where this is a perfect fit.

Update: See u/frondeus's improvements, polishes and additions in this playground! An especially amazing idea is adding type Output associated to the Query trait such that the output of a query can be some arbitrary type, even one that does not belong to you (like usize, for which we needed to write get function on each query type). But there is more…