Bitfield Consulting

View Original

CUE is an exciting configuration language

There’s a future beyond YAML engineering

You may already be familiar with JSON or YAML data, possibly to the point of exasperation. It might represent Kubernetes services, API schemas, or cloud infrastructure of some kind. Configuration data specifies how things should behave or be arranged, and there's plenty of it around these days. So what's my problem?

The problem

JSON is easy for machines to generate or parse, but it seems verbose and pernickety to us. YAML is more flexible, but this can make it very difficult to parse mechanically. It's probably fair to say that JSON is the de facto standard wire format for exchanging data between machines, while YAML does better at human-to-machine communication. They're both okay, but not exactly a joy to write or maintain. Shouldn't we be able to do better?

Can we start by fixing JSON?

Since most programs or APIs already read and write JSON, and whatever data format we choose as a source of truth must be equivalent to some JSON, let's start there. Consider the following JSON data:

{
    "john": {
        "age": 29,
        "hobbies": [
            "physics",
            "reading"
        ]
    }
}

There are several annoying things about this. We have to supply useless outer curly braces, because {REASONS}. We're not allowed to add any comments to explain what things are. We have to quote the field names, and while we must put a trailing comma after all but the last element of the hobbies array, we equally mustn't put a comma after the last one. This is guaranteed to irritate the next person who has to add or remove an item from the array.

Note that YAML doesn't get a pass here, either: while it doesn't require quite as much curly boilerplate as JSON, it achieves this by making whitespace significant. How many more thousands of engineer-hours must we waste painfully tweaking spaces and tabs until it's just right? This madness needs to end.

Let's just fix all these nits right now, seeing as we have our magic wand handy. We can write something like this:

// John can be grumpy before he's had his coffee.
john: {
    age: 29
    hobbies: [
        "physics",
        "reading",
    ]
}

This is no less readable than the JSON, and considerably more writable. It's trivial to transform this into the equivalent JSON, too, if that's what the machines want.

That's a good start; can we do more?

Type checking

Let's think about types, for example. It's clear that age here is a number (indeed, it's just a number). Similarly, hobbies is an array of strings, and with a little more thought we can see that whatever john is, he's an instance of some kind of struct (that is, a structured data record) with a bunch of fields.

These types are implied by the data, but there's nothing in JSON that lets us actually specify them. Clearly, there are many ways to write syntactically correct JSON that is nonetheless semantically invalid: giving a string for age, for example.

The ideal thing would be to write some kind of type definition for whatever kind of thing john is, including all its fields (a schema), against which we could validate any given data. That is to say, not just checking its syntax, but ensuring that each value is the type it's supposed to be. This would go a long way towards reducing errors and problems in our configuration.

Types are values

Of course, since a schema is just another kind of data, we should be able to express the schema in the same form as our existing data. Can we?

#Person: {
  age: number
  hobbies?: [...string]
}

We're saying that there's a kind of struct called #Person (let's adopt the # naming convention to make it clear that this is a definition, not a piece of actual data that should result in some JSON output). A #Person, then, has an age which is a number, and this field is mandatory.

Not everyone has hobbies, though, so we've made that field optional by using the trailing ? character (hobbies?). If hobbies is present, it must be a list of strings.

We've already given the data for john previously, so there's only one more step needed before we can automatically validate this data: to say that john is a #Person. All we've done so far is to give two structs, without specifying any relationship between them. We can fix that now:

john: #Person

Combining values

Rather than write two separate values for john, we can just combine them. The & operator would make sense for this:

john: #Person & {
    age: 29
    hobbies: [
        "physics",
        "reading",
    ]
}

In other words, john is a #Person, and he has exactly these properties.

It's trivial to mechanically check that these two statements about john are valid, that is to say, consistent. (I mean "trivial" in the engineer's sense of "astoundingly complicated in practice, but in a way we can easily cope with".) If John's age were a string, or if his hobbies were the value 41, or if he had some unexpected field such as phone, then this data would be invalid.

Similarly, we can check that the data is complete: if we had omitted John's age, for example, the two structs john and #Person would no longer match, because #Person.age is a required field. That's a different way for data to be invalid, and we can catch that too, thanks to our definition.

(In strict fairness to JSON, it is possible to do something like this using JSON Schema, but that's nowhere near as elegant as our "types are values" idea, and it's not clear that the best solution to the JSON problem is more JSON.)

Constraints

We're making good progress, but let's not stop there. It's possible to have consistent and complete data that is nonetheless still incorrect in the context where we need to use it, because it doesn't meet some constraint.

For example, imagine a situation where people have to be 18 or over to participate in something. We'll have to be able to constrain the age field along the following lines:

#Adult: #Person & {
  age: >=18
}

That is, an #Adult is a #Person, but not just that: they must also have an age field matching the constraint >=18.

This is true for john, so we can confidently declare:

john: #Adult

// valid

But if we try it with someone a little younger, we'd expect this not to validate:

anusha: #Adult & {
    age: 17
}

// anusha.age: invalid value 17 (out of bound >=18)

We could write more complex constraint expressions if necessary. For example:

#WorkingAgePerson: #Person & {
    age: >=16 & <65
}

While we're designing our dream language, why don't we also say that we can apply constraints to strings, in the form of regular expressions:

#Phone: string & =~ "[0-9]+"

This would require that a valid #Phone be a string containing at least one digit. We can read the =~ operator as "matches regular expression", and let's also allow the inverse operator, !~ ("doesn't match").

Enums and sum types

It would be nice to be able to constrain things to a set of allowed values (such types are sometimes called enums, or enumerated types). For example, suppose we have a service where only a specific set of users are allowed to log in:

#Allowed: "mary" | "leroy" | "abby"

For a user to be allowed to log in, it must be exactly "mary" OR "leroy" OR "abby", and nothing else.

Since types, too, are values, we can also constrain something to a set of types (programmers would call this kind of thing a sum type):

#Port: string | int

This says that a #Port can be either a string or an integer (but nothing else).

Defaults and references

We'd like to be able to specify a default if no specific value is provided for a field:

port: int | *8080

Here, the * indicates that port will default to 8080 if not otherwise specified.

By itself, this single feature can eliminate a lot of boilerplate. Because similar things tend to share similar configuration, we'll never need to specify a value that is the same as the default.

We can also avoid repeating the same information, by simply referencing a field that's already been specified, using its name:

port:        6666
ingressPort: port

// ingressPort: 6666

It'll be convenient to interpolate references in strings, too:

port: 8000
url:  "https://localhost:\(port)"

// url: "https://localhost:6666"

Maps

Lookup tables, or maps, are extremely handy for reducing duplication. Actually, a map is just another kind of struct, so let's use the same syntax:

instanceType: {
    web: "small"
    app: "medium"
    db:  "large"
}

Since we already know how to refer to fields by name, looking up a key in a map should work roughly the same way:

server1: {
    role:     "app"
    instance: instanceType[role]
}

// server1.instance: "medium"

When our startup takes off, and we need to upgrade all our app servers to a "large" instance type, we don't have to go through the data changing every occurrence manually. We can change it once in the instanceType map, and every server will get the updated value automatically.

Generating config

If, as often happens, we have a large number of things that have configuration in common, it should be possible to generate them automatically, shouldn't it? Let's try.

Suppose we have three services, a, b, and c, and we want to generate a similar server configuration for each of them:

for s in ["a", "b", "c"] {
    "www_\(s)": {
        service: s
        role:    "web"
    }
}

This specifies a struct for each value in the list ["a", "b", "c"] (a list comprehension), and we would expect it to generate the following JSON:

{
    "www_a": {
        "service": "a",
        "role": "web"
    },
    "www_b": {
        "service": "b",
        "role": "web"
    },
    "www_c": {
        "service": "c",
        "role": "web"
    }
}

If we can generate data from lists, we might also want to be able to filter those lists by some expression:

nums: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
evens: [ for n in nums if mod(n, 2) == 0 ]

// evens: [2, 4, 6, 8, 10]

The if clause here is known as a guard, because it guards against non-matching values being included in the result. Let's also grant ourselves a few standard functions such as mod (integer modulus), because they're bound to come in useful.

Packages

Speaking of which, a great language needs a great standard library. Let's include a helpful set of builtin packages that we can import for specific jobs, such as list operations:

import "list"

jumbled: [4, 10, 1, 3, 7, 9, 6, 2, 5, 8]
sorted: list.Sort(jumbled, list.Ascending)

// sorted: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Since there are builtin packages, we should be able to create our own user-defined packages, too, by adding a declaration like this to the top of our data file:

package person

We can now split data across multiple files, for ease of editing and collaboration. We can evaluate a whole set of files as a single configuration, provided that all the files are members of the same package.

While thinking about packages, it occurs to us that our hypothetical language is also its own testing framework. In order to test any of our code, all we need to do is specify the expected result as a value, and the evaluator will tell us if that's consistent with the actual result:

sorted: list.Sort(jumbled, list.Ascending)
sorted: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

// valid

What we left out

It seems like we gained a lot of features with very little extra syntax, which is elegant (exciting, even). So what don't we have? Well, we deliberately left out inheritance (that is to say, more than one layer of defaults). We know from bitter experience in the YAML mines that this can quickly lead to complex and hard-to-debug problems.

And as powerful as it is, this is still just a data language, not a full-fledged programming language. And that's by design; a programming language has more power (thus complexity) than we need to express data. Not using a programming language keeps us out of the worst kinds of trouble. We can't, for example, write loops that potentially never terminate, and that means, given valid data, the configuration is guaranteed to converge on a consistent set of values in a reasonable time.

Something else we can't do is exchange data with the outside world, such as user input, disk files, or network requests. The evaluated configuration is entirely predictable and stable. This hermetic quality of data makes it much easier to check its correctness, though we might want to grant ourselves an exemption when writing tools—for example, automating configuration workflows such as testing, provisioning, or deployment. Let's imagine a special opt-in "scripting mode" which relaxes these restrictions, and lets us talk to the outside world, in at least a limited way.

Living the dream

Sure, it's a nice dream, but could it really happen? Are we condemned to remain YAML engineers, or JSON wranglers, for life? Or is there a way to make this hypothetical language real?

CUE is an exciting configuration language.

Reader, in a twist I dare say you saw coming about two thousand words ago, it is already real. The language is called CUE (or "cuelang" for searchability) and you can use it today. All the examples in this article (including the original JSON) are valid CUE, and you can try them out in the CUE playground right in your browser.

Although it's still quite young, and developing fast, CUE already has excellent tooling, as you'd expect from a language strongly inspired by Go, and can import data from JSON, JSON Schema, YAML, Protobuf, Go packages, and OpenAPI schemas. It can export data in most of these formats natively, and you can generate whatever format you want by means of CUE's built-in text templating feature. Istio, for example, uses CUE to generate OpenAPI and CRDs for Kubernetes.

Where do I start?

So, assuming you're excited about CUE, how can you start using it to live your best life, and ease the burden of your fellow YAML engineers? It's probably unwise to propose replacing all your existing config data with CUE, at least for now, but we can be more subtle about it. We can do CUE by stealth, if you like.

You could start by validating your existing config. CUE can easily import all the YAML and JSON you have and tell you if there are any syntax errors in it. There probably are, so you're already winning. And with a little more effort, you can write some simple definitions to check if the data is also semantically correct.

You can use CUE to gradually introduce policies into your configuration. Also, CUE data could become your single source of truth, and you could simply generate all your configuration from it in whatever format is required. No need to modify your existing tooling: as long as it speaks JSON, YAML, or equivalent, we're good. Indeed, provided they get the data they want, there's no reason for anyone to even know it was produced by CUE.

No doubt you'll also be able to do a lot of boilerplate reduction, by using references, setting defaults, and generating config, as we've seen. And maybe there's also a bunch of schemas that you just won't have to write in the first place, as CUE can import them from Protobuf files or Go packages (Kubernetes, for example).

Find out more

Here are a few resources you might find useful:

See this content in the original post