
Taking a Crash Course in YAML
Taking a Crash Course in YAML 관련


Taking a Crash Course in YAML
In this section, you’re going to learn the basic facts about YAML, including its uses, syntax, and some of its unique and powerful features. If you’ve worked with YAML before, then you can skip ahead and continue reading from the next section, which covers using YAML in Python.
Historical Context
YAML, which rhymes with camel, is a recursive acronym that stands for YAML Ain’t Markup Language because it’s not a markup language! Interestingly enough, the original draft for the YAML specification defined the language as Yet Another Markup Language, but later the current backronym was adopted to more accurately describe the language’s purpose.
An actual markup language, such as Markdown or HTML, lets you annotate text with formatting or processing instructions intermixed with your content. Markup languages are, therefore, primarily concerned with text documents, whereas YAML is a data serialization format that integrates well with common data types native to many programming languages. There’s no inherent text in YAML, only data to represent.
YAML was originally meant to simplify Extensible Markup Language (XML), but in reality, it has a lot more in common with JavaScript Object Notation (JSON). In fact, it’s a superset of JSON.
Even though XML was initially designed to be a metalanguage for creating markup languages for documents, people quickly adopted it as the standard data serialization format. The HTML-like syntax of angle brackets made XML look familiar. Suddenly, everyone wanted to use XML as their configuration, persistence, or messaging format.
As the first kid on the block, XML dominated the scene for many years. It became a mature and trusted data interchange format and helped shape new concepts like building interactive web applications. After all, the letter X in AJAX, a technique for getting data from the server without reloading the page, stands for none other than XML.
Ironically, it was AJAX that ultimately led to XML’s decline in popularity. The verbose, complex, and redundant XML syntax wasted a lot of bandwidth when data was sent over the network. Parsing XML documents in JavaScript was slow and tedious because of XML’s fixed document object model (DOM), which wouldn’t match the application’s data model. The community finally acknowledged that they’d been using the wrong tool for the job.
That’s when JSON entered the picture. It was built from the ground up with data serialization in mind. Web browsers could parse it effortlessly because JSON is a subset of JavaScript, which they already supported. Not only was JSON’s minimalistic syntax appealing to developers, but it also made porting to other platforms easier than XML did. To this day, JSON remains the slimmest, fastest, and most versatile textual data interchange format on the Internet.
YAML came into existence the same year as JSON, and by pure coincidence, it was almost a complete superset of JSON on the syntactical and semantic levels. Starting from YAML 1.2, the format officially became a strict superset of JSON, meaning that every valid JSON document also happens to be a YAML document.
In practice, however, the two formats look different, as the YAML specification puts more emphasis on human readability by adding a lot more syntactic sugar and features on top of JSON. As a result, YAML is more applicable to configuration files edited by hand rather than as a transport layer.
Comparison With XML and JSON
If you’re familiar with XML or JSON, then you might be wondering what YAML brings to the table. All three are major data interchange formats, which share some overlapping features. For example, they’re all text based and more or less human readable. At the same time, they differ in many respects, which you’ll find out next.
Note
There are other, popular textual data formats like TOML, which the new build system in Python is based on. Currently, only external packaging and dependency management tools like Poetry can read TOML, but since Python 3.11 there’s a TOML parser in the standard library.
Common binary data serialization formats you’d find in the wild include Google’s Protocol Buffers and Apache’s Avro.
Now have a look at a sample document expressed in all three data formats but representing the same person. You can click to expand the collapsible sections and reveal data serialized in those formats:
<?xml version="1.0" encoding="UTF-8" ?>
<person firstName="John" lastName="Doe">
<dateOfBirth>1969-12-31</dateOfBirth>
<married>true</married>
<spouse>
<person firstName="Jane" lastName="Doe">
<dateOfBirth />
<!- This is a comment -->
</person>
</spouse>
</person>
{
"person": {
"dateOfBirth": "1969-12-31",
"firstName": "John",
"lastName": "Doe",
"married": true,
"spouse": {
"dateOfBirth": null,
"firstName": "Jane",
"lastName": "Doe"
}
}
}
---
person:
dateOfBirth: 1969-12-31
firstName: John
lastName: Doe
married: true
spouse:
dateOfBirth: null # This is a comment
firstName: Jane
lastName: Doe
At first glance, XML appears to have the most intimidating syntax, which adds a lot of noise. JSON greatly improves the situation in terms of simplicity, but it still buries information beneath mandatory delimiters. On the other hand, YAML uses Python-style block indentation to define the structure, making it look clean and straightforward. The downside is that you can’t collapse whitespace to reduce size when transferring messages over the wire.
Note
JSON is the only data format that doesn’t support comments. They were removed from the specification on purpose to simplify the parsers and prevent people from abusing them for custom processing instructions.
Here’s a somewhat subjective comparison of XML, JSON, and YAML to give you an idea of how they stack up against each other as today’s data interchange formats:
XML | JSON | YAML | |
---|---|---|---|
Adoption and support | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐ |
Readability | ⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ |
Read and Write Speed | ⭐⭐ | ⭐⭐⭐⭐ | ⭐ |
File Size | ⭐ | ⭐⭐⭐ | ⭐⭐ |
When you look at Google Trends to track interest in the three search phrases, then you’ll conclude that JSON is the current winner. However, XML isn’t far behind, with YAML attracting the least interest. Also, it looks like XML’s popularity has been on a steady decline ever since Google started collecting data.
Note
Of these three formats, XML and JSON are the only ones that Python supports out of the box, whereas if you wish to work with YAML, then you must find and install a corresponding third-party library. Python isn’t the only language that has better support for XML and JSON than YAML, though. You’re likely to find this tendency across programming languages.
YAML is arguably the easiest on the eyes, as readability has always been one of its core principles, but JSON isn’t bad either. Some might even find JSON less cluttered and noisy due to its minimalistic syntax and resemblance to Python lists and dictionaries. XML is the most verbose, as it requires wrapping every piece of information in a pair of opening and closing tags.
In Python, performance when working with these data formats will vary, and it’ll be highly sensitive to the implementation you choose. A pure Python implementation will always lose to a compiled C library. In addition to this, using different XML processing models—(DOM, SAX, or StAX)—will also impact performance.
Implementations aside, YAML’s versatile, liberal, and complex syntax makes it by far the slowest to parse and serialize. On the other side of the spectrum, you’ll find JSON, whose grammar can fit on a business card (jeresig
). In contrast, YAML’s own grammar documentation (yaml/yaml-grammar
) claims that creating a fully compliant parser has proven almost impossible.
Fun Fact
The official yaml.org website is written as a valid YAML document.
When it comes to document size, JSON is a clear winner again. While the extra quotation marks, commas, and curly brackets take up valuable space, you can remove all the whitespace between the individual elements. You can do the same with XML documents, but it won’t overcome the overhead of the opening and closing tags. YAML sits somewhere in the middle by having a comparatively medium size.
Historically, XML has had the best support across a wide range of technologies. JSON is an unbeatable all-around interchange format for transferring data on the Internet. So, who uses YAML and why?
Practical Uses of YAML
As noted earlier, YAML is mostly praised for its readability, which makes it perfect for storing all kinds of configuration data in a human-readable format. It became especially popular among DevOps engineers, who’ve built automation tools around it. A few examples of such tools include:
- Ansible: Uses YAML to describe the desired state of the remote infrastructure, manage the configuration, and orchestrate IT processes
- Docker Compose: Uses YAML to describe the microservices comprising your Dockerized application
- Kubernetes: Uses YAML to define various objects in a computer cluster to orchestrate and manage
Apart from that, some general-purpose tools, libraries, and services give you the option to configure them through YAML, which you may find more convenient than other data formats. For example, platforms like CircleCI and GitHub frequently turn to YAML to define continuous integration, deployment, and delivery (CI/CD) pipelines. The OpenAPI Specification allows for code stub generation based on a YAML description of RESTful APIs.
Note
The documentation for Python’s logging framework mentions YAML despite the fact that the language doesn’t support YAML natively.
Maybe you’ll decide to adopt YAML in your future project after completing this tutorial!
YAML Syntax
YAML draws heavy inspiration from other data formats and languages that you might have heard of before. Perhaps the most striking and familiar element of YAML’s syntax is its block indentation, which resembles Python code. The leading whitespace on each line defines the scope of a block, eliminating the need for any special characters or tags to denote where it begins or ends:
grandparent:
parent:
child:
name: Bobby
sibling:
name: Molly
This sample document defines a family tree with grandparent
as the root element, whose immediate child is the parent
element, which has two children with the name
attribute on the lowest level in the tree. You can think of each element as an opening statement followed by a colon (:
) in Python.
Note
The YAML specification forbids using tabs for indentation and considers their use a syntax error. This coincides with Python’s PEP 8 recommendation about preferring spaces over tabs.
At the same time, YAML lets you leverage an alternative inline-block syntax borrowed from JSON. You can rewrite the same document in the following way:
grandparent:
parent:
child: {name: Bobby}
sibling: {'name': "Molly"}
Notice how you can mix the indented and inline blocks in one document. Also, you’re free to enclose both the attributes and their values in single quotes ('
) or double quotes ("
) if you want to. Doing so enables one of two interpolation methods of the special character sequences, which are otherwise escaped with another backslash (``) for you. Below, you’ll find Python equivalents alongside the YAML:
YAML | Python | Description |
---|---|---|
Don''t\n | Don''t\\n | Unquoted strings are parsed literally so that escape sequences like \n become \\n . |
'Don''t\n' | Don't\\n | Single-quoted strings only interpolate the double apostrophe ('' ), but not the traditional escape sequences like \n . |
"Don''t\n" | Don''t\n | Double-quoted (" ) strings interpolate escape sequences like \n , \r , or \t , known from the C programming language, but not the double apostrophe ('' ). |
Don’t worry if that looks confusing. You’ll want to specify unquoted string literals in YAML for the most part, anyway. One notable exception would be declaring a string that the parser could misinterpret as the wrong data type. For example, True
written without any quotation marks might be treated as a Python Boolean.
The three fundamental data structures in YAML are essentially the same as in Perl, which was once a popular scripting language. They’re the following:
- Scalars: Simple values like numbers, strings, or Booleans
- Arrays: Sequences of scalars or other collections
- Hashes: Associative arrays, also known as maps, dictionaries, objects, or records comprised of key-value pairs
You can define a YAML scalar similarly to a corresponding Python literal. Here are a few examples:
Data Type | YAML |
---|---|
Null | null , ~ |
Boolean | true , false (Before YAML 1.2: yes , no , on , off ) |
Integer | 10 , 0b10 , 0x10 , 0o10 (Before YAML 1.2: 010 ) |
Floating-Point | 3.14 , 12.5e-9 , .inf , .nan |
String | Lorem ipsum |
Date and Time | 2022-01-16 , 23:59 , 2022-01-16 23:59:59 |
You can write reserved words in YAML in lowercase (null
), uppercase (NULL
), or title case (Null
) to parse them into the desired data type. Any other case variant of such words will be treated as plain text. The null
constant or its tilde (~
) alias lets you explicitly state the lack of a value, but you can also leave the value blank for the same effect.
Note
This implicit typing in YAML seems convenient, but it’s like playing with fire, and it can cause serious problems in edge cases. As a result, the YAML 1.2 specification dropped support for some built-in literals like yes
and no
.
Sequences in YAML are just like Python lists or JSON arrays. They use the standard square bracket syntax ([]
) in the inline-block mode or the leading dashes (-
) at the start of each line when they’re block indented:
fruits: [apple, banana, orange]
veggies:
- tomato
- cucumber
- onion
mushrooms:
- champignon
- truffle
You can keep your list items at the same indentation level as their property name or add more indentation if that improves readability for you.
Finally, YAML has hashes analogous to Python dicts or JavaScript objects. They’re made of keys, also known as attribute or property names, followed by a colon (:
) and a value. You’ve seen a few examples of YAML hashes in this section already, but here’s a more involved one:
person:
firstName: John
lastName: Doe
dateOfBirth: 1969-12-31
married: true
spouse:
firstName: Jane
lastName: Smith
children:
- firstName: Bobby
dateOfBirth: 1995-01-17
- firstName: Molly
dateOfBirth: 2001-05-14
You’ve just defined a person, John Doe, who’s married to Jane Smith and has two children, Bobby and Molly. Notice how the list of children contains anonymous objects, unlike, for example, the spouse defined under a property named "spouse"
. When anonymous or unnamed objects appear as list items, you can recognize them by their properties, which are aligned with an opening dash character (-
).
Note
Property names in YAML are pretty flexible, as they can contain whitespace characters and span multiple lines. What’s more, you’re not limited to using only strings. Unlike JSON, but similar to Python dictionaries, a YAML hash allows you to use almost any data type for a key!
Naturally, you’ve only scratched the surface here, as YAML has plenty of much more advanced features to offer. You’ll learn about some of them now.
Unique Features
In this section, you’ll check out some of YAML’s most unique features, including:
- Data types
- Tags
- Anchors and aliases
- Merged attributes
- Flow and block styles
- Multiple-document streams
While XML is all about text, and JSON inherits JavaScript’s few data types, YAML’s defining feature is tight integration with the type systems of modern programming languages. For example, you can use YAML to serialize and deserialize data types built into Python, such as date and time:
YAML | Python |
---|---|
2022-01-16 23:59:59 | datetime.datetime(2022, 1, 16, 23, 59, 59) |
2022-01-16 | datetime.date(2022, 1, 16) |
23:59:59 | 86399 |
59:59 | 3599 |
YAML understands various date and time formats, including the ISO 8601 standard, and can work with optional time zones. Timestamps such as 23:59:59 get deserialized into the number of seconds elapsed since midnight.
Note
The PyYAML library used in this tutorial is based on the older YAML 1.1 specification, which supports base-60 numbers through the colon (:
) notation. This means that a literal like 59:59
gets interpreted as an integer value of because adds up to it.
On the surface, this looks like a convenient way of converting timestamps to the number of elapsed seconds. Unfortunately, the rules governing the parsing process of such literals in YAML 1.1 can be confusing and potentially error-prone. When your literal starts with a leading zero (05:59
), a YAML 1.1 parser will interpret it as a string instead of a Python datetime
object!
Therefore, you might consider using a different library than PyYAML in your production code for peace of mind.
To resolve potential ambiguities, you can cast values to specific data types by using YAML tags, which start with the double exclamation point (!!
). There are a few language-independent tags, but different parsers might provide additional extensions only relevant to your programming language. For example, the library that you’ll be using later lets you convert values to native Python types and even serialize your custom classes:
text: !!str 2022-01-16
numbers: !!set
? 5
? 8
? 13
image: !!binary
R0lGODdhCAAIAPAAAAIGAfr4+SwAA
AAACAAIAAACDIyPeWCsClxDMsZ3CgA7
pair: !!python/tuple
- black
- white
center_at: !!python/complex 3.14+2.72j
person: !!python/object:package_name.module_name.ClassName
age: 42
first_name: John
last_name: Doe
The use of the !!str
tag next to a date object makes YAML treat it as a regular string. Question marks (?
) denote a mapping key in YAML. They’re usually unnecessary but can help you define a compound key from another collection or a key that contains reserved characters. In this case, you want to define blank keys to create a set data structure, which is equivalent to a mapping without the keys.
Moreover, you can use the !!binary
tag to embed Base64-encoded binary files such as images or other resources, which will become instances of bytes
in Python. The tags prefixed with !!python/
are provided by PyYAML.
The YAML document above would translate into the following Python dictionary:
{
"text": "2022-01-16",
"numbers": {8, 13, 5},
"image": b"GIF87a\x08\x00\x08\x00\xf0\x00…",
"pair": ("black", "white"),
"center_at": (3.14+2.72j),
"person": <package_name.module_name.ClassName object at 0x7f08bf528fd0>
}
Notice how, with the help of YAML tags, the parser turned property values into various Python data types, including a string, a set, a bytes object, a tuple, a complex number, and even a custom class instance.
Other powerful features of YAML are anchors and aliases, which let you define an element once and then refer to it many times within the same document. Potential use cases include:
- Reusing the shipping address for invoicing
- Rotating meals in a meal plan
- Referencing exercises in a training program
To declare an anchor, which you can think of as a named variable, you’d use the ampersand (&
) symbol, while to dereference that anchor later on, you’d use the asterisk (*
) symbol:
recursive: &cycle [*cycle]
exercises:
- muscles: &push-up
- pectoral
- triceps
- biceps
- muscles: &squat
- glutes
- quadriceps
- hamstrings
- muscles: &plank
- abs
- core
- shoulders
schedule:
monday:
- *push-up
- *squat
tuesday:
- *plank
wednesday:
- *push-up
- *plank
Here, you’ve created a workout plan from the exercises that you defined earlier, repeating them across various daily routines. Additionally, the recursive
property demonstrates an example of a circular reference. This property is a sequence whose only element is the sequence itself. In other words, recursive[0]
is the same as recursive
.
Note
Unlike plain XML and JSON, which can only represent tree-like hierarchies with a single root element, YAML also makes it possible to describe directed graph structures with recursive cycles. Cross-referencing in XML and JSON can be possible, though, with the help of custom extensions or dialects.
You can also merge (<<
) or override attributes defined elsewhere by combining two or more objects:
shape: &shape
color: blue
square: &square
a: 5
rectangle:
<< : *shape
<< : *square
b: 3
color: green
The rectangle
object inherits properties of shape
and square
while adding a new attribute, b
, and changing the value of color
.
Scalars in YAML support either a flow style or a block style, which give you different levels of control over the newline handling in multiline strings. Flow scalars can start on the same line as their property name and may span multiple lines:
text: Lorem ipsum
dolor sit amet
Lorem ipsum
dolor sit amet
In such a case, each line’s leading and trailing whitespace will always be folded into a single space, turning paragraphs into lines. This works a bit like HTML or Markdown, resulting in the following piece of text:
Lorem ipsum dolor sit amet
Lorem ipsum dolor sit amet
And, in case you were wondering, Lorem ipsum is a common placeholder text used in writing and web design to fill up the available space. It doesn’t carry any meaning, as it’s deliberately nonsensical and written in improper Latin to let you focus on the form rather than the content.
In contrast to flow scalars, block scalars allow for changing how to deal with the newlines, trailing newlines, or indentation. For example, the pipe (|
) indicator placed right after the property name preserves the newlines literally, which can be handy for embedding shell scripts in your YAML file:
script: |
#!/usr/bin/env python
def main():
print("Hello world")
if __name__ == "__main__":
main()
The YAML document above defines a property named script
, which holds a short Python script consisting of a few lines of code. Without the pipe indicator, a YAML parser would’ve treated the following lines as nested elements rather than as a whole. Ansible is a notable example that takes advantage of this feature of YAML.
If you’d like to only fold lines with indentation determined by the first line in a paragraph, then use the greater than sign (>
) indicator:
text: >
Lorem
ipsum
dolor
sit
amet
Lorem ipsum
dolor sit amet
This will produce the following output:
Lorem
ipsum
dolor sit amet
Lorem ipsum dolor sit amet
Finally, you can have multiple YAML documents stored in a single file separated with the triple dash (---
). You can optionally use the triple dot (...
) to mark the end of each document.