Parsing JSON: Part 1 - The Core

Part 1 of 2 in a series in creating a JSON parser in Rust

Jacob avatar
  • Jacob
  • 16 min read
Image generated by ChatGPT 4o

Introduction

Code

Hey! This blog post is the first in a 2 part series that details how I went about creating a JSON parser in Rust. In this series, I tie together all my previous posts. This post takes the theoretical compiler concepts discussed in Creating a Language and turns them into a real, working JSON parser.

In the second part, we will look into supporting parsing into a user defined struct, whilst ensuring type safety. This is implemented using procedural macros (my first dive into using writing these in Rust!). At the end, to show it all working, we will swap out the JSON parsing in my Hand-rolled Auth post with my very own parser. So buckle in, because this series is going to cover a lot!

Overview

The goal of this parser is to turn a JSON string into a useful data structure, similar to an Abstract Syntax Tree (AST). This is exactly the role of the front-end of a compiler, so we can use a fairly standard approach to parsing here - scanning/lexing, followed by parsing. This is potentially over-engineered, as the JSON specification outlines how to do the whole parsing in one phase (i.e. without a scanner/lexer). However, I decided to take my own approach to parsing - hopefully learning more in the process. I did still use the JSON specification, but only as a reference for what syntax is possible in JSON.

I have written this parser in Rust, but everything in this first blog post should be applicable to any language! Here, we will parse the JSON into a generic JsonValue type that can hold any JSON data, whatever its structure. In the next part, we will look into populating a user-defined type and validating the types using macros - this part is Rust-specific.

Scanning

Let’s start with the first phase of the parser - the scanner. The purpose of this phase is to turn the raw string into a sequence of tokens, which will make the parsing easier later.

Tokens

A token here can mean a symbol (e.g. {, [, :, etc.), or a literal (strings, numbers, Booleans, or nulls). The tokens store their “kind”, along with a line number and the lexeme (original string value). In Rust, we can represent this with the following data structures:

pub struct Token {
    pub kind: TokenKind,
    pub line: usize,
    pub lexeme: String,
}

pub enum TokenKind {
    LCurlyBracket,
    RCurlyBracket,

    LBracket,
    RBracket,

    Colon,
    Comma,

    // Stores unescaped, dequoted value
    String(String),

    Number,
    Bool,
    Null,
}

The TokenKind::String variant here stores an extra value - the result of removing the surrounding quotes, and handling escape sequences.

The parsing of numbers is handled in the later parsing phase, so we don’t store any extra information on TokenKind::Number. This is because we don’t know what type of number we want (e.g. i32, f64, u8, etc.) until the parsing phase.

The Scanner

The scanner will look at each character in the string, one by one, and segment it into tokens. It holds the entire source code, the start of the token we are currently scanning, the current line, and the position of the current character the scanner is looking at.

pub struct Scanner<'a> {
    source: &'a str,
    token_start: usize,
    line: usize,
    current: usize,
}

The scanner has some utility methods that simplify scanning. Here are the most important ones:

Method SignatureWhat It Does
advance() -> Result<char, ScannerErr>Return the current character, and advance the current position to the next char. This takes into account multibyte Unicode characters
peek() -> Result<char, ScannerErr>Return the current character
matches(c: char) -> boolReturns true and advances if the current char matches c, otherwise does not advance and returns false
make_token(kind: TokenKind) -> TokenCreate a token of the given kind, using the line number from the scanner, and calculating the lexeme as a sub-string of the source code

To interact with the scanner, we have a next_token method, which just obtains the next token in the source code. This is the only public method (other than the constructor) - all other methods are just to assist in scanning the next token.

pub fn next_token(&mut self) -> Result<Option<Token>, ScannerErr> {
    self.skip_whitespace();

    if self.is_at_end() {
        return Ok(None);
    }

    // Set the start of the token
    self.token_start = self.current;

    // Get the next character
    let c = self.advance()?;

    // Number
    if c.is_ascii_digit() || c == '-' {
        return self.number().map(Some);
    }

    // Literal
    if c.is_alphabetic() {
        return self.literal().map(Some);
    }

    // String
    if c == '"' {
        return self.string().map(Some);
    }

    // Symbol
    self.symbol().map(Some)
}

The .map(Some) notation is used as number, literal, string, and symbol all return Result<Token, ScannerErr>, but we need to turn the Token into Option<Token>. map(Some) will wrap the inner value in Some(_) if the Result is Ok

Here, we can see that the scanner identifies the type of token to scan from the first character of said token. Here are the options:

First CharacterToken Type
Digit or -Number
"String
Alphabetic (a-z, A-Z)Literal (null, true, false)
Anything elseSymbol (any of []{}:,)

Scanning Strings

The majority of string scanning is easy - the scanner just loops through until it reaches a closing ". As JSON doesn’t allow multiline strings, if the scanner reaches a newline it will return an error saying the string is not terminated.

let mut str_val = String::new();
while self.peek()? != '"' {
    let chr = self.advance().unwrap();
    if chr == '\n' {
        return Err(self.make_err(ScannerErrKind::UnterminatedString));
    }

    str_val.push(chr);
}

self.advance().unwrap();
Ok(self.make_token(TokenKind::String(str_val)))

When the scanner hits a \ character, it is at the start of an escape sequence. It must convert this escape sequence into a single character. For example, the two characters \n are converted into a single newline character.

Unicode characters involve a bit more work. These are in the form \uXXXX where X is a hex digit. After the u, the scanner first collects 4 hex digits, then it converts them to an integer (u32). From here, it can convert the integer into a character, giving us the Unicode character.

If there is an error at any point in this process, the scanner reports an invalid escape sequence.

This code is inserted just after checking the character isn’t a newline:

// Check for a backslash
if chr == '\\' {
    // Then look at the following escape sequence
    let value = match self.advance()? {
        '"' => '"',
        '/' => '/',
        'b' => '\x08',
        'f' => '\x0C',
        'n' => '\n',
        'r' => '\r',
        't' => '\t',
        '\\' => '\\',
        'u' => {
            // Unicode character - read next 4 hex values and parse
            let mut hex = String::with_capacity(4);
            for _ in 0..4 {
                hex.push(self.advance()?);
            }

            // Convert hex string to unicode char
            let digit = u32::from_str_radix(&hex, 16)
                .map_err(|_| self.make_err(ScannerErrKind::InvalidEscapeSequence))?;

            char::from_u32(digit)
                .ok_or(self.make_err(ScannerErrKind::InvalidEscapeSequence))?
        }
        _ => return Err(self.make_err(ScannerErrKind::InvalidEscapeSequence)),
    };

    // Add the processed character to the string
    str_val.push(value);
    continue;
}

Scanning Numbers

Numbers may start with a digit or -. They may have decimal points and also use scientific notation. Examples of valid numbers are:

  • 12
  • -12
  • 12E3
  • 12.34
  • 12.34e5
  • 12.34e+5
  • -12.34e-5

The scanner handles this by first looping until it reaches a non-digit. At this point, the number may have a decimal point followed by more digits, so continue consuming those:

fn number(&mut self) -> Result<Token, ScannerErr> {
    // Consume digits - we already know we've got an initial one
    while matches!(self.peek(), Ok(c) if c.is_ascii_digit()) {
        self.advance().unwrap();
    }

    // If reach a `.`, include it and continue matching digits
    // We know it is a float at this point
    if self.matches('.') {
        while matches!(self.peek(), Ok(c) if c.is_ascii_digit()) {
            self.advance().unwrap();
        }
    }

    // ...
}

Next, the scanner checks for scientific notation - if the number has an e or E, search for an optional - or + followed by more digits. To handle this, we add this extra code just before returning the token:

let next_char = self.peek();

// Allow scientific notation e.g. 10e5
if let Ok(c) = next_char {
    if c == 'e' || c == 'E' {
        let mut has_number_after_e = false;

        self.advance().unwrap();

        // Consume `-` or `+` if it exists
        self.matches_any(&['-', '+']);

        // Loop until not a digit
        while matches!(self.peek(), Ok(c) if c.is_ascii_digit()) {
            self.advance().unwrap();
        }
    } else if c.is_alphabetic() {
        return Err(self.make_err(ScannerErrKind::InvalidNumber));
    }
}

This leaves out some edge cases to make it easier to read. For example, it does not ensure that there is at least one digit after an e. For the full number parsing code, look here in my GitHub repo.

Scanning Other Literals

If the scanner encounters an alphabetic character, this can be one of 3 literals: null, true, or false. It loops until the end of the alphabetic characters, then checks the scanned lexeme:

fn literal(&mut self) -> Result<Token, ScannerErr> {
    // Loop until not alphabetic character
    while matches!(self.peek(), Ok(c) if c.is_alphabetic()) {
        self.advance().unwrap();
    }

    // Check lexeme
    let literal = &self.source[self.token_start..self.current];
    let kind = match literal {
        "null" => TokenKind::Null,
        "true" | "false" => TokenKind::Bool,
        _ => Err(self.make_err(ScannerErrKind::UnrecognisedLiteral))?,
    };

    Ok(self.make_token(kind))
}

Scanning Symbols

Finally, there are 6 valid symbols in JSON, which are one character long, making them easy to scan:

fn symbol(&mut self) -> Result<Token, ScannerErr> {
    let char = self.prev();
    let kind = match char {
        '{' => TokenKind::LCurlyBracket,
        '}' => TokenKind::RCurlyBracket,
        '[' => TokenKind::LBracket,
        ']' => TokenKind::RBracket,
        ':' => TokenKind::Colon,
        ',' => TokenKind::Comma,
        _ => Err(self.make_err(ScannerErrKind::UnrecognisedSymbol))?,
    };

    Ok(self.make_token(kind))
}

Parsing

Now let’s take a step back to look at the API we are developing - how should parsing JSON look to the programmer using this library?

Generic JSON Value

JSON is very flexible, and an object does not have to conform to a fixed layout, so we may not know the structure of the data we are parsing. To support this in Rust, we need a type-safe way to represent any possible JSON value.

Enums are the perfect way to represent this in Rust, as they support payloads. This is very similar to a tagged union in other languages. Each variant can have a payload (or associated data), which does not necessarily have to be the same type.

For example, we can have a Number enum, which allows us to store either an integer OR a float

pub enum Number {
    Int(i32),
    Float(f64),
}

fn main() {
    // n1 and n2 are the same type, but one contains an integer and one a float
    let n1 = Number::Int(42);
    let n2 = Number::Float(3.14);
}

Let’s now apply this to JSON values - a JSON value can be one of a few different types:

  • Number (always represented as a float)
  • String
  • Boolean
  • Object (key-value pairs, where the key is a string and the value can be any valid JSON value)
  • Array (a list of JSON values. Note that this supports different element types in the same list)
  • Null (no payload required)
pub enum JsonValue {
    Number(f64),
    String(String),
    Bool(bool),

    Object(HashMap<String, JsonValue>),
    Array(Vec<JsonValue>),
    Null,
}

With this enum, we can represent any possible JSON value, including nested objects and arrays. This is because it is a recursive data structure - the Object and Array variants both contain JsonValues, which themselves can be Objects or Arrays

API

I wanted the API to be relatively simple - just call Parser::parse on a JSON string and get back a JsonValue. For example:

fn main() {
    let json = r#"{"name": "John Smith", "age": 42}"#;
    let result: JsonValue = Parser::parse(json).unwrap();
    println!("{result:#?}")

    // Prints:
    // Object(
    //     {
    //         "name": String(
    //             "John Smith",
    //         ),
    //         "age": Number(
    //             42.0,
    //         ),
    //     },
    // )
}

It would also be nice to parse other types - if we expect the JSON to be a specific type, we could specify that. Say we expect a float that could be null, we could do:

fn main() {
    let json = "42";
    let result: Option<f32> = Parser::parse(json).unwrap();
    println!("{result:#?}")

    // Prints: Some(42)
}

To support this, we can create a Parse trait (sometimes called an interface), which has a parse function. This can be implemented on any type, and defines a function that returns the parsed value as that type. We can provide an implementation of these on the Rust equivalents of the basic JSON types (including strings, lists, integers, floats, Booleans, etc.), and users may provide implementations other types, including ones they define. We will also define a parse function for JsonValue.

Here is the trait:

pub trait Parse {
    fn parse(parser: &mut Parser) -> Result<Self, ParserErr>
    where
        Self: Sized;
}

It is possible that an error occurs when parsing, so the return type allows returning either the type we are implementing for, or a parser error.

This function takes a Parser, which does not do the parsing itself, but provides utility functions to make processing the list of tokens more easily. Let’s have a look at it in more depth.

The Parser

The parser holds a scanner, which it can use to request the next token. It then stores the current and previous token for later use.

pub struct Parser<'a> {
    scanner: Scanner<'a>,

    prev: Option<Token>,
    current: Option<Token>,
}

This parser struct has some utility methods that assist parsing each type. Below are the main methods. Note that any of these methods can return a ParserErr, but I removed that from the signatures for better readability.

Method SignatureWhat It Does
consume(kind: TokenKind) -> TokenAdvances and returns the current token if it matches kind. Otherwise, returns an error
check(kind: TokenKind) -> boolReturns true if the current token matches kind, otherwise, false. Does not advance
peek() -> TokenReturns the current token without advancing
advance() -> TokenReturns the current token and advances to the next
previous() -> TokenReturns the previous token

Now let’s look at the parse function. This is a static function that takes the JSON source string, and returns the parsed value, or an error. In order to support parsing different types, we make this a generic function, taking type T that implements Parse (which ensures it has a parse function).

pub fn parse<T: Parse>(source: &str) -> Result<T, ParserErr> {
    let mut scanner = Scanner::init(source);
    let current = scanner.next_token()?;

    let mut parser = Parser {
        scanner,
        current,
        prev: None,
    };

    // Call parse on the provided type `T`
    let result = T::parse(&mut parser)?;
    if parser.current.is_some() {
        return Err(parser.make_err(ParserErrKind::ExpectedEndOfSource));
    }

    Ok(result)
}

This function sets up the scanner and parser, and reads the first token, ready for parsing. Following this, it calls parse on the provided generic type, and returns the result. If there are any tokens left over, it will return an error, as it expected the source to end. This allows us to report invalid JSON strings such as the following:

{
    "something": 1,
}
"this is an unexpected token"

Parsing Basic Types

We now have the building blocks to parse any type we like. We can start by looking at a simple case - a string. The scanner has done the heavy lifting for us here, so we just need to check we have a TokenKind::String token, returning the value if so, or an error if not:

impl Parse for String {
    fn parse(parser: &mut Parser) -> Result<Self, ParserErr> {
        // If we have a string, return the value captured by the scanner
        // Otherwise, we expected a string, but didn't get one - error
        match parser.advance()?.kind {
            TokenKind::String(val) => Ok(val),
            _ => Err(parser.make_err_prev(ParserErrKind::UnexpectedToken)),
        }
    }
}

Integers are only slightly more complicated, having to parse the lexeme string as the integer type:

impl Parse for i32 {
    fn parse(parser: &mut Parser) -> Result<Self, ParserErr> {
        let token = parser.advance()?;

        // If we have a number, parse the lexeme (string) as an i32, and return that, or error if failed
        match token.kind {
            TokenKind::Number => token
                .lexeme
                .parse::<i32>()
                .map_err(|_| parser.make_err_prev(ParserErrKind::InvalidNumber)),
            _ => Err(parser.make_err_prev(ParserErrKind::UnexpectedToken)),
        }
    }
}

We can also write very similar code for other number types as well, swapping the i32s with the desired type (e.g. f64).

Options are an interesting case. These can be None (i.e. null), or a specific type. We want to support a generic optional (i.e. Option<T>), where the inner type T implements Parse. Therefore, the parsing code works as follows:

  1. If the current token is null, return None
  2. Otherwise, call the parse function of the inner type T

Here is the code:

impl<T: Parse> Parse for Option<T> {
    fn parse(parser: &mut Parser) -> Result<Self, ParserErr> {
        // If null, return `None`
        if parser.check(TokenKind::Null)? {
            parser.consume(TokenKind::Null)?;
            return Ok(None);
        }

        // Otherwise parse the inner type `T`
        Ok(Some(T::parse(parser)?))
    }
}

Parsing JsonValues

As mentioned earlier, JsonValues can be one of several types. We need to know which type to parse. Fortunately, each type starts with a different token kind, so we can just check the current token kind, and parse the corresponding type for that:

impl Parse for JsonValue {
    fn parse(parser: &mut Parser) -> Result<Self, ParserErr> {
        let token = parser.peek()?;
        let ast = match token.kind {
            TokenKind::LCurlyBracket => Self::Object(<HashMap<String, JsonValue>>::parse(parser)?),
            TokenKind::LBracket => Self::Array(<Vec<JsonValue>>::parse(parser)?),
            TokenKind::String(_) => Self::String(String::parse(parser)?),
            TokenKind::Number => Self::Number(f64::parse(parser)?),
            TokenKind::Bool => Self::Bool(bool::parse(parser)?),
            TokenKind::Null => {
                parser.advance()?;
                Self::Null
            }
            _ => return Err(parser.make_err(ParserErrKind::UnexpectedToken)),
        };

        Ok(ast)
    }
}

Parsing Objects

Objects are the most complicated type to parse. First, here is an example of a JSON object:

{
    "name": "John Smith",
    "age": 42
}

These are a key-value store which, in Rust, we can represent as a HashMap<String, T> where T is a type that implements Parse (e.g. JsonValue). Here’s the parsing approach:

  1. Consume a {
  2. Loop through all key-value pairs, stopping after reaching a } or if there isn’t a comma after a pair
    1. Consume a string token - the key/name
    2. Consume a colon
    3. Parse the value as type T
    4. Check for a comma
  3. Check there isn’t a trailing comma and consume the }
impl<T: Parse> Parse for HashMap<String, T> {
    fn parse(parser: &mut Parser) -> Result<Self, ParserErr> {
        parser.consume(TokenKind::LCurlyBracket)?;

        let mut props = HashMap::new();
        let mut had_comma = false;

        // Loop through all properties, until reaching closing bracket
        while !parser.check(TokenKind::RCurlyBracket)? {
            let token = parser.advance()?;
            // Expect the key
            match token.kind {
                TokenKind::String(name) => {
                    // Followed by a colon
                    parser.consume(TokenKind::Colon)?;

                    // Followed by the value
                    let value = T::parse(parser)?;
                    props.insert(name, value);

                    // Once no comma at end, we have reached end of object
                    had_comma = parser.check(TokenKind::Comma)?;
                    if had_comma {
                        parser.advance()?;
                    } else {
                        break;
                    }
                }
                _ => return Err(parser.make_err_prev(ParserErrKind::UnexpectedToken)),
            }
        }

        // No trailing comma
        if had_comma {
            return Err(parser.make_err_prev(ParserErrKind::UnexpectedToken));
        }

        parser.consume(TokenKind::RCurlyBracket)?;

        Ok(props)
    }
}

Arrays are parsed similarly, just using square brackets, and checking for a single value instead of key, colon, value. The code for this can be seen in my GitHub repo.

The Result

We now have a fully functioning JSON parser! This works just as outlined in the API section - we can provide a JSON string, such as {"name": "John Smith", "age": 42}, and it will turn this into a JsonValue. Let’s have a look at the example again:

fn main() {
    let json = r#"{"name": "John Smith", "age": 42}"#;
    let result: JsonValue = Parser::parse(json).unwrap();
}

Here, result has the following value:

Object(
    {
        "name": String(
            "John Smith",
        ),
        "age": Number(
            42.0,
        ),
    },
)

This JsonValue will be able to parse any JSON string you throw at it. However, it is not the most practical - JsonValue is an enum, which means we must first check the variant (e.g. string, object, etc.) before being able to access the data. This makes accessing data in nested structures really cumbersome. This also doesn’t support the enforcement of a particular structure/layout of the data.

We can solve both of these problems by supporting parsing to a user defined type (i.e. a struct). We can validate that the JSON data matches the structure of the struct, and also populate it at the same time. There is no need for JsonValue, and we can enforce the layout we expect. For example, we could have the struct:

struct Person {
    name: String,
    age: Option<u32>,
}

Which would only allow JSON objects with a name that is a String and an age that can be an integer, or null.

To do this in Rust, we can use a derive macro, and this will be the focus of the next post in this series - stay tuned!

Resources

Jacob

Written by : Jacob

I am a software engineer that loves exploring tech and figuring out how things work. I am passionate about learning and hope to share some of that passion with you!