Guide to adding new AST readers

Pyndoc’s reader functions in a way that makes it independent of the language being processed, the structure of syntax tree makes it simple to add new blocks or modify existing ones.

AST Blocks

AST blocks are divided into two subcategories:

  • Atom blocks - cannot hold other blocks inside of them

  • Composite blocks - can hold other blocks inside of them

The reader will treat these two types completely independently and has a different way for reading them

Default AST Blocks

The default representation of AST Blocks has been defined in pyndoc.ast.blocks and contains definitions for most blocks found in any markup language. Any language parser or reader added to pyndoc should define these blocks

The blocks defined in the aforementioned file all derive from either the ASTAtomBlock or ASTCompositeBlock class, found in pyndoc.ast.basic_blocks. These classes define the default behaviour and contents of Atom and Composite blocks

Read Handler

A read handler is a class containing default methods for parsing tokens related to certain AST Blocks, every AST Block derives from a Read Handler, allowing for custom reading functions and definitions

CompositeReadHandler

A class defining the Read Handler for all composite blocks, it contains attributes related to the start and end patterns of a Composite Block, as well as information whether the block is an inline block (if it can exist on its own, not wrapped in any other composite block)

CompositeReadHandler contains the following methods important for creating new readers:

  • process_read - invoked after a block is created, can process additional arguments after a block’s definition, by default - does nothing

  • start - matches a token against a start pattern

  • end - matches a token against an end pattern

  • handle_premature_closure - special handling of any situation in which the file has ended and the block needs extra processing

AtomReadHandler

A class defining the Read Handler for atom blocks, contains attributes related to the pattern that matches an atom block, and a boolean indicating if the block has any content (for example: a Str block will have a string as the content, and a Space block won’t have anything)

AtomReadHandler contains the following methods important for creating new readers

  • match_pattern - matches the token against the block’s pattern

Atom Wrapper

An atom wrapper is a block that will catch, and wrap around any atom or inline blocks that are defined without any context existing, most of the time, ast.Para will be used for this, but any other function can be used as well

Defining a reader

New readers can be defined in the src/pyndoc/readers directory.

First create a directory with the language’s name, inside of the directory, create an empty __init__.py file.

tokens.py

tokens.py is a required file for each language reader, it contains details on all tokens and their start and end patterns, it will be used to define attribute values for AST Blocks

A tokens.py file should contain the following definitions:

  • A declared_tokens dict, containing a specific AST Block class as a key, and a tuple containing a regex pattern defining the block’s start, and a boolean as a value. The boolean will be used as the is_inline attribute

  • A declared_ends dict, containing information on declared end patterns, key values are same as above, values are just the regex pattern

  • an atom_wrapper variable - containing the class name for the atom wrapper

  • A declared_atomic_patterns dict - keys as above, values are a tuple containing a regex string for each atomic pattern, and a boolean indicating if the block has any contents

Default block processing

The reader goes over a file character by character and forms tokens that are then matched, by the parser, against the patterns defined in tokens.py. With the default bahaviour of all read handles, the reader will do the following for each read character:

  1. Check if the currently processed block has ended:

    • Run the end method of a read handler, it will return a match and a new token

    • If there is a match, pop the current block from the context tree, and place it into the parsed tree if the context is empty, or into the block below it otherwise

  2. Check if a new block has started:

    • Run the start method of a read handler, it will return a match and a new token

    • if the block is an inline block, it will be wrapped in an atom handler first

    • Add the block to the context tree

  3. Check if an atom block has ended

    • Check if an atom block has been matched in a previous iteration, and does not match now, the match_pattern method is used for this

    • this indicates that the atom block has ended

    • insert the atom block into the current context, or wrap it around the atom wrapper if there is no context.

Defining custom blocks

Custom blocks can be defined within a language module (directory under pyndoc/readers) in its own blocks.py file, custom behaviour such as overriden start(), end() and process_read() methods can be defined here

Examples

All of these can be found under ``pyndoc.ast.gfm.blocks``

Getting a header level from a matched string

class Header(ast.Header):
    def __init__(self) -> None:
        super().__init__()

    def process_read(self, **kwargs: Unpack[ast_helpers.ProcessParams]) -> None:
        match = kwargs["match"]
        level = len(match.group("h"))
        self.contents.metadata = [level]

Handling premature closure of an Emph

@classmethod
  def handle_premature_closure(cls, token: str) -> str:
      return token[:-1] if token[-1] == "*" else token

Adding a Plain inside of a newly created bullet list

def process_read(self, **kwargs: Unpack[ast_helpers.ProcessParams]) -> None:
    match = kwargs["match"]
    indent = len(match.group("s"))
    self.contents.metadata = [indent]
    self.add_plain(kwargs["context"])

@staticmethod
def add_plain(context: list) -> None:
    plain = ast.Plain()
    context.append(plain)