ATHN markup language specification

Version: 0.1.5 (alpha)

Project ATHN is still very early in development. Suggestions and discussion are very welcome. Expect everything to change at any time.

Introduction

ATHN markup language is the backbone of the project, it's the plain text format that makes up every ATHN site. It's intended to be as protocol agnostic as reasonable. Right now https is used to exchange the markup files, but in theory any other protocol, maybe even a custom one specifically tailored to the needs of this project, could be used for that purpose.

Generally complexity is put on the client software, making the server software simpler to ease site development

Whenever an example of a keyword is given in this specification it's case sensitive unless otherwise noted and has the bytes that make it up written in paretheses next to it. The bytes MUST match exactly, whitespace characters are not ignored.

1 Line types

An ATHN markup document can be analyzed line by line. Each line, except text lines, start with a 3 byte line type identifier (LTI) followed by the content of the line. Every line type has a unique LTI so that the type of any line can be determined unambiguously by knowing the section of the line and inspecting its first 3 bytes. If the first 3 bytes dont match any valid LTI, the line is considered a text line. Each line type has its own rules about how it should be processed, and not all of them should be presented to humans. Blank lines are ignored.

1.1 Text lines

As the name suggests text lines just contain plain text. Text lines are rendered as is. But they have one special feature: formatting

Available in: 'Main', 'Footer' and 'Form'

If the line type cannot be determined to be something else it's a text line.

1.1.1 Text line formatting

Text lines can contain a few special formatting character sequences. They should not be rendered, but MAY be ignored. When the line ends all formatting is implicitly reset. The special sequences are as follows:

\r (0x5c 0x72)

Resets all formatting

\b (0x5c 0x62)

Enables bold text formatting

\i (0x5c 0x69)

Enables italic text formatting

\p (0x5c 0x70)

Enables monospaced formatting

On the rare occation that you do need to write any of the control sequences in a text line you can write it like this \​b (0x5c 0xe2 0x80 0x8b 0x62) with a zero width space or similar in between the backslash character and the letter

1.2 Section lines

Section lines MUST NOT be rendered. Section lines change the section of every following line (until a new section line is encountered). The section starts off as 'Meta' on the first line of any document. Any of the other sections can occur on any line after the 'Meta' section.

LTI: +++ (0x2d 0x2d 0x2d)

The content of a section line is only valid if it is one of the following:

Nothing

Changes the section to 'Main'. It is required and must occur 1 or more times per document.

Header (0x20 0x48 0x65 0x61 0x64 0x65 0x72)

Changes the section to 'Header'. Is optional and only allowed 1 or 0 times per document.

Footer (0x20 0x46 0x6f 0x6f 0x74 0x65 0x72)

Changes the section to 'Footer'. Is optional and only allowed 1 or 0 times per document.

Form (0x20 0x46 0x6f 0x72 0x6d)

Changes the section to 'Form'. Is optional and allowed any number of times per document.

1.3 Metadata tag lines

Metadata tag lines is a collection of line types, used for metadata, each type of metadata tag line is called a tag. Some tags can occur more than once, their order matters and MUST be preserved unless otherwise noted.

Available in: 'Meta'. In fact they are the only permitted line type in the 'Meta' section.

LTI: A letter referring to the tag name, followed by an M (0x4d) character, followed by a space (0x20) character.

These are the valid metadata tags and their rules, they MUST occur in this order in the document:

1.3.1 Title tag

TM (0x54 0x4d 0x20)

The title tag is REQUIRED, it can only occur once, maximum size is 2048 bytes. It SHOULD be shown to the human, typically in a big font at the top of the site, but can usually just be ignored by automatic clients.

1.3.2 Subtitle tag

SM (0x53 0x4d 0x20)

The subtitle tag can only occur once, the maximum size is 16384 bytes. It SHOULD be shown to the human.

1.3.3 Author tag

AM (0x41 0x4d 0x20)

The author tag can occur up to 16 times. Maximum size is 1024 bytes.

1.3.5 Language tag

LM (0x4c 0x41 0x20)

The first occurence of the language tag tells the client what language the current document is (primarily) written in. All subsequent occurences mean that the document is also available in that language. Can occur up to 256 times. The content of the tag is a language string as defined in BCP47. It's important that the first occurence comes first, but the order of the rest is not important.

1.3.4 License (usage _Rights) tag

RM (0x4c 0x4d 0x20)

The license tag specifies what license the whole document is under, it applies to the whole document. It can be any identifier in the SPDX license list or a URL to a custom license. It can occur up to 4 times. The maximum size is 2048 bytes.

1.3.6 Cache duration tag

CM (0x43 0x4d 0x20)

The cache duration tag is an unsigned 32 bit integer, it can only occur once. It is the number of seconds that the page should be cached by clients that support caching. The special value of 0 is used for an infinite cache time.

1.4 Link lines

Link lines are used to redirect to any other document or file of any sort. Link lines are the primary (only) way to have images, video, other multimedia, tables and anything else not supported by ATHN markup in an ATHN markup document. They are of course also used to link to other ATHN documents. They are typically underlined, colored or have some other formatting that indicates that they're interactive. When activated by a human the link SHOULD open in their prefered app, the exception being other ATHN documents which MAY be opened/redirected to in the client app.

Available in: 'Main', 'Header', 'Footer' and 'Form'. They are the only allowed line type in the 'Header' section.

LTI: @@@ (0x40 0x40 0x40)

The content of a link line always starts with a URL (absolute or relative) with spaces and other reserved characters percent-encoded as per RFC 3986. Optionally followed by the | (0x20 0x7c 0x20) delimiter, followed by a label. When no label is specified, the url is shown, otherwise only the label is shown. If a label is present the client MAY still show the url to the human upon request.

1.5 Preformatted lines

Preformatted lines are rendered in a monospaced font with unaltered whitespace. Preformatted lines SHOULD NOT be wrapped. Preformatted lines are used for things like source code, terminal outputs, ASCII art or anything else where it's important to preserve whitespacing or use a fixed width font.

Available in: 'Main' and 'Form'

LTI: ;;; (0x3b 0x3b 0x3b)

Alternative LTI: ''' (0x27 0x27 0x27) can be used to indicate that the content of the preformatted line is textual and makes sense to read (this would mean that a screen reader would read it out loud). The difference between the 2 LTIs is purely a matter of accessibility role.

2.5.1 Code blocks

Advanced clients MAY choose to render continous sequences of preformatted lines into a single "code" block. If they do that they MUST not differentiate between the 2 different types of preformatted lines.

If an author makes a code block containing for example ASCII art, they can chose to start the block with a line beginning with ''' and containing a title which would be read aloud by the screen reader. Then make the rest of the block's lines start with ;;; which would, correctly, not be read by the screen reader since it isnt text.

1.5.2 Source code highlighting

It would be unreasonable to assume of clients that they can correctly highlight source code of any kind. If you want source code to be highlighted you can use ASCII control codes to do the highlighting, but dont expect all clients to show the highlighting. This has the benefit that you can be sure that your source code is always correctly highlighted (if the client supports it). With the drawback that it's more difficult for you to write it. Another drawback is that the client wont get to apply its own color scheme.

The reasoning behind this decision is that if there's one thing that's worse than unhighlighted source code, it's incorrectly highligted source code, and you cant expect clients to know every code language in the world.

1.6 Separator lines

The separator line type is used to separate paragraphs, code blocks or in other circumstances where you might otherwise use a blank line. Since blank lines are ignored and using text lines containing nothing but whitespace is hacky. Separator lines are rendered as a horizontal line, extra spacing or something similar.

Available in: 'Main' and 'Form'

LTI: === (0x3d 0x3d 0x3d)

Separator lines MUST NOT contain any content

1.7 Unordered list lines

Unordered list lines are lines rendered starting with a bullet (the specific symbol is up to the client and could differ depending on the level of indentation) and indented to indicate that it's part of a list. Unordered lists support 6 levels of indentation.

Available in: 'Main' and 'Form'

LTI: a number corresponding to the level of indentation followed by a - (0x2d) character followed by a space character. That makes the following LTIs specify unordered list lines:

1- (0x31 0x2d 0x20): Unordered list line indented by 1

2- (0x32 0x2d 0x20): Unordered list line indented by 2

3- (0x33 0x2d 0x20): Unordered list line indented by 3

4- (0x34 0x2d 0x20): Unordered list line indented by 4

5- (0x35 0x2d 0x20): Unordered list line indented by 5

6- (0x36 0x2d 0x20): Unordered list line indented by 6

The content of an unordered list line can be anything

1.8 Ordered list lines

Ordered list lines are identical to unordered list lines except the author gets to choose the bullet.

Available in: 'Main' and 'Form'

LTI: a number corresponding to the level of indentation followed by a * (0x2a) character followed by a space character. That makes the following LTIs specify unordered list lines:

1* (0x31 0x2a 0x20): Ordered list line indented by 1

2* (0x32 0x2a 0x20): Ordered list line indented by 2

3* (0x33 0x2a 0x20): Ordered list line indented by 3

4* (0x34 0x2a 0x20): Ordered list line indented by 4

5* (0x35 0x2a 0x20): Ordered list line indented by 5

6* (0x36 0x2a 0x20): Ordered list line indented by 6

The content of an ordered list line starts with the bullet symbol(s) followed by the | (0x20 0x7c 0x20) delimiter followed by the content of the list item

1.8.1 Considerations about custom bullets

Requiring the author to number the list items themselves is a bit more difficult on the author. But it does also provide a lot of flexibility in terms of how you choose to number them. You could number them with numbers, letters, roman numerals or not number them at all but just use a special bullet symbol (this would be how you do checklists). It also allows you to stick a text line in between 2 list items in an ordered list without breaking the numbering flow.

1.9 Dropdown lines

A dropdown is used to hide a long and not neccesary to read paragraph behind a shorter label. Only the label is shown until the human requests to see the hidden content of the dropdown line.

Available in: 'Main' and 'Form'

LTI: ... (0x2e 0x2e 0x2e)

Dropdown lines start with the label followed by the | (0x20 0x7c 0x20) delimiter, followed by the hidden content

1.10 Admonition lines

Admonitions are callout boxes that directs your attention to warnings, notes, tips etc. They're often used in software documentation. ATHN markup has 3 types of admonitions used to convey different levels of importance.

Available in: 'Main' and 'Form'

LTIs:

_! (0x5f 0x21 0x20): Note admonition. Is no more important than the other lines in the document and doesnt need any special emphasis.

*! (0x2a 0x21 0x20): Warning admonition. Is important and needs to be emphasized accordingly. It's recommended to only have a few of these in each document.

!! (0x21 0x21 0x20): Danger admonition. Is the single most important piece of information in the entire document and needs strong emphasis. It's recommended not to have more than one of these in each document.

The content of an admonition line can be anything

1.11 Heading lines

There are 6 levels of heading lines. They are typically rendered in a bigger and differently weighted font than text lines. Simple clients MAY choose to treat them identically to text lines (with the LTI shown to the human). Headings do not support formatting, only text lines do. Conventionally the title metadata tag is used as the title of the page, not a level 1 heading.

Available in: 'Main' and 'Form'

LTI: a number corresponding to the level of the heading, followed by a # (0x23) character, followed by a space (0x20) character. That makes the following LTIs specify heading lines:

1# (0x31 0x23 0x20): Level 1 heading line

2# (0x32 0x23 0x20): Level 2 heading line

3# (0x33 0x23 0x20): Level 3 heading line

4# (0x34 0x23 0x20): Level 4 heading line

5# (0x35 0x23 0x20): Level 5 heading line

6# (0x36 0x23 0x20): Level 6 heading line

The content of a heading line can be anything

1.12 Quote lines

Quote lines are used for quotes. They are usually formatted in a certain way to indicate that the line is a quote, or might just by wrapped in quotation marks. Simple clients MAY choose to treat quote lines identically to text lines. Multiline quotes are not supported.

Available in: 'Main' and 'Form'

LTI: /// (0x2f 0x2f 0x2f)

The content of quote lines can be anything

1.13 A note on LTI patterns

All 35 LTIs have been selected carefully to follow specific patterns, whose purpose is to make it easer to parse (mostly for machines). They all fall into one of these two catergories:

  1. There is only 1 sub type for the line type that corresponds to the LTI
  2. There is more than 1 sub type (like heading lines) for the line type that corresponds to the LTI

For LTIs that fall into the 1st category all 3 characters are identical.

For LTIs that fall into the 2nd category, the first character defines the sub type, the second character defines the line type and the third character is a space.

Also, there are 3 line types where the content consists of 2 parts separated by the | (0x20 0x7c 0x20) delimiter. The second character in the LTI of those line types has an ASCII code that's divisible by 2 (which none of the others are).

2 Section rules

The 'Main' section

The 'Main' section contains the primary content of the page and MUST be presented to the human at all times in an appropriate manner.

The 'Header' section

The 'Header' section is a navigation bar of the site as a whole. It can only contain link lines. It contains the most important links to other documents that are relevant to the current one. It should be accessible, but can be hidden away sometimes, only be at the top of the page or be presented in some other way that makes sense given its role.

The 'Footer' section

The 'Footer' section is typically located at the bottom of the page and can contain link lines and regular text lines. It's typically used for persistent information about the page or site that isnt handled my the 'Meta' section. It could for example link to a privacy policy, or explain briefly what the company behind a site does. It isnt very important and doesnt need to be presented prominently to the human, but could be of interest and should be viewable.

3 Forms

The 'Form' section behaves much like the 'Main' section. You can have several form sections in one document. A 'Form' section has all the same line types as the 'Main' section, and they're all rendered in the same way. The difference is that it has one additional line type available: the form field line.

A form field collects user input. All form field lines in a single 'Form' section are grouped together. All 'Form' sections MUST have a submit field, that when activated, if all fields are validly filled, sends the input of all of the fields in the 'Form' section to a specified server. The specifics of how such a write request is sent and encoded depends on the protocol used, see the companion specification for the protocol for that.

The fields of different types should be presented as interactive elements that make sense to choose that type of data. For example the date type could have a calendar picker and the string type a text box.

Clients SHOULD save the inputs in forms until the form data has been successfully sent or the user decides to clear it. That means that if the user partially fills out a form, then leaves the page to come back to it later, the input should still be filled into the form. It also means that if the form data is sent unsuccessfully, whether that's because the client fails to send it or because the server reports an error in the response, the input in the form should not be cleared but saved.

3.1 Form field lines

Available in: 'Form'

LTI: ??? (0x3f 0x3f 0x3f)

The content of a form field line goes like this:

  1. The ID of the field, this can consist of the ASCII letters a-z (0x61-0x7a), A-Z (0x41-0x5a) and the _ (0x5f) character. Snake case is recommended.
  2. The : (0x3a) character
  3. The name of the type of the field, see section 3.2 for a list
  4. Any number of properties. Properties are specified in a way similar to command line arguments. Like this:
    1. The \ (0x20 0x5c) characters.
    2. The name of the property, the valid property names will depend on the type of the field and can also be seen in section 3.2.
    3. A space (0x20) character
    4. The value of the property. The property value continues until a new property is encountered or the line ends. Some properties are boolean and have a default behavior that can be overridden by just applying the property to a field, they therefore dont have any value. The expected property type is written in the definitions of the properties.

Here is an example of a correctly formatted form field

???password:string \secret \min 12 \max 64 \label Choose your password

3.2 Form field types

3.2.1 Global properties

These properties can be used on all (except when noted) form field types.

3.2.2 The submit type

Type name: submit (0x73 0x75 0x62 0x6d 0x69 0x74)

This is the special submit type, it is the only type that doesnt take any input. Instead it behaves like a button and can only be used once per 'Form' section. When activated it verifies that all of the form fields in the 'Form' section that it's in are validly filled out. If they're not it will refuse to submit, notifying the user of the error. If they are valid it will submit the form data, this process depends on what protocol is used. The ID of the submit button is only used to set its default label, it doesnt matter otherwise.

Valid properties:

3.2.3 The string type

Type name: string (0x73 0x74 0x72 0x69 0x6e 0x67)

Input type: string

The string type contains a string of UTF-8 characters. By default only a single line is allowed. It's very generic so it has quite a few options to narrow its meaning down.

Valid properties:

3.2.4 The integer type

Type name: int (0x69 0x6e 0x74)

Input type: 64bit signed integer

The integer type represents a whole number.

Valid properties:

3.2.5 The floating point number type

Type name: float (0x66 0x6c 0x6f 0x61 0x74)

Input type: IEEE 754 double precision floating point (binary64) number

The floating point number type represents (almost) any number.

Valid properties:

3.2.6 The boolean type

Type name: bool (0x62 0x6f 0x6f 0x6c)

Input type: boolean

A boolean is either true or false. If the boolean is optional it can also be undecided. Optional booleans SHOULD have some way to be reset back to the undecided state. If a default value hasnt been specified with the default property the boolean field defaults to false if the field is required and undecided if the field is optional.

Valid properties: Only the global properties. Default property can have the values true (0x74 0x72 0x75 0x65) or false (0x66 0x61 0x6c 0x73 0x65)

3.2.7 The file (binary) type

Type name: file (0x66 0x69 0x6c 0x65)

Input type: arbitrary binary data

Allows a file (including binary files) to be uploaded.

Valid properties:

Remember: The file type doesnt support the default property

3.2.8 The list type

Type name: list (0x6c 0x69 0x73 0x74)

Input type: (non zero) unsigned int* (the list field doesnt technically take any input)

The list type is a bit special because it doesnt contain any value in itself. Instead it has a number of child fields that contain the actual values. It can contain any number of elements which each has a value for each child element, like an array or vector in most programming languages. Clients should somehow indicate that the child fields are related to each other and to the list, they should also let the user create and remove elements from the list.

Valid properties:

Note: This is the only case where the document flow can be broken, meaning that the order of the lines in the source document doesnt correspond to the order in the rendered document. That might become a problem, but it's currently unsolved. For now the best thing to do is to always put child fields on the line immedeatly after their parent field.

3.2.9 The date type

Type name: date (0x64 0x61 0x74 0x65)

Input type: RFC 3339 formatted timestamp where the timezone is always UTC. The special value now (0x6e 0x6f 0x77) refers to the time of filling out the form field.

The date type represents a specific point in time. The actual value is always in UTC but it should be presented to the human in local time.

Valid properties:

3.2.10 The email address type

Type name: email (0x65 0x6d 0x61 0x69 0x6c)

Input type: An email address

The email type represents an email address.

Valid properties: Only the global properties.

3.2.11 The phone number type

Type name: tel (0x74 0x65 0x6c)

Input type: International ITU-T E.164 formatted phone number

The phone number type represents a phone number. Like with the date, the phone number MAY be presented to the human, and the human MAY input the phone number in another format, in which case the client would have to convert it to E.164 prior to sending it.

Valid properties:

4 Uncompliant document behavior

4.1 Client behavior

A client MUST be able to parse a fully compliant ATHN markup document correctly. However, some of the things that might make a document uncompliant might not cause a client like that to catch the error. If an uncompliant document is given to a client it MAY try to parse it to the best of its abilities, erroring (in a user friendly manner) if it cant make sense of it, this includes trying to fix the errors, but that would be advanced to implement and wont always work.

An example of such an uncompliance would be if the metadata tags are in the wrong order. Some clients might rely on the order of the metadata tags, and would interpret the document wrong or fail to interpret it entirely if the metadata tags are in the wrong order. While other clients might be completely oblivious to the order of the metadata tags and would interpret the document just fine. Both cases would be an acceptable way to handle the error.

Some clients, might be designed to validate the compliance of documents and provide helpful feedback if it finds any problems with them. These clients should of course always be completely accurate and correctly flag all possible uncompliant documents. They may also show recommendations for following best practices.

4.2 Server behavior

It's important to note here that a server refers to any piece of software that transmits ATHN documents, which includes CGI scripts and other ways of doing server side rendering.

Servers SHOULD have some mechanism in place to ensure that the documents they transmit are always compliant. There are many different types of server software and it's up to the developer of the server to determine how to best ensure this.

Here are some examples of how you might implement such a check:

Appendix: Design decisions that need to be discussed