Unit formats

Unit is a self-similar higher-order notation for hypermedia. It can be manifested in various mediums through multiple formats. Specifically, Unit has a text format, a code format, a binary format, a graphical format, and a logical format. This document presents a brief overview of these formats. As explained in the rationale, the purpose of this notation is to power a structured hypermedia environment. Through this unified notation, it is possible to engineer a coherent and tightly integrated full-stack information system where every layer, from persistence to communication to presentation, works in unison, thus affording great economics and ergonomics for both the engineering and usability aspects of the system.

Text Format

Text was the medium through which the design of Unit emerged. This format offers a clear picture of the notation, aiding the comprehension of its semantic and structural properties.

Article "Unit – A Self-Similar Higher-Order Semantic Notation"
  .Status "Created at"
    Datetime "2024-01-29T15:11:54.840Z"
  .Property "ID"
    UUIDv4 "35F102D8-A967-49FA-BF70-897503CDDCBA"
  .Credit "Author"
    Name "Jon Secchis"
  .Attribute "Tags"
    Tag "notation"
    Tag "infosys"
  Section* "4A342773-C7D1-4ACF-8768-F35E4541AF8C"
  Section* "716F791E-289B-4B31-8315-7BB11A3CB0A3"
  Section* "259D161A-3564-4F81-B115-1526C9D9EF30"

The structure presented in the code block above is sufficient to demonstrate all components of the notation. Now, let’s relate them.

Minor Unit

A Type-Value ordered pair

Major Unit

A rooted Minor Unit subtree

Meta Unit

A prefixed-root Major Unit

Data Unit

An unprefixed-root Major Unit

Relation

A sequence of sibling Major Units

Meta Relation

A sequence of sibling Meta Units

Data Relation

A sequence of sibling Data Units

That’s about it. Through the orderly and straightforward hierarchical composition of simple constructs, the notation is sufficient to express complex information in a semantic and typed fashion.

Code Format

The code block below depicts the instantiation of a unit structure within the Rust programming language. It’s the format I use in my experiments.

let unit = Unit(vec![
    (0, "Article", "Unit – A Self-Similar Higher-Order Semantic Notation"),
    (1, ".Status", "Created at"),
    (2, "Datetime", "2024-01-29T15:11:54.840Z"),
    (1, ".Property", "ID"),
    (2, "UUIDv4", "35F102D8-A967-49FA-BF70-897503CDDCBA"),
    (1, ".Credit", "Author"),
    (2, "Name", "Jon Secchis"),
    (1, ".Attribute", "Tags"),
    (2, "Tag", "notation"),
    (2, "Tag", "infosys"),
    (1, "Section*", "4A342773-C7D1-4ACF-8768-F35E4541AF8C"),
    (1, "Section*", "716F791E-289B-4B31-8315-7BB11A3CB0A3"),
    (1, "Section*", "259D161A-3564-4F81-B115-1526C9D9EF30"),
]);

As you can see, nesting levels are encoded as numbers. The reason for this is that I aimed for a baseline format that could function similarly in any programming language, as virtually all languages provide some type of ordered collection. If I were to represent hierarchies with nested constructs by default, additional complexity would be required for modeling types and constructing values, particularly within strongly typed languages. Indeed, special procedures are required if manipulation of these structures is needed, but they should be simple. Currently, my experiments only involved the assembling, serialization and rendering of immutable units. Surely, I may be biased by the limited scope of the current developments, but I find it rather convenient to have all nodes already laid out in a depth-first arrangement, which is the natural order of the notation. Furthermore, by not interning the instances into language-specific constructs, the structures are made more amenable to dispatching and functional processing.

Binary Format

A serialized unit structure in binary format is organized as a sequence of fixed-length rows. Currently, I’m utilizing 64-bit rows.

0       4       8      16      32      48      64
┆       ┆       ┆       ┆       ┆       ┆       ┆
┌───────┬───────┬───────┬───────┬───────┬───────┐
│ TPADD │ VPADD │ TROWS │ VROWS │ MROWS │ DROWS │
├───────┴───────┴───────┴───────┴───────┴───────┤
│ TYPE                                          │
├───────────────────────────────────────────────┤
│ VALUE                                         │
├───────┬───────┬───────┬───────┬───────┬───────┤
│ TPADD │ VPADD │ TROWS │ VROWS │ MROWS │ DROWS │
├───────┴───────┴───────┴───────┴───────┴───────┤
│ TYPE                                          │
├───────────────────────────────────────────────┤
│ VALUE                                         │
├───────┬───────┬───────┬───────┬───────┬───────┤
│ TPADD │ VPADD │ TROWS │ VROWS │ MROWS │ DROWS │
├───────┴───────┴───────┴───────┴───────┴───────┤
│ TYPE                                          │
├───────────────────────────────────────────────┤
│ VALUE                                         │
├───────────────────────────────────────────────┤
................................................. 

The diagram above illustrates the binary layout. In this form, minor units are structured by:

  1. a header;
  2. one or more rows for the type string;
  3. zero or more rows for the value string;

Both the type and value strings are padded to span an integral number of rows. As unit complexes consist of hierarchical and ordered compositions of minor units, a serialized structure is essentially a concatenation of these minor units. To account for the hierarchy, the header of each minor unit stores the lengths of its meta and data relations. The header layout is structured as follows:

  1. TPADD – a 4-bit unsigned integer storing the byte length of the type padding;
  2. VPADD – a 4-bit unsigned integer storing the byte length of the value padding;
  3. TROWS – an 8-bit unsigned integer storing the row count of the type string;
  4. VROWS – a 16-bit unsigned integer storing the row count of the value string;
  5. MROWS – a 16-bit unsigned integer storing the row count of the meta relation;
  6. DROWS – a 16-bit unsigned integer storing the row count of the data relation;

The lengths stored in MROWS and DROWS represent the cumulative number of rows across all minor units within each relation. For example, consider a data unit c with empty relations, and both type and value shorter than the row width. The total row count for c would be 3 – 1 for the header, 1 for the type, and 1 for the value. Now, suppose there’s another data unit p that includes c as its only child in its data relation. In this case, the DROWS value in p’s header will store the value 3. If additional child units are added to c, whether meta or data units, the combined row count of all children of c needs to be added to p’s DROWS. Therefore, in broad terms, the serialization process of a complex entails (1) sequentially processing minor units, (2) concatenating their serialized forms, and (3) propagating updates to the appropriate relation lengths up through the hierarchy. With the specified metrics, the following limits are established:

I designed units to be small, allowing a significant number of them to fit in memory. However, there are cases where units need to be big. By storing lengths as multiples of rows, it is possible to reduce header overhead and enable the accommodation of larger payloads, compared with the case of headers encoding sizes 1 to 1. Although the limits above are more than sufficient to store entire articles and even thumbnails in a single value field, they are not sufficient to store large blobs, as 65.535 rows at 64-bits each is 524.280 bytes. In order to support payloads larger than that, special strategies are employed.

Large Types

Since the 255 rows addressable by TROWS is way more than enough to store types, types larger than that are simply not supported.

Large Values

The byte length of paddings are stored as 4-bit integers, which support numbers from 0 to 15. But since rows are 8-bytes wide, the maximum padding a value can have is 7. So, to encode values larges than 65.535 rows, the extra room available in VPADD is leveraged. Specifically, (1) VROWS is set to its maximum supported value of 0xFFFF, (2) VPADD is set to the actual padding plus 8, and (3) the correct row count for the value is stored as a 64-bit unsigned integer just before the value contents. These adjustments effectively extend our capability into the unlimited range for handling large values.

VPADD = 8..15;
VROWS = 65535;

0       4       8      16      32      48      64
┆       ┆       ┆       ┆       ┆       ┆       ┆
┌───────┬───────┬───────┬───────┬───────┬───────┐
│ TPADD │ VPADD │ TROWS │ VROWS │ MROWS │ DROWS │
├───────┴───────┴───────┴───────┴───────┴───────┤
│ TYPE                                          │
├───────────────────────────────────────────────┤
│ VROWS 64                                      │
├───────────────────────────────────────────────┤
│ LARGE VALUE                                   │
................................................. 

Large Relations

To handle large relations, yet another heuristic is employed. Given that minor units should span at least 2 rows (1 for the header and 1 for the type), the value of 1 for either MROWS or DROWS has its interpretation reserved to mean that the relation is larger than 65.535 rows. When that is the case, the correct row count for the relation is stored as a 64-bit unsigned integer just before its contents.

MROWS = 1;

0       4       8      16      32      48      64
┆       ┆       ┆       ┆       ┆       ┆       ┆
┌───────┬───────┬───────┬───────┬───────┬───────┐
│ TPADD │ VPADD │ TROWS │ VROWS │ MROWS │ DROWS │
├───────┴───────┴───────┴───────┴───────┴───────┤
│ TYPE                                          │
├───────────────────────────────────────────────┤
│ VALUE                                         │
├───────────────────────────────────────────────┤
│ MROWS 64                                      │
├───────┬───────┬───────┬───────┬───────┬───────┤
│ TPADD │ VPADD │ TROWS │ VROWS │ MROWS │ DROWS │
├───────┴───────┴───────┴───────┴───────┴───────┤
│ TYPE                                          │
├───────────────────────────────────────────────┤
│ VALUE                                         │
................................................. 
MROWS = 1;
DROWS = 1;

0       4       8      16      32      48      64
┆       ┆       ┆       ┆       ┆       ┆       ┆
┌───────┬───────┬───────┬───────┬───────┬───────┐
│ TPADD │ VPADD │ TROWS │ VROWS │ MROWS │ DROWS │
├───────┴───────┴───────┴───────┴───────┴───────┤
│ TYPE                                          │
├───────────────────────────────────────────────┤
│ VALUE                                         │
├───────────────────────────────────────────────┤
│ MROWS 64                                      │
├───────┬───────┬───────┬───────┬───────┬───────┤
│ TPADD │ VPADD │ TROWS │ VROWS │ MROWS │ DROWS │
├───────┴───────┴───────┴───────┴───────┴───────┤
│ TYPE                                          │
├───────────────────────────────────────────────┤
│ VALUE                                         │
................................................. 
├───────────────────────────────────────────────┤
│ DROWS 64                                      │
├───────┬───────┬───────┬───────┬───────┬───────┤
│ TPADD │ VPADD │ TROWS │ VROWS │ MROWS │ DROWS │
├───────┴───────┴───────┴───────┴───────┴───────┤
│ TYPE                                          │
├───────────────────────────────────────────────┤
│ VALUE                                         │
.................................................

The guiding principles behind this design are intended to keep the format simple, lightweight, general, and consistent. The notation’s inherent malleability and native structural and semantic affordances allow it to handle various protocol requirements as data within the format. This approach aims to guarantee that changes to the format are never necessary, ensuring high compatibility and longevity for the design.

Graphical Format

Since the notation yields hypermedia graphs, the graphical format is navigable and hence has a time component to it. For this reason, it cannot be fully comprehended through images or diagrams. I’m still perfecting an interactive proof-of-concept to showcase the notation working as a fully-featured hypermedia environment. In the meantime, I’ll provide this screen recording of a contrived design prototype, demonstrating the primitive input and navigation affordances of the graphical format.

Logical Format

The logical format defines Unit formally. Given that the notation is defined recursively, each element is presented separately. Beyond defining Unit logically, I’ll briefly outline its semantic properties.

Definition

Minor Unit (u)

An ordered pair of Type (t) and Value (v). Both components are concretely expressed by primitives, that is, opaque and uninterpreted values. The function of u is to define the identity of Major Units.

u ⟹ (t,v)

Major Unit (U)

An ordered pair of Major Type (T) and Major Value (V). U the Root Unit of a complex.

U ⟹ (T,V)

Major Type (T)

A Minor Unit (u).

T ⟹ u

Major Value (V)

An ordered pair of Meta Relation (MR) and Data Relation (DR).

V ⟹ (MR , DR)

Meta Relation (MR)

A sequence of Major Units with Meta semantics (UM), or simply Meta Units. UM is structurally identical to U.

MR ⟹ [ UM , UM , UM … ] : UM ≡ U

Data Relation (DR)

A sequence of Major Units with Data semantics (UD), or simply Data Units. UD is structurally identical to U.

DR ⟹ [ UD , UD , UD … ] : UD ≡ U

Semantic Properties

Meta Semantics. By functioning as extensions to their parent unit, akin to edges on a graph, it is expected that the Major Type (T) component of Meta Units (UM) expresses an identity proper for relationships. In other words, the identity of UM should convey an intangible or immaterially-inclined concept. In practice, in the context (t,v) ∈ T ∈ UM , the Type (t) component should express the identity of an association class and the Value (v) component should express the nominal value of that association class.

Unit "U"
  .Attribute "Nominal Value"
    …
  .Depencency "Nominal Value"
    …
  .Property "Nominal Value"
    …
  .Relationship "Nominal Value"
    …
  .Rule "Nominal Value"
    …
  .Status "Nominal Value"
    …

Data Semantics. By functioning as top-level units, attachments to data units and as relation targets to meta units, it is expected that the Major Type (T) component of Data Units (UD) expresses an identity proper for entities. In other words, the identity of UD should convey a tangible or materially-inclined concept. In practice, in the context (t,v) ∈ T ∈ UD, the Type (t) component should express the identity of a data format and the Value (v) component should express a nominal value for that format.

Directory "Samples"
  Hex String "0x0F"
  HTML "<p>Hello</p>"
  Integer "12"
  Date "1970-01-01"
  Paragraph "Hello"
  RGBA Color "(255,255,255,1)"

Q&A

Q: Are people supposed to write content with this notation?

No. At least not directly. The notation was designed in such a way so a structured editor with a chat-like input-and-control affordance can be offered to users of the information system which is based upon the notation. This screen recording of a prototype editor gives a hint on how the input process should happen. I have found chat-like interfaces to be the most accessible, scalable and easily learnable form of user interaction.

Q: What is the formal relation between types and values?

First, it should be clear that the notation defines a system of nominal typing. For simple types, that is, types that are terminal and can be fully represented by a primitive, the type defines the string syntax for its immediate value. For compound types, those that define a domain of values of, possibly, other types, the type defines a structural syntax, which can also be regarded as a dependency tree or as a template. It is clear that simple types are going to be simple validation functions employing a value parser that are, most commonly, boolean regex proxies. However, it is not clear how compound types should be structured as a function. Categorically, there are static compound types, where the domain is fixed and is captured by a fully determined and unchanging dependency tree, and there are dynamic compound types, that have multiple valid dependency trees. In dynamic compound types, the validity requirements change in response to changes within the current state of the dependency tree, which is a requirement for the implementation of complex parametric structures. In this case, the requirements for presence, absence, cardinality or value of a given field or subtree is potentially determined by the presence, absence, cardinality, or value of another field or subtree.

Q: How is this notation self-similar?

Through the prism of mediation, we say that (1) a type mediates a value; (2) a typed-value mediates a meta relation; and (3) a type-valued-meta-relation mediates a data relation. Therefore, concurrently, (1) a value is data subject to a type; (2) a meta relation is data subject to a typed-value; and (3) a data relation is data subject to a type-valued-meta-relation. Given that these pair-wise mediation relationships are symmetric and propagates recursively through every component within unit complexes, we can assert that the notation is self-similar, both concretely through its spatiotemporal disposition and abstractly through its logic.

Q: What’s the value of self-similarity?

Self-similarity provides complete predictability of structural forms. This, in turn, allows systems to develop robust data-independent heuristics.

Q: How are types enforced?

Types are supposed to be enforced through shareable functional dictionaries at the application level. The notation has no role in enforcing types and there is no concept of built-in types.

Q: Are types and values required to be strings?

The notation does not impose any encoding restrictions. Contingent on the medium used to instantiate the structures, types and values can be expressed by arbitrary byte strings.

Q: What’s with the verbose metadata?

Concise constructs commonly used to express metadata come with their own trade-offs. In Unit, metadata is first-class. For instance, the meta unit .Status "Created at" has edge semantics, extending its parent unit with a Status-type relation named “Created at”. At the cost of compactness, this design offers several advantages, including (1) the openness to more intuitive and humane identifiers, (2) enhanced expressivity for modeling through the explicit typing of metadata, (3) support for multi-valued and typed metadata, and (4) the possibility for metadata to have its own metadata, as meta units are structurally identical to data units, i.e. both can have meta and data relations simultaneously.

I’d like to improve this Q&A, so if you happen to have some questions, email them to me through jon at this domain.