Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define fragment identifiers for application/yaml #38

Merged
merged 6 commits into from
May 12, 2022

Conversation

eemeli
Copy link
Collaborator

@eemeli eemeli commented Apr 5, 2022

Closes #21

This defines application/yaml fragment identifiers to be parsed as YAML aliases, which currently means that they must point to an explicitly defined anchor in the document, a feature natively supported by YAML.

The definition intentionally allows for changes in later editions of the YAML spec to be automatically supported, e.g. as we're working towards supporting something like JSON pointers as well.

The language in the +yaml fragment identifier section seems a bit complex, and I'm not sure if it should be updated as well. Formats xxx/yyy+yaml should be allowed to define their own rules for fragment identifiers. Is this currently the case?

@eemeli eemeli requested a review from ioggstream April 5, 2022 09:32
@ioggstream
Copy link
Collaborator

@eemeli Can you please add an example YAML URI with fragment identifier, so that we could find a way to provide it?

@eemeli
Copy link
Collaborator Author

eemeli commented Apr 5, 2022

Can you please add an example YAML URI with fragment identifier, so that we could find a way to provide it?

Struggling a bit to figure out just what you're looking for here. For an example of how this would work, let's presume that we have file.yaml with the following contents:

%YAML 1.2
---
one: &foo scalar
two:
  - some
  - sequence
  - &bar items

Then, path/to/file.yaml#foo would be pointing at the node with the value scalar, while path/to/file.yaml#bar would point to the node with the value items.

Do you want this sort of example to be included in the RFC?

@ioggstream
Copy link
Collaborator

ioggstream commented Apr 5, 2022

@eemeli probably it could be useful to add either an example section or a normative section that explicits that. An alternative could be to add this information in the YAML spec. WDYT?

I imagine something brief like https://datatracker.ietf.org/doc/html/rfc6901#section-6 which includes some considerations on percent-encoding and some examples. A couple of question, for example:

  • yaml can be encoded in UTF-8, 16, 32. Can anchor/alias nodes identifier be non-ascii / non-utf8 encoded ?

@ioggstream
Copy link
Collaborator

The language in the +yaml fragment identifier section seems a bit complex, and I'm not sure if it should be updated as well.
Formats xxx/yyy+yaml should be allowed to define their own rules for fragment identifiers. Is this currently the case?

Let's discuss this topic in the issue #21

@eemeli
Copy link
Collaborator Author

eemeli commented Apr 5, 2022

@eemeli probably it could be useful to add either an example section or a normative section that explicits that. An alternative could be to add this information in the YAML spec. WDYT?

The YAML spec already includes sections on Node Anchors and Alias Nodes, which then include some examples of them in use. The intent here is to defer to that spec's definition of alias nodes.

yaml can be encoded in UTF-8, 16, 32. Can anchor/alias nodes identifier be non-ascii / non-utf8 encoded ?

At the point where the YAML spec defines anchors and aliases, it's treating its input as a sequence of Unicode code points, i.e. it doesn't care about their encoding. The YAML 1.2 set of acceptable characters for these is tbh far too wide, as it allows for nearly all printable Unicode code points.

@ioggstream
Copy link
Collaborator

ioggstream commented Apr 5, 2022

it's treating its input as a sequence of Unicode code points

foo: &però ciao
bar: *però

reading https://www.rfc-editor.org/rfc/rfc3986#section-3.5 iiuc I need to %encode the però string, right? In this case I am not sure how this should work with UTF-8, 16, 32... Can you make some examples?

(maybe related to ingydotnet/yaml-pm#127)

@ioggstream
Copy link
Collaborator

ioggstream commented Apr 12, 2022

## Fragment identification {#application-yaml-fragment}

This section describes how to use
named anchors (see Section 3.2.2.2 of [YAML])
as fragment identifier to designate a node.

A YAML named anchor can be represented in a URI fragment identifier
by encoding it into octects using UTF-8 {{!UTF-8==RFC3629}},
while percent-encoding those characters not allowed by the fragment rule
in {{Section 3.5 of URI}}. 

If multiple nodes would match a fragment identifier,
the first such match is selected.

Users concerned with interoperability of fragment identifiers:

- SHOULD limit named anchors to a set of characters
  that do not require encoding 
  to be expressed as URI fragment identifiers:
  this is always possible since named anchors are a serialization
  detail;
- SHOULD NOT use a named anchor that matches multiple nodes.

In the example resource below, the URL `file.yaml#foo`
references the anchor `foo` pointing to the node with value `scalar`;
whereas
the URL `file.yaml#bar` references the anchor `bar` pointing to the node
with value `[ some, sequence, items ]`.

~~~ example
%YAML 1.2
---
one: &foo scalar
two: &bar
  - some
  - sequence
  - items
~~~

## Fragment identification {#application-yaml-fragment}

This section describes how to use
named anchors (see Section 3.2.2.2 of [YAML])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using named anchors because they can be defined even when no alias is defined.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, this works for now. The intent of the original language was to account for an expected expansion of alias functionality in the YAML spec, which would allow with a document like this:

- &foo 
  bar:
    - 1
    - 2
    - 42

for an alias *foo/bar/2 to point at the value 42.

By referring directly to anchors here, such a later change to the YAML spec would not be reflected in the mediatype's fragment id.

To be clear, pathlike aliases are not yet valid and there's no fixed schedule for when we might get a YAML 1.3 spec out, so defining the mediatype according to current reality is an entirely valid thing to do.

Copy link
Collaborator

@ioggstream ioggstream Apr 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eemeli thanks for the insight! Some question then

  1. is / a valid character for a named anchor? If / is used as a pathlike separator, isn't using / in named anchors problematic?
  2. since keys can contain non-string characters, how can I address pathlike alias node such as *foo/bar/1 or *fizz/buzz/baz ?
- &foo
  bar:
    1: "integer"
   "1": "string"
- baz: *foo/bar/1
- &fizz
  "buzz/baz": "a"
  "buzz":
    "baz": "b"
- roc: *fizz/buzz/baz

If we want to achieve publication quickly, I think that using "named anchors" is easier, It is always possible to amend the media type registration and the according fragment identifiers interacting directly with IANA.

It is ok to spend some time trying to use alias nodes, provided that:

  1. we need to specify that the fragment identifier "should be interpreted as an alias node": this is because a named anchor might not be referenced by an alias node;
  2. since / is a valid key character, we need to encode it properly like it is done in json pointers. This is probably valid independently on the fragment identifier;
  3. for the sake of interoperability, I suggest to at least having an idea of how to handle the behavior of the above yaml document with the following fragments:
  • file.yaml#foo
  • file.yaml#foo/bar
  • file.yaml#foo/bar/1

Copy link
Collaborator Author

@eemeli eemeli Apr 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. is / a valid character for a named anchor? If / is used as a pathlike separator, isn't using / in named anchors problematic?

Yes, it's a valid character, and yes it's potentially problematic. Not fatally so, though, as preferentially matching the longest substring allows for a deterministic and pretty sensible resolution. I've a prototypical implementation of how this could work here: eemeli/yaml#380.

  1. since keys can contain non-string characters, how can I address pathlike alias node such as *foo/bar/1 or *fizz/buzz/baz ?

Badly. The resolution algorithm can end up pretty straightforward, but with degenerate cases like this that'll mean resolving one of the possible nodes while making the other one unaddressable via a pathlike alias. But it's possible to attach an anchor to a node, which circumvents the problem. That's also the solution for addressing e.g. nodes that are in a mapping and have a non-scalar key like { [ foo, bar ]: value }.

If we want to achieve publication quickly, I think that using "named anchors" is easier, It is always possible to amend the media type registration and the according fragment identifiers interacting directly with IANA.

Yeah, that's why I said referring to anchors directly should be ok for now. They'll need to continue working in the future as well, and any changes should just make expressions that currently fail potentially start resolving, rather than changing the meaning of anything that's currently valid.

If multiple nodes would match a fragment identifier,
the first such match is selected.

Users concerned with interoperability of fragment identifiers:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure whether anchor names allows all possible UTF-32, so here we suggest an interoperale behavior.

pyyaml for example only supports [a-zA-Z0-9\-_]+ for anchor names; I didn't test other implementations.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the above snippet looks odd. we're working on a media type registration. the above text seems to define/describe behavior that hopefully is well-defined for the format, and if it's not, then that's too bad but nothing that a media registration should attempt to change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the YAML spec, the allowed characters in ns-anchor-name is literally everything up to \x10FFFF except for:

  • C0 and C1 control codes, though the Next Line character \x85 is allowed
  • \x20 | ',' | '[' | ']' | '{' | '}'
  • Surrogates [\xD800-\xD8FF]
  • the BOM character \xFEFF

That range is rather silly, and as @ioggstream noted, not supported by all implementations. Sticking to [\w-]+ is indeed recommended for interoperability.

Copy link
Collaborator

@ioggstream ioggstream Apr 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eemeli

  • do you think it's worth reducing the possible values of ns-anchor-name in a future YAML revision ? @dret 's comment is relevant.
  • what happens if I have something like *foo/bar/baz ?
- &foo
  "bar/baz": "a"
  "bar":
    "baz": "b"

for example, json pointers encodes them in a special way

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, reducing the valid space of anchor names is definitely planned. I even wrote up a proposal for it (yaml/yaml-spec#64), but the spec update progress has been a bit stop-and-go-ish.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eemeli it would be a major improvement :)


~~~ example
%YAML 1.2
---
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a bug in kramdown: it seems it can't process --- in code blocks. I kludged it adding a space to each line. Alternatively, we could remove the %YAML header.

Comment on lines +116 to +119
A YAML named anchor can be represented in a URI fragment identifier
by encoding it into octects using UTF-8 {{!UTF-8=RFC3629}},
while percent-encoding those characters not allowed by the fragment rule
in {{Section 3.5 of URI}}.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines taken from json pointer rfc

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't that read "referenced" instead of "represented"?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote "represent" because the serialization is different from the YAML's one.

@ioggstream ioggstream requested a review from dret April 12, 2022 15:10
@ioggstream
Copy link
Collaborator

Merging and moving discussion in #41

@ioggstream ioggstream merged commit 4ff8647 into ietf-wg-httpapi:main May 12, 2022
@eemeli eemeli deleted the anchor-fragments branch May 12, 2022 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fragment identifier for yaml and +yaml
3 participants