add utf16 string types to golang generator #244

lucebac · 2022-02-21T16:05:47Z

I needed UTF16 strings for some Kaitai structs and since Golang supports them similarly to the other encoding, I just added them to the generator.

generalmimon · 2022-02-21T22:18:26Z

shared/src/main/scala/io/kaitai/struct/translators/GoTranslator.scala

+    "utf16-auto" -> ("unicode.UTF16(unicode.LittleEndian, unicode.UseBOM)", "golang.org/x/text/encoding/unicode"),
+    "utf16-le" -> ("unicode.UTF16(unicode.LittleEndian, unicode.IgnoreBOM)", "golang.org/x/text/encoding/unicode"),
+    "utf16-be" -> ("unicode.UTF16(unicode.BigEndian, unicode.IgnoreBOM)", "golang.org/x/text/encoding/unicode")


"utf16" is almost certainly not the canonical name in any language - in @GreyCat's table of encodings, it's referred to as "UTF-16{,BE,LE}" (notice the hyphen), the same for IANA list of character sets, RFC 2781, list of supported encodings in JavaScript (utf16 is not even recognized as an encoding¹), in Python (utf16 is recognized, but ~~utf16-be~~ or ~~utf16-le~~ are not) etc.

You must only use utf-16, utf-16be or utf-16le.

"utf16-auto" is controversial. The -auto suffix is almost certainly non-standard, I cannot find a single environment using such identifier. It should probably be simply "utf-16". But I'm not even sure about the behavior - you used unicode.LittleEndian as the default encoding, which will be used if BOM is not present (as governed by unicode.UseBOM) - see https://pkg.go.dev/golang.org/x/[email protected]/encoding/unicode#UTF16:

For UseBOM, if there is no starting BOM, it will proceed with the default Endianness.

Should UTF-16LE be the default? I checked RFC 2781 and it suggests otherwise (yeah, it's an old memo, but how do you find up-to-date info of this kind):

If the first two octets of the text is not 0xFE followed by
0xFF, and is not 0xFF followed by 0xFE, then the text SHOULD be
interpreted as being big-endian.

This looks funny, because not everyone seems to care about such recommendations - JavaScript's "utf-16" is an alias for "utf-16le". I don't know why that is, but hey (https://developer.mozilla.org/en-US/docs/Web/API/Encoding_API/Encodings):

Label Encoding

"utf-16", "utf-16le" 'utf-16le'

So JavaScript behaves exactly like it MUST NOT behave 😞

Applications that process text with the "UTF-16"
charset label MUST NOT assume the serialization without first
checking the first two octets to see if they are a big-endian BOM, a
little-endian BOM, or not a BOM.

I haven't yet checked what other languages do when they decode UTF-16 from bytes without BOM, but it would be probably worth checking so that we know what to do. I've also found this official FAQ about UTF-16 and BOM - however, this particular case is not included there.

Footnotes

This JavaScript snippet

(new TextDecoder('utf16')).decode(new Uint8Array([0xFF, 0xFE, 0x3D, 0xD8, 0xDC, 0xDC]))

raises the following error (the same goes for ~~utf16-be~~ and ~~utf16-le~~):

RangeError: Failed to construct 'TextDecoder': The encoding label provided ('utf16') is invalid.
↩

Thanks for your comment - and I am sorry for the long delay. There were other projects which had a higher priority but finally I have the time to work on this again.

I removed the -auto encoding and adjusted the other names to what is expected by the standard. Since you edited your comment: is there anything left to do then? As far as I can see there was only the naming left, which should be fixed now. The rest seems to only be caused by my erroneous naming.

2. I haven't yet checked what other languages do when they decode UTF-16 from bytes without BOM, but it would be probably worth checking so that we know what to do. I've also found this official FAQ about UTF-16 and BOM - however, this particular case is not included there.

I checked a few languages and the behavior of UTF-16 encoding (without explicit byte order) is largely inconsistent. Most languages respect BOM if present (Java, Python) but some do not (JavaScript, where utf-16 is an alias for utf-16le - little-endian BOM is ignored and big-endian BOM is simply copied as if it were a normal character, but the bytes are still interpreted in LE order), Python will use native endianness for UTF-16 if there's no BOM, Java defaults to big-endian if there is no BOM, etc.

Java code

import java.io.UnsupportedEncodingException; public class Main { public static void main(String[] args) throws UnsupportedEncodingException { // UTF-16BE without BOM System.out.println(new String(new byte[]{/* */0x30,0x53,0x30,(byte)0x93,0x30,0x6B,0x30,0x61,0x30,0x6F}, "UTF-16")); // UTF-16BE with BOM System.out.println(new String(new byte[]{(byte)0xFE,(byte)0xFF,0x30,0x53,0x30,(byte)0x93,0x30,0x6B,0x30,0x61,0x30,0x6F}, "UTF-16")); // UTF-16LE with BOM System.out.println(new String(new byte[]{(byte)0xFF,(byte)0xFE,0x53,0x30,(byte)0x93,0x30,0x6B,0x30,0x61,0x30,0x6F,0x30}, "UTF-16")); // Output: "こんにちは\n" three times } }

Python code (Windows x86_64 => little-endian)

# UTF-16BE with BOM print(b'\xfe\xff0S0\x930k0a0o'.decode('utf-16')) # UTF-16LE with BOM print(b'\xff\xfeS0\x930k0a0o0'.decode('utf-16')) # UTF-16LE without BOM print(b'S0\x930k0a0o0'.decode('utf-16')) # Output: "こんにちは\n" three times

JavaScript code

var strUtf16Le = (new TextDecoder('utf-16')).decode(new Uint8Array([0xFF, 0xFE, 0x3D, 0xD8, 0xDC, 0xDC])) // => '📜' var strUtf16Be = (new TextDecoder('utf-16')).decode(new Uint8Array([0xFE, 0xFF, 0xD8, 0x3D, 0xDC, 0xDC])) console.log(strUtf16Be) // => '�㷘�' for (let codePoint of strUtf16Be) { console.log(codePoint.codePointAt(0).toString(16)) } // => fffe, 3dd8, fffd

It would be nice to standardize it, but currently there's even not any format in KSF repo using encoding: UTF-16 without explicit LE or BE order (except nt_mdt_pal.ksy, where encoding: UTF-16 is an error and it should be UTF-16LE), so there's probably no point in making such an effort. But we can at least warn about it or ban it altogether, as @GreyCat suggests in kaitai-io/kaitai_struct#391 (comment).

generalmimon · 2022-03-21T20:33:46Z

@lucebac Note: don't worry about failing tests, that's normal 🙂

generalmimon

Thanks for the contribution!

add utf16 string types

b308535

generalmimon requested changes Feb 21, 2022

View reviewed changes

apply correct encoding names; remove utf16-auto

0e0d656

generalmimon approved these changes Apr 7, 2022

View reviewed changes

generalmimon merged commit 61981a8 into kaitai-io:master Apr 7, 2022

lucebac deleted the add-golang-strings branch April 8, 2022 15:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add utf16 string types to golang generator #244

add utf16 string types to golang generator #244

lucebac commented Feb 21, 2022 •

edited

Loading

generalmimon Feb 21, 2022 •

edited

Loading

lucebac Mar 21, 2022 •

edited

Loading

generalmimon Mar 28, 2022

generalmimon commented Mar 21, 2022

generalmimon left a comment

add utf16 string types to golang generator #244

add utf16 string types to golang generator #244

Conversation

lucebac commented Feb 21, 2022 • edited Loading

generalmimon Feb 21, 2022 • edited Loading

Choose a reason for hiding this comment

Footnotes

lucebac Mar 21, 2022 • edited Loading

Choose a reason for hiding this comment

generalmimon Mar 28, 2022

Choose a reason for hiding this comment

generalmimon commented Mar 21, 2022

generalmimon left a comment

Choose a reason for hiding this comment

lucebac commented Feb 21, 2022 •

edited

Loading

generalmimon Feb 21, 2022 •

edited

Loading

lucebac Mar 21, 2022 •

edited

Loading