Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add utf16 string types to golang generator #244

Merged

Conversation

lucebac
Copy link
Contributor

@lucebac lucebac commented Feb 21, 2022

I needed UTF16 strings for some Kaitai structs and since Golang supports them similarly to the other encoding, I just added them to the generator.

Comment on lines 280 to 282
"utf16-auto" -> ("unicode.UTF16(unicode.LittleEndian, unicode.UseBOM)", "golang.org/x/text/encoding/unicode"),
"utf16-le" -> ("unicode.UTF16(unicode.LittleEndian, unicode.IgnoreBOM)", "golang.org/x/text/encoding/unicode"),
"utf16-be" -> ("unicode.UTF16(unicode.BigEndian, unicode.IgnoreBOM)", "golang.org/x/text/encoding/unicode")
Copy link
Member

@generalmimon generalmimon Feb 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. "utf16" is almost certainly not the canonical name in any language - in @GreyCat's table of encodings, it's referred to as "UTF-16{,BE,LE}" (notice the hyphen), the same for IANA list of character sets, RFC 2781, list of supported encodings in JavaScript (utf16 is not even recognized as an encoding1), in Python (utf16 is recognized, but utf16-be or utf16-le are not) etc.

    You must only use utf-16, utf-16be or utf-16le.

  2. "utf16-auto" is controversial. The -auto suffix is almost certainly non-standard, I cannot find a single environment using such identifier. It should probably be simply "utf-16". But I'm not even sure about the behavior - you used unicode.LittleEndian as the default encoding, which will be used if BOM is not present (as governed by unicode.UseBOM) - see https://pkg.go.dev/golang.org/x/[email protected]/encoding/unicode#UTF16:

    For UseBOM, if there is no starting BOM, it will proceed with the default Endianness.

    Should UTF-16LE be the default? I checked RFC 2781 and it suggests otherwise (yeah, it's an old memo, but how do you find up-to-date info of this kind):

    If the first two octets of the text is not 0xFE followed by
    0xFF, and is not 0xFF followed by 0xFE, then the text SHOULD be
    interpreted as being big-endian.

    This looks funny, because not everyone seems to care about such recommendations - JavaScript's "utf-16" is an alias for "utf-16le". I don't know why that is, but hey (https://developer.mozilla.org/en-US/docs/Web/API/Encoding_API/Encodings):

    Label Encoding
    "utf-16", "utf-16le" 'utf-16le'

    So JavaScript behaves exactly like it MUST NOT behave 😞

    Applications that process text with the "UTF-16"
    charset label MUST NOT assume the serialization without first
    checking the first two octets to see if they are a big-endian BOM, a
    little-endian BOM, or not a BOM.

    I haven't yet checked what other languages do when they decode UTF-16 from bytes without BOM, but it would be probably worth checking so that we know what to do. I've also found this official FAQ about UTF-16 and BOM - however, this particular case is not included there.

Footnotes

  1. This JavaScript snippet

    (new TextDecoder('utf16')).decode(new Uint8Array([0xFF, 0xFE, 0x3D, 0xD8, 0xDC, 0xDC]))
    

    raises the following error (the same goes for utf16-be and utf16-le):

    RangeError: Failed to construct 'TextDecoder': The encoding label provided ('utf16') is invalid.
    

Copy link
Contributor Author

@lucebac lucebac Mar 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comment - and I am sorry for the long delay. There were other projects which had a higher priority but finally I have the time to work on this again.

I removed the -auto encoding and adjusted the other names to what is expected by the standard. Since you edited your comment: is there anything left to do then? As far as I can see there was only the naming left, which should be fixed now. The rest seems to only be caused by my erroneous naming.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2. I haven't yet checked what other languages do when they decode UTF-16 from bytes without BOM, but it would be probably worth checking so that we know what to do. I've also found this official FAQ about UTF-16 and BOM - however, this particular case is not included there.

I checked a few languages and the behavior of UTF-16 encoding (without explicit byte order) is largely inconsistent. Most languages respect BOM if present (Java, Python) but some do not (JavaScript, where utf-16 is an alias for utf-16le - little-endian BOM is ignored and big-endian BOM is simply copied as if it were a normal character, but the bytes are still interpreted in LE order), Python will use native endianness for UTF-16 if there's no BOM, Java defaults to big-endian if there is no BOM, etc.

Java code
import java.io.UnsupportedEncodingException;

public class Main {
	public static void main(String[] args) throws UnsupportedEncodingException {
		// UTF-16BE without BOM
		System.out.println(new String(new byte[]{/*                  */0x30,0x53,0x30,(byte)0x93,0x30,0x6B,0x30,0x61,0x30,0x6F}, "UTF-16"));
		// UTF-16BE with BOM
		System.out.println(new String(new byte[]{(byte)0xFE,(byte)0xFF,0x30,0x53,0x30,(byte)0x93,0x30,0x6B,0x30,0x61,0x30,0x6F}, "UTF-16"));
		// UTF-16LE with BOM
		System.out.println(new String(new byte[]{(byte)0xFF,(byte)0xFE,0x53,0x30,(byte)0x93,0x30,0x6B,0x30,0x61,0x30,0x6F,0x30}, "UTF-16"));

		// Output: "こんにちは\n" three times
	}
}
Python code (Windows x86_64 => little-endian)
# UTF-16BE with BOM
print(b'\xfe\xff0S0\x930k0a0o'.decode('utf-16'))
# UTF-16LE with BOM
print(b'\xff\xfeS0\x930k0a0o0'.decode('utf-16'))
# UTF-16LE without BOM
print(b'S0\x930k0a0o0'.decode('utf-16'))

# Output: "こんにちは\n" three times
JavaScript code
var strUtf16Le = (new TextDecoder('utf-16')).decode(new Uint8Array([0xFF, 0xFE, 0x3D, 0xD8, 0xDC, 0xDC]))
// => '📜'
var strUtf16Be = (new TextDecoder('utf-16')).decode(new Uint8Array([0xFE, 0xFF, 0xD8, 0x3D, 0xDC, 0xDC]))
console.log(strUtf16Be)
// => '�㷘�'
for (let codePoint of strUtf16Be) {
   console.log(codePoint.codePointAt(0).toString(16))
}
// => fffe, 3dd8, fffd

It would be nice to standardize it, but currently there's even not any format in KSF repo using encoding: UTF-16 without explicit LE or BE order (except nt_mdt_pal.ksy, where encoding: UTF-16 is an error and it should be UTF-16LE), so there's probably no point in making such an effort. But we can at least warn about it or ban it altogether, as @GreyCat suggests in kaitai-io/kaitai_struct#391 (comment).

@generalmimon
Copy link
Member

@lucebac Note: don't worry about failing tests, that's normal 🙂

Copy link
Member

@generalmimon generalmimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

@generalmimon generalmimon merged commit 61981a8 into kaitai-io:master Apr 7, 2022
@lucebac lucebac deleted the add-golang-strings branch April 8, 2022 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants