-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add utf16 string types to golang generator #244
add utf16 string types to golang generator #244
Conversation
"utf16-auto" -> ("unicode.UTF16(unicode.LittleEndian, unicode.UseBOM)", "golang.org/x/text/encoding/unicode"), | ||
"utf16-le" -> ("unicode.UTF16(unicode.LittleEndian, unicode.IgnoreBOM)", "golang.org/x/text/encoding/unicode"), | ||
"utf16-be" -> ("unicode.UTF16(unicode.BigEndian, unicode.IgnoreBOM)", "golang.org/x/text/encoding/unicode") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
"
utf16
" is almost certainly not the canonical name in any language - in @GreyCat's table of encodings, it's referred to as "UTF-16{,BE,LE}
" (notice the hyphen), the same for IANA list of character sets, RFC 2781, list of supported encodings in JavaScript (utf16
is not even recognized as an encoding1), in Python (utf16
is recognized, butorutf16-be
are not) etc.utf16-le
You must only use
utf-16
,utf-16be
orutf-16le
. -
"
utf16-auto
" is controversial. The-auto
suffix is almost certainly non-standard, I cannot find a single environment using such identifier. It should probably be simply "utf-16
". But I'm not even sure about the behavior - you usedunicode.LittleEndian
as the default encoding, which will be used if BOM is not present (as governed byunicode.UseBOM
) - see https://pkg.go.dev/golang.org/x/[email protected]/encoding/unicode#UTF16:For UseBOM, if there is no starting BOM, it will proceed with the default Endianness.
Should UTF-16LE be the default? I checked RFC 2781 and it suggests otherwise (yeah, it's an old memo, but how do you find up-to-date info of this kind):
If the first two octets of the text is not 0xFE followed by
0xFF, and is not 0xFF followed by 0xFE, then the text SHOULD be
interpreted as being big-endian.This looks funny, because not everyone seems to care about such recommendations - JavaScript's "
utf-16
" is an alias for "utf-16le
". I don't know why that is, but hey (https://developer.mozilla.org/en-US/docs/Web/API/Encoding_API/Encodings):Label Encoding " utf-16
", "utf-16le
"'utf-16le' So JavaScript behaves exactly like it MUST NOT behave 😞
Applications that process text with the "UTF-16"
charset label MUST NOT assume the serialization without first
checking the first two octets to see if they are a big-endian BOM, a
little-endian BOM, or not a BOM.I haven't yet checked what other languages do when they decode UTF-16 from bytes without BOM, but it would be probably worth checking so that we know what to do. I've also found this official FAQ about UTF-16 and BOM - however, this particular case is not included there.
Footnotes
-
This JavaScript snippet
(new TextDecoder('utf16')).decode(new Uint8Array([0xFF, 0xFE, 0x3D, 0xD8, 0xDC, 0xDC]))
raises the following error (the same goes for
andutf16-be
):utf16-le
↩RangeError: Failed to construct 'TextDecoder': The encoding label provided ('utf16') is invalid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comment - and I am sorry for the long delay. There were other projects which had a higher priority but finally I have the time to work on this again.
I removed the -auto
encoding and adjusted the other names to what is expected by the standard. Since you edited your comment: is there anything left to do then? As far as I can see there was only the naming left, which should be fixed now. The rest seems to only be caused by my erroneous naming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2. I haven't yet checked what other languages do when they decode UTF-16 from bytes without BOM, but it would be probably worth checking so that we know what to do. I've also found this official FAQ about UTF-16 and BOM - however, this particular case is not included there.
I checked a few languages and the behavior of UTF-16
encoding (without explicit byte order) is largely inconsistent. Most languages respect BOM if present (Java, Python) but some do not (JavaScript, where utf-16
is an alias for utf-16le
- little-endian BOM is ignored and big-endian BOM is simply copied as if it were a normal character, but the bytes are still interpreted in LE order), Python will use native endianness for UTF-16
if there's no BOM, Java defaults to big-endian if there is no BOM, etc.
Java code
import java.io.UnsupportedEncodingException;
public class Main {
public static void main(String[] args) throws UnsupportedEncodingException {
// UTF-16BE without BOM
System.out.println(new String(new byte[]{/* */0x30,0x53,0x30,(byte)0x93,0x30,0x6B,0x30,0x61,0x30,0x6F}, "UTF-16"));
// UTF-16BE with BOM
System.out.println(new String(new byte[]{(byte)0xFE,(byte)0xFF,0x30,0x53,0x30,(byte)0x93,0x30,0x6B,0x30,0x61,0x30,0x6F}, "UTF-16"));
// UTF-16LE with BOM
System.out.println(new String(new byte[]{(byte)0xFF,(byte)0xFE,0x53,0x30,(byte)0x93,0x30,0x6B,0x30,0x61,0x30,0x6F,0x30}, "UTF-16"));
// Output: "こんにちは\n" three times
}
}
Python code (Windows x86_64 => little-endian)
# UTF-16BE with BOM
print(b'\xfe\xff0S0\x930k0a0o'.decode('utf-16'))
# UTF-16LE with BOM
print(b'\xff\xfeS0\x930k0a0o0'.decode('utf-16'))
# UTF-16LE without BOM
print(b'S0\x930k0a0o0'.decode('utf-16'))
# Output: "こんにちは\n" three times
JavaScript code
var strUtf16Le = (new TextDecoder('utf-16')).decode(new Uint8Array([0xFF, 0xFE, 0x3D, 0xD8, 0xDC, 0xDC]))
// => '📜'
var strUtf16Be = (new TextDecoder('utf-16')).decode(new Uint8Array([0xFE, 0xFF, 0xD8, 0x3D, 0xDC, 0xDC]))
console.log(strUtf16Be)
// => '�㷘�'
for (let codePoint of strUtf16Be) {
console.log(codePoint.codePointAt(0).toString(16))
}
// => fffe, 3dd8, fffd
It would be nice to standardize it, but currently there's even not any format in KSF repo using encoding: UTF-16
without explicit LE
or BE
order (except nt_mdt_pal.ksy
, where encoding: UTF-16
is an error and it should be UTF-16LE
), so there's probably no point in making such an effort. But we can at least warn about it or ban it altogether, as @GreyCat suggests in kaitai-io/kaitai_struct#391 (comment).
@lucebac Note: don't worry about failing tests, that's normal 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution!
I needed UTF16 strings for some Kaitai structs and since Golang supports them similarly to the other encoding, I just added them to the generator.