-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't combine output types for strings with different encodings #810
Comments
Unfortunately, key And note that even if it supported expressions, any encoding identifier would be a string, and every string literal in KS has to be enclosed in quotes: encoding: 'is_wide ? "utf16" : "utf8"'
encoding: '"utf-8"' # ' for YAML, " for KS expression
This bug has been fixed in 0.9 in commit kaitai-io/kaitai_struct_compiler@375a140. Make sure you have the latest development 0.9 KS compiler installed (https://kaitai.io/#download), or make your life easier and use the devel Web IDE, which has always the latest KSC. The other thing is that this expression value: 'bytes.to_s(utf8)' is incorrect, because the string utf8 must be enclosed in quotes to be a string literal ( instances:
utf8:
value: 'true ? "utf8" : "ascii"' If any attribute called
The case class BytesLimitType(
size: Ast.expr,
terminator: Option[Int],
include: Boolean,
padRight: Option[Int],
override val process: Option[ProcessExpr]
) extends BytesType But if everything works correctly, this type diversity should not be a problem when you're e.g. working with
This is an instance of issue #318, and again, the solution is to download the latest development (unstable) 0.9 version of KSC or use the devel Web IDE. |
This isn't directly related to the original question, but you should be aware that |
Thanks. I'll do that
…On Sat, Sep 19, 2020, 2:27 PM dgelessus ***@***.***> wrote:
This isn't directly related to the original question, but you should be
aware that strz currently doesn't work correctly in combination with
"wide" encodings like UTF-16 - see #187
<#187>. In your case it
should be easy to work around this bug though - because you have the exact
length of the string, you could use str instead of strz and remove the
zero terminator by reducing the length by 1 character, or using substring
after reading the string.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#810 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJ3K4T75UB5BVYKYXZVADLSGUO43ANCNFSM4RSRVBYQ>
.
|
@hallipr, I'd just like to test your expectations, here. UTF-8 and UTF-16 strings really are different data types: you cannot trivially cast from one to the other -- they have to be converted. So, even if you could get a single field to contain either one, I suspect that it wouldn't be long before the downstream code ran into problems. So, it strikes me that working hard to get the KSY to paper over the differences isn't actually going to save you anything. |
@webbnh You're right that the string representation in the raw data is different, but immediately after the data is read, the generated parser will convert it into the target language's native string representation. So even though the raw data may have different encodings, the value stored in the field will always have the same type (e. g. |
@dgelessus, how does that work out in C++? |
It's a bit more confusing when KS targets C++, because it represents both byte arrays and "true strings" (KS
|
Probably utf-16le strings should be wstrings in C++. BTW, how about std::span? |
That's only available as of C++20; I, for one, am using C++11.... |
The downstream code works on "expected values" and opaque strings.
For expected values, i.e. Switch statements with constant cases, they're
currently all ascii, but there isn't any code expecting ascii
…On Mon, Sep 21, 2020, 2:25 PM webbnh ***@***.***> wrote:
KS_STR_ENCODING_ICONV causes all "true strings" to be converted to a
common encoding (by default UTF-8)
It's unclear whether that will serve @hallipr <https://github.com/hallipr>
-- it depends on whether his downstream code is prepared to handle UTF-8
instead of ASCII....
BTW, how about std::span?
That's only available as of C++20; I, for one, am using C++11....
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#810 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJ3K4V3AYTGHP57LQE2AGDSG7AEJANCNFSM4RSRVBYQ>
.
|
I feel like this may be my issue also? I wanted to write this: text:
doc: |
A text string. Can be UTF-8 or UTF-16LE depending on the setting
in the header.
seq:
- id: length
type: u4
- id: value
type: str
size: length
encoding:
switch-on: _root.header.encoding
cases:
0: UTF-16LE
1: UTF-8 But I ended up having to hoist it up to the type: text:
doc: |
A text string. Can be UTF-8 or UTF-16LE depending on the setting
in the header.
seq:
- id: value
type:
switch-on: _root.header.encoding
cases:
0: text_utf16
1: text_utf8
text_utf16:
seq:
- id: length
type: u4
- id: value
type: str
size: length
encoding: UTF-16LE
text_utf8:
seq:
- id: length
type: u4
- id: value
type: str
size: length
encoding: UTF-8 |
I need to parse null-termintated strings with conditional encoding utf16 or utf8
The strings are stored as Length + Value, with length negated for utf16.
If I use separate types with type switching wide_string (utf16) and narrow_string (utf8), when comparing strings, I'm forced to continually cast them to the same type. I'm not sure I can do this when I compile to C#.
I've tried several approaches that avoid type switching and casting:
ternary operators in encoding
error: The encoding label provided ... is invalid.
byte array with ternary to_s
error: don't know how to call method 'identifier(to_s)'
I was surprised that field "bytes" was BytesLimitType and not a raw byte array.
conditional fields and a ternary instance value
error: can't combine output types
This works but is hacky: Conditional fields and ternary instance, forcing string type using .reverse.reverse or .substring(0, length).
The text was updated successfully, but these errors were encountered: