Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add built-in process hex and base64 #668

Open
Mingun opened this issue Jan 10, 2020 · 17 comments
Open

Add built-in process hex and base64 #668

Mingun opened this issue Jan 10, 2020 · 17 comments

Comments

@Mingun
Copy link

Mingun commented Jan 10, 2020

This is two widely used encoding schemas, so it will be great, if kaitai will have built-in primitives for this.

@GreyCat
Copy link
Member

GreyCat commented Jan 10, 2020

We're not adding more built-in process anymore, given that we have pluggable modules now. We'll have series of libraries that will have these widely used procedures instead. See https://github.com/kaitai-io/kaitai_compress, for example, for popular compression algorithms.

Please consider contributing something like that, but for hex and base64?

@Mingun
Copy link
Author

Mingun commented Jan 10, 2020

Ok, that is appropriate solution (but when it will be implemented will be good to have them available in webide).

Is there any recommendations, how to create processor for all supported languages and how end users should get these algorithms in their applications?

@GreyCat
Copy link
Member

GreyCat commented Jan 10, 2020

but when it will be implemented will be good to have them available in webide

Yep, that's the plan — like all these "common" libraries will be automatically available in WebIDE together with all their dependencies.

Is there any recommendations, how to create processor for all supported languages

Custom processors are documented in http://doc.kaitai.io/user_guide.html#custom-process. Per-language specifics are supposed to be documented in per-language notes on https://doc.kaitai.io, but in reality we're lagging behind on that documentation updates. Probably your best bet would be to copy the existing layout of kaitai_compress and start something like "kaitai_common" or "kaitai_misc" collection of algorithms.

and how end users should get these algorithms in their applications?

Installation is obviously language-dependent and is outlined around Usage section in kaitai_compress.

@KOLANICH
Copy link

KOLANICH commented Jan 10, 2020

process works on raw bytes. Hex and base64-encoded values are strings. I mean they may be utf-32be, or utf-16be, or utf32le... So, I guess process is a bit unsuitable here.

@GreyCat
Copy link
Member

GreyCat commented Jan 10, 2020

Makes sense, but in reality 100% of hex dumps I've seen so far were in ASCII. I can imagine a hex dump in UTF16, but we might just introduce special parameter for that in processing routine, or may be a special routine for these purposes.

Even from performance-related side, it doesn't make much sense to "real" conversion of that data to strings first, and then do a string-to-integer conversions.

@KOLANICH
Copy link

KOLANICH commented Jan 10, 2020 via email

@Mingun
Copy link
Author

Mingun commented Jan 11, 2020

Hex and base64-encoded values are strings.

Not absolutely. By definition of this conversions they converts any byte sequence to 7-bit byte sequence (i.e. to ASCII encoded strings), that can be safely transferred through some old protocols. As strings they represented only for stupid humans (glory to robots!)..

However, it is possible to solve this problem if we will represent that byte sequences as strings in ksy with defined hex or base64 encoding in the same way as we represent strings with ASCII or UTF-8 encoding (by the way, what encodings should be guaranteed to be supported by any kaitai-struct runtime?).

@GreyCat
Copy link
Member

GreyCat commented Jan 11, 2020

what encodings should be guaranteed to be supported

See #116 and #393.

@dgelessus
Copy link
Contributor

process works on raw bytes.

Any reason not to support process for strings?

The performance of the bytes-to-string conversion is unlikely to be an issue for ASCII - any decent language has optimizations for that common case (I know at least Java and Python do).

Conceptually I think hex/base64-encoded data should count as text strings. Hex is usually used to store arbitrary binary data in a format that can be read by humans (i. e. text), and nowadays base64 is almost exclusively used to convert arbitrary binary data to printable, ASCII-compatible text.

(Yes, base64 was originally developed to transfer 8-bit data over channels that might only be 7-bit and could clobber the 8th bit, but if you're parsing that kind of data you probably need to strip the 8th bit beforehand anyway.)

@KOLANICH
Copy link

KOLANICH commented Jan 11, 2020

Any reason not to support process for strings?

Because process by definition works before any parsing of a field is done. The generated code

  1. carves the field
  2. processes it
  3. does parsing on processing result

It is a bytes-level operation.

@dgelessus
Copy link
Contributor

Good point, you still need to be able to use a regular byte process on string fields.

Perhaps the hex/base64 decoding should be done using string methods instead (i. e. something like string_field.decode_hex, which returns a byte array). There should be no need for an attribute ("process-str") here - a method call in a value instance would work just as well.

@KOLANICH
Copy link

KOLANICH commented Jan 11, 2020

Making it a method will require it to be a part of every runtime. It'd be better to make it a separate auxilary package. So IMHO it is better to have it as just a function.

@GreyCat
Copy link
Member

GreyCat commented Jan 11, 2020

"Function" is actually the worst possible choice for such stuff — it's imperative, you basically show how to do transformation one way and it's very untrivial to do it the other way around. Things like process make it much more declarative:

  • you clearly determine that there is transformation,
  • it's always in one predetermined position,
  • it's decoupled from the specification,
  • it's easily reversible — to implement serialization, you just need to provide not just a "decode" implementation, but also "encode" implementation

@Mingun
Copy link
Author

Mingun commented Jan 12, 2020

I think, we can add another process phase. Right now there is situation, when process actually must be named pre-process. So it just needed to add post-process, that will transform parsed result to final form.

Then, we can write:

  - id: mac
    doc: Message Autentification Code (HEX)
    size: 8
    post-process: hex
    expect: _.size == 4 # valid from #435 , but that name is better, IMHO

This mean: read 8 bytes, then apply hex transformation (which, by convention, actually applied unhex transformation -- from HEX to bytes). Finally, assert, that size of result array is 4 bytes, as it should, just to clarify

@KOLANICH
Copy link

instances are already present.

@Mingun
Copy link
Author

Mingun commented Jan 17, 2020

Yes. Actually, in case of hex and base64 even post-process is not required, because:

  • parser creates stream of size bytes
  • parser feed it into process function
  • parser does actual parsing of process result (not needed in that case or, the same, this is 1-to-1 transformation)

Do you think you can add these algorithms to your katai_compress or better implement them in a separate repository

As you think, can that algorithms to be added to the https://github.com/kaitai-io/kaitai_compress (and maybe rename it to more generic kaitai_algorithms), or better implement them in a separate repository?

@KOLANICH
Copy link

KOLANICH commented Jan 18, 2020

As you think, can that algorithms to be added

you may have meant

How do you think, if that algorithms can be added

.

I personally pretty sure that it will be never merged that way. I mean IMHO we don't need hex and base64 in decoders. We need it, but on other layers. These other layers are custom types. So feel free to create a repo of custom types with processors that cannot be implemented in KS only. Also look at my PRs into KSF, they contain code for some of such types.

and maybe rename it to more generic kaitai_algorithms

I have thought about renaming the kaitai_compress repo into kaitai_processors (and I have an own extended and refactored fork of that repo, not yet merged), but we strictly need interfaces #314 first because of serialization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants