-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scanning (seek, search) for attribute start & end positions #538
Comments
These new properties are in fact doing the job too similar to pos: _io.size
scan-start: kaitai.byte_reverse(0x42) can be pos: _io.size - kaitai.byte_reverse(0x42) Scanning op is translated into an offset from the . arithmetic ops are not commutative: for i in range(x):
self._io.seek(self._io.pos()+a)
f(self._io) # seeks internally
self._io.seek(self._io.pos()+b) An instance pos: ...
value: kaitai.byte_reverse(0x42) would just result into an offset of the first Algo ideas:
For value instances with pos: initial pos is remembered, then the function doing the stuff is called, then pos is retrieved, original one is subtracted, the value is stored, the original pos is restored. So terminator may also be eliminated:
Do you mean alignment? I think the modes you have given an example should be hidden behind a single interface: |
What I want to emphasize here is that scanning typically involves a lot of reading (trial-and-error). This process eventually moves IO stream position, and this is actually what we want. There's no point in going back and forth like "memorize old position, scan, get & return new position, jump back to old position, realize that we've got new position, jump to the new position".
This is much much much more complex to implement, has lots of restrictions due to these symbolic ops, potentially leads to more arithmetics, jumps back and forth, etc. With all that, I don't really any upsides for this solution.
Effectively, yes.
There will be separate namespace, as all these monikers are processed in separate YAML keys, i.e.:
For processing routines, it can be:
|
The proposed syntax is just an interface, it just behaves like if the funcs were returning an offset from the current position, if the current position was given by a preceeding part of the expression. The underlying impl may not to compute any expressions matching
In fact not all arythmetic ops, but only ones involved into noncommutative functions calls. So part of ops in the expression is commutative, part are not.
Yes, I have thought about it. But
feels like ah inconsistency. Maybe third-party packages must also contain the needed namespace, and we should add the info where it lies into the KS "header" for opaque funcs? |
I agree with the big proposal. Just a question, what is the point to have two different implementation between Byte and Bytes? We can merge those things. Following the example of @GreyCat in Java, a possible implementation in Python would be class Scanner:
def __init__(self, pattern):
self._pattern = pattern
self._len = len(pattern)
self._border = self._kmppre()
def _kmppre(self):
border = [-1]
match_index = -1
for i in range(0, self.len):
while match_index >= 0 and self.patter[i] != self.patter[match_index]:
match_index = border[match_index]
match_index += 1
border.append(match_index)
return border
def _scan(self, io, patter):
r = b''
match_index = 0 # for kmp search
while True:
c = self._io.read(1)
while match_index >= 0 and c != self.patter[match_index]:
match_index = self.border[match_index]
match_index += 1
if match_index == self._len: # It's a match
return
def scanToStart(self, io):
_scan(io)
io.seek(io.pos() - self.len)
def scanToEnd(self, io):
_scan(io) The complexity here is O(M) where M is the amount of bytes between the initial position of IO and the end position of it. The implementation that I used, its from my team algorithmic competition notebook icpc-notebook-vasito
I don't know what that mean or what is the purpose. I read a little bit but I couldn't figure out the usage. |
Just synced with @tinrodriguez8 in the chat. Here's the broad plan: we start with simple and very basic stuff. Let's start with, say, 2 or 3 languages (Python and Java sounds like a good plan), and eventually add more as it grows. Before you startPlease check that:
TestsWe'll need to create some tests and everything needed for it to run. Tests generally are to be put in tests repo. Given that this whole thing is somewhat similar to custom processing things, it's best to look up relevant tests and its implementations, e.g. process_custom.ksy test In general, for this to run, we'll need:
Ideally, we'd like to move away from writing specs manually for every single language, and there's a thing called KST translator, but I'm not sure at this stage it will be possible to make use of it, given that these specs would likely require loading/importing of external stuff. Or may be it will work?
Python custom scanner won't probably require anything extra, but Java custom scanner will probably need to implement a certain formal interface. If we're happy with proposed interface, then we can just start with putting it into java runtime repo, along with CustomDecoder (which is the interface for custom processing routines). For the contents of test custom scanner, I'd start, again, with something very basic and simple. For example, scanner that reproduces After that we'll have the tests which would obviously fail at first due to lack of support in the compiler. Then we can start working on the compiler. CompilerIt will need updates in 2 places:
Format parsing is generally done in shared/src/main/scala/io/kaitai/struct/format, namely AttrSpec / InstanceSpec. Code generation update would be probably the hardest of all that, as it involves several different files in the code gen pipeline. Probably it would be easier to reach to me at that stage, I can explain in more detail. |
For visibility: @tinrodriguez8 started a PR here: kaitai-io/kaitai_struct_compiler#166 |
Driving scenarios
Normally, binary formats are well-defined and do not require parser to employ trial-and-error tactics to pinpoint exact locations of the data.
However, in practice, in quite a few cases, we need to go to location identified not by some sort of exact address, but by some arbitrary properties of what's around that location. The simplest case of that is "read the string until the null byte appears", which is currently taken care of
terminator
/consume
/include
semantics. Much more complicates use cases exist, for example:00 00
group is encountered.The proposal is to make a unified framework that could be used in both:
pos: ...
substitute in instances, to be executed before parsing of the element to find the starting positionterminator: ...
orsize: ...
substitute, to determine ending position for the elementTerminology proposal
Let's call it "scanning", as this is actually what would be going on. "Seeking" is a bad choice of word, as typically
seek
is a API call which moves stream pointer to a certain pinpointed location and does not involve any scanning actions. "Searching" is a more generic term, which does not reflect the fact that this is a pretty expensive operation.KSY syntax proposal
instances
, one would be able to specifyscan-start
key with "scan specification" described below. When this instance will be invoked, first thing it will be before doing actual reading, it will invoke scan algorithm to position the stream pointer properly.size
orterminator
, one would be able to specifyscan-end
key with "scan specification" as well.For contents of this "scan specification" would be akin to current "process" specifications, i.e.
scan_mode_name(param1, param2, ...)
. We can start with a few "scan modes" and add more of them later.Initial ideas for scan modes to implement:
kaitai.byte(0)
— reimplement single-byteterminator
, scan until a certain byte is encountered (0 in this example)kaitai.bytes([0x45, 0x4e, 0x44])
— scan until sequence of 3 bytes is encounterdkaitai.bytes_group(2, [0, 0])
— two-byte terminator working at groups of 2 bytes every timekaitai.byte_reverse(0)
— same as "byte", but scans in reverse directionExamples of usage
Scans for
***
[42, 42, 42] 3-byte sequence, parses u4 after that:The same as
terminator: 0
:Scans until
OrbShdr
is encountered, catches everything up to that byte sequence as substream and parses it usingmy_custom_body
type:The same, but includes
OrbShdr
into substream as well:For instances,
scan-start
can be combined withpos
. In this case,pos
is executed first, scanning starts working after that. This scans from the end of file backwards, searching for first 0x42:Implementation proposal
All scan specifications will internally invoke a certain implementation class, very similar to current custom processing classes implementation. Scan mode names will map to these class names. Special prefix
kaitai.
would be used to designate scan mode implementations bundled with runtime library.For Java, for example, this could be:
Implementation for
kaitai.byte
might be:scan-start: byte(42)
+type: u1
will be translated into something like:scan-end: byte(42)
+include: true
for byte array will be something like:Finally,
scan-end: byte(42)
+include: false
(which is default) will be something like:The text was updated successfully, but these errors were encountered: