Joachim Breitner's Homepage
arbtt goes Binary
Three weeks ago, I announced the automatic rule-based time tracker here, and it seems that there are actually users out there :-). Since then, it has recorded more than 240 hours of my computer’s uptime in about 15000 samples. Until now, this data was stored in a simple text file, one line per entry, and relying on Haskell’s Show/Read instances to do the serialization. Although not quite unexpected, this turned out to be a severe bottleneck: Already, it takes more than 25 seconds to parse the log file on my computer.
Following an advice given to me on the #haskell IRC channel, I switched to a binary representation of the data, using the very nice Data.Binary module. The capture program will automatically detect if the log file is in the old format, move it away and convert the entries to binary data. And voilà, the statistics program evalutes my data in about two seconds! This should be fast enough for quite a while, I hope.
In my binary instances, which you can find in Data.hs, I prepended a version tag to each entry. This hopefully allows me to add more fields to the log file later, while still being able to parse the old data directly and without conversion. To still be able to manually inspect the recorded data, the program arbtt-dump was added to the package. The new version is uploaded to hackage as 0.3.0.
One thing still worries me: With the old format, I could easily throw away unparseable lines in the log file (e.g. from a partial write) and still read the rest of it. With the new code, there is no such robustness: If the file breaks somewhere, it will be unreadable in its whole, or at least to the broken point. I’m not sure what to do about that problem, but given the very low number of bytes written each time, I hope that it will just not happen.
Comments
A TLV scheme may eliminate your need to escape values, though you may still wish to have sigil T values to aid in well-formed-ness checks.
OTOH, given that I find it unlikely that breakage will occur, I could really try to find the next intact record at the 0x01 markers, and then check the data for sensibility. Since the first value in the record is a timestamp, I could check it whether it is monotonically increasing, and not in the future, and if it is, throw it away. This would be sufficient.
So if you happen to have a broken file, contact me, I might implement an arbtt-restore program. (And I should probably move all these programs into one "arbtt command" style program :-))
Have something to say? You can post a comment by sending an e-Mail to me at <mail@joachim-breitner.de>, and I will include it here.
This scheme will work with any data whatsoever. If you have more information about the format, e.g. the records are a fixed size, they have a structured format where errors can be detected, there are certain bit patterns that can't occur, you could potentially simplify the above scheme.