This year is the first year that the JLPT (Japanese Language Proficiency Test) will be run in Ireland. I’ve submitted my application form and now just have to wait until December. I searched the App Store for some apps that might help me brush up on my kanji and vocabulary for level 2, but I couldn’t find anything that worked the way I wanted it to. That left me with a great opportunity to make my own!
I’m using Core Data to store all the vocabulary. This allows me search through all the vocabulary somewhat quickly by specifying appropriate NSPredicates. An NSPredicate might look something like so
(isKanji == YES OR isKanjiHiragana == YES) AND level == 'JLPT 2'
This will give every JLPT level 2 word with a kanji character in it.
It has three different types of test, and seven tests including variations. The three tests are
- Reading input: you must type in the correct reading.
- Choose correct answer: you must select the correct answer from a list of potential answers.
- Hide answer: the answer is hidden under a button. This one is pretty much a flash-card.
These three become seven when you see that you can take a Choose Correct Answer test and ask the user to select the correct meaning of a word, select the correct reading of a word, or select the correct word for a particular meaning. This extends to other tests and we end up with seven. Three classes exist, but there are seven objects with various different instance variables determining how they behave.
There are various categories of vocabulary: Katakana words, Hirgana words, Kanji words, Kanji+Hiragana words, prefixes, suffixes, miscellaneous (mix) and untestable (do not have reading or meaning available).
There are four JLPT levels (for now), and each of these levels has the categories above.
Every time a particular test has to generate a question, it has to find a word in the database suitable for the particular test. For example, if it’s Reading Input, there’s no point in testing a katakana word because it will be phonetic to begin with. It only makes sense to pick words with kanji in them somewhere. This means choosing a random managed object (i.e., a random row) that matches a particular NSPredicate from Core Data.
It turns out that this is slow. When I’d load a test, the screen would freeze up for a couple of seconds until it had sifted through the 5k~ rows of the database with a complex predicate. There were a few things I did to get this going faster.
Cache query results for each test
The first place I ran to was to cache the array of possible questions/answers for each test in the test object. When a test queries Core Data for a list of objects matching a predicate, save that list for next time so it won’t have to do a search again. A test’s predicate will not change through-out its life so caching everything with no expiry defined is safe to do.
This meant that the second time a particular search ran, it was fast. The first time, however, it was just as slow as before. And with seven different tests, that meant the user would have to experience seven slow-downs.
Load all data into a shared array and search manually
Next I decided to try to centralise the slow-downs to one place. I loaded everything in the database into an array at startup, which takes a couple of seconds but no longer than it took for each individual test.
When a test would need to find a value matching its predicate, it would pull a random one from the shared array and check it against the predicate. It would keep doing this until it found one. Except, sometimes it didn’t ever find one. Or it would take a really really long time before it came across one that would work! If it was never going to find one, there was no way that it could know that… so this was starting to look like a dead end.
Preload all data
Trying to centralise the slow-down without using a big, dumb shared array of vocabulary meant that each test would just fetch its own set of vocabulary matching its predicate at application start-up. Of course, this took many times longer than the previous method… but I had a plan…
Use a disk cache
The plan was to use a disk cache so that the hit wasn’t repeated across application launches. I implemented this, adding NSCoding to my Core Data object, but it was slow. Really slow. Writing or reading the file to disk took many times longer than querying the database to generate that array for me dynamically.
This was going to require a bit more intelligence…
Shared memory cache
Now knowing that querying the database directly was faster than loading query results saved to disk, I went about creating a memory cache shared across the whole app. This is basically an NSMutableDictionary with some fancy-pants methods around it. When the database is queried with a particular predicate, that predicate is stored as a key in the dictionary and the results from the database as the key’s object.
This meant that if you ran queries (or “fetch requests”, if you will… still not 100% comfortable with Core Data terminology) from different objects in the application but with the same predicates, the first would take time but the others would return immediately.
This meant that out of my seven tests, only five had to query the database and the other two could just pull from the cache.
Eliminate redundancy in level checking
The predicates I was generating at first were a total mess. To check for all vocabulary in JLPT 1, it would look something like this:
level == 'JLPT 1' AND (
isKatakana == YES OR
isHiragana == YES OR
isKanji == YES OR
isKanjiHiragana == YES OR
isPrefix == YES OR
isSuffix == YES OR
isMiscellaneous == YES OR
isUntestable == YES
)
Out of all those properties beginning with “is”, every single object will have at least one set. Therefore, when the user has not disabled any category (i.e., checking for all vocabulary as above), we might as well just use
This meant that to check for everything in the database, rather than having that big mess above multiplied by 4 (as there are four levels), the following can be used instead:
level == 'JLPT 1' OR level == 'JLPT 2' OR level == 'JLPT 3' OR level == 'JLPT 4'
Right? Right… but wait a minute! When the user has not disabled any level the above can be simplified to:
So when I was building a predicate to select everything in the database, it went from this
(level == 'JLPT 1' AND ( isKatakana == YES OR isHiragana == YES OR isKanji == YES OR isKanjiHiragana == YES OR isPrefix == YES OR isSuffix == YES OR isMiscellaneous == YES OR isUntestable == YES )) OR (level == 'JLPT 2' AND ( isKatakana == YES OR isHiragana == YES OR isKanji == YES OR isKanjiHiragana == YES OR isPrefix == YES OR isSuffix == YES OR isMiscellaneous == YES OR isUntestable == YES )) OR (level == 'JLPT 3' AND ( isKatakana == YES OR isHiragana == YES OR isKanji == YES OR isKanjiHiragana == YES OR isPrefix == YES OR isSuffix == YES OR isMiscellaneous == YES OR isUntestable == YES )) OR (level == 'JLPT 4' AND ( isKatakana == YES OR isHiragana == YES OR isKanji == YES OR isKanjiHiragana == YES OR isPrefix == YES OR isSuffix == YES OR isMiscellaneous == YES OR isUntestable == YES ))
to this
This gave a ~300% speed increase. If a user deselects one category from each level, the speedup will be totally lost. However, I do not expect users to do this too much. The default settings have everything on, and I expect that users may wish to disable a full category (possibly the extremes?—too easy and/or too hard?).
Order operands by property name in logic operators
As I said earlier, by caching the query results against its predicate, I only had to run five of the seven queries. However, I noticed that a couple of these looked the same, except with the order of “sub-predicates” reversed. What I mean is that I saw
TEXT != nil AND reading != nil
together with
reading != nil AND TEXT != nil
These are obviously the same thing, logically, but as strings they are completely different. There is no way to prove predicates are logically equivalent on the iPhone so to solve this, I made sure that when I construct these predicates with logic operators such as AND and OR, I order the operands based on the property name’s alphabetic order. What that means is that either of the above will become
reading != nil AND TEXT != nil
This is because “reading” < “text”.
This reduced the number of queries from five down to three!
Skip != nil if the property is not optional
Finally, I noticed that many of my predicates contained
The text property is not “optional”. That’s the same as having a “not null” column in the normal database world. So why bother checking if it’s nil if it’s never going to be nil?
NSEntityDescription can provide an NSDictionary of NSPropertyDescriptions, which let you know if any property isOptional or not. I used this to eliminate these redundant tests.
And this brought the number of unique queries down from three to just two!!
Conclusions
By simplifying predicates to logically equivalent but more sensible versions where possible, I reduced loading time to a third with default settings.
By shuffling around operand ordering and removing redundant comparisons, less obviously logically-equivalent predicates were discovered: for the seven tests there is only actually two predicates required.
By using a shared memory cache there is the overhead of a few seconds at application startup, but this means that queries with equal (not equivalent) predicates are only run once. It’s up to the rest of the application to make sure that equivalent predicates end up being equal as much as possible.