Chinese parsing issues
-
Here's an example of the type of parsing issue I see regularly. I don't expect parsing to be perfect (of course it won't be, especially for a language like Chinese), but imo this type of parsing error is really annoying. The reason is, those first two characters are not part of the second two. 不能 appears very often in Chinese before a verb, because it means "can't". I don't think there's a single situation where 不能 should be considered as part of the word that comes directly after it.
-
@anon-fn Another user suggests that Pleco uses a more accurate parser; is it possible to use the same one, or otherwise find a more accurate parser library?
-
I’ve seen a number of parsing issues too. I reported a few of them but it’s not very motivating since it’s fairly time-consuming, but I don’t get any feedback when I do that. Is there a way to set up an automated feedback loop where we can correct the errors from within Anki, and then the parser automatically learns to make fewer mistakes in the future?
I think it is important to get this working really well since parsing errors contaminate the list of known words, and then when I go to make new cards, sometimes Migaku thinks I know a word I actually don’t, or vice versa, and sometimes it gives me an incorrect list of new words because one or more word boundaries were identified incorrectly.
-
I'm also seeing the same issue, and it happens frequently. I think it's fixable- at least when I use Pleco (particularly the reader), it parses accurately.
-
@istangel I think Migaku use an external reference for parsing so I'm not sure there is much that can be done. Even Morphman messes up a lot. At least you sited a real phrase. But things like “真奇怪的風” parsed as "真奇.. 怪 ..的 ..風", which isn't even a real word (at least according to 8 dictionaries) is what's super annoying. I'm never sure whether to say I "know" it or not. If I don't, then I'm missing out on identifying T1s. Adding feature that lets us manually correct parsing errors would be great. A not as ideal, but still helpful option would be to add an "Ignore" option (instead of learning/known/unknown), which we could also use on these non-words and Names. Left as unknown, they hide opportunities for T1 sentence flagging. On the bright side, so far I have found Migaku's Chinese parsing at least on par if not better than Language Reactor's.