Full Text Search in Chinese and Japanese doesn't work as well as English
![](http://jazz.net/_images/myphoto/25e360d56d5310651308ca7c972e8be6.jpg)
Problem Description: Tried full text search for a variety of text but didn't get the expected results.
试样(sample)
块状(block)
模拟(simulate)
测绘(measure)
are having not related meaning. So when search for 模块 测试, we expect those WIs with those keywords are related to module test, but the result of search could have some WIs which are totally not relevant as the search is done based on single character模,块,测,试.
Accepted answer
![](http://jazz.net/_images/myphoto/25e360d56d5310651308ca7c972e8be6.jpg)
Thai, Chinese, and Japanese all have a similar issue.
The tokenizer, the component in full text search that reduces words into searchable segments, (e.g for English, iteration is reduced to iter, so searches for iteration, iterations, iterate, all match), doesn't work well.
The tokenizer breaks words down into individual characters. Searching with "", around the characters helps some. In this case the characters being searched for have to appear near one another and in order. So for the above,
Here's what I saw.
I created WIs with one character
模
块
测
试
The ones with the entries above
模块 测试
模块测试
块模试测
试样(sample)
块状(block)
模拟(simulate)
测绘(measure)
Single character searches always matched any word containing them. So a search for 模 matched 模, 模块, 模块测试, etc...
Multiword searches matched words but didn't care about order of letters, or spaces. A searcgh for 模块 matched 块模试测, 模块测试, etc...
This would be expected given the single character tokens, and a matching algorithm not sensitive to order.
I found slightly better behavior when I put "" (e.g. "模块测试") _around words. Then order mattered, but spaces didn't.
This is because the space character is dropped during matching. Only the tokens and the order matter. The search for "模块测试" didn't match 块模试测, but did match
模块 测试.
The tokenizer, the component in full text search that reduces words into searchable segments, (e.g for English, iteration is reduced to iter, so searches for iteration, iterations, iterate, all match), doesn't work well.
The tokenizer breaks words down into individual characters. Searching with "", around the characters helps some. In this case the characters being searched for have to appear near one another and in order. So for the above,
Here's what I saw.
I created WIs with one character
模
块
测
试
The ones with the entries above
模块 测试
模块测试
块模试测
试样(sample)
块状(block)
模拟(simulate)
测绘(measure)
Single character searches always matched any word containing them. So a search for 模 matched 模, 模块, 模块测试, etc...
Multiword searches matched words but didn't care about order of letters, or spaces. A searcgh for 模块 matched 块模试测, 模块测试, etc...
This would be expected given the single character tokens, and a matching algorithm not sensitive to order.
I found slightly better behavior when I put "" (e.g. "模块测试") _around words. Then order mattered, but spaces didn't.
This is because the space character is dropped during matching. Only the tokens and the order matter. The search for "模块测试" didn't match 块模试测, but did match
模块 测试.