It's all about the answers!

Ask a question

Full Text Search in Chinese and Japanese doesn't work as well as English


Glenn Bardwell (58621527) | asked Jul 30 '12, 10:22 a.m.
JAZZ DEVELOPER

Problem Description:  Tried full text search for a variety of text but didn't get the expected results.
试样(sample)
块状(block)
模拟(simulate)
测绘(measure)
are having not related meaning. So when search for 模块 测试, we expect those WIs with those keywords are related to module test, but the result of search could have some WIs which are totally not relevant as the search is done based on single character模,块,测,试.

Accepted answer


permanent link
Glenn Bardwell (58621527) | answered Jul 30 '12, 10:27 a.m.
JAZZ DEVELOPER
Thai, Chinese, and Japanese all have a similar issue.

The tokenizer, the component in full text search that reduces words into searchable segments, (e.g for English, iteration is reduced to iter, so searches for iteration, iterations, iterate, all match), doesn't work well.

The tokenizer breaks words down into individual characters. Searching with "", around the characters helps some. In this case the characters being searched for have to appear near one another and in order. So for the above,

Here's what I saw.
I created WIs with one character





The ones with the entries above
模块 测试
模块测试
块模试测
试样(sample)
块状(block)
模拟(simulate)
测绘(measure)

Single character searches always matched any word containing them. So a search for 模 matched 模, 模块, 模块测试, etc...
Multiword searches matched words but didn't care about order of letters, or spaces. A searcgh for 模块 matched  块模试测, 模块测试, etc...
This would be expected given the single character tokens, and a matching algorithm not sensitive to order.

I found slightly better behavior when I put "" (e.g. "模块测试") _around words. Then order mattered, but spaces didn't.
This is because the space character is dropped during matching. Only the tokens and the order matter. The search for "模块测试" didn't match  块模试测, but did match
模块 测试.
Seth Packham selected this answer as the correct answer

Comments
Sunil Kumar R commented Oct 24 '13, 10:42 a.m.
JAZZ DEVELOPER

Hello! I found this after a good amount of search.


Does the reported behavior apply to "Turkish" character "Full Text Search" and is it relevant to CLM v4.0.2?

Best Regards, Sunil

Your answer


Register or to post your answer.


Dashboards and work items are no longer publicly available, so some links may be invalid. We now provide similar information through other means. Learn more here.