Jazz Forum Welcome to the Jazz Community Forum Connect and collaborate with IBM Engineering experts and users

Full Text Search in Chinese and Japanese doesn't work as well as English


Problem Description:  Tried full text search for a variety of text but didn't get the expected results.
试样(sample)
块状(block)
模拟(simulate)
测绘(measure)
are having not related meaning. So when search for 模块 测试, we expect those WIs with those keywords are related to module test, but the result of search could have some WIs which are totally not relevant as the search is done based on single character模,块,测,试.

0 votes


Accepted answer

Permanent link
Thai, Chinese, and Japanese all have a similar issue.

The tokenizer, the component in full text search that reduces words into searchable segments, (e.g for English, iteration is reduced to iter, so searches for iteration, iterations, iterate, all match), doesn't work well.

The tokenizer breaks words down into individual characters. Searching with "", around the characters helps some. In this case the characters being searched for have to appear near one another and in order. So for the above,

Here's what I saw.
I created WIs with one character





The ones with the entries above
模块 测试
模块测试
块模试测
试样(sample)
块状(block)
模拟(simulate)
测绘(measure)

Single character searches always matched any word containing them. So a search for 模 matched 模, 模块, 模块测试, etc...
Multiword searches matched words but didn't care about order of letters, or spaces. A searcgh for 模块 matched  块模试测, 模块测试, etc...
This would be expected given the single character tokens, and a matching algorithm not sensitive to order.

I found slightly better behavior when I put "" (e.g. "模块测试") _around words. Then order mattered, but spaces didn't.
This is because the space character is dropped during matching. Only the tokens and the order matter. The search for "模块测试" didn't match  块模试测, but did match
模块 测试.
Seth Packham selected this answer as the correct answer

0 votes

Comments

Hello! I found this after a good amount of search.


Does the reported behavior apply to "Turkish" character "Full Text Search" and is it relevant to CLM v4.0.2?

Best Regards, Sunil

Your answer

Register or log in to post your answer.

Dashboards and work items are no longer publicly available, so some links may be invalid. We now provide similar information through other means. Learn more here.

Search context
Follow this question

By Email: 

Once you sign in you will be able to subscribe for any updates here.

By RSS:

Answers
Answers and Comments
Question details
× 12,031

Question asked: Jul 30 '12, 10:22 a.m.

Question was seen: 5,923 times

Last updated: Oct 24 '13, 10:42 a.m.

Confirmation Cancel Confirm