Full Text Search in Chinese and Japanese doesn't work as well as English

Glenn Bardwell

JAZZ DEVELOPER (586●2●15●27) Jul 30 '12, 10:22 a.m.

Problem Description: Tried full text search for a variety of text but didn't get the expected results.
试样(sample)
块状(block)
模拟(simulate)
测绘(measure)
are having not related meaning. So when search for 模块测试, we expect those WIs with those keywords are related to module test, but the result of search could have some WIs which are totally not relevant as the search is done based on single character模,块,测,试.

0 votes

1 answer

5,923 views

0 votes

Accepted answer

Permanent link

Glenn Bardwell

JAZZ DEVELOPER (586●2●15●27) Jul 30 '12, 10:27 a.m.

Thai, Chinese, and Japanese all have a similar issue.

The tokenizer, the component in full text search that reduces words into searchable segments, (e.g for English, iteration is reduced to iter, so searches for iteration, iterations, iterate, all match), doesn't work well.

The tokenizer breaks words down into individual characters. Searching with "", around the characters helps some. In this case the characters being searched for have to appear near one another and in order. So for the above,

Here's what I saw.
I created WIs with one character
模
块
测
试

The ones with the entries above
模块测试
模块测试
块模试测
试样(sample)
块状(block)
模拟(simulate)
测绘(measure)

Single character searches always matched any word containing them. So a search for 模 matched 模, 模块, 模块测试, etc...
Multiword searches matched words but didn't care about order of letters, or spaces. A searcgh for 模块 matched 块模试测, 模块测试, etc...
This would be expected given the single character tokens, and a matching algorithm not sensitive to order.

I found slightly better behavior when I put "" (e.g. "模块测试") _around words. Then order mattered, but spaces didn't.
This is because the space character is dropped during matching. Only the tokens and the order matter. The search for "模块测试" didn't match 块模试测, but did match
模块测试.

Seth Packham selected this answer as the correct answer

0 votes

Comments

Sunil Kumar R

JAZZ DEVELOPER Oct 24 '13, 10:42 a.m.

Hello! I found this after a good amount of search.

Does the reported behavior apply to "Turkish" character "Full Text Search" and is it relevant to CLM v4.0.2?

Best Regards, Sunil

Your answer

Dashboards and work items are no longer publicly available, so some links may be invalid. We now provide similar information through other means. Learn more here.

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here.

By RSS:

Answers

Answers and Comments

Question details

usage

× 12,031

jazz-foundation

× 4,327

Question asked: Jul 30 '12, 10:22 a.m.

Question was seen: 5,923 times

Last updated: Oct 24 '13, 10:42 a.m.

Full Text Search in Chinese and Japanese doesn't work as well as English

Glenn Bardwell

1 answer

5,923 views

0 votes

Accepted answer

Glenn Bardwell

Comments

Sunil Kumar R

Your answer

Follow this question

Question details

Related questions