Ticket #7254 (closed defect: fixed)

Opened 5 months ago

Last modified 5 months ago

WP Diff shouldn't split words in the middle of UTF-8 characters

Reported by: nbachiyski Assigned to: anonymous
Priority: high Milestone: 2.6
Component: General Version:
Severity: normal Keywords: has-patch
Cc:

Description

Expected:

When we compare Грещките and Грешките we should get the following HTML code for the deleted part:

Гре<del>щ</del>ките

However, we get:

Гре�<del>�</del>ките

WP_Text_Diff_Renderer_inline::_splitOnWords() uses the following regular expression to split words: /([^\w])/. \w in this case matches [a-zA-Z0-9_] and everything else is outside of a word. This both isn't a good definition of a word and allows a word to end in the middle of a UTF-8 character, which is the case above.

The solution is to make the regular expression work on a UTF-8 string, using the /u modifier (available from PHP 4.1.0).

Patch attached.

Attachments

diff-utf8.diff (0.5 kB) - added by nbachiyski on 07/06/08 20:57:59.

Change History

07/06/08 20:57:59 changed by nbachiyski

  • attachment diff-utf8.diff added.

07/06/08 21:14:47 changed by ryan

  • status changed from new to closed.
  • resolution set to fixed.

(In [8264]) Don't split in the middle of a UTF-8 character. Props nbachiyski. fixes #7254