Ticket #6077 (closed defect: fixed)

Opened 6 months ago

Last modified 6 months ago

UTF-8 strings are sometimes cut in the middle of a character

Reported by: nbachiyski Assigned to: anonymous
Priority: normal Milestone: 2.5
Component: General Version:
Severity: normal Keywords: unicode utf-8 excerpt has-patch
Cc:

Description

Using substr on UTF-8 strings can cause some characters to be cut on the middle, because substr counts bytes, but in UTF-8 a character can be more than one byte.

Here is a patch, which:

  • Defines mb_strcut in compat.php}} for the users, who don't have the {{{mb_string extension.
  • Introduces a new wp_html_excerpt function, which uses mb_strcut and works well with html strings: counts entities as one character (& isn't 4 chars) and strips tags.

There are some tests for the two functions:

Attachments

safe-excerpts.diff (5.3 kB) - added by nbachiyski on 03/03/08 16:08:50.
html_entity_decode.diff (0.5 kB) - added by tenpura on 03/06/08 02:34:47.
safe-excerpts-no-decode.diff (1.2 kB) - added by nbachiyski on 03/09/08 14:58:47.

Change History

03/03/08 16:08:50 changed by nbachiyski

  • attachment safe-excerpts.diff added.

03/03/08 16:48:25 changed by nbachiyski

  • keywords changed from unicode utf-8 excerpt to unicode utf-8 excerpt has-patch.

03/03/08 21:05:23 changed by ryan

  • status changed from new to closed.
  • resolution set to fixed.

(In [7140]) Multi-byte character safe excerpting from nbachiyski. fixes #6077

03/04/08 14:12:55 changed by tenpura

  • status changed from closed to reopened.
  • resolution deleted.

03/04/08 21:02:17 changed by nbachiyski

Oh, I was misled by the html_entity_decode manual, which says:

Version   Description
5.0.0     Support for multi-byte character sets was added.

I didn't see above this message was written that most of the encodings we need are supported by 4.3.0. So, let's add the encoding then.

03/06/08 02:34:47 changed by tenpura

  • attachment html_entity_decode.diff added.

03/06/08 02:38:58 changed by tenpura

The manual says:

Any other character sets are not recognized
 and ISO-8859-1 will be used instead.

This is good thing but it outputs warnings in that case, so I updated the patch just to add "@" anyway.

03/06/08 04:08:17 changed by tenpura

According to the PHP user notes, html_entity_decode() has a bug with UTF-8. Maybe we shoud create substitute function?

bug should be reproduced with this code before PHP 5.0.1:

echo html_entity_decode('€', ENT_QUOTES, 'UTF-8');

03/06/08 06:59:07 changed by nbachiyski

We can just drop the entity decoding part. Yes -- the excerpt could be a couple of characters shorter than the specified length, but that's how it has worked up to now and nobody complained.

03/09/08 14:58:30 changed by nbachiyski

Here is a patch, which removes entity decoding. Documentation and test are also updated.

03/09/08 14:58:47 changed by nbachiyski

  • attachment safe-excerpts-no-decode.diff added.

03/09/08 22:11:21 changed by westi

  • status changed from reopened to closed.
  • resolution set to fixed.

(In [7190]) Remove the entity decoding and recoding from wp_html_excerpt. Fixes #6077 props nbachiyski.