Ticket #3843 (new defect)

Opened 1 year ago

Last modified 3 days ago

Smart quote apostrophe ’ results in a permalink URL with %e2%80%99

Reported by: foolswisdom Assigned to: ryan
Priority: normal Milestone: 2.6
Component: Administration Version: 2.2
Severity: minor Keywords: has-patch slug permalink dev-feedback
Cc: drmike

Description

Smart quote apostrophe ’ results in a permalink URL (slug) with %e2%80%99

ENV: WP trunk r4915

smart quote apostrophe ’
Mac shortcut: Using Shift - Option - ]

ADDITIONAL DETAILS
My guess is that a solution should identify allowed characters, translated to hyphen -, and strip all the others.

Attachments

unicode-punctuation-removal.diff (1.6 kB) - added by noel on 06/20/08 00:47:19.
Permalink filter for unicode punctuation.
formatting-7-2-4am.diff (1.0 kB) - added by noel on 07/02/08 08:02:07.
Unicode fixes that do not produce a 404.

Change History

02/23/07 00:03:39 changed by rob1n

This is proper behavior. The curly quote isn't plaintext -- it's a symbol and has to be translated. The same is for other UTF-8 symbols such as Chinese characters (some other bug was about that) -- they are and should be turned into URL-safe entities.

02/25/07 11:56:38 changed by markjaquith

Solution would have to deal with this case specifically. Note that the URL, while ugly, is functional. Also note that in 2.1, people should be able to edit their post slug and have the old one redirect to the current one.

02/26/07 02:58:13 changed by jhodgdon

Just a little clarification. The function being used to create the slug is sanitize_title_with_dashes in wp-includes/formatting.php

The sequence of events is currently:

1) Post title becomes the slug candidate

2) Accents are removed (replaced by un-accented letters)

3) Characters that still look like they are UTF-8 are encoded with utf8_uri_encode into octets (%e2, etc.) (this is what is creating the reported behavior)

4) HTML entities and any character except letters, numbers, underscores, spaces, octets, and hyphens are removed (this is where other punctuation is removed)

5) Spaces are turned into hyphens, and whole thing is lower-cased

So... to fix this, would have to add step 2.5:

2.5: Translate into hyphens, or remove (more consistent with what happens to other punctuation), a specific list of special (but common) punctuation characters.

Questions:

a) Is this worth doing, considering that the current behavior makes a usable slug, and that you can always edit your slug by hand if you want to?

b) If it is worth doing, what should the list of special punctuation characters be, and should they be removed or translated into hyphens?

03/27/07 02:25:35 changed by foolswisdom

  • milestone changed from 2.2 to 2.4.

04/03/07 23:26:54 changed by drmike

  • cc set to drmike.

Issue exists over in wp.com land as well:

http://en.forums.wordpress.com/topic.php?id=9645

04/03/07 23:27:37 changed by drmike

Also why not just strip it out?

04/20/07 04:09:29 changed by rob1n

Well, we strip *regular* quotes out, but not fancy quotes. I think this is really not going to be fixed easily -- we can strip out UTF-8 quotes, but what about other encodings?

02/14/08 14:40:26 changed by thee17

  • status changed from new to closed.
  • resolution set to wontfix.
  • milestone deleted.

02/14/08 14:52:12 changed by pishmishy

  • status changed from closed to reopened.
  • resolution deleted.

Please can you leave a comment explaining why you've closed the ticket.

02/14/08 20:23:12 changed by lloydbudd

  • milestone set to 2.7.

06/20/08 00:45:38 changed by noel

  • priority changed from low to normal.

This patch should fix the problem - we were treating all unicode as equal - when we should have been defining the different categories and removing the unicode characters relevant to punctuation, etc.

This patch simply attaches onto the other sanitize_title functions and will probably need to be integrated more fully in the future. As for now, it works great for me on all the test cases I threw at it.

In the future, when all browsers support full unicode characters in the URL shouldn't we not be converting them at all? ;)

06/20/08 00:47:19 changed by noel

  • attachment unicode-punctuation-removal.diff added.

Permalink filter for unicode punctuation.

06/20/08 00:48:01 changed by noel

  • owner changed from anonymous to ryan.
  • status changed from reopened to new.
  • milestone changed from 2.7 to 2.6.

06/23/08 04:43:59 changed by noel

  • keywords changed from slug permalink to slug permalink dev-feedback.

06/23/08 04:44:13 changed by noel

  • keywords changed from slug permalink dev-feedback to has-patch slug permalink dev-feedback.

06/26/08 22:04:02 changed by ryan

Any changes to the sanitizer will lead to 404s for slugs made with the old sanitizer.

06/26/08 22:04:29 changed by noel

  • owner changed from ryan to noel.

I'll get that sorted out and resubmit a patch.

07/02/08 08:02:07 changed by noel

  • attachment formatting-7-2-4am.diff added.

Unicode fixes that do not produce a 404.

07/02/08 08:19:45 changed by noel

  • owner changed from noel to ryan.