Ticket #1955 (new defect)

Opened 3 years ago

Last modified 10 months ago

Outgoing trackback ping could include 'charset' attribute for international trackbacks

Reported by: matopc Assigned to: anonymous
Priority: normal Milestone: 2.9
Component: Optimization Version: 2.0
Severity: normal Keywords: trackback charset has-patch
Cc:

Description

Although the original trackback API (Six Apart) requires only title, excerpt, url and blog_name attributes in trackback pings, 'charset' attribute is getting appreciated to correclty parse trackbacks encoded in multibyte character sets.

Current WordPress does parse 'charset' attribute in incoming trackbacks (please see wp-trackback.php), but 'charset' attribute is lacking in outgoing trackbacks. Recieving trackbacks without 'charset' attribute, wp-trackback.php tries to convert characters using PHP mb_convert_encoding function, and this sometimes fails even for UTF-8 trackbacks causing broken characters.

Proposal: /wp-includes/functions.php, line 776-

// Send a Trackback
function trackback($trackback_url, $title, $excerpt, $ID) {
	global $wpdb, $wp_version;

	if ( empty($trackback_url) )
		return;

	$title = urlencode($title);
	$excerpt = urlencode($excerpt);
	$blog_name = urlencode(get_settings('blogname'));
	$tb_url = $trackback_url;
	$url = urlencode(get_permalink($ID));
	$query_string = "title=$title&url=$url&blog_name=$blog_name&excerpt=$excerpt";

could be like this

// Send a Trackback
function trackback($trackback_url, $title, $excerpt, $ID) {
	global $wpdb, $wp_version;
 
	if (empty($trackback_url))
	return;
 
	$title = urlencode($title);
	$excerpt = urlencode($excerpt);
	$blog_name = urlencode(get_settings('blogname'));
	$tb_url = $trackback_url;
	$url = urlencode(get_permalink($ID));
	$charset = get_settings('blog_charset');
	$query_string = "title=$title&url=$url&blog_name=$blog_name&excerpt=$excerpt&charset=$charset";

When 'charset' argument is received by other blog systems that do not require charset attribute, it will be just ignored. This modification may help many of multibyte character users. Thank you.

Attachments

trackback.patch (1.5 kB) - added by drssay on 09/05/06 13:10:01.
1955.diff (1.3 kB) - added by majelbstoat on 09/05/06 15:31:04.
Filter trackback charset.
trackback2.patch (1.8 kB) - added by drssay on 09/05/06 15:56:26.

Change History

11/26/05 18:04:44 changed by davidhouse

  • keywords changed from trackback charset to trackback charset bg|has-patch.

11/29/05 01:16:04 changed by philor

Hmm. Isn't the bug that we ignore an incoming charset param on the content-type header, rather than that we don't send a charset param in the post data that hasn't ever seen the light of any spec? According to [1734] the WP Japan folks wanted it that way, but I'm not sure quite why, since that was after the August 2004 change in the Trackback spec that called for it on the content-type header, where we do send it.

11/30/05 05:09:57 changed by matopc

Hi philor, Thanks for the comment. I agree that it is reasonable to parse charset in the header instead of the post. I may have misunderstood about the change of trackback specification seeing outdated websites. So we'd better to tweak wp-trackback.php, right?

12/01/05 02:36:35 changed by matopc

I modified wp-trackback.php to parse charset parameter in the request header. Below codes need to be verified, works for me though. Thank you.

/wp-trackback.php, lines 33-

$tb_url    = $_POST['url'];
$title     = $_POST['title'];
$excerpt   = $_POST['excerpt'];
$blog_name = $_POST['blog_name'];
$charset   = $_POST['charset'];

if ($charset)
	$charset = strtoupper( trim($charset) );
else
	$charset = 'ASCII, UTF-8, ISO-8859-1, JIS, EUC-JP, SJIS';

change to

$tb_url    = $_POST['url'];
$title     = $_POST['title'];
$excerpt   = $_POST['excerpt'];
$blog_name = $_POST['blog_name'];
$headers = apache_request_headers();
$charset = stristr ($headers['Content-Type'], 'charset=');

if ($charset) {
    $charset = (strpos ($charset, ';')) ? substr ($charset, 8, (strpos($charset, ';') - 8)) : substr ($charset, 8);
    $charset = strtoupper (trim ($charset));
    } else {
    $charset = 'ASCII, UTF-8, ISO-8859-1, JIS, EUC-JP, SJIS';
    }

08/19/06 01:02:58 changed by drssay

  • version changed from 1.6 to 2.0.
  • milestone set to 2.1.
$headers = apache_request_headers();
$charset = stristr ($headers['Content-Type'], 'charset=');

change to below:

$charset = stristr ($_SERVER['Content-Type'], 'charset=');

is possible?

09/05/06 07:42:12 changed by Tae-young

  • type changed from defect to enhancement.
  • 'ASCII' is 100% compatible with 'UTF-8' so it will be better to remove 'ASCII' from $charset;
  • I've never seen error in conversion string from 'iso8859-1' to 'utf-8' (or other) if the string was encoded with 'euc-kr' or whatever. I don't know how mb_convert_encoding works, but I think mb_convert_encoding will judge 'iso8859-1' is the appropriate encoding because there's no error in conversion 'iso8859-1' to any other encoding. How about move 'iso8859-1' to the end of $charset.
  • Why there's no Korean encoding like 'euc-kr', 'cp949' in $charset? (cp949 is compatible with euc-kr, and cp949 has every character in euc-kr, so it will be enough adding 'cp949' to $chraset)
  • For system which doesn't support mbstrings, how about using iconv if there not exist mbstring module.
    if ($_POST['charset'])
        $charset = strtoupper( trim($_POST['charset']) );
    else
        $charset = 'UTF-8, EUC-KR, JIS, EUC-JP, SJIS, ISO-8859-1';

    $passed = array( 0, 0, 0 );
    if ( function_exists('mb_convert_encoding') ) {

        $title     = mb_convert_encoding($title, get_settings('blog_charset'), $charset);
        $excerpt   = mb_convert_encoding($excerpt, get_settings('blog_charset'), $charset);
        $blog_name = mb_convert_encoding($blog_name, get_settings('blog_charset'), $charset);
    }
    else if ( function_exists('iconv') ) {
        $charsets = explode( ', ', $charset  );
        
        for( $i = 0 ; $i < count($charsets); $i++ ){
            if ( !$passed[0] )
                $passed[0] = iconv($charsets[$i], get_settings('blog_charset'), $title);
            if ( !$passed[1] )
                $passed[1] = iconv($charsets[$i], get_settings('blog_charset'), $excerpt);
            if ( !$passed[2] )
                $passed[2] = iconv($charsets[$i], get_settings('blog_charset'), $blog_name);
        }
        $title = $passed[0];
        $excerpt = $passed[1];
        $blog_name = $passed[2];

    }

09/05/06 07:43:07 changed by aqua0125

  • type changed from enhancement to defect.

09/05/06 11:46:32 changed by majelbstoat

Is there a compelling reason not to just add a filter and make this pluggable? Then plugins, especially multilingual ones, can take advantage and send whatever trackback charset is required. (I'm assuming this area of the code isn't already pluggable, but I haven't checked).

09/05/06 13:09:00 changed by drssay

First patch upload

include: 1. matopc's update 2. iconv fallback

ps. trackback response is always UTF-8. so in LINE9, charset is set fixed value utf-8.

need : validate

09/05/06 13:10:01 changed by drssay

  • attachment trackback.patch added.

09/05/06 14:34:49 changed by Tae-young

majelbstoat// Unfortunately we can't make this pluggable. there's no do_action(...) in trackback function. I made korean trackback plugin, but it's almost hack! do_action('trackback_post') called after inserting trackback to the database. (in wp-trackback.php).

drssay// here's utf-8 validation code

// Returns true if $string is valid UTF-8 and false otherwise.
// utf-8 validation with regex expression
function is_utf8($string) {
return preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs', $string);
}

Here's another one. I referenced glib's utf-8 validating interface .

function utf8_validation( $str, &$i ){

    $i = 0;
    $len = strlen($str);

    while( $i < $len ){
    
        if( (ord($str[$i]) & 0xF0) == 0xE0 ){
            if( $i <= ($len-3) && 
                (ord($str[$i+1])&0x80 == 0x80) &&
                (ord($str[$i+2])&0x80 == 0x80) )
                    $i += 3;
            else
                return;
        }
        // 2Byte
        else if( (ord($str[$i]) & 0xE0) == 0xC0 ){
            if( $i <= ($len-2) && 
                (ord($str[$i+1])&0x80 == 0x80) )
                    $i += 2;
            else
                return;
        }
        // 1Byte
        else {
            $i++;
        }
        
    }
    
}

you can test the second code in the url below. http://mytears.org/resources/mysrc/php/unicode/

09/05/06 14:44:47 changed by Tae-young

drssay// I find bug in the patch :)

iconv's charset argument must be a single encoding like 'utf-8' or 'euc-kr', it can't be multiple encodings like 'utf-8, jis, euc-jp, ...'. So we should explode $charset to a single encoding.

09/05/06 15:21:49 changed by drssay

iconv charset error resolved

09/05/06 15:29:28 changed by majelbstoat

My suggestion was to make it pluggable!

This patch (1955.diff) adds the filter 'trackback_charset', so that any plugin can do whatever it wants to to the charset. Admittedly, it doesn't allow for iconv, but that whole section of the code could be plugged too if it was really needed.

09/05/06 15:31:04 changed by majelbstoat

  • attachment 1955.diff added.

Filter trackback charset.

09/05/06 15:33:38 changed by majelbstoat

Also, regardless, we don't need to do any of that stuff if the tb_id isn't valid, so that check should come first.

09/05/06 15:56:26 changed by drssay

  • attachment trackback2.patch added.

09/06/06 12:12:25 changed by raghos

Hi,

I'm finding an error in the way wordpress is sending trackbacks in my blog, and I'd like to know if it's the same issue described here, so I don't file a duplicate bug report.

The article here: http://eldiabloenlosdetalles.net/2006/08/16/%c2%bfque-es-the-long-tail-parte-i/

has a a trackback from: http://eldiabloenlosdetalles.net/2006/08/20/%c2%bfque-es-the-long-tail-parte-ii/

but it shows up as: http://eldiabloenlosdetalles.net/2006/08/20/c2bfque-es-the-long-tail-parte-ii/

so the link is broken. I appreciatte any feedback. Is that the same issue?

Carlos

09/06/06 19:20:10 changed by ryan

raghos, that is a different issue that has already been fixed in trunk.

11/29/06 22:56:03 changed by matt

  • milestone changed from 2.1 to 2.2.

12/18/06 21:53:56 changed by tenpura

majelbstoat,
It is already pluggable. You can bypass or recode the whole part through preprocess_comment hook.

Note to matt and ryan:
I think this encoding conversion section is one of the worst code in the WP. Currently only UTF-8, ISO-8859-1 and JIS input are seemingly converted safely and many of the rest would possibly be destroyed or result in error (Notice that even EUC-JP and SJIS are destroyed!). It seems that the demerit of the existence of the code is bigger than the merit.

Suggested Solution:
Simply remove the input encoding conversion section and let local people write their own plugins.

Reasons:

  • It is pluggable.
  • Current incomming pingback section does not have this kind of engine and people seem fine without it.
  • Implementing almighty input encoding conversion logic is almost impossible.
  • If one has proper MySQL character set settings (like skip-character-set-client-handshake), theoretically any foreign encoding input can be retrievable later on.
  • Nobody seems to like the current code.

03/27/07 19:57:58 changed by foolswisdom

  • keywords changed from trackback charset bg|has-patch to trackback charset has-patch.
  • milestone changed from 2.2 to 2.3.

09/13/07 21:50:16 changed by Nazgul

  • milestone changed from 2.3 to 2.4.

02/12/08 09:32:35 changed by pishmishy

  • milestone changed from 2.5 to 2.6.

Bumping milestone for feature freeze.