03.02.02

tags balancer  -  @ 02:20:42 347
In addition to the HTML entities to Unicode converter, there is now an HTML corrector in b2 (in v0.6pre).
What it does ? Closes all non-closed tags in posts and comments, ensuring that a mistake does not make your whole page italicized, for example.
This balanceTags code is courtesy of Leonard Lin @ http://randomfoo.net - praises to him.
7 comments

 

:: comments

 

tj_edit - email - url
Hi, Michel.

What charsets does the HTML entities to Unicode convertor support?


05.02.02 @ 02:05:09 336

 

michel v - email - url
It converts named HTML entities to their Unicode character reference; and that's about it. I guess this means ISO-8859-1, but it surely covers more than that ; ) 
TJ, if you could test typing some japanese text, I would be glad. Not sure about that.
05.02.02 @ 10:31:26 688

 

tj_edit - email - url
Michel V:

I ask because mapping the charsets to Unicode (utf-8, in this case) is --in theory-- easily done. But in practice, not so easily done.

Some of the big boys [IBM] have developed applications in C or Perl to do just this: convert documents from charset_x to utf-8 (or utf-16 | 32).

But those applications don't work as interfaces to content management systems. They just batch process the selected docs.

Likewise, you can download as shareware and/or Open Source various charset convertors.

But in my experience thus far, and I would happily be corrected, the ones that work right work only for Latin-1 and the major European langauges | charsets. So that's a bummer.

The Asian | subcontinent languages/charsets are not truly supported in the versions I've tried | tested so far. Again, I would be happily corrected / given new apps to try.

But even that would not give us an interface that would convert charset-x to utf-8 on the fly. That would be really cool!

But it seems to be quite a challenge.

Here are some common
encodings for B2/Cafelog sites:
iso-8859-1 (Western Europe),
iso-8859-2 (Central Europe),
iso-8859-4 (Baltic Rim),
iso-8859-5 (Cyrillic),
windows-1250 (Central Europe), windows-1251 (Cyrillic),
windows-1252 (Western Europe),
windows-1257 (Baltic Rim).

Latin-1 -- iso-8859-1 -- is relatively transparent (a non-issue).

How could these other charsets be automatically mapped/converted to utf-8 (Unicode)?

That's the question for which I do NOT have an answer.

It might depend on how much Unicode support PHP has [at this time].

If it covers those cases, great! If not, I have no idea how to proceed. ;-(

Below, some shift_jis | Japanese. This, I'm told, is a particularly difficult/ugly charset to convert. If it looks like pure gibberish, don't be surprised!

ˆâ“`žq‘gŠ·H•i Genetically Engineered Food
05.02.02 @ 17:46:17 990

 

michel v - email - url
Actually there are 2 misunderstatements here ; ) 
1. I don't mean it translates between charsets: I have given up on this at some point. It translates to stuff like — HTML numeric entities.
2. Errrr, I meant posting japanese text in the v0.6 testdrive blog. Though I've just tested, and the Unicode filter kills the japanese text I pasted from sega.jp - looks like I'll have to provide an option to disable it for non-Latin charsets users. It could also be a bug on my end, since I'm using windows-1252 on the computer. I wish for the latter.
05.02.02 @ 19:24:01 058

 

michel v - email - url
by the way TJ I'm online these days on Y!IM as 'cafelog', but I never see you : ( 
05.02.02 @ 19:24:38 058

 

tj_edit - email - url
Michel:

My fault! I wa jumping from entities to charsets.

As for you "giving up on it" [translating between charsets for web-based applications], HEY:

it's internationalization--i8ln--issue. If you solve it by yourself, great! You'll be famous.

But I suspect that a team of well funded smart people will crack it. I know of some bits and pieces of related projects, but nothing that would apply directly to content management systems| blogs. [That is, the problems described above].

I was just hoping you heard something I did not.

As for YIM, my bad. I'm make sure it's on.

Best to all,
TJH

06.02.02 @ 00:09:25 256

 

Chief - email - url
Please forgive me typing this here. But I've been having a comments post bug on my install of b2 and you had suggested it might be a broswer problem. So I thought I would post here to see if the problem occured here as well. That way I can let you know in the forum.
15.02.02 @ 15:18:56 888

 

:: leave a comment

 

name

email

url

your comment

Auto-BR (line-breaks become <br> tags)

 

:: return to the blog

[powered by b2.]

4 sp@mbots e-mail me