You are here: PSPad forum > Bug report / Hlášení chyb > Save ANSI file as UTF-8 doesn't work

Save ANSI file as UTF-8 doesn't work

Goto Page: Previous1 2 3 Next

#11 Re: Save ANSI file as UTF-8 doesn't work

Posted by: naputtelija | Date: 2008-03-05 06:31 | IP: IP Logged

I tested this too by saving www.wikipedia.org frontpage and noticed that situation depens on browser. Firefox saves file as UTF-8 but does not use BOM. So PSPad and EditPlus both opens file using ANSI.

Internet Explorer also saves file as UTF-8 and it also uses BOM, but PSPad still does not recognise encoding. Byte-order mark is clearly visibile in the first line. EditPlus show file correctly

Try to search for example Español (or use just Espa because you will not find Español).

I'm using build 2309.

I also used a script to verify encoding:

PS:4 > M:\Skriptit\PowerShell\Get-FileEncoding.ps1 Wikipedia.htm

BodyName : utf-8
EncodingName : Unicode (UTF-8)
HeaderName : utf-8
WebName : utf-8
WindowsCodePage : 1200
IsBrowserDisplay : True
IsBrowserSave : True
IsMailNewsDisplay : True
IsMailNewsSave : True
IsSingleByte : False
EncoderFallback : System.Text.EncoderReplacementFallback
DecoderFallback : System.Text.DecoderReplacementFallback
IsReadOnly : True
CodePage : 65001

Options: Reply | Quote | Up ^


#12 Re: Save ANSI file as UTF-8 doesn't work

Posted by: pspad | Date: 2008-03-05 06:46 | IP: IP Logged

UTF-8 is matematically encoded. It doesn't use char table as other encoding.
If something is wrong - case of the wikipedia page, PSPad isn't able to open it.
Instead PSPad shows you blank page it shows you content in ANSI.

Options: Reply | Quote | Up ^


#13 Re: Save ANSI file as UTF-8 doesn't work

Posted by: vbr | Date: 2008-03-05 10:22 | IP: IP Logged

It seems, that there might be some problems with wide unicode characters (over hex 0xFFFF)
If I open the sourcecode of the main wikipedia page in another editor as utf8, paste the content to pspad and save here as utf8, the resulted file can be open normaly in this encoding in PSPad

The only difference to the original source seems to be the line containing
lang="got" xml:lang="got" title="Gutisk">

Edited 1 time(s). Last edit at 2008-03-05 10:35 by vbr.

Options: Reply | Quote | Up ^


#14 Re: Save ANSI file as UTF-8 doesn't work

Posted by: Freeman | Date: 2008-07-15 00:03 | IP: IP Logged

vbr:
It seems, that PSPad can handle wide unicode chars somehow, but maybe the internal encoding is different (?), but in any case there are probably no appropriate (fixed-width) fonts for these unicode ranges either.

I think, all the Unicode support used by PSPad is provided by Windows.

Since Windows NT it used UCS2 (UTF-16 subset) that not contains characters higher U+FFFF. Windows 2000 added full UTF-16, including two-word characters. IMHO, your sample demonstrates this effect.

Options: Reply | Quote | Up ^


#15 Re: Save ANSI file as UTF-8 doesn't work

Posted by: pspad | Date: 2008-07-20 15:25 | IP: IP Logged

Extended code page support is my priority now.

Options: Reply | Quote | Up ^


#16 Re: Save ANSI file as UTF-8 doesn't work

Posted by: Freeman | Date: 2015-03-19 19:10 | IP: IP Logged

vbr:
It seems, that there might be some problems with wide unicode characters (over hex 0xFFFF)

Problem still exists in build 2655. Some sites like VK.com using Unicode emoticon characters to encode smiles, most of them have codes greater than U+FFFF. My program can save messages using 4-byte UTF-8 sequences for smiles, but PSPad cannot open that file.

The problem is in System.Utf8ToUnicode function under Delphi 6/7, AFAIK. You can use MultiByteToWideChar with CP_UTF8 instead, Jan.

Options: Reply | Quote | Up ^


#17 Re: Save ANSI file as UTF-8 doesn't work

Posted by: pspad | Date: 2015-03-19 20:15 | IP: IP Logged

Freeman:
vbr:
It seems, that there might be some problems with wide unicode characters (over hex 0xFFFF)

Problem still exists in build 2655. Some sites like VK.com using Unicode emoticon characters to encode smiles, most of them have codes greater than U+FFFF. My program can save messages using 4-byte UTF-8 sequences for smiles, but PSPad cannot open that file.

The problem is in System.Utf8ToUnicode function under Delphi 6/7, AFAIK. You can use MultiByteToWideChar with CP_UTF8 instead, Jan.

PSPad works internally with 2 bytes unicode only. When you open file, all is encoded into UTF-16. Change of it isn't possible, at least without rewriting near whole program.

Options: Reply | Quote | Up ^


#18 Re: Save ANSI file as UTF-8 doesn't work

Posted by: Freeman | Date: 2015-03-20 18:33 | IP: IP Logged

pspad:
When you open file, all is encoded into UTF-16. Change of it isn't possible, at least without rewriting near whole program.

UTF-16 supports surrogate pairs for characters higher than U+FFFF. Windows XP can handle these two-widechar combinations, it works fine!

For example:

const
Smile: array[0..2] of WideChar = (WideChar($D83D), WideChar($DE0A), #0);
begin
MessageBoxW(Application.Handle, @Smile, nil, 0);
end;

This code will show only one "rectangular character" because surrogate pair contains only one UTF-16 character.

Unfortunately, MultiByteToWideChar function cannot decode 4-byte UTF-8 under Windows XP, this was my mistake. I will investigate more. My program uses own UTF-8 decoder, written from scratch, but a little complex for use outside of library it contains.

I can write new Utf8Decode/Utf8Encode functions for PSPad, if you ready to use them. smiling smiley

Options: Reply | Quote | Up ^


#19 Re: Save ANSI file as UTF-8 doesn't work

Posted by: pspad | Date: 2015-03-20 18:54 | IP: IP Logged

Freeman:
I can write new Utf8Decode/Utf8Encode functions for PSPad, if you ready to use them. smiling smiley

If this function encode your UTF-8 into UTF-16 correctly and will be able to handle 4 bytes char in 2 bytes unicode char...

Options: Reply | Quote | Up ^


#20 Re: Save ANSI file as UTF-8 doesn't work

Posted by: Freeman | Date: 2015-03-20 21:36 | IP: IP Logged

Okay, take first Unicode emoticon from VK.com List, U+1F60A (#128522 decimal). I can copy and paste it "rectangle" to PSPad, then save as UTF-16. It will produce two UTF-16 codes in the file — surrogate pair:
D83D DE0A

It's great! I can reopen this file in PSPad and save it as UTF-8. It will produce two 3-byte sequences:
ED A0 BD ED B8 8A

This is not true UTF-8, but CESU-8 — Compatibility-Encoded Surrogates to UTF-8. It also allowed, but other Unicode-aware editors, like built-in Far Manager, will produce true UTF-8 — one 4-byte sequence, starting with $Fx:
F0 9F 98 8A

PSPad cannot decode that file.

Because Windows XP and newer have full support of UTF-16, the problem is in Delphi itself:

function Utf8Decode(const Source: UTF8String): WideString; // my function
var
L: Integer;
Dest: WideString;
begin
L := Length(Source);
SetLength(Dest, L);
SetLength(Dest, MultiByteToWideChar(CP_UTF8, 0, Pointer(Source), L, Pointer(Dest), L));
Result := Dest;
end;

function Utf8Encode(const Source: WideString): UTF8String; // my function
var
L: Integer;
Dest: UTF8String;
begin
L := Length(Source);
SetLength(Dest, L * 3);
SetLength(Dest, WideCharToMultiByte(CP_UTF8, 0, Pointer(Source), L, Pointer(Dest), Length(Dest), nil, nil));
Result := Dest;
end;

procedure TMainForm.Button1Click(Sender: TObject);
const
Smile = #$F0#$9F#$98#$8A; // 4-byte UTF-8
begin
MessageBoxW(Handle, Pointer(Utf8Decode(Smile)), nil, 0); // one char, good
MessageBoxW(Handle, Pointer(System.Utf8Decode(Smile)), nil, 0); // null string, wrong
end;

Being compiled under Delphi XE2, Delphi built-in function will also produce a surrogate pair, because Embarcadero fixed System.Utf8Decode function in Unicode-aware Delphi.

I was a little wrong about UTF-8 in Windows XP: both surrogates and 4-byte UTF-8 work fine, but we should pass zero as Flags parameter to MultiByteToWideChar function, as wrote in MSDN.

Options: Reply | Quote | Up ^


Goto Page: Previous1 2 3 Next





Editor PSPad - freeware editor, © 2001 - 2024 Jan Fiala, Hosted by Webhosting TOJEONO.CZ, design by WebDesign PAY & SOFT, code Petr Dvořák, Privacy policy and GDPR