You are here: PSPad forum > Bug report / Hlášení chyb > Old codepage autodetection behaviour

Old codepage autodetection behaviour

Goto Page: 1 2 Next

#1 Old codepage autodetection behaviour

Posted by: jojo | Date: 2022-01-21 22:41 | IP: IP Logged

I'm using the latest development version of PSPad. In older PSPad versions (maybe a year ago), PSPad could automatically detect if a file was valid UTF-8 and if it wasn't, it would fall back to Windows-1252.

Then I think a year ago or so, PSPad would warn me on each file that wasn't UTF-8 that is is, in fact, not UTF-8. Which is a bit of a pointless warning when working with lots of legacy files that indeed are not valid UTF-8. So I figured out that I can get rid of this warning by choosing the "autodetect codepage" option, now PSPad no longer warns me on those old files but it seems to randomly guess the codepage. Most of my files are old DOS or Windows files with European character sets, but most of the time PSPad seems to guess that they use Eastern Asian character sets. Is there a way to restore the old behaviour that PSPad first tries to decode a file as UTF-8, and if it fails, it silently falls back to whatever codepage I select, without any guesswork by PSPad itself?

Options: Reply | Quote | Up ^


#2 Re: Old codepage autodetection behaviour

Posted by: pspad | Date: 2022-01-22 04:19 | IP: IP Logged

In the past PSPad detects ansi/utf-8 files only. For east Europe charsets it detects DOS too.
Now it is able detect wide range of charsets. Your "guessing" means he calculates chars from each file, evaluates char weight in each charset and winner is selected.
I agree, it isn't 100%, but it isn't random.

If you want to help and improve autodetection, I need files with wrong detection and info about correct code page.

If file contains few accented chars only, the detection is hard.

Options: Reply | Quote | Up ^


#3 Re: Old codepage autodetection behaviour

Posted by: pspad | Date: 2022-01-22 04:22 | IP: IP Logged

Try to switch off autodetection, select "favourite" (default for you) codepage and PSPad will work as you want.
It will still detects utf-8 (if there are any utf-8 encoded chars) and other unicode pages.

Options: Reply | Quote | Up ^


#4 Re: Old codepage autodetection behaviour

Posted by: jojo | Date: 2022-01-22 13:23 | IP: IP Logged

Autodetection seems to have the most issues with old DOS codepages using ANSI art. See for example github.com or the ASM files in github.com and github.com

pspad:
Try to switch off autodetection, select "favourite" (default for you) codepage and PSPad will work as you want.
It will still detects utf-8 (if there are any utf-8 encoded chars) and other unicode pages.

The problem with changing my favourite codepage is that I want my own newly created files to be all UTF-8 without BOM. So maybe a potential fix could be to have two different settings for 1) default encoding to use for newly created files and 2) default encoding to use for reading files with unknown encoding.

Options: Reply | Quote | Up ^


#5 Re: Old codepage autodetection behaviour

Posted by: pspad | Date: 2022-01-22 13:30 | IP: IP Logged

jojo:
The problem with changing my favourite codepage is that I want my own newly created files to be all UTF-8 without BOM. So maybe a potential fix could be to have two different settings for 1) default encoding to use for newly created files and 2) default encoding to use for reading files with unknown encoding.

If you want to have UTF-8 as default, set it in the program settings / Files or in your project settings.
In this case PSpad will automatically create UTF-8 files.
But problem is there if your file doesn't contains any accented char, in this case content of the UTF-8 no BOM file is identical as ANSI file. There is no possibility to recognize if your file is UTF-8 or ANSI.

If you want to be sure, I suggest to add few accented chard (above #128) into comment. In this case all editors with autodetection will identify your files as UTF-8 even there is no BOM

Options: Reply | Quote | Up ^


#6 Re: Old codepage autodetection behaviour

Posted by: jojo | Date: 2022-01-22 13:37 | IP: IP Logged

pspad:

If you want to have UTF-8 as default, set it in the program settings / Files or in your project settings.

Maybe I misunderstood but isn't that exactly the same setting as the "favourite" setting? Because if I set that to UTF-8, I get the annoying "this file is not valid UTF-8" message any time I open legacy documents, which is not what I want.

Options: Reply | Quote | Up ^


#7 Re: Old codepage autodetection behaviour

Posted by: pspad | Date: 2022-01-22 14:09 | IP: IP Logged

Hello

No. This settings is for new files, not for the existing files.

Options: Reply | Quote | Up ^


#8 Re: Old codepage autodetection behaviour

Posted by: jojo | Date: 2022-01-22 14:17 | IP: IP Logged

pspad:
Hello

No. This settings is for new files, not for the existing files.

Where can I change the setting for existing files? I cannot seem to find it.

Options: Reply | Quote | Up ^


#9 Re: Old codepage autodetection behaviour

Posted by: pspad | Date: 2022-01-22 15:09 | IP: IP Logged

If you want to open file in different CP (content is incorrect), menu Codepage, click on the codepage you want and reload file (Ctrl+R) or menu File / Reload
If you want to save file in different CP (content is correct), menu Code page, choose code page you want and save file

Current file CP is always on the statusbar. You can click on the statusbar CP info to fast access too.

Edited 1 time(s). Last edit at 2022-01-22 15:10 by pspad.

Options: Reply | Quote | Up ^


#10 Re: Old codepage autodetection behaviour

Posted by: jojo | Date: 2022-01-23 13:55 | IP: IP Logged

All of those suggestions just hinder my workflow I'm afraid. Maybe I should better explain what the most important things for me are here:

- I do not want to be warned everytime I open an ANSI file that it contains invalid UTF-8 codepoint. That just slows down my work.
- I do not want the codepage for those legacy ANSI files to be autodetected, as Chinese or Japanese characters don't have the same width as e.g. all characters in Windows-1252. For me it's not a big matter if DOS ANSI characters are not decoded correctly (in my mind I can pretty much map the incorrect characters anyway), but if the formatting of tables breaks because some characters get turned into Chinese glyphs, it's rather annoying.
- All of that I want without having to manually reopen a file with a different encoding.

Options: Reply | Quote | Up ^


Goto Page: 1 2 Next





Editor PSPad - freeware editor, © 2001 - 2022 Jan Fiala, Hosted by Webhosting TOJEONO.CZ, design by WebDesign PAY & SOFT, code Petr Dvořák, Privacy policy and GDPR