Posted by: vbr | Date: 2019-03-21 18:50 | IP: IP Logged
maki:Example Blank Char:
the character in your above sample is:
(dec.: 8203) (hex.: 0x200b) # ZERO WIDTH SPACE (Other, Format) (General Punctuation [8192-8303] [0x2000-0x206f])
in the previous one it was:
⠀ (dec.: 10240) (hex.: 0x2800) # ⠀ BRAILLE PATTERN BLANK (Symbol, Other) (Braille Patterns [10240-10495] [0x2800-0x28ff])
However, it is not clear to me from these posts, what is the problem you are trying to solve here.
the chained regex patterns are rather hard to overview, are the respective characters - codepoints contained there? or is the search not matching where you think it should?
Posted by: maki | Date: 2019-03-27 23:34 | IP: IP Logged
I copied any texts from any pages to a text file and it turns out that the text file contains various unknown and difficult to analyze chars.
I see normal text in Polish (for me it's 1250),
but the text contains some Unicode encoding etc.
And how to fix it all? (Without manipulation and damage to sentences, words, spaces, etc. of the text)
I gave only one or a few examples, and what to do with other coding, how do I know what other encoding contains text, and it can be 100 different, maybe a thousand or certainly more "Unicode"
Edited 2 time(s). Last edit at 2019-03-27 23:38 by maki.
Posted by: pspad | Date: 2019-03-28 04:26 | IP: IP Logged
maki:I copied any texts from any pages to a text file and it turns out that the text file contains various unknown and difficult to analyze chars.
Can you provide page you copy text from?
Posted by: vbr | Date: 2019-03-28 16:07 | IP: IP Logged
it is still not quite clear, what are actually the problems to be solved, but maybe some guesses or hints based on the above data may help to clarify it a bit.
If you need to work with the unicode data like character names, there are dedicated tools for this, like BabelPad:
If you just want to detect any characters outside of the given charset like windows-1250 - you may just use a negated character set in regex [^...] contaning the respective characters, e.g. according to:
it might be approximately:
(there might be some cornercases with control characters which might not be handled completely in text editors).
If this would work for your needs, you can also replace the found non-cp-1250 chars with a space an empty string ot other replacement character.
The "personal list" posted in
might evaluate to something like:
[some characters were replaced with ..., as they apparently also cause problems and cannot be handled in this form too]
however there are some quirks which prevents the editors to display it completely, as the data is "illegal" in some aspects with regard to unicode encoding standards -
there are lone surrogate "pseudocharacters" - isolated individually, which are actually only expected (allowed) to be used in pairs - in utf-16 to encode unicode characters in additional unicode planes - above U+FFFF.
In your data there are multiple invalid surrogates, but some of them combine to existing codepoints, partly rather "exotic ones", cf.
# (dec.: 1059933) (hex.: 0x102c5d) # ? (Other, Private Use) surrog. (Supplementary Private Use Area-B [1048576-1114111] [0x100000-0x10ffff])
# (dec.: 65581) (hex.: 0x1002d) # LINEAR B SYLLABLE B031 SA (Letter, Other) surrog. (Linear B Syllabary [65536-65663] [0x10000-0x1007f])
I am not sure, whether this is intended, or rather randomly built combinations, as the underlying logic of the chracter sets is not clear to me.
Apart of it, I believe, PSPad regex engine currently doesn't support unicode literals like: \u00b5 only the hexadecimal ones: \xb5 (up to \xff), hence this notation could only work in programs supporting it.
If you have text files with unknown encoding, there might be some detection heuristics available like the one in PSPad, but in some cases it can't be handled reliably, if the data is ambiguous.
If there is a mix of multiple encodings used in one file (e.g. database dumps, joined files with different encoding properties etc), a generalised handling of such corrupt data might be rather hard.
Posted by: maki | Date: 2019-03-28 16:55 | IP: IP Logged
Some russian webpages (Russian word translated into Polish word) contain in some places a non-standard space in the text)
Text damage, incorrect change! Please help
Edited 6 time(s). Last edit at 2019-03-28 17:04 by maki.
Posted by: vbr | Date: 2019-03-28 21:09 | IP: IP Logged
maki:Some russian webpages (Russian word translated into Polish word) contain in some places a non-standard space in the text)
Text damage, incorrect change! Please help
that whitespace character is:
# (dec.: 8203) (hex.: 0x200b) # ZERO WIDTH SPACE (Other, Format) (General Punctuation [8192-8303] [0x2000-0x206f])
it depends on your use case, how you would need to replace it - it should be normaly invisible, it only delimits word boundaries internally, e.g. for potential linebreak etc.
It might be suitable to replace it with empty string - i.e. joining the surrounding parts to one "word", if inserting a normal space in these places is not correct in your case.
Posted by: maki | Date: 2019-03-28 22:29 | IP: IP Logged
In general, when selecting this Unicode, the text moves(left-right) strangely in PSPad.
But I used a different editor and it's better here.