You are here: PSPad forum > English discussion forum > Re: character or text to identify.

Re: character or text to identify.

Goto Page: Previous1 2

#11 Re: character or text to identify.

Posted by: vbr | Date: 2019-03-21 18:50 | IP: IP Logged

maki:
Example Blank Char:

​​

...

Hi,
the character in your above sample is:
​ (dec.: 8203) (hex.: 0x200b) # ​ ZERO WIDTH SPACE (Other, Format) (General Punctuation [8192-8303] [0x2000-0x206f])

in the previous one it was:
⠀ (dec.: 10240) (hex.: 0x2800) # ⠀ BRAILLE PATTERN BLANK (Symbol, Other) (Braille Patterns [10240-10495] [0x2800-0x28ff])

However, it is not clear to me from these posts, what is the problem you are trying to solve here.
the chained regex patterns are rather hard to overview, are the respective characters - codepoints contained there? or is the search not matching where you think it should?

vbr

Options: Reply | Quote | Up ^


#12 Re: character or text to identify.

Posted by: maki | Date: 2019-03-27 23:34 | IP: IP Logged

I copied any texts from any pages to a text file and it turns out that the text file contains various unknown and difficult to analyze chars.
I see normal text in Polish (for me it's 1250),
but the text contains some Unicode encoding etc.
And how to fix it all? (Without manipulation and damage to sentences, words, spaces, etc. of the text)

I gave only one or a few examples, and what to do with other coding, how do I know what other encoding contains text, and it can be 100 different, maybe a thousand or certainly more "Unicode"

Edited 2 time(s). Last edit at 2019-03-27 23:38 by maki.

Options: Reply | Quote | Up ^


#13 Re: character or text to identify.

Posted by: pspad | Date: 2019-03-28 04:26 | IP: IP Logged

maki:
I copied any texts from any pages to a text file and it turns out that the text file contains various unknown and difficult to analyze chars.

Can you provide page you copy text from?

Options: Reply | Quote | Up ^


#14 Re: character or text to identify.

Posted by: vbr | Date: 2019-03-28 16:07 | IP: IP Logged

Hi,
it is still not quite clear, what are actually the problems to be solved, but maybe some guesses or hints based on the above data may help to clarify it a bit.

If you need to work with the unicode data like character names, there are dedicated tools for this, like BabelPad:
www.babelstone.co.uk

==

If you just want to detect any characters outside of the given charset like windows-1250 - you may just use a negated character set in regex [^...] contaning the respective characters, e.g. according to:
en.wikipedia.org

it might be approximately:

[^-\t !"#$%&'()*+,./0-9:;<=>?@A-Z\[\\\]\^_`a-z{|}~€‚„…†‡‰Š‹ŚŤŽŹ‘’“”•–—™š›śťžźˇ˘Ł¤Ą¦§¨©Ş«¬®Ż°±˛ł´µ¶·¸ąş»Ľ˝ľżŔÁÂĂÄĹĆÇČÉĘËĚÍÎĎĐŃŇÓÔŐÖ×ŘŮÚŰÜÝŢßŕáâăäĺćçčéęëěíîďđńňóôőö÷řůúűüýţ˙\x00-\x1f\x7f\xa0\xad]

(there might be some cornercases with control characters which might not be handled completely in text editors).
If this would work for your needs, you can also replace the found non-cp-1250 chars with a space an empty string ot other replacement character.

==

The "personal list" posted in
forum.pspad.com

[A-Za-z\u00aa\u00b5\u00ba ...

might evaluate to something like:

[A-Za-zªµºÀ-ÖØ-öø-ˁˆ-ˑˠ-ˤˬˮͰ-ʹͶ-ͷͺ-ͽΆΈ-ΊΌΎ-ΡΣ-ϵϷ-ҁҊ-ԣԱ-Ֆՙա-ևא-תװ-ײء-يٮ-ٯٱ-ۓەۥ-ۦۮ-ۯۺ-ۼۿܐܒ-ܯݍ-ޥޱߊ-ߪߴ-ߵߺऄ-हऽॐक़-ॡॱ-ॲॻ-ॿঅ-ঌএ-ঐও-নপ-রলশ-হঽৎড়-ঢ়য়-ৡৰ-ৱਅ-ਊਏ-ਐਓ-ਨਪ-ਰਲ-ਲ਼ਵ-ਸ਼ਸ-ਹਖ਼-ੜਫ਼ੲ-ੴઅ-ઍએ-ઑઓ-નપ-રલ-ળવ-હઽૐૠ-ૡଅ-ଌଏ-ଐଓ-ନପ-ରଲ-ଳଵ-ହଽଡ଼-ଢ଼ୟ-ୡୱஃஅ-ஊஎ-ஐஒ-கங-சஜஞ-டண-தந-பம-ஹௐఅ-ఌఎ-ఐఒ-నప-ళవ-హఽౘ-ౙౠ-ౡಅ-ಌಎ-ಐಒ-ನಪ-ಳವ-ಹಽೞೠ-ೡഅ-ഌഎ-ഐഒ-നപ-ഹഽൠ-ൡൺ-ൿඅ-ඖක-නඳ-රලව-ෆก-ะา-ำเ-ๆກ-ຂຄງ-ຈຊຍດ-ທນ-ຟມ-ຣລວສ-ຫອ-ະາ-ຳຽເ-ໄໆໜ-ໝༀཀ-ཇཉ-...ྈ-ྋ...Ⴀ-Ⴥა-ჺჼᄀ-ᅙᅟ-ᆢᆨ-ᇹሀ-ቈቊ-ቍቐ-ቖቘቚ-ቝበ-ኈኊ-ኍነ-ኰኲ-ኵኸ-ኾዀዂ-ዅወ-ዖዘ-ጐጒ-ጕጘ-ፚᎀ-ᎏᎠ-Ᏼᐁ-ᙬᙯ-ᙶᚁ-ᚚᚠ-ᛪ...ក-ឳៗៜᠠ-ᡷᢀ-ᢨᢪᤀ-ᤜᥐ-ᥭᥰ-ᥴᦀ-ᦩᧁ-ᧇ...ᴀ-ᶿḀ-ἕἘ-Ἕἠ-ὅὈ-Ὅὐ-ὗὙὛὝὟ-ώᾀ-ᾴᾶ-ᾼιῂ-ῄῆ-ῌῐ-ΐῖ-Ίῠ-Ῥῲ-ῴῶ-ῼⁱⁿₐ-ₔℂℇℊ-ℓℕℙ-ℝℤΩℨK-ℭℯ-.ⵥⵯⶀ-ⶖⶠ-ⶦⶨ-ⶮⶰ-ⶶⶸ-ⶾⷀ-ⷆⷈ-ⷎⷐ-ⷖⷘ-ⷞ...々-〆〱-〵〻-〼ぁ-ゖゝ-ゟァ-ヺー-ヿㄅ-...ㄱ-ㆎㆠ-ㆷㇰ-㐀-䶵一-ꀀ-ꒌꔀ-ꘌꘐ-ꘟꘪ-ꘫ...Ꙣ-ꙮ...ꜗ-ꜟꜢ-ꞈꞋ-ꞌꟻ-...ꡀ-ꡳ...가-힣豈-鶴侮-頻...ff-

[some characters were replaced with ..., as they apparently also cause problems and cannot be handled in this form too]

however there are some quirks which prevents the editors to display it completely, as the data is "illegal" in some aspects with regard to unicode encoding standards -
there are lone surrogate "pseudocharacters" - isolated individually, which are actually only expected (allowed) to be used in pairs - in utf-16 to encode unicode characters in additional unicode planes - above U+FFFF.

In your data there are multiple invalid surrogates, but some of them combine to existing codepoints, partly rather "exotic ones", cf.

# (dec.: 1059933) (hex.: 0x102c5d) # ? (Other, Private Use) surrog. (Supplementary Private Use Area-B [1048576-1114111] [0x100000-0x10ffff])
# (dec.: 65581) (hex.: 0x1002d) # LINEAR B SYLLABLE B031 SA (Letter, Other) surrog. (Linear B Syllabary [65536-65663] [0x10000-0x1007f])

I am not sure, whether this is intended, or rather randomly built combinations, as the underlying logic of the chracter sets is not clear to me.

Apart of it, I believe, PSPad regex engine currently doesn't support unicode literals like: \u00b5 only the hexadecimal ones: \xb5 (up to \xff), hence this notation could only work in programs supporting it.

==
If you have text files with unknown encoding, there might be some detection heuristics available like the one in PSPad, but in some cases it can't be handled reliably, if the data is ambiguous.

==
If there is a mix of multiple encodings used in one file (e.g. database dumps, joined files with different encoding properties etc), a generalised handling of such corrupt data might be rather hard.

==

hth,
vbr

Options: Reply | Quote | Up ^


#15 Re: character or text to identify.

Posted by: maki | Date: 2019-03-28 16:55 | IP: IP Logged

Some russian webpages (Russian word translated into Polish word) contain in some places a non-standard space in the text)

.\x{200b}.
Replace with:
Space

Text damage, incorrect change! Please help

imagesad smiley

Edited 6 time(s). Last edit at 2019-03-28 17:04 by maki.

Options: Reply | Quote | Up ^


#16 Re: character or text to identify.

Posted by: vbr | Date: 2019-03-28 21:09 | IP: IP Logged

maki:
Some russian webpages (Russian word translated into Polish word) contain in some places a non-standard space in the text)

.\x{200b}.
Replace with:
Space

Text damage, incorrect change! Please help
...

Hi,
that whitespace character is:
# ​ (dec.: 8203) (hex.: 0x200b) # ​ ZERO WIDTH SPACE (Other, Format) (General Punctuation [8192-8303] [0x2000-0x206f])
en.wikipedia.org

it depends on your use case, how you would need to replace it - it should be normaly invisible, it only delimits word boundaries internally, e.g. for potential linebreak etc.
It might be suitable to replace it with empty string - i.e. joining the surrounding parts to one "word", if inserting a normal space in these places is not correct in your case.

hth,
vbr

Options: Reply | Quote | Up ^


#17 Re: character or text to identify.

Posted by: maki | Date: 2019-03-28 22:29 | IP: IP Logged

In general, when selecting this Unicode, the text moves(left-right) strangely in PSPad.
But I used a different editor and it's better here.

Options: Reply | Quote | Up ^


Goto Page: Previous1 2





Editor PSPad - freeware editor, © 2001 - 2024 Jan Fiala, Hosted by Webhosting TOJEONO.CZ, design by WebDesign PAY & SOFT, code Petr Dvořák, Privacy policy and GDPR