You are here: PSPad forum > English discussion forum > Detection of coding of any characters, even those invisible.

Detection of coding of any characters, even those invisible.

Goto Page: 1 2 Next

#1 Detection of coding of any characters, even those invisible.

Posted by: maki | Date: 2019-03-20 16:48 | IP: IP Logged

How to detect any character in a text file that requires a different encoding than the default 1250.
I tested different regex ranges, but I still can not find!

Options: Reply | Quote | Up ^


#2 Re: Detection of coding of any characters, even those invisible.

Posted by: pspad | Date: 2019-03-20 16:57 | IP: IP Logged

When i did autodetection for PSPad, I took big texts (several sources for each encoding like books, internet pages, ...)
I calculate occurence of all chars in text and created weight for each char

Now I take about forst 10 000 chars, calculate total weight from chars and decide what encoding it can be...

Options: Reply | Quote | Up ^


#3 Re: Detection of coding of any characters, even those invisible.

Posted by: maki | Date: 2019-03-20 17:04 | IP: IP Logged

I do not know what you're talking about.
I use various Online tools and different regex ranges, but they have not found anything. How a function can detect any other character that is not Polish.
I do not have any other idea
probably need to know the universal Unidode regex, with the possibility of excluding Polish letters, etc.

Edited 1 time(s). Last edit at 2019-03-20 17:06 by maki.

Options: Reply | Quote | Up ^


#4 character or text to identify.

Posted by: maki | Date: 2019-03-20 18:24 | IP: IP Logged

Please supports all 137,928 named characters defined in Unicode 12.0

character or text to identify.

Example:

U+006D : LATIN SMALL LETTER M
U+0119 : LATIN SMALL LETTER E WITH OGONEK
U+017C : LATIN SMALL LETTER Z WITH DOT ABOVE
U+0063 : LATIN SMALL LETTER C
U+007A : LATIN SMALL LETTER Z
U+0079 : LATIN SMALL LETTER Y
U+017A : LATIN SMALL LETTER Z WITH ACUTE
U+006E : LATIN SMALL LETTER N
U+0069 : LATIN SMALL LETTER I
U+0020 : SPACE [SP]

Options: Reply | Quote | Up ^


#5 Re: character or text to identify.

Posted by: pspad | Date: 2019-03-20 18:34 | IP: IP Logged

I don't what you are talking about. Support it where? You can use hex value in reg expressions.

Options: Reply | Quote | Up ^


#6 Re: character or text to identify.

Posted by: maki | Date: 2019-03-21 05:51 | IP: IP Logged

I have used REGEX but can not detect:

Blank Char / Invisible Character

MY PERSONAL LIST

[A-Za-z\u00aa\u00b5\u00ba\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u02c1\u02c6-\u02d1\u02e0-\u02e4\u02ec\u02ee\u0370-\u0374\u0376-\u0377\u037a-\u037d\u0386\u0388-\u038a\u038c\u038e-\u03a1\u03a3-\u03f5\u03f7-\u0481\u048a-\u0523\u0531-\u0556\u0559\u0561-\u0587\u05d0-\u05ea\u05f0-\u05f2\u0621-\u064a\u066e-\u066f\u0671-\u06d3\u06d5\u06e5-\u06e6\u06ee-\u06ef\u06fa-\u06fc\u06ff\u0710\u0712-\u072f\u074d-\u07a5\u07b1\u07ca-\u07ea\u07f4-\u07f5\u07fa\u0904-\u0939\u093d\u0950\u0958-\u0961\u0971-\u0972\u097b-\u097f\u0985-\u098c\u098f-\u0990\u0993-\u09a8\u09aa-\u09b0\u09b2\u09b6-\u09b9\u09bd\u09ce\u09dc-\u09dd\u09df-\u09e1\u09f0-\u09f1\u0a05-\u0a0a\u0a0f-\u0a10\u0a13-\u0a28\u0a2a-\u0a30\u0a32-\u0a33\u0a35-\u0a36\u0a38-\u0a39\u0a59-\u0a5c\u0a5e\u0a72-\u0a74\u0a85-\u0a8d\u0a8f-\u0a91\u0a93-\u0aa8\u0aaa-\u0ab0\u0ab2-\u0ab3\u0ab5-\u0ab9\u0abd\u0ad0\u0ae0-\u0ae1\u0b05-\u0b0c\u0b0f-\u0b10\u0b13-\u0b28\u0b2a-\u0b30\u0b32-\u0b33\u0b35-\u0b39\u0b3d\u0b5c-\u0b5d\u0b5f-\u0b61\u0b71\u0b83\u0b85-\u0b8a\u0b8e-\u0b90\u0b92-\u0b95\u0b99-\u0b9a\u0b9c\u0b9e-\u0b9f\u0ba3-\u0ba4\u0ba8-\u0baa\u0bae-\u0bb9\u0bd0\u0c05-\u0c0c\u0c0e-\u0c10\u0c12-\u0c28\u0c2a-\u0c33\u0c35-\u0c39\u0c3d\u0c58-\u0c59\u0c60-\u0c61\u0c85-\u0c8c\u0c8e-\u0c90\u0c92-\u0ca8\u0caa-\u0cb3\u0cb5-\u0cb9\u0cbd\u0cde\u0ce0-\u0ce1\u0d05-\u0d0c\u0d0e-\u0d10\u0d12-\u0d28\u0d2a-\u0d39\u0d3d\u0d60-\u0d61\u0d7a-\u0d7f\u0d85-\u0d96\u0d9a-\u0db1\u0db3-\u0dbb\u0dbd\u0dc0-\u0dc6\u0e01-\u0e30\u0e32-\u0e33\u0e40-\u0e46\u0e81-\u0e82\u0e84\u0e87-\u0e88\u0e8a\u0e8d\u0e94-\u0e97\u0e99-\u0e9f\u0ea1-\u0ea3\u0ea5\u0ea7\u0eaa-\u0eab\u0ead-\u0eb0\u0eb2-\u0eb3\u0ebd\u0ec0-\u0ec4\u0ec6\u0edc-\u0edd\u0f00\u0f40-\u0f47\u0f49-\u0f6c\u0f88-\u0f8b\u1000-\u102a\u103f\u1050-\u1055\u105a-\u105d\u1061\u1065-\u1066\u106e-\u1070\u1075-\u1081\u108e\u10a0-\u10c5\u10d0-\u10fa\u10fc\u1100-\u1159\u115f-\u11a2\u11a8-\u11f9\u1200-\u1248\u124a-\u124d\u1250-\u1256\u1258\u125a-\u125d\u1260-\u1288\u128a-\u128d\u1290-\u12b0\u12b2-\u12b5\u12b8-\u12be\u12c0\u12c2-\u12c5\u12c8-\u12d6\u12d8-\u1310\u1312-\u1315\u1318-\u135a\u1380-\u138f\u13a0-\u13f4\u1401-\u166c\u166f-\u1676\u1681-\u169a\u16a0-\u16ea\u1700-\u170c\u170e-\u1711\u1720-\u1731\u1740-\u1751\u1760-\u176c\u176e-\u1770\u1780-\u17b3\u17d7\u17dc\u1820-\u1877\u1880-\u18a8\u18aa\u1900-\u191c\u1950-\u196d\u1970-\u1974\u1980-\u19a9\u19c1-\u19c7\u1a00-\u1a16\u1b05-\u1b33\u1b45-\u1b4b\u1b83-\u1ba0\u1bae-\u1baf\u1c00-\u1c23\u1c4d-\u1c4f\u1c5a-\u1c7d\u1d00-\u1dbf\u1e00-\u1f15\u1f18-\u1f1d\u1f20-\u1f45\u1f48-\u1f4d\u1f50-\u1f57\u1f59\u1f5b\u1f5d\u1f5f-\u1f7d\u1f80-\u1fb4\u1fb6-\u1fbc\u1fbe\u1fc2-\u1fc4\u1fc6-\u1fcc\u1fd0-\u1fd3\u1fd6-\u1fdb\u1fe0-\u1fec\u1ff2-\u1ff4\u1ff6-\u1ffc\u2071\u207f\u2090-\u2094\u2102\u2107\u210a-\u2113\u2115\u2119-\u211d\u2124\u2126\u2128\u212a-\u212d\u212f-\u2139\u213c-\u213f\u2145-\u2149\u214e\u2183-\u2184\u2c00-\u2c2e\u2c30-\u2c5e\u2c60-\u2c6f\u2c71-\u2c7d\u2c80-\u2ce4\u2d00-\u2d25\u2d30-\u2d65\u2d6f\u2d80-\u2d96\u2da0-\u2da6\u2da8-\u2dae\u2db0-\u2db6\u2db8-\u2dbe\u2dc0-\u2dc6\u2dc8-\u2dce\u2dd0-\u2dd6\u2dd8-\u2dde\u2e2f\u3005-\u3006\u3031-\u3035\u303b-\u303c\u3041-\u3096\u309d-\u309f\u30a1-\u30fa\u30fc-\u30ff\u3105-\u312d\u3131-\u318e\u31a0-\u31b7\u31f0-\u31ff\u3400-\u4db5\u4e00-\u9fc3\ua000-\ua48c\ua500-\ua60c\ua610-\ua61f\ua62a-\ua62b\ua640-\ua65f\ua662-\ua66e\ua67f-\ua697\ua717-\ua71f\ua722-\ua788\ua78b-\ua78c\ua7fb-\ua801\ua803-\ua805\ua807-\ua80a\ua80c-\ua822\ua840-\ua873\ua882-\ua8b3\ua90a-\ua925\ua930-\ua946\uaa00-\uaa28\uaa40-\uaa42\uaa44-\uaa4b\uac00-\ud7a3\uf900-\ufa2d\ufa30-\ufa6a\ufa70-\ufad9\ufb00-\ufb06\ufb13-\ufb17\ufb1d\ufb1f-\ufb28\ufb2a-\ufb36\ufb38-\ufb3c\ufb3e\ufb40-\ufb41\ufb43-\ufb44\ufb46-\ufbb1\ufbd3-\ufd3d\ufd50-\ufd8f\ufd92-\ufdc7\ufdf0-\ufdfb\ufe70-\ufe74\ufe76-\ufefc\uff21-\uff3a\uff41-\uff5a\uff66-\uffbe\uffc2-\uffc7\uffca-\uffcf\uffd2-\uffd7\uffda-\uffdc]|[\ud840-\ud868][\udc00-\udfff]|\ud800[\udc00-\udc0b\udc0d-\udc26\udc28-\udc3a\udc3c-\udc3d\udc3f-\udc4d\udc50-\udc5d\udc80-\udcfa\ude80-\ude9c\udea0-\uded0\udf00-\udf1e\udf30-\udf40\udf42-\udf49\udf80-\udf9d\udfa0-\udfc3\udfc8-\udfcf]|\ud801[\udc00-\udc9d]|\ud802[\udc00-\udc05\udc08\udc0a-\udc35\udc37-\udc38\udc3c\udc3f\udd00-\udd15\udd20-\udd39\ude00\ude10-\ude13\ude15-\ude17\ude19-\ude33]|\ud808[\udc00-\udf6e]|\ud835[\udc00-\udc54\udc56-\udc9c\udc9e-\udc9f\udca2\udca5-\udca6\udca9-\udcac\udcae-\udcb9\udcbb\udcbd-\udcc3\udcc5-\udd05\udd07-\udd0a\udd0d-\udd14\udd16-\udd1c\udd1e-\udd39\udd3b-\udd3e\udd40-\udd44\udd46\udd4a-\udd50\udd52-\udea5\udea8-\udec0\udec2-\udeda\udedc-\udefa\udefc-\udf14\udf16-\udf34\udf36-\udf4e\udf50-\udf6e\udf70-\udf88\udf8a-\udfa8\udfaa-\udfc2\udfc4-\udfcb]|\ud869[\udc00-\uded6]|\ud87e[\udc00-\ude1d]

Edited 2 time(s). Last edit at 2019-03-21 05:55 by maki.

Options: Reply | Quote | Up ^


#7 Re: character or text to identify.

Posted by: pspad | Date: 2019-03-21 05:55 | IP: IP Logged

Your blank char and invisible char has own char code. Use it.

Options: Reply | Quote | Up ^


#8 Re: character or text to identify.

Posted by: maki | Date: 2019-03-21 06:00 | IP: IP Logged

Not work:

\x{2800}

Options: Reply | Quote | Up ^


#9 Re: character or text to identify.

Posted by: pspad | Date: 2019-03-21 06:01 | IP: IP Logged

Send me sample text, containing your blank/invisible chars + regular expression you are using into support mail

Options: Reply | Quote | Up ^


#10 Re: character or text to identify.

Posted by: maki | Date: 2019-03-21 06:04 | IP: IP Logged

Example Blank Char:

​​

[^\x{0000}\x{0001}\x{0002}\x{0003}\x{0004}\x{0005}\x{0006}\x{0007}\x{0008}\x{0009}\x{000a}\x{000b}\x{000c}\x{000d}\x{000e}\x{000f}\x{0010}\x{0011}\x{0012}\x{0013}\x{0014}\x{0015}\x{0016}\x{0017}\x{0018}\x{0019}\x{001a}\x{001b}\x{001c}\x{001d}\x{001e}\x{001f}\x{0020}\x{0021}\x{0022}\x{0023}\x{0024}\x{0025}\x{0026}\x{0027}\x{0028}\x{0029}\x{002A}\x{002B}\x{002C}\x{002D}\x{002E}\x{002F}\x{030}\x{0031}\x{0032}\x{0033}\x{0034}\x{0035}\x{0036}\x{0037}\x{0038}\x{0039}\x{003A}\x{003B}\x{003C}\x{003D}\x{003E}\x{003F}\x{040}\x{0041}\x{0042}\x{0043}\x{0044}\x{0045}\x{0046}\x{0047}\x{0048}\x{0049}\x{004A}\x{004B}\x{004C}\x{004D}\x{004E}\x{004F}\x{050}\x{0051}\x{0052}\x{0053}\x{0054}\x{0055}\x{0056}\x{0057}\x{0058}\x{0059}\x{005A}\x{005B}\x{005C}\x{005D}\x{005E}\x{005F}\x{060}\x{0061}\x{0062}\x{0063}\x{0064}\x{0065}\x{0066}\x{0067}\x{0068}\x{0069}\x{006A}\x{006B}\x{006C}\x{006D}\x{006E}\x{006F}\x{070}\x{0071}\x{0072}\x{0073}\x{0074}\x{0075}\x{0076}\x{0077}\x{0078}\x{0079}\x{007A}\x{007B}\x{007C}\x{007D}\x{007E}\x{007F}\x{0AC}\x{201A}\x{201E}\x{2026}\x{2020}\x{2021}\x{2030}\x{0160}\x{2039}\x{015A}\x{0164}\x{017D}\x{0179}\x{018}\x{2019}\x{201C}\x{201D}\x{2022}\x{2013}\x{2014}\x{2122}\x{0161}\x{203A}\x{015B}\x{0165}\x{017E}\x{017A}\x{0A0}\x{02C7}\x{02D8}\x{0141}\x{00A4}\x{0104}\x{00A6}\x{00A7}\x{00A8}\x{00A9}\x{015E}\x{00AB}\x{00AC}\x{00AD}\x{00AE}\x{017B}\x{0B0}\x{00B1}\x{02DB}\x{0142}\x{00B4}\x{00B5}\x{00B6}\x{00B7}\x{00B8}\x{0105}\x{015F}\x{00BB}\x{013D}\x{02DD}\x{013E}\x{017C}\x{154}\x{00C1}\x{00C2}\x{0102}\x{00C4}\x{0139}\x{0106}\x{00C7}\x{010C}\x{00C9}\x{0118}\x{00CB}\x{011A}\x{00CD}\x{00CE}\x{010E}\x{110}\x{0143}\x{0147}\x{00D3}\x{00D4}\x{0150}\x{00D6}\x{00D7}\x{0158}\x{016E}\x{00DA}\x{0170}\x{00DC}\x{00DD}\x{0162}\x{00DF}\x{155}\x{00E1}\x{00E2}\x{0103}\x{00E4}\x{013A}\x{0107}\x{00E7}\x{010D}\x{00E9}\x{0119}\x{00EB}\x{011B}\x{00ED}\x{00EE}\x{010F}\x{111}\x{0144}\x{0148}\x{00F3}\x{00F4}\x{0151}\x{00F6}\x{00F7}\x{0159}\x{016F}\x{00FA}\x{0171}\x{00FC}\x{00FD}\x{0163}\x{02D9}]

Edited 2 time(s). Last edit at 2019-03-21 06:05 by maki.

Options: Reply | Quote | Up ^


Goto Page: 1 2 Next





Editor PSPad - freeware editor, © 2001 - 2024 Jan Fiala, Hosted by Webhosting TOJEONO.CZ, design by WebDesign PAY & SOFT, code Petr Dvořák, Privacy policy and GDPR