Posted by: maki | Date: 05/01/2018 09:50 | IP: IP Logged
How do find in the text - at least one Unicode character?
Posted by: pspad | Date: 05/01/2018 09:53 | IP: IP Logged
Can you be more specific? If you edit text, all characters are unicode.
What does mean "unicode character" for you?
Posted by: maki | Date: 05/01/2018 10:39 | IP: IP Logged
I have a very large text file. It seems that 99.9% of characters /text plain are normal but one Unicode character tells you that it will encode if it writes without Unicode. How do I find this one or more Unicode characters?
Posted by: pspad | Date: 05/01/2018 11:07 | IP: IP Logged
Plaintext mean text without formating. Any text file (no matter what encoding it uses) what doesn't contains formating like RTF, DOC is plaintext
Unicode, ANSI, UTF-8 is encoding.
Whole file is encoded in any encoding.
I stol doesn't understand what do you mean. Some example should be more explaining.
Posted by: vbr | Date: 05/01/2018 16:09 | IP: IP Logged
maki:I have a very large text file. It seems that 99.9% of characters /text plain are normal but one Unicode character tells you that it will encode if it writes without Unicode. How do I find this one or more Unicode characters?
if you are able to list the characters, you want to accept as "normal", you can search with a regular expression for any other character using a [^...] pattern for a "negated" or complement character set, e.g. the pattern:
should match any character except of basic Latin letters, numbers, a space and the listed diacritics and symbols (some of them are escaped in the pattern as they are metacharacters with special meaning in regular expressions).
The list can of sourse be adjusted for specific needs.
Posted by: maki | Date: 05/01/2018 18:00 | IP: IP Logged
Currently, the file is saved in UTF-8. The file contains 200,000 any characters(!!!) (Not unicode)
and contains some Unicode. How to find only Unicode characters?
Posted by: maki | Date: 05/01/2018 18:24 | IP: IP Logged
Posted by: vbr | Date: 05/02/2018 09:53 | IP: IP Logged
Well, this one is
о (dec.: 1086) (hex.: 0x43e) - U+043E # о CYRILLIC SMALL LETTER O (Letter, Lowercase) (Cyrillic [1024-1279] [0x400-0x4ff])
You might encounter some other visually similar o-characters
o (dec: 111; hex: 0x6f) LATIN SMALL LETTER O
ͦ (dec: 870; hex: 0x366) COMBINING LATIN SMALL LETTER O
ο (dec: 959; hex: 0x3bf) GREEK SMALL LETTER OMICRON
о (dec: 1086; hex: 0x43e) CYRILLIC SMALL LETTER O
օ (dec: 1413; hex: 0x585) ARMENIAN SMALL LETTER OH
ₒ (dec: 8338; hex: 0x2092) LATIN SUBSCRIPT SMALL LETTER O
⒪ (dec: 9386; hex: 0x24aa) PARENTHESIZED LATIN SMALL LETTER O
ⓞ (dec: 9438; hex: 0x24de) CIRCLED LATIN SMALL LETTER O
ⲟ (dec: 11423; hex: 0x2c9f) COPTIC SMALL LETTER O
ｏ (dec: 65359; hex: 0xff4f) FULLWIDTH LATIN SMALL LETTER O
However, I can think of yet another way to check for such "exotic" characters in PSPad - you can use File :: File Info and switch to the tab [Chars]
there is a list of the characters in the currently open file; due to the sorting the non-Latin characters are placed further down in the list, hence you will find the "suspicious" chars around the end of the lising.
Posted by: pspad | Date: 05/02/2018 10:57 | IP: IP Logged
Save your file to Unicode UTF-7 or US_ASCII 7-bit encoding. It will eliminate all chars with diacritical signs and other exotic characters. Then save it back to UTF-8
Edited 1 time(s). Last edit at 05/02/2018 10:58 by pspad.