You are here: PSPad forum > English discussion forum > How do find in the text - at least one Unicode character?

How do find in the text - at least one Unicode character?

#1 How do find in the text - at least one Unicode character?

Posted by: maki | Date: 05/01/2018 09:50 | IP: IP Logged

How do find in the text - at least one Unicode character?

Options: Reply | Quote | Up ^


#2 Re: How do find in the text - at least one Unicode character?

Posted by: pspad | Date: 05/01/2018 09:53 | IP: IP Logged

Hello

Can you be more specific? If you edit text, all characters are unicode.
What does mean "unicode character" for you?

Options: Reply | Quote | Up ^


#3 Re: How do find in the text - at least one Unicode character?

Posted by: maki | Date: 05/01/2018 10:39 | IP: IP Logged

I have a very large text file. It seems that 99.9% of characters /text plain are normal but one Unicode character tells you that it will encode if it writes without Unicode. How do I find this one or more Unicode characters?

Options: Reply | Quote | Up ^


#4 Re: How do find in the text - at least one Unicode character?

Posted by: pspad | Date: 05/01/2018 11:07 | IP: IP Logged

Hello
Plaintext mean text without formating. Any text file (no matter what encoding it uses) what doesn't contains formating like RTF, DOC is plaintext
Unicode, ANSI, UTF-8 is encoding.
Whole file is encoded in any encoding.

I stol doesn't understand what do you mean. Some example should be more explaining.

Options: Reply | Quote | Up ^


#5 Re: How do find in the text - at least one Unicode character?

Posted by: vbr | Date: 05/01/2018 16:09 | IP: IP Logged

maki:
I have a very large text file. It seems that 99.9% of characters /text plain are normal but one Unicode character tells you that it will encode if it writes without Unicode. How do I find this one or more Unicode characters?

Hi,
if you are able to list the characters, you want to accept as "normal", you can search with a regular expression for any other character using a [^...] pattern for a "negated" or complement character set, e.g. the pattern:
[^0-9a-zA-Z !"#$%&'()*+,./:;<=>?@\[\\\]\^_`\{\|\}\-]

should match any character except of basic Latin letters, numbers, a space and the listed diacritics and symbols (some of them are escaped in the pattern as they are metacharacters with special meaning in regular expressions).
The list can of sourse be adjusted for specific needs.

hth,
vbr

Options: Reply | Quote | Up ^


#6 Re: How do find in the text - at least one Unicode character?

Posted by: maki | Date: 05/01/2018 18:00 | IP: IP Logged

Currently, the file is saved in UTF-8. The file contains 200,000 any characters(!!!) (Not unicode)
and contains some Unicode. How to find only Unicode characters?
What's incomprehensible?

Options: Reply | Quote | Up ^


#7 Re: How do find in the text - at least one Unicode character?

Posted by: maki | Date: 05/01/2018 18:24 | IP: IP Logged

Ok, work

Unicode:
о

Options: Reply | Quote | Up ^


#8 Re: How do find in the text - at least one Unicode character?

Posted by: vbr | Date: 05/02/2018 09:53 | IP: IP Logged

maki:
Ok, work

Unicode:
о

Well, this one is
о (dec.: 1086) (hex.: 0x43e) - U+043E # о CYRILLIC SMALL LETTER O (Letter, Lowercase) (Cyrillic [1024-1279] [0x400-0x4ff])

You might encounter some other visually similar o-characters

o (dec: 111; hex: 0x6f) LATIN SMALL LETTER O
ͦ (dec: 870; hex: 0x366) COMBINING LATIN SMALL LETTER O
ο (dec: 959; hex: 0x3bf) GREEK SMALL LETTER OMICRON
о (dec: 1086; hex: 0x43e) CYRILLIC SMALL LETTER O
օ (dec: 1413; hex: 0x585) ARMENIAN SMALL LETTER OH
ₒ (dec: 8338; hex: 0x2092) LATIN SUBSCRIPT SMALL LETTER O
⒪ (dec: 9386; hex: 0x24aa) PARENTHESIZED LATIN SMALL LETTER O
ⓞ (dec: 9438; hex: 0x24de) CIRCLED LATIN SMALL LETTER O
ⲟ (dec: 11423; hex: 0x2c9f) COPTIC SMALL LETTER O
o (dec: 65359; hex: 0xff4f) FULLWIDTH LATIN SMALL LETTER O

However, I can think of yet another way to check for such "exotic" characters in PSPad - you can use File :: File Info and switch to the tab [Chars]
there is a list of the characters in the currently open file; due to the sorting the non-Latin characters are placed further down in the list, hence you will find the "suspicious" chars around the end of the lising.

hth,
vbr

Options: Reply | Quote | Up ^


#9 Re: How do find in the text - at least one Unicode character?

Posted by: pspad | Date: 05/02/2018 10:57 | IP: IP Logged

Save your file to Unicode UTF-7 or US_ASCII 7-bit encoding. It will eliminate all chars with diacritical signs and other exotic characters. Then save it back to UTF-8

Edited 1 time(s). Last edit at 05/02/2018 10:58 by pspad.

Options: Reply | Quote | Up ^


#10 Re: How do find in the text - at least one Unicode character?

Posted by: MarkJohnson | Date: 06/12/2018 12:05 | IP: IP Logged

I didnt seem to understand the question in the very beginning but the comments were really indicative. thanks guys

--
ph375usa.com

Options: Reply | Quote | Up ^






Editor PSPad - freeware editor, © 2001 - 2018 Jan Fiala, Hosted by Webhosting TOJEONO.CZ, design by WebDesign PAY & SOFT, code Petr Dvořák, Privacy policy and GDPR