You are here: PSPad forum > English discussion forum > Text Regex
Posted by: maki | Date: 2019-04-08 10:06 | IP: IP Logged
My regex is incorrect, what should I change?
Search for text with characters in one line.
^([\p{Cyrillic}]+[\-\.\,\!\…\?\(\)\„\”—0-9\s]+[\p{Cyrillic}\-\.\,\!\…\?\(\)\„\”—0-9\s]*|[\-\.\,\…\?\(\)\„\”—0-9]+[\p{Cyrillic}]+[\p{Cyrillic}\-\.\,\!\…\?\(\)\„\”—0-9\s]*)$
Text included:
\p{Cyrillic}
!
!!
!!!
!!!!
?
??
???
… (unicode)
— (unicode)
-
.
..
...
,
0-9
(
)
„ (unicode)
„ (unicode)
"
\s (space)
\
/
\x(200b}
*
#
@
%
Edited 1 time(s). Last edit at 2019-04-08 10:09 by maki.
Posted by: maki | Date: 2019-04-08 10:54 | IP: IP Logged
Still wrong:
^([\p{Cyrillic}]+[\-\.\,\!\…\?\(\)\„\”\,\;\\\/\*\#\@\&\:\.x{200B}—0-9\s]+[\p{Cyrillic}\-\.\,\!\…\?\(\)\„\”\,\;\\\/\*\#\@\&\:\.x{200B}—0-9\s]*|[\-\.\,\…\?\(\)\„\”\,\;\\\/\*\#\@\&\:\.x{200B}—0-9\s]+[\p{Cyrillic}]+[\p{Cyrillic}\-\.\,\!\…\?\(\)\„\”\,\;\\\/\*\#\@\&\:\.x{200B}—0-9\s]*)$
Edited 2 time(s). Last edit at 2019-04-08 11:00 by maki.
Posted by: vbr | Date: 2019-04-08 14:37 | IP: IP Logged
Hi,
PSPad regex engine doesn't currently support several features you are using in your pattern:
e.g. the unicode properties \p{...} (e.g. \p{Cyrillic}) refering to unicode block or script or codepoint matching using \x{...} (there must be a backslash before x in this notation, or it is handled litterally, i.e. it had to be \x{200B} in a tool that supports it to match
# (dec.: 8203) (hex.: 0x200b) # ZERO WIDTH SPACE (Other, Format) (General Punctuation [8192-8303] [0x2000-0x206f])
you may use another tools that currently support these features, or it might be possible to replace the properties with character sets,
e.g. with [Ѐ-ԯ] you can match all codepoinds in the ranges 0x0400-0x04FF; Cyrillic and 0x0500-0x052F; Cyrillic Supplement
which might be sufficient for some usecases.
The codepoint or unicode notation is not supported - the respective characters must be entered directly in the pattern.
hth,
vbr
Posted by: maki | Date: 2019-04-08 15:11 | IP: IP Logged
But with Cyrillic I have no problems, I can replace the standard regex working in PSPad or other editor, but what about the next code? I can not deal with him.
Edited 2 time(s). Last edit at 2019-04-08 15:15 by maki.
Posted by: vbr | Date: 2019-04-08 17:03 | IP: IP Logged
Ok, the zero width space can be problematic too, but it can be copied directly and in can be handled in PSPad, even in search dialog - it should be the "invisible" character between the following parens:
()
UNfortunatley, the editor component in PSPad has som problems with selection of such characters, but it is in general possible to select/copy larger part of the surrounding test and (carefully) delet the other parts as needed.
It is also shown the same likeregular space in pspad as a dot marking whitespace.
vbr
Posted by: maki | Date: 2019-04-08 17:11 | IP: IP Logged
Ok, let's get rid of "ZERO WIDTH SPACE", I'm talking about normal Regex, let's not talk about Unicode anymore.
\char = match a literal
How to match all characters from the above on the forum?
Posted by: vbr | Date: 2019-04-09 08:02 | IP: IP Logged
Hi, what is the pattern supposed to match? - probably some (preferably short) samples of real text strings might be clearer.
If Cyrillic or some "special" - i.e. non-ascii punctuation should be matched, unicode definitely needs to be taken into account.
The current patterns are most likely more complicated than needed - e.g. you don't need escaping most of the literals in the character class (between [...]) - most likely only: - ^ [ and ] (with some further positional specificities).
vbr
Posted by: maki | Date: 2019-04-09 09:19 | IP: IP Logged
I will not give examples because there are too many different ones and a more accurate match would have been needed. There is no point in matching everything! It would be very bad. And that this is my work - I will not write private texts.
It simply has to contain Russian words and all char that I gave in early. That's all.
Editor PSPad - freeware editor, © 2001 - 2024 Jan Fiala, Hosted by Webhosting TOJEONO.CZ, design by WebDesign PAY & SOFT, code Petr Dvořák, Privacy policy and GDPR