You are here: PSPad forum > Bug report / Hlášení chyb > Re: Cyrillic character problem with accented characters.

Re: Cyrillic character problem with accented characters.

#1 Cyrillic character problem with accented characters.

Posted by: BrentMartin720 | Date: 10/30/2016 22:38 | IP: IP Logged

I've been having crashes while editing Cyrillic text.

I noticed that certain characters cause problems with my column alignments.
The equal signs should be lined up perfectly, but they seem to be off by one
space on lines containing accented characters.

Example lines containing alignment problems. Note the accented characters cause the right column to be shifted to the left one space, resulting in a misalignment of the column, which makes it difficult to paste in new information
from other sources like Google Translate.


баловатьбалую =баловатьбалую/Nc
балу́ =балу́/Nc
балуюсь =балуюсь/Nc
балы́ =балы́/Nc
бреюсь =бреюсь/Nc
броня́ =броня́/Nc
бро́ня =бро́ня/Nc
брошек =брошек/Nc
вгляжусь =вгляжусь/Nc
веде́ние =веде́ние/Nc
ве́дение =ве́дение/Nc
ведущий =ведущий/Nc

Also when I try to paste in a new list in the right column, the
editor crashes. It's happened repeatedly and I can no longer use
PsPad for editing my Russian dictionary files, due to the crashes.

I've been using PsPad for years and I've done this numerous times,
thousands of times, with lots of different languages. This problem
seems to be with certain Unicode characters that involve acute accents
on Russian words in Cyrillic alphabet (stored in UTF-8 format).

I'm using version 4.6.0 (2700) dated 10/2/2015. I am running
Windows 10 Pro, Version 1607 (OS Build 14955.1000)

I use Google Translate to translate the left column of words and then I use
cut/paste to change the words on the right to English or some other language like Esperanto.

I've been doing this for many years. This seems to be a new problem, not sure why it is cropping up now. Perhaps the accent marks are the real problem.
The accent marks come from a list of the 10,000 most frequently used Russian words.

That's about all I can report. I can attach a file that causes crashes when I try to paste new definitions in the right column of my "dictionary" file.
I'm not sure how to post it here, but can email it.

Edited 2 time(s). Last edit at 10/30/2016 22:43 by BrentMartin720.

Options: Reply | Quote | Up ^


#2 Re: Cyrillic character problem with accented characters.

Posted by: vbr | Date: 10/31/2016 13:16 | IP: IP Logged

BrentMartin720:
I've been having crashes while editing Cyrillic text.

I noticed that certain characters cause problems with my column alignments.
The equal signs should be lined up perfectly, but they seem to be off by one
space on lines containing accented characters.

Example lines containing alignment problems. Note the accented characters cause the right column to be shifted to the left one space, resulting in a misalignment of the column, which makes it difficult to paste in new information
from other sources like Google Translate.


баловатьбалую =баловатьбалую/Nc
балу́ =балу́/Nc
балуюсь =балуюсь/Nc
балы́ =балы́/Nc
бреюсь =бреюсь/Nc
броня́ =броня́/Nc
бро́ня =бро́ня/Nc
брошек =брошек/Nc
вгляжусь =вгляжусь/Nc
веде́ние =веде́ние/Nc
ве́дение =ве́дение/Nc
ведущий =ведущий/Nc

Also when I try to paste in a new list in the right column, the
editor crashes. It's happened repeatedly and I can no longer use
PsPad for editing my Russian dictionary files, due to the crashes.

I've been using PsPad for years and I've done this numerous times,
thousands of times, with lots of different languages. This problem
seems to be with certain Unicode characters that involve acute accents
on Russian words in Cyrillic alphabet (stored in UTF-8 format).

I'm using version 4.6.0 (2700) dated 10/2/2015. I am running
Windows 10 Pro, Version 1607 (OS Build 14955.1000)

I use Google Translate to translate the left column of words and then I use
cut/paste to change the words on the right to English or some other language like Esperanto.

I've been doing this for many years. This seems to be a new problem, not sure why it is cropping up now. Perhaps the accent marks are the real problem.
The accent marks come from a list of the 10,000 most frequently used Russian words.

That's about all I can report. I can attach a file that causes crashes when I try to paste new definitions in the right column of my "dictionary" file.
I'm not sure how to post it here, but can email it.

Hi,
I can't simulate the crashes, but it seems, that a possible problem could be the decomposed accented characters, which might not be handled properly in the editor.
There are separate "base letters" followed by the combining diacritics acute accent.
Some glitches can be seen with Latin letters too:
the first line contains an e folowed by the combining acute accent, the second one contains directly an accented é:
-é-
-é-

The character count (when selected) shows 4 and 3 chars respectively; there are some display artefacts (doubling some characters) while making the text selection near the combined accent.

It is possible that there are differences in OS versions with regard to unicode handling (I am using win7, where this might be supported/implemented differently than win 10).

regards,
vbr

Edited 1 time(s). Last edit at 10/31/2016 13:17 by vbr.

Options: Reply | Quote | Up ^


#3 Re: Cyrillic character problem with accented characters.

Posted by: BrentMartin720 | Date: 11/04/2016 04:46 | IP: IP Logged

Thanks for your reply.

I've continued processing many other languages and found similar
problems in many of them, with regard to loading dictionary lines
which mix Latin and non-Latin alphabets.

Some of the files load fine in Notepad and look drastically different
when displayed in PsPad, with unwanted "display artifacts" like what
looks like sequences of spaces on the screen where you want no spaces
at all.

All these problems seem to be related.

For now I can just work around the problem, or use another editor.
but PsPad is great for the work I am doing, mainly working with columnar
data, mainly foreign language dictionaries.

Something definitely seems to be wrong, but it apparently doesn't cause anyone
else any problems.

I don't think it's a Windows 10 issue. I haven't had any big issues since upgrading to Delphi XE7 and Windows 8, 8.1, and 10.

I do think the character counts are off somehow, possibly in Delphi, or possibly in the Unicode handling within Windows, but the files display "properly" (as expected) in Notepad. I have not tried Notepad Plus, but do have it installed on my system. My daughter was using it.

Thanks again for your reply.

P.S. I am currently working on translating text in the following languages:
(some display correctly, some have spacing problems, and at least one works,
but the characters show as as little squares in the editor, but show up as
foreign characters in Notepad. This could point to a font issue.)

type
LangType = (UnknownLanguage,
Afrikaans, Albanian, Arabic, Aramaic, Armenian, Azeri,
Bashkir, BasicEnglish, Basque, Belarusian, Bengali,
Bosnian, Breton, Bulgarian,
Cantonese, Catalan, Cebuano, Cherokee, Chichewa,
Chinese, Corsican, Croatian, Czech,
Danish, Dutch, Elvish, English, Ergane,
Esperanto, Estonian,
Filipino, Finnish, French, Frisian,
Galician, Georgian, German, Greek, Gujarati,
Haitian, Hausa, Hebrew, Hindi, Hmong, Hungarian,
Icelandic, Ido, Igbo, Indonesian,
Interlingua, Interlingue, Irish, Italian,
Japanese, Javanese,
Kannada, Kazakh, Khmer, Klingon, Korean, Kyrgyz,
Ladino, Lakota, Lao, Latin, Latvian, Lithuanian, LowGerman,
Macedonian, Malagasy, Malay, Malayalam, Maltese,
Maori, Marathi, Mayan, MiddleEnglish, Mongolian, Myanmar,
Nahuatl, Navi, Nepali, Niedersachsen, Norwegian, Novial,
Occidental, OldChurchSlavonic, OldEnglish, Ostfriesisch,
Persian, Polish, Portuguese, Punjabi,
Romanian, Romansh, Russian,
Scottish, Serbian, Sesotho, Sinhala, Slovak, Slovene,
Slovianski, Slovio, Slovioski, Somali,
Spanish, Sundanese, Swahili, Swedish,
Tagalog, Tajik, Tamil, Tatar, Telugu,
Thai, Turkish, Turkmen,
Ukrainian,
Urdu,
Uzbek,
Vietnamese,
Welsh,
Yiddish, Yoruba,
Zulu);

Options: Reply | Quote | Up ^


#4 Re: Cyrillic character problem with accented characters.

Posted by: vbr | Date: 11/04/2016 14:28 | IP: IP Logged

Hi,
thanks for further information; we will have to wait for the response of PSPad developer regarding combining diacritics, in connection with character counting, column editing etc.

In my usage, PSPad generally supports unicode characters in basic multilingual plane (BMP) - characters up to U+FFFF

The support for other characters above this limit seems to be non-standard (in the internally used encoding utf-16 the surrogate characters are used in a different way, incompatibly to some other editors.

There may be other problems with displaying some features of unicode texts correctly - text direction (e.g. right-to left); even the currently discussed combining diacritics etc.

However, if you get square symbols instead of the correct characters (glyphs), it is most likely matter of font support.

There is a limitation to use fixed-width (monospace) fonts, and it might be tricky to find one supporting the needed character ranges.
E.g. is DejaVu Sans Mono is probably worth a try:
dejavu-fonts.org
but there are many others.

(In retrospect, I realise, that in my previous post, my example with combining diactritics é vs é was somehow normalised, probably by the browser or on the server-side the combining acute accent may be written via Alt+769 in PSPas (and some other programs in windows)
- ́- (dec.: 769) (hex.: 0x301) # ́ COMBINING ACUTE ACCENT (Mark, Nonspacing) (Combining Diacritical Marks [768-879] [0x300-0x36f])

If I use this character (displayed above the previous character, such as "e"), I can see alignment error with column insertion (the accent is taken as character, even without visually taking a character position on itself).
(These accent characters are often used for additional information (such as quantity or stress), but are in many cases not part of the official orthography of the language; or on the other hand, those accented characters, which are used "officially" are generally predefined in unicode as single characters.)

regards,
vbr

Options: Reply | Quote | Up ^






Editor PSPad - freeware editor, © 2001 - 2017 Jan Fiala
Hosted by Webhosting TOJEONO.CZ, design by WebDesign PAY & SOFT, code Petr Dvořák