Hello, When I copy paste a text from Word with German Umlaut (ü, ä), they are not replaced by their corresponding HTML entities (ä,...). I noticed that some of them are in two letters: I mean you need to delete twice: first to remove the ¨ and then to remove the letter. Is there any plugin for that? Or a way to parse ans replace such letters? Thank you
Tue, 10/28/2014 - 07:57
#1
Word uses extended unicode
A problem exists in pasting text from Microsoft Word to a browser window, because javascript itself does not implement the full unicode character set, but omits some rarely-used high-number codes. Surprisingly this does not affect plain HTML, but affects all javascript apps. Microsoft in their infinite wisdom have chosen to use some of these extended character codes in Word output. The 'smart quote' is a frequently-encountered example.
The way to find out if this is the problem would be to examine the unicode byte sequences being issued by a paste from Word into a hex editor. Pasting Arial unicode umlauted characters from charmap to CK works normally, thus I suspect it is the problem.
The character is ü
The character is ü which gives me 75 cc 88 (hex) which normaly be ü and should give me c3 bc (hex).
The unicode is \u0075\u0308.
What I do now is parsing the text when I get the data.
For those who are in the same
For those who are in the same case, I found the solution here: http://stackoverflow.com/a/1144788/2379208.
Thank you iwrconsultancy for your time.
Unsupported unicode
Yes, as I suspected it's UTF-32. (or UTF-24 if such an encoding exists) There is no fix possible in any javascript app since javascript itself cannot handle these codes, being limited to UTF-16. Any fix would have to be at the clipboard level before the data is pasted.
String replacement may work in specific situations, but not in the general case because once in javascript storage there is no way of telling if three bytes make up a nonstandard code, or parts of two valid codes.
This explains the situation quite well. Basically, although the shortcoming is in javascript, Microsoft should have had more sense than to create this problem. There are plenty of situations where Web coders have to forego the use of js functions which do not work in all browsers, and use a more compatible function instead. Why Microsoft thought they could disregard such concerns is beyond me.