I am having a problem copying/pasting text to a Rich Text Editor from MS Word into a mySQL db.
Basically, special characters such as single and double quotes, long dashes, etc. show up as little square boxes on the output page if the text was originally copied/pasted from MS Word, but displays perfectly if the same text was keyed in to the RTE directly.
To start with, I have an existing application that stores the output of the RTE into an MS Access db memo field. Even if I copy/paste to the RTE from MS Word, It works perfectly and the special characters get displayed as they should when called from a web page. It should be noted that the text is being copied from MS Word and stored in a MS Access, both Microsoft Office products with presumably the same character encoding.
However, using the exact same RTE with the exact same text copied from MS Word, but stored in a mySQL db, causes the special characters to display as little square boxes on the same web page. The only difference here is storing the data in mySQL vs. MS Access. BTW, mySQL is using the UTF-8 character set.
I have tried several things to fix the problem and get the special characters to display correctly, with limited success:
1. Tried using a different RTE (the RTE "Cross-Browser Rich Text Editor by Kevin Roth") and got the same results. This seems to eliminate the RTE as the problem.
2. Copied the contents of the MS Access memo field that displays correctly directly to the text field of the mySQL db (ie copied directly from MS Access to msSQL) and the special characters that display normally when called from MS Access, do not display properly when called from mySQL. So, when the same text is copied/pasted from MS Word is called from MS Access it displays correctly, but THE EXACT SAME DATA called from mySQL's text field the special character display as little square boxes. This also seems to eliminate the output of the RTE as a problem.
3. Tried replacing the single and double quotes with their HTML equivalents before the RTE text was stored in the mySQl db , but the "replaceAwithBinC" function would not recognize the single or double quotes if they were copied from MS Word. This indicates that MS Word seems to produce special characters that are proprietary to Microsoft Office (???).
4. So, I then went into MS Word and turned off the "Smart Quote" function that replaces the straight quote marks with the curved ones you see in a formatted Word document. I then did a search on all curved single and double quote marks and replaced them with straight ones, then re-copied/pasted the text into the RTE and stored it in mySQL. All of the sudden, the single and double quote marks displayed from the mySQL db correctly. However, this did not fix the long dashes and certain other special characters that I din't search/replace.. This corroborates that MS Word's curvy quote marks (and other special characters) seem to be causing the problem (perhaps because they are not encoded to UTF-8 standards ???).
5. This leads me to believe that MS Word produces proprietary special characters, such as the curvy quote marks, that are not in the UTF-8 character set. To test this, I copied text with the curvy quote marks from MS Word to Dreamweaver and looked at the code and it properly showed “ and ” … But I am not sure these are in the UTF-8 set or just the straight quotes are??? Since MS Access is an Office product, it does not seem to have a problem handling these special characters produced by Word.
So, I then copied/pasted text with left and right quote marks from Adobe Indesign into the RTE and stored it in mySQL and it worked perfectly. In direct contrast to MS Word, Adobe Indesign seems to produce encoded quote marks that mySQl can recognize.
In any event, my application must allow anyone to copy/paste an existing document from MS Word (without turning Smart Quotes off and performing a tedious find/replace of all problem characters), and is too large to use a MS Access db. So, I need to find a solution so I can copy/paste a MS Word document into the RTE, store it in mySQL and have the special characters produced by MS Word to display properly.
Any thoughts?
Basically, special characters such as single and double quotes, long dashes, etc. show up as little square boxes on the output page if the text was originally copied/pasted from MS Word, but displays perfectly if the same text was keyed in to the RTE directly.
To start with, I have an existing application that stores the output of the RTE into an MS Access db memo field. Even if I copy/paste to the RTE from MS Word, It works perfectly and the special characters get displayed as they should when called from a web page. It should be noted that the text is being copied from MS Word and stored in a MS Access, both Microsoft Office products with presumably the same character encoding.
However, using the exact same RTE with the exact same text copied from MS Word, but stored in a mySQL db, causes the special characters to display as little square boxes on the same web page. The only difference here is storing the data in mySQL vs. MS Access. BTW, mySQL is using the UTF-8 character set.
I have tried several things to fix the problem and get the special characters to display correctly, with limited success:
1. Tried using a different RTE (the RTE "Cross-Browser Rich Text Editor by Kevin Roth") and got the same results. This seems to eliminate the RTE as the problem.
2. Copied the contents of the MS Access memo field that displays correctly directly to the text field of the mySQL db (ie copied directly from MS Access to msSQL) and the special characters that display normally when called from MS Access, do not display properly when called from mySQL. So, when the same text is copied/pasted from MS Word is called from MS Access it displays correctly, but THE EXACT SAME DATA called from mySQL's text field the special character display as little square boxes. This also seems to eliminate the output of the RTE as a problem.
3. Tried replacing the single and double quotes with their HTML equivalents before the RTE text was stored in the mySQl db , but the "replaceAwithBinC" function would not recognize the single or double quotes if they were copied from MS Word. This indicates that MS Word seems to produce special characters that are proprietary to Microsoft Office (???).
4. So, I then went into MS Word and turned off the "Smart Quote" function that replaces the straight quote marks with the curved ones you see in a formatted Word document. I then did a search on all curved single and double quote marks and replaced them with straight ones, then re-copied/pasted the text into the RTE and stored it in mySQL. All of the sudden, the single and double quote marks displayed from the mySQL db correctly. However, this did not fix the long dashes and certain other special characters that I din't search/replace.. This corroborates that MS Word's curvy quote marks (and other special characters) seem to be causing the problem (perhaps because they are not encoded to UTF-8 standards ???).
5. This leads me to believe that MS Word produces proprietary special characters, such as the curvy quote marks, that are not in the UTF-8 character set. To test this, I copied text with the curvy quote marks from MS Word to Dreamweaver and looked at the code and it properly showed “ and ” … But I am not sure these are in the UTF-8 set or just the straight quotes are??? Since MS Access is an Office product, it does not seem to have a problem handling these special characters produced by Word.
So, I then copied/pasted text with left and right quote marks from Adobe Indesign into the RTE and stored it in mySQL and it worked perfectly. In direct contrast to MS Word, Adobe Indesign seems to produce encoded quote marks that mySQl can recognize.
In any event, my application must allow anyone to copy/paste an existing document from MS Word (without turning Smart Quotes off and performing a tedious find/replace of all problem characters), and is too large to use a MS Access db. So, I need to find a solution so I can copy/paste a MS Word document into the RTE, store it in mySQL and have the special characters produced by MS Word to display properly.
Any thoughts?