As I understand CKEditor works by simply replacing a <textarea> and html is then sent to the POST request when a user hits submit. Of course, then a user could simply submit his own html in a custom POST request and this html can have javascript, tags I don't allow and invalid xhtml.
So clearly I need something on the server side to ensure only certain xhtml tags have been submitted and the xhtml is valid without any javascript. How do others handle this? Is there a good library, preferably Java for this?
So clearly I need something on the server side to ensure only certain xhtml tags have been submitted and the xhtml is valid without any javascript. How do others handle this? Is there a good library, preferably Java for this?

Re: How best to deal with security issues and CKEditor?
Probably the simplest solution would be to attempt to prevent the user from doing anything custom. For instance: remove the Source plugin. The editor, by default, can produce a number of different dtds, including HTML 4.01 Strict and Transitional as well as XHTML 1.0 Strict and Transitional.
Using those two in concert with each other should produce a bit more favorable results.
Re: How best to deal with security issues and CKEditor?
I wonder how normal it is for web apps to completely rely on a client library like CKEditor to do all the validation? How easy it is for others to take advantage..
Re: How best to deal with security issues and CKEditor?
PHP
strip_tags function
HTML Tidydevshed article
ASP.NET
WebProNews article
public static string StripHtml(string html, bool allowHarmlessTags) { if (html == null || html == string.Empty) return string.Empty; if (allowHarmlessTags) return System.Text.RegularExpressions.Regex.Replace(html, "", string.Empty); return System.Text.RegularExpressions.Regex.Replace(html, "<[^>]*>", string.Empty); }Re: How best to deal with security issues and CKEditor?
And the checks based on regexp are usually too simple for a real attacker. The best way to do it is to take the incoming data, parse it into a DOM tree and then filter out any node or attribute that it isn't whitelisted. Then you get that again as a HTML string and that's the final data.
Any other approach is vulnerable to browser bugs that allows to execute code in a ways that you didn't expect.
For example I think that some site suffered an attack because something like
<scri pt>alert('hello');</scri pt>is executed in IE. Maybe that's not exactly the correct syntax, but it was something that a simple regexp looking for "script" couldn't find. If the cleanup is base only on allowing a set of tags and attributes then it will be safe (or at least much safer, there's nothing perfect)
Re: How best to deal with security issues and CKEditor?
What are good DOM parsers for this approach in Java?
Re: How best to deal with security issues and CKEditor?
Re: How best to deal with security issues and CKEditor?
If CKEditor can do this, then I think the best approach is to simply design an XML schema that I compare the submitted xhtml document to on the server side. Then if it fails, I report an error back to the browser as clearly they very likely circumvented the CKEditor...
If CKEditor cannot do this, then I still like the XML schema approach but then I want to tell the user exactly why their document failed (e.g. we don't support the script tag or the onsubmit attribute). Is there a decent java XML schema validator that can report reasons for failing to validate?
Does this make sense?
Re: How best to deal with security issues and CKEditor?
Re: How best to deal with security issues and CKEditor?
Re: How best to deal with security issues and CKEditor?
Could you please elaborate what exactly this means: "take the incoming data, parse it into a DOM tree and then filter out any node or attribute that it isn't whitelisted".
On the server side I do the following to validate the submitted values:
$allowed_html="<strong><p><a><ul><ol>";
$value = mysql_real_escape_string(strip_tags(html_entity_decode($_POST['value']),$allowed_html));
I have disabled the SOURCE button in CKEditor. But users can still type/paste in html tags. The editor encodes these so I use the decode, strip all non allowed tags, make it database safe and write it into the database. Am I missing anything essential to make it secure?
Thanks, Jens