I decided to write my own XML parser when faced with the task to load non-well-formed XML files into an object model in C#.
You might rightfully wonder why on earth I wasted time on creating an XML parser when .NET framework already has the capability to parse XML perfectly in several ways. My answer:
- To enable loading of not so well formed XML into a tree structure.
- To make insertion of character entities in a document tree possible.
- To have the power to change how markup is parsed.
- Because it was a very fun thing to do.
The ideal thing to do is obviously to get the producer of the markup to fix existing "wellformdness" issues, but in reality this is not always an option.
However, in my case it is very much an option to deal with the issues of bad XML because it is me who produce the markup myself—but I don't want to! I am producing markup in a simplified HTML syntax for a flat file based content management system. This markup is in its turn embedded in a XML file containing meta data. At a later stage, the metadata is extracted and the markup are transformed into well formed XHTML. It is not possible to handle such a file using System.Xml, nor is it possible (or at least practical) to do using a library such as HTML Agility Pack.
One important reason for writing my own parser are the issues I have experienced when trying to insert character entities into a XmlDocument instance.
Strictly speaking, it is not an XML-parser. It's only an XML-ish parser capable of parsing XML to some extent, hence the name, QuasiXML.
- Preserve or normalize attribute value white spaces.
- Throw exceptions or continue on errors.
- Render indented or non-indented markup.
This simple C# example demonstrates how the string "..." can be replaced with the HTML character entity … in a text node.
string markup = @"<root> <element>This is a text ... </element> </root>"; //Parse markup var root = new QuasiXmlNode(); root.OuterMarkup = markup; //Modify object model QuasiXmlNode textNode = root["element"].Children; textNode.Value = textNode.Value.Replace("...", "…");
The following code snippet demonstrates how all links in an XHTML document can be modified:
... var root = new QuasiXmlNode(); root.OuterMarkup = markup; var links = root.Descendants.Where(node => node.Name == "a" && node.Attributes.ContainsKey("href")).ToList(); foreach(QuasiXmlNode node in links) node.Attributes["href"] = Foo(node);
Visit http://quasixml.codeplex.com/ to download binaries and source code if you want to try it out.