Java XML/DOM question

HMS_Irruncible · January 22, 2008, 2:58am

For some reason Xerces is parsing whitespace as text elements as I’m traversing the tree… pls help a noob out here…

Here’s a fragment of the XML file:



	<key>Tracks</key>
	<dict>
		<key>1185</key>
		<dict>
			<key>Track ID</key><integer>1185</integer>

When I call getChildNodes on the <key> node, the first 3 nodes I get are text elements consisting of newline, tab, and tab (which beautifully describes the document indention but is not remotely what I’m interested in).

I gather I’m missing some bit of information that causes the whitespace to be parsed, but damned if I know what that might be. Any takers?

PS - if anyone is curious I’m trying to write a utility to manipulate the iTunes library and educate myself on DOM whilst I’m at it.

LSLGuy · January 22, 2008, 3:16am

IANA Java guy, but generally speaking that’s correct XML DOM behavior.

I am a .Net guy, and in .Net the object which represents the entire XML document has a property you can set which controls whether whitespace nodes between elements are ignored or presented through the various child collections.

I suggest you look at your document object for a similar property. There may be a separate navigation object to permit XPath access or to expose an event interface like SAX. Those would be another place to look for the property.

Finally, it might be a parameter to the method you use to populate the DOM object from the xml-formatted text stream in the first place.

HMS_Irruncible · January 22, 2008, 1:55pm

Thanks for that info. I’ve been poking around for that information, and it seems all we have is the option to ignore all whitespace, or not. So if I want to ignore these friendly-formatting tabs and newlines, I also have to discard every element that has spaces. :smack:

I just can’t believe they failed to foresee an XML document formatted with newlines and tab indents for easy human readability, yet still containing space characters within the elements. That would mean overlooking what is arguably the most common use case of XML. Surely there must be something else I’m missing.

Also - just curious - is SAX just used for parsing, or can it be used for editing and saving a document as well?

LSLGuy · January 23, 2008, 4:31am

I think you misunderstand the operation of the whitespace filter.

“whitespace” is a magic word in XML in with a clearly defined meaning in the standard. It is NOT just a synonym for invisible characters

Again, I have zero experience with Java, but the idea that the whitespace filter would crush the embedded blank out of


<name>John Smith</name>

is ludicrous. I suggest you try it to see what really happens.
SAX parsers are mostly for reading. I’m not an expert on them, and there are many open source implementations. I have to imagine at least one exposes methods for editting, but from my distance that feels like it’d be pretty awkward.

Civil_Guy · January 23, 2008, 5:10am

Another amateur here. I’ve had some success with using the Java code for XML / DOM, although it’s been a slight struggle.

For my application, for input, I wrote up a generic document reader function:



	public static Document makeDoc(File f_in) throws IOException {
		javax.xml.parsers.DocumentBuilder DocBldr;
		Document doc = null;
		String strErrMsg = null;

		try
		{
			DocBldr = (javax.xml.parsers.DocumentBuilderFactory.newInstance()).newDocumentBuilder();
			doc = DocBldr.parse(f_in);
		}
		catch (FileNotFoundException err1)
		{
			strErrMsg = "File not found - no codes definitions available.

" + err1.toString();
		}
		catch (javax.xml.parsers.FactoryConfigurationError err2)
		{
			strErrMsg = "XML Reader not available.

" + err2.toString();
		}
		catch (javax.xml.parsers.ParserConfigurationException err3)
		{
			strErrMsg = "XML Parser not available.

" + err3.toString();
		}
		catch (org.xml.sax.SAXException err4)
		{
			strErrMsg = "Corrupted configuration file.

" + err4.toString();
		}
		catch (IOException err5)
		{
			strErrMsg = "Corrupted configuration file.

" + err5.toString();
		}

		if (strErrMsg != null) throw new IOException(strErrMsg);
		/* else */
		return doc;
	}

From there, I’ve got pretty brute-force code for traversing the expected schema within the return document. Still, as far as the schema goes, I’ve figured that I know too little myself to go presuming much about miscellaneous nodes that might have gotten put into the source file - that is, that somehow there could be extra processing nodes, comment nodes, or what have you.

For text nodes, I don’t think I’d make any assumptions about the quantity or type of whitespace; it only serves as a delimiter between words. AFAICT.

HMS_Irruncible · January 23, 2008, 1:04pm

LSLGuy:

I think you misunderstand the operation of the whitespace filter.

“whitespace” is a magic word in XML in with a clearly defined meaning in the standard. It is NOT just a synonym for invisible characters

Again, I have zero experience with Java, but the idea that the whitespace filter would crush the embedded blank out of
<name>John Smith</name>
is ludicrous. I suggest you try it to see what really happens.

OK, you’re right, it seems I was thinking of the Perl whitespace definition which does not apply here.

This raises yet another problem for me in that having parsed the doc and edited it, when I write it back to disk all the whitespace is missing! No doubt this is according to the design but it is counter to my expectation. I was hoping to read the doc, traverse it freely without regard to pretty-printed whitespace, and then write it back to disk with the whitespace intact. Any pointers there? Seems like there might be some pattern or technique suited to this task.

arseNal · January 23, 2008, 2:24pm

I seem to recall that when you want to output it, you have the option to pretty-print it. This wouldn’t necessarily keep the original tabs and spaces intact but it will look like a well-indented document.

I don’t have time right now but if you can’t google it, I’ll see if I can dig it up later.

MrSquishy · January 23, 2008, 7:17pm

Just an idea: once you’re happy with your DOM learning experience, and you’re ready to do it the “easy” way (depending on exactly what you’re trying to do, I guess), you might want to look into using XPath/XQuery. Once you get the hang of it, it’s much more fun than parsing the DOM yourself.

Topic		Replies	Views
XML as text in XHTML Factual Questions	5	1245	March 19, 2010
Html and paragraph breaks Factual Questions	10	707	January 18, 2001
Anyone know why so many WYSIWYG html editors use "<p> </p>" instead of "<br />"? Factual Questions	6	3368	October 29, 2011
XML, VB6 and Multiple nodeValues Factual Questions	5	1644	December 31, 2005
XML question: writing a DTD Factual Questions	10	761	January 28, 2003

Java XML/DOM question

Related topics