Java XML/DOM question

For some reason Xerces is parsing whitespace as text elements as I’m traversing the tree… pls help a noob out here…

Here’s a fragment of the XML file:

			<key>Track ID</key><integer>1185</integer>

When I call getChildNodes on the <key> node, the first 3 nodes I get are text elements consisting of newline, tab, and tab (which beautifully describes the document indention but is not remotely what I’m interested in).

I gather I’m missing some bit of information that causes the whitespace to be parsed, but damned if I know what that might be. Any takers?

PS - if anyone is curious I’m trying to write a utility to manipulate the iTunes library and educate myself on DOM whilst I’m at it.

IANA Java guy, but generally speaking that’s correct XML DOM behavior.

I am a .Net guy, and in .Net the object which represents the entire XML document has a property you can set which controls whether whitespace nodes between elements are ignored or presented through the various child collections.

I suggest you look at your document object for a similar property. There may be a separate navigation object to permit XPath access or to expose an event interface like SAX. Those would be another place to look for the property.

Finally, it might be a parameter to the method you use to populate the DOM object from the xml-formatted text stream in the first place.

Thanks for that info. I’ve been poking around for that information, and it seems all we have is the option to ignore all whitespace, or not. So if I want to ignore these friendly-formatting tabs and newlines, I also have to discard every element that has spaces. :confused: :smack:

I just can’t believe they failed to foresee an XML document formatted with newlines and tab indents for easy human readability, yet still containing space characters within the elements. That would mean overlooking what is arguably the most common use case of XML. Surely there must be something else I’m missing.

Also - just curious - is SAX just used for parsing, or can it be used for editing and saving a document as well?

I think you misunderstand the operation of the whitespace filter.

“whitespace” is a magic word in XML in with a clearly defined meaning in the standard. It is NOT just a synonym for invisible characters

Again, I have zero experience with Java, but the idea that the whitespace filter would crush the embedded blank out of

<name>John Smith</name>

is ludicrous. I suggest you try it to see what really happens.
SAX parsers are mostly for reading. I’m not an expert on them, and there are many open source implementations. I have to imagine at least one exposes methods for editting, but from my distance that feels like it’d be pretty awkward.

Another amateur here. I’ve had some success with using the Java code for XML / DOM, although it’s been a slight struggle.

For my application, for input, I wrote up a generic document reader function:

	public static Document makeDoc(File f_in) throws IOException {
		javax.xml.parsers.DocumentBuilder DocBldr;
		Document doc = null;
		String strErrMsg = null;

			DocBldr = (javax.xml.parsers.DocumentBuilderFactory.newInstance()).newDocumentBuilder();
			doc = DocBldr.parse(f_in);
		catch (FileNotFoundException err1)
			strErrMsg = "File not found - no codes definitions available.

" + err1.toString();
		catch (javax.xml.parsers.FactoryConfigurationError err2)
			strErrMsg = "XML Reader not available.

" + err2.toString();
		catch (javax.xml.parsers.ParserConfigurationException err3)
			strErrMsg = "XML Parser not available.

" + err3.toString();
		catch (org.xml.sax.SAXException err4)
			strErrMsg = "Corrupted configuration file.

" + err4.toString();
		catch (IOException err5)
			strErrMsg = "Corrupted configuration file.

" + err5.toString();

		if (strErrMsg != null) throw new IOException(strErrMsg);
		/* else */
		return doc;

From there, I’ve got pretty brute-force code for traversing the expected schema within the return document. Still, as far as the schema goes, I’ve figured that I know too little myself to go presuming much about miscellaneous nodes that might have gotten put into the source file - that is, that somehow there could be extra processing nodes, comment nodes, or what have you.

For text nodes, I don’t think I’d make any assumptions about the quantity or type of whitespace; it only serves as a delimiter between words. AFAICT.

OK, you’re right, it seems I was thinking of the Perl whitespace definition which does not apply here.

This raises yet another problem for me in that having parsed the doc and edited it, when I write it back to disk all the whitespace is missing! No doubt this is according to the design but it is counter to my expectation. I was hoping to read the doc, traverse it freely without regard to pretty-printed whitespace, and then write it back to disk with the whitespace intact. Any pointers there? Seems like there might be some pattern or technique suited to this task.

I seem to recall that when you want to output it, you have the option to pretty-print it. This wouldn’t necessarily keep the original tabs and spaces intact but it will look like a well-indented document.

I don’t have time right now but if you can’t google it, I’ll see if I can dig it up later.

Just an idea: once you’re happy with your DOM learning experience, and you’re ready to do it the “easy” way (depending on exactly what you’re trying to do, I guess), you might want to look into using XPath/XQuery. Once you get the hang of it, it’s much more fun than parsing the DOM yourself.