OpenOffice to HTML trick [Sun, 09 May 2010 15:52:37 +0000]
I've recently been working on some documentation for a MusicXML platform I wrote called MXMLiszt - I'll be releasing the files/source in a few weeks.
In the past I've used the W3C's Amaya editor [http://www.w3.org/Amaya/] to write the HTML documentation for AudioRegent [http://blog.humaneguitarist.org/tag/audioregent/] but that's a really laborious process and requires a good bit of coding by hand even as I worked with the WYSIWYG environment.
So this time I just decided to use OpenOffice [http://www.openoffice.org/]. Problem is when I did a Save As to HTML, the W3C's Validator [http://validator.w3.org/] was giving me nearly 300 errors. Even worse is that the export was exporting the .html file and all the images in the document to the same directory. Boo. Ideally, the images should be in a subfolder for compartmentalization purposes.
Here's what worked better: instead of saving to HTML, I used the File>Preview in Web Browser option in OpenOffice. Since Firefox is my default browser, it opened in Firefox. Then I just Firefox's Save Page As option using the Web Page, complete type. Firefox saved the file (let's call it "foo") as foo.htm and created a folder called "foo" that contained all the images. Sweet!
This time: only 13 errors from the WC3s' p.o.v., all very minor errors - who knows where the other ~275 went?!
One way to cut down on errors is to make sure any image you embed in your OpenOffice .odt document has an alternative (alt) text attribute since that attribute is technically required for all images in HTML docs.
I should mention that it's better to do this from a Linux box rather than Windows as the former uses the UTF-8 encoding and the latter Windows-1252. That's no huge deal for non-critical documents, but it's probably better to go with UTF-8 if you can.
Now I'm not going to waste my time hand-correcting a perfectly "valid" HTML doc for things like this. That's mistaking the cart for the horse. I'm just an average dude trying to share some info with some folks. I'm not an institution charged with a preservation mindset. These WordPress blog entries aren't valid either BTW ...
I realize the importance of standards and sustainability but the Open Document format (foo.odt) is what I would argue is the thing to save, the HTML version being just a convenient manifestation. Secondly, if all major browsers have no trouble with the document, then from a certain p.o.v. the HTML is valid. In a sense, it's much ado about nothing.
ps: If you're wondering about exporting to xHTML, don't.
... that's a much bigger pain.