dealing with a PubMed2XL bug [Wed, 16 Mar 2011 23:26:34 +0000]
Björn from Sweden has been using PubMed2XL [http://blog.humaneguitarist.org/projects/pubmed2xl/] and has suggested some additional features that we are working on. More on that some other time ...
But he also found a bug, or rather an oversight on my part. That needs to be dealt with first.
I didn't realize that some data in the PubMed.gov XML elements are insanely long. We encountered an abstract in one article [http://www.ncbi.nlm.nih.gov/pubmed/20737003?report=abstract] nearly 50,000 characters long. That wasn't breaking PubMed2XL but the resultant spreadsheet had all kinds of problems - values in the wrong column, wrong cell, etc. I guess this is because - as I now know - Excel/OpenOffice don't let cells carry more than about 32k characters. I don't know if this is true of newer versions of MS Excel, but whatever. 32k is enough!
So in a test version of the application, I added a length checking and stoppage feature. This restricts the length of the data placed into a cell to 30,000 characters if the data to be placed is greater than 32,000 characters.
Eventually, I'll make it so that if the data is greater than 32k characters, the cell will contain colored text so the user can know that "Hey, this data is incomplete because it's so darn long!".
Anyway, as a note to myself, here's a code snippet that seems to be a quick patch. I'll upload the fixed version in a week or so. I'm moving and all, so my schedule's a bit wonky.
cell = getElement.text
if len(cell) > 32000:
cell = cell[0:30000]
writeExcel.write (rowIter, columnIter, cell)