fun with lxml [Mon, 07 Mar 2011 17:13:08 +0000]
First off, I don't consider myself a programmer. I just know enough to dabble even though I try and learn new stuff all the time in the hope that I - as someone in digital libraries - can occasionally write something that can serve the needs of others, rather than serving my ego. Don't get me started on people who try and write software that has no utility other than patting themselves on the back ... Anyway, that's another post for another time. So, the other day I got some questions/feature requests for PubMed2XL [http://vimeo.com/15098984] and so I started thinking about ways to tackle a few of the issues. It kinda makes me feel like a real programmer when people in the real world are asking about the software - but only for a few minutes before I make myself come back down to earth. :/ Currently, the software places into a spreadsheet cell the value of one XML element, the position of which is defined by the user in the setup file [http://blog.humaneguitarist.org/projects/pubmed2xl/#Changing]. But there may potentially be a need it seems to be able to concatenate ALL the values for a given element into one spreadsheet cell. So I wrote a little function to help me get started with that. The code uses this [http://www.w3schools.com/xml/simple.xml] simple restaurant-based XML file from W3Schools and uses the awesome lxml [http://lxml.de/] Python library. When run, it yields the following:
Calories for the first entree:
Calories for all entrees:
650; 900; 900; 600; 950
And here's the code:
#import required modules (lxml is non-standard; it likely needs to be installed)
import urllib #makes it easy to read documents from the web!
from lxml import etree #great XML parser and more!
#retrieve values from an XML file
def ElementCherryPicker(xpathArg, positionArg):
This places all the element values for the element passed as the
"xpathArg" argument into a list called "elementBox". It then returns
the list item preceeding the one specified by the "positionArg" argument.
This means passing a "1" equates to the first item in the list instead
of the traditional "0". If "0" is passed then the entire list will be
returned as a string with a delimiter of '; '.
positionArg = positionArg - 1
elements = parseUrl.findall(xpathArg) #make list of all matching elements
elementBox =  #create empty list
for element in elements:
elementBox.append(element.text) #place element values into the list
if positionArg != -1:
elementBox = elementBox[positionArg]
elementBox =  #if no element at stated position exists,
#then make the list empty again
delimiter = '; '
elementBox = delimiter.join(elementBox)
#define, open, read, and parse an XML file
readUrl = urllib.urlopen(url).read()
parseUrl = etree.XML(readUrl)
#print header and the values returned from ElementCherryPicker()
print 'Calories for the first entree:'
print ElementCherryPicker('.//calories', 1)
print 'Calories for all entrees:'
print ElementCherryPicker('.//calories', 0)