PubMed to spreadsheet made easy [Mon, 16 Aug 2010 01:31:23 +0000]
Update, September 2010: This post refers to an Alpha version of PubMed2XL. You can get the latest version of the software here [http://blog.humaneguitarist.org/projects/pubmed2xl/].
Some time ago - exactly a year ago, actually! - I shared a post [http://blog.humaneguitarist.org/2009/08/15/xslt-a-practical-usage-example-with-pubmed-records/] on how to use XSLT to turn a PubMed XML file into an HTML table and in turn paste that into Microsoft Excel or OpenOffice Calc.
That's fine and all but that's still too "techy" for the average bear who just wants to get a list of articles into a spreadsheet. So, I've been working on some software called PubMed2XL to make the job super simple.
PubMed2XL's a GUI program written in Python and it uses PyQT [http://www.riverbankcomputing.co.uk/software/pyqt/intro]:
... a set of Python bindings for Nokia's [http://qt.nokia.com] Qt application framework and runs on all platforms supported by Qt including Windows, MacOS/X and Linux.
Since the program's still in early stages there's no real documentation but if you want to just play around with it and you use Windows you can get it here [http://blog.humaneguitarist.org/projects/pubmed2xl/]. [DEL: If it doesn't work, it's probably because you need a file called MSVCR71.dll which I can't legally distribute but I think you can find it if you are resourceful :DEL] .
Basically all you need to do is this:
1. Conduct searches in PubMed [http://www.ncbi.nlm.nih.gov/pubmed/].
2. Send your articles to the Clipboard [http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=helppubmed&part=pubmedhelp#pubmedhelp.Saving_citations_tem].
3. Send the results to "File" as XML.
4. Save the file as "pubmed_results.txt" which is the default name - of course, you can call the file something else if you want as long as it ends in ".txt" or ".xml".
5. Click on the file called PubMed2XL.exe and then choose FILE>SELECT PUBMED FILE as below:
IMAGE: "PubMed2XL screenshot"[http://blog.humaneguitarist.org/uploads/PubMed2XL/pubmed2xl_081510.png]
6. Then "open" the file you downloaded from PubMed (pubmed_results.txt).
7. You should now see an XLS (Microsoft Excel) file in the same folder as pubmed_results.txt.
That should pretty much be it. And by the way the Help currently just points your browser to blog.humaneguitarist.org [http://blog.humaneguitarist.org/] because, um, there's no help documentation yet.
If you're curious how this all works in the very general sense, I'm using a home-grown XML setup file (see below) that tells PubMed2XL which element or attribute value to extract from the pubmed_results.txt file. Then, the script uses the awesome pyExcelerator [http://sourceforge.net/projects/pyexcelerator/] module to write the data to an XLS file.
By using this XML file advanced users can change the data as well as the spreadsheet column names that are generated in the resultant XLS file. I'm trying to make this software as open and mutable as possible but casual users won't have to worry about anything since the defaults should eventually work just fine.
Right now, the main work I have left to do is to overcome one glaring weakness. PubMed2XL can currently only retrieve data from non-repeating XML elements. In other words, elements like an author's can't be extracted because there may be more than one author. What I'll eventually do is incorporate something in the setup file that tells PubMed2XL which occurrence of a repeating element to get data from: i.e. the last name of the primary author, etc.
If you are bored enough to download the zip file containing the program files, you'll notice the main executable file, PubMed2XL.exe, but also another file called PubMed2XL_CL.exe. Now this is exactly the same application but if you click on it you will see an ugly console window pop up in addition to PubMed2XL. The only reason I've included that file is to demonstrate that PubMed2XL can support command line arguments [http://en.wikipedia.org/wiki/Command-line_interface]. In other words if you were to go to the command line and type in
$ PubMed2XL_CL -h
you would see a message pop up on the command line showing you the options for passing arguments to the software via the command line.
Basically what this means is that you can tell PubMed2XL which PubMed file to process and what to call the resultant spreadsheet while bypassing the program's graphical interface. Now if you're working on just one file, the GUI version is definitely the way to go, but by incorporating command line functionality the program becomes instantly usable for batch-processing multiple files and also becomes a viable tool to incorporate on a server. In other words, it could be used on the back end of a website. For example, users could just upload their PubMed file to a website while having the XLS file emailed to them or something like that.
Anyway, there's still lots to do and when I've taken care of the issues I mentioned I'll release the source code if anyone's interested - or if Linux or MAC users want to get this up and running on their systems.
Ideally, I'd like this to become a nifty tool reference librarians could use to help their patrons with. Now if something like this is already out there, please let me know. No need to re-invent the wheel.
<?xml version="1.0" encoding="UTF-8" ?>
<config xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="PubMed2XL-0.8.9.xsd">
<column xPath="PubmedArticle/MedlineCitation/PMID" type="element" linkPrefix="http://www.ncbi.nlm.nih.gov/pubmed/">PMID</column>
<column xPath="PubmedArticle/MedlineCitation" type="attribute" attributeName="Owner" linkPrefix="none">Owner</column>
<column xPath="PubmedArticle/MedlineCitation/Article/Journal/JournalIssue/PubDate/Year" type="element" linkPrefix="none">Publication Year</column>
<column xPath="PubmedArticle/MedlineCitation/Article/Journal/JournalIssue/PubDate/Month" type="element" linkPrefix="none">Publication Month</column>
<column xPath="PubmedArticle/MedlineCitation/Article/Journal/Title" type="element" linkPrefix="http://www.ncbi.nlm.nih.gov/pubmed?term=">Journal</column>
<column xPath="PubmedArticle/MedlineCitation/MedlineJournalInfo/NlmUniqueID" type="element" linkPrefix="none">NLM ID</column>
<column xPath="PubmedArticle/MedlineCitation/Article/ArticleTitle" type="element" linkPrefix="none">Article Title</column>
<column xPath="PubmedArticle/MedlineCitation/Article/Abstract/AbstractText" type="element" linkPrefix="none">Abstract</column>
<column xPath="PubmedArticle/MedlineCitation/Article/Language" type="element" linkPrefix="none">Language</column>