blog.humaneguitarist.org

too many brackets are being typed in the dark: XPath-like expressions for Python dictionaries and JSON

[Sun, 10 Nov 2013 16:29:19 +0000]
When I first started learning how to program and parse data, the data formats I first got acquainted with were text delimited files and XML. I avoided JSON as long as I could because I found (find?) it far less elegant than XML. Actually whenever I read things where programmers put down XML because they think JSON has killed it, I'm not sure they have worked with some things where XML is probably a far, far better fit: like MusicXML. I find XML infinitely easier to read . Not to mention XSDs and schemas a good thing even if they are a pain to write. But XPath [http://en.wikipedia.org/wiki/XPath] is what I really like about XML. I hate it when people try to regex all over XML. I feel like saying, "Dude: this problem's already been solved!". I prefer to write this for XML: "foo/bar[1]" to this for how I'd parse a JSON string after converted to a Python dictionary: ["foo"]["bar"][1] Why? Because I start typing really slow when intermxing brackets and quotations marks. Maybe I should just work on my typing. But for now, I'm playing with a little Python function called "jpath" that will allow me to use simple XPath-like expressions to return data from a Python dictionary a la: >>> d = {"foo":{"bar":1}, "baz":[2,3]} >>> jpath(d, "foo") {'bar': 1} >>> jpath(d, "foo", "type") 'dict' >>> jpath(d, "foo", "text") 'bar' >>> jpath(d, "foo/bar", "text") '1' >>> jpath(d, "baz", "type") 'list' >>> len(jpath(d, "baz")) 2 >>> jpath(d, "baz", "text") '[2, 3]' >>> jpath(d, "baz[1]") 3 >>> Below is a slightly more elaborate example where after pulling some data from the HathiTrust [http://www.hathitrust.org], the code parses the data using jpath() function. from jpath import jpath from json import loads from urllib import urlopen # make request for "Edgar Allen Poe" (sic) to Hathi Trust. url = "http://chinkapin.pti.indiana.edu:9994/solr/meta/select/?q='edgar+allen+poe'&wt=json" hathi = urlopen(url) #get response hathi = hathi.read() #read response hathi = loads(hathi) #convert to dict ### iterate through nodes within first item; print fields and values. doc = jpath(hathi, "response/docs[0]") #get first item. del doc["fullrecord"] #delete MARC data: too much XML to post in blog for a small example! print ("*** Printing fields and values (as strings) for first item ...") print for each in doc: print str(each) + ": " + jpath(doc, each, "text") print # print Title of first item. print ("*** Printing Title of first item (as %s item) ...") %(jpath(hathi, "response/docs[0]/title", "type")) print jpath(hathi, "response/docs[0]/title") The script will output this: >>> *** Printing fields and values (as strings) for first item ... mainauthor: [u'Poe, Edgar Allan, 1809-1849.'] htrc_gender: [u'male'] htrc_charCount: 346672 htrc_pageCount: 274 title_a: [u"Poe's poems"] title_c: [u'[by] Edgar Allen [!] Poe.'] htrc_volumePageCountBin: M htrc_volumeWordCountBin: M oclc: [u'(OCoLC)16566920'] id: uc2.ark:/13960/t8pc2vk16 author: [u'Poe, Edgar Allan, 1809-1849.'] sdrnum: [u'sdr-ia-srlf2759906'] topicStr: [u'Poetics.'] title_top: [u"Poe's poems [by] Edgar Allen [!] Poe."] publishDateRange: [u'1890'] htrc_wordCount: 58434 publisher: [u'The Henneberry Company'] author_top: [u'Poe, Edgar Allan, 1809-1849.', u'[by] Edgar Allen [!] Poe.'] publishDate: [u'1890'] countryOfPubStr: [u'United States'] htsource: [u'University of California'] language: [u'English'] htrc_genderMale: [u'Poe, Edgar Allan, 1809-1849'] title_ab: [u"Poe's poems"] published: [u'Chicago, New York : The Henneberry Company, [189-?]'] title: [u"Poe's poems [by] Edgar Allen [!] Poe."] *** Printing Title of first item (as list item) ... [u"Poe's poems [by] Edgar Allen [!] Poe."] >>> And here's the "jpath.py" file containing the function. # jpath.py ''' to do: - add support for leading double slashes a la XPath: >>> d = {"foo":{"bar":[1,2]}} >>> jpath("//bar[1]") 2 ''' ##### def _jparse(_path): ''' Takes an XPAth like expression (nodes and positions only); returns a snippet used to get data from a dictionary. example: >>> _jparse("foo[1]/bar[2]") "['foo'][1]['bar'][2]" ''' from re import split # start final output string. outpath = "" # split by square brackets. expression = split(("[\[|\]]"),_path) #see: http://stackoverflow.com/a/4998688 # work on writing "outpath". for position in expression: try: #leave integers in brackets alone. test = int(position) open_quote, close_quote = "[", "]" except: open_quote, close_quote = "['", "']" # make string by splitting "_path" argument on forward slashes. slash = "".join([(open_quote + slashes + close_quote) for slashes in position.split("/") if len(slashes) > 0]) outpath = outpath + slash #append to string. return outpath ##### def jpath(_dict, _path="", _format=""): ''' Takes a dictionary (name or string literal), an XPAth like expression (nodes and positions only), and an optional output format ("type" or "text") and returns the value. Returns an empty string if the expression fails. ''' # parse "_path" to dictionary/brackets syntax. outpath = _jparse(_path) try: # make final string to be evaluated; ex. >>> eval('{"foo":"bar"}["foo"]') #yields "bar". dict_expression = str(_dict) + outpath evaluated = eval(dict_expression) # prepare output format per "_format" parameter. _type = type(evaluated) if _format == "type": results = str(_type).split("'")[1] elif _format == "text": if _type == int: pass #evaluated = str(evaluated) results = "".join(str(evaluated)) else: results = evaluated except: results = "" # return results. return results Yes, it's silly. Yes, I should just work on my typing. But I like it [http://en.wikipedia.org/wiki/Too_Many_Puppies] all the same. ... Update, November, 11, 2013: I guess I should have taken a look at these first, BTW: https://pypi.python.org/pypi/jsonpath/ [https://pypi.python.org/pypi/jsonpath/] and JSONPath - XPath for JSON [http://goessner.net/articles/JsonPath/]. I gotta remember to ask The Google first! :)