Skip to content

Make petrarch2 output more JSON friendly #44

@cegme

Description

@cegme

When using the petratch output, it would be helpful to make the output python friendly and json. Currently, the petrarch output is a Python specific. Can we make the output abide by json rules? This way the data can still be read in python (via json package) and other language can easily read and use the output without custom converters (i.e. mongo, Redis).

Below is a snippet out petrarch2 output that shows. The main issues is that the actorroot, actortext,eventtext contains dictionaries, those dictionaries have a Tuple as its key.

python3
>>> s = """{u'nytasiapacific20160622.0002': {'sents': {1: {'geo-location': [{u'placename': u'Beirut', u'countrycode': u'LBN', u'lon': 35.49442, u'admin1': u'Beyrouth', u'lat': 33.88894, u'searchterm': u'Beirut'}], u'events': [(u'TUNJUD', u'NGAEDU', u'173')], 'content': u'A Tunisian court has jailed a Nigerian student for two years for helping young militants join an armed Islamic group in Beirut, his lawyer said Wednesday.', u'meta': {u'actorroot': {(u'TUNJUD', u'NGAEDU', u'173'): [u'', u'']}, (u'TUNJUD', u'NGAEDU', u'173'): [[u'JAILED'], [u'HAS']], u'eventtext': {(u'TUNJUD', u'NGAEDU', u'173'): u'has jailed'}, u'nouns': [([u' TUNISIAN', u' COURT'], [u'TUNJUD'], [(u'TUN', []), [u'~']]), ([u' NIGERIAN', u' STUDENT'], [u'NGAEDU'], [(u'NGA', []), [u'~']]), ([u' MILITANTS', u' ARMED ISLAMIC GROUP', u' BEIRUT'], [u'DZAREBUAF', u'LBNUAF'], [[u'~'], (u'DZAREB', []), (u'LBN', [])]), ([u' LAWYER'], [u'~JUD'], [[u'~']])], u'actortext': {(u'TUNJUD', u'NGAEDU', u'173'): [u'Tunisian court', u'Nigerian student']}}, 'parsed': u'(S (S (NP (DT A )  (NNP TUNISIAN )  (NN COURT )  )  (VP (VBZ HAS )  (VP (VBN JAILED )  (NP (DT A )  (NNP NIGERIAN )  (NN STUDENT )  )  (PP (IN FOR )  (NP (NP (CD TWO )  (NNS YEARS )  )  (PP (IN FOR )  (S (VP (VBG HELPING )  (S (NP (JJ YOUNG )  (NNS MILITANTS )  )  (VP (VB JOIN )  (NP (DT AN )  (JJ ARMED )  (JJ ISLAMIC )  (NN GROUP )  )  (PP (IN IN )  (NP (NNP BEIRUT )  )  )  )  )  )  )  )  )  )  )  )  )  (, , )  (NP (PRP$ HIS )  (NN LAWYER )  )  (VP (VBD SAID )  (NP (NNP WEDNESDAY )  )  )  (. . )  )  ', u'issues': [[u'STUDENTS', 1], [u'NAMED_TERROR_GROUP', 1]]}}, 'meta': {'date': '20160621', 'headline': u'Lightning Ridge Journal: An Amateur Undertaking in Australian Mining Town With No Funeral Home', u'verbs': {u'actorroot': {(u'TUNJUD', u'NGAEDU', u'173'): [u'', u'']}, (u'TUNJUD', u'NGAEDU', u'173'): [[u'JAILED'], [u'HAS']], u'eventtext': {(u'TUNJUD', u'NGAEDU', u'173'): u'has jailed'}, u'nouns': [([u' TUNISIAN', u' COURT'], [u'TUNJUD'], [(u'TUN', []), [u'~']]), ([u' NIGERIAN', u' STUDENT'], [u'NGAEDU'], [(u'NGA', []), [u'~']]), ([u' MILITANTS', u' ARMED ISLAMIC GROUP', u' BEIRUT'], [u'DZAREBUAF', u'LBNUAF'], [[u'~'], (u'DZAREB', []), (u'LBN', [])]), ([u' LAWYER'], [u'~JUD'], [[u'~']])], u'actortext': {(u'TUNJUD', u'NGAEDU', u'173'): [u'Tunisian court', u'Nigerian student']}}}}}"""
>>> import pprint
>>> pprint.pprint(z)
{'nytasiapacific20160622.0002':
  {'meta': 
    {'date': '20160621',
             'headline': 'Lightning Ridge Journal: An Amateur Undertaking in Australian Mining Town With No Funeral Home',
             'verbs': {'actorroot': {('TUNJUD', 'NGAEDU', '173'): ['', '']},
                      'actortext': {('TUNJUD', 'NGAEDU', '173'): ['Tunisian court', 'Nigerian student']},
                      'eventtext': {('TUNJUD', 'NGAEDU', '173'): 'has jailed'},
                      'nouns': [([' TUNISIAN', ' COURT'], ['TUNJUD'], [('TUN', []), ['~']]),
                               ([' NIGERIAN', ' STUDENT'], ['NGAEDU'], [('NGA', []), ['~']]),
                               ([' MILITANTS', ' ARMED ISLAMIC GROUP', ' BEIRUT'], ['DZAREBUAF', 'LBNUAF'],
                               [['~'], ('DZAREB', []), ('LBN', [])]),
                               ([' LAWYER'], ['~JUD'], [['~']])],
                      ('TUNJUD', 'NGAEDU', '173'): [['JAILED'], ['HAS']]}},
   'sents': {1: {'content': 'A Tunisian court has jailed a Nigerian student for two years for helping young militants join an armed '
                          'Islamic group in Beirut, his lawyer said Wednesday.',
               'events': [('TUNJUD', 'NGAEDU', '173')],
               'geo-location': [{'admin1': 'Beyrouth',
                               'countrycode': 'LBN',
                               'lat': 33.88894,
                               'lon': 35.49442,
                               'placename': 'Beirut',
                               'searchterm': 'Beirut'}],
               'issues': [['STUDENTS', 1], ['NAMED_TERROR_GROUP', 1]],
               'meta': {'actorroot': {('TUNJUD', 'NGAEDU', '173'): ['', '']},
                       'actortext': {('TUNJUD', 'NGAEDU', '173'): ['Tunisian court', 'Nigerian student']},
                       'eventtext': {('TUNJUD', 'NGAEDU', '173'): 'has jailed'},
                       'nouns': [([' TUNISIAN', ' COURT'], ['TUNJUD'], [('TUN', []), ['~']]),
                                ([' NIGERIAN', ' STUDENT'], ['NGAEDU'], [('NGA', []), ['~']]),
                                ([' MILITANTS', ' ARMED ISLAMIC GROUP', ' BEIRUT'], ['DZAREBUAF', 'LBNUAF'],
                                [['~'], ('DZAREB', []), ('LBN', [])]),
                                ([' LAWYER'], ['~JUD'], [['~']])],
                       ('TUNJUD', 'NGAEDU', '173'): [['JAILED'], ['HAS']]},
               'parsed': '(S (S (NP (DT A )  (NNP TUNISIAN )  (NN COURT )  )  (VP (VBZ HAS )  (VP (VBN JAILED )  (NP (DT A )  (NNP '
                         'NIGERIAN )  (NN STUDENT )  )  (PP (IN FOR )  (NP (NP (CD TWO )  (NNS YEARS )  )  (PP (IN FOR )  (S (VP '
                         '(VBG HELPING )  (S (NP (JJ YOUNG )  (NNS MILITANTS )  )  (VP (VB JOIN )  (NP (DT AN )  (JJ ARMED )  (JJ '
                         'ISLAMIC )  (NN GROUP )  )  (PP (IN IN )  (NP (NNP BEIRUT )  )  )  )  )  )  )  )  )  )  )  )  )  (, , )  '
                         '(NP (PRP$ HIS )  (NN LAWYER )  )  (VP (VBD SAID )  (NP (NNP WEDNESDAY )  )  )  (. . )  )  '}}}}

Three alternatives are:

  1. Quotify the key:
    'actortext': {('TUNJUD', 'NGAEDU', '173'): ['Tunisian court', 'Nigerian student']},
    to
    'actortext': {"['TUNJUD', 'NGAEDU', '173']": ["Tunisian court", 'Nigerian student"]},

  2. Use arrays instead of tuples/dictionaries
    'actortext': {('TUNJUD', 'NGAEDU', '173'): ['Tunisian court', 'Nigerian student']},
    to
    'actortext': [["TUNJUD", "NGAEDU", "173"], ["Tunisian court", "Nigerian student"]},

  3. Use more descriptive dictionaries (code, text key pairs)
    'actortext': {('TUNJUD', 'NGAEDU', '173'): ['Tunisian court', 'Nigerian student']},
    to
    'actortext': {"code" : ["TUNJUD", "NGAEDU", "173"], "text": ["Tunisian court", "Nigerian student"]},

This output/structure is decided during the do_coding phase of petrarch2. It seems like this change may break a lot of existing code.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions