-
Notifications
You must be signed in to change notification settings - Fork 0
PDF conversion method for CCA
For each PDF paper available at common research information space (RePEc+Socionet) we make a conversion and create paper’s text versions. It allows us a parsing of citation data and making citation content analysis. The produced text versions of PDF papers are available for everyone to make transparent a data source for our citation content analysis. In addition, everyone can take and use this open data source to make their own citation data extraction and analysis.
A conversion utility of PDF papers should provide data, which allow a correct parsing of all citation related data specified by CCA concept. At the same time, the parsed citation data should have attributes, which allow a visualization of the data and results of its analysis for readers of PDF papers. The visualization is necessary to provide transparence for results of our citation content analysis and to allow a public control over correctness of these results. As a visualization method of citation data, we have chosen PDF annotation tools, based on using PDF.js library and Hypothes.is additional modules.
To meet requirements, including necessary visualization attributes, we create a conversion tool based on node.js+PDF.js open source software. As a conversion result for each PDF paper we have two files: 1) a paper as JSON file, which allows finding/selecting text strings related with citation data, and 2) a paper as a plain text file, which allow a proper calculation of start and end coordinates the selected data in the PDF papers needed for its visualization.
The JSON version of a PDF paper includes some formatting attributes that makes possible to find/select paper’s sections, its titles, pages’ headers and footers, etc. Below is a fragment of some paper’s JSON version. Paper’s text is located here in “str”: tags.
The plain text version of a PDF paper, a fragment of which for the same paper is below, includes the text only. The text is organized to fit with the PDF.js method to specify coordinates of selected strings in the text.
For a paper - http://dspacecris.eurocris.org/bitstream/11366/526/1/CRIS2016_paper_40_Parinov.pdf which for annotating purposes is available by this link - https://socionet.ru/pdfviewer.xml?file=http://dspacecris.eurocris.org/bitstream/11366/526/1/CRIS2016_paper_40_Parinov.pdf we have produced a conversion to text with results in two files: