Creating a Reference Tree from an existing PDF Collection

Rate This: 
Fivestar rating field for readers to rate the content.

I am currently in the Snowballing Phase of the Systematic Review approach (explained here). So I already have accumulated a amount of papers (most of them as PDF files) for my current research goals using a keyword search on several databases (Econis, EBSCOhost and ScienceDirect, etc.).


Now I thought about how to proceed with the analysis of the references mentioned in the papers I already found. I was afraid, that the amount of work necessary, to make a manual analysis of all papers available in my library and manually would be just overwhelming.
My calculation was as follows: My library for this goal consists of 200+ papers, let’s assume each with 15 references, thereof seven already in my library. From the remaining eight references per paper, three are going to be relevant for my research.

So in sum I would have to screen over 3000 references (for relevance for my research goal and for existence in my library) and have to manually copy and paste 600 papers into a database search engine.
Even if I am very fast it sounds like hours of monotone, uninteresting labor.


Usually in those situations I try to rely on the internet to help me with a tool, but after I made a thorough search I did not find anything which could help me with my problem.

So at the moment I am reactivating my Python skills and learning something about regular expression to turn something like this quote from a PDF:

Spira, L.F. and Page, M. (2002), “Risk management: the reinvention of internal control and the changing role of internal audit”, Accounting, Auditing & Accountability Journal, Vol. 16 No. 4, pp. 640-61.
Steele, P. and Court, B. (1996), Profitable Purchasing Strategies: A Manager’s Guide for Improving Organizational Competitiveness through the Skills of Purchasing, McGraw-Hill, London.
Tchankova, L. (2002), “Risk identification – basic stage in risk management”, Environmental Management and Health, Vol. 13 No. 3, pp. 290-7.
Treleven, M. and Schweikhart, S.B. (1988), “A risk/benefit analysis of sourcing strategies: single vs multiple sourcing”, Journal of Operations Management, Vol. 7 No. 4, pp. 93-114.

Into a list of author-year-title-combinations which can be automatically compared to my existing library (using Papers) and made into a search query at Google Scholar or a similar database.


At the moment I am making good progress with the core pattern matching. The open questions are how to read from the PDF files and how to compare the results to my existing library (using title and author or only title).
I will post update.

Future work could then analyze the possibility to create a reference tree connecting the literature within a library to show “connections of thought” between them.


Gidday Daniel,
I have been looking for something like this as I too go through the 'Snowball phase', did the software work -- can I try it?


I ended up using another more manual approach. However I put my last working copy on github.

If you have any questions or suggestions, just contact me.


Add new comment