Recursive-citeproc

GitHub build status

Pandoc/Quarto filter for self-citing BibTeX bibliographies.

Overview

BibTeX bibliographies can self-cite: one bibliography entry may cite another entry. That is done in two ways: the crossref field to cite a collection from which an entry is extracted (see the BibTeX’s documentation), or by entering citation commands, e.g. in a note field:

@incollection{Doe:2000,
   author = 'Jane Doe',
   title = 'What are Fish Even Doing Down There',
   crossref = 'Snow:2000',
}
@book{Snow:2010,
   editor = 'Jane Snow',
   title = 'Fishy Works',
   note = 'Reprint of~\citet{Snow:2000}',
}
@collection{Snow:2000,
   editor = 'Jane Snow',
   title = 'Fishy Works',
}

LaTeX’s bibliography engines (natbib, biblatex) handle self-citations of both kinds.

Pandoc and Quarto can use those engines but for PDF output only. They come instead with their own engine, Citeproc, which conveniently uses citation styles files and covers all output formats.

However, Citeproc only handles crossref self-citations. It fails to process citation commands in bibliographies.

This filter enables Citeproc to process cite commands in the bibliography. It ensures that the self-cited entries are displayed in the document’s bibliography.

Are self-citing bibliographies a good idea? It ensures consistency by avoiding multiple copies of the same data, but creates dependencies between entries. The citation sytle language doesn’t seem to permit it. Be that as it may, many of us have legacy self-citing bibliographies, so we may as well handle them.

Usage

The filter modifies the internal document representation; it can be used with many publishing systems that are based on Pandoc.

When using several filters on a document, this filter must be placed: * after any filter that adds citations to the document, * before Citeproc or Quarto

The filter must be used in combination with Citeproc.

Plain pandoc

Pass the filter to pandoc via the --lua-filter (or -L) command line option, followed by Citeproc (--citeproc or -C):

pandoc --lua-filter recursive-citeproc.lua -C ...

Or via a defaults file:

filters:
- recursive-citeproc.lua
- citeproc

Copy the file in your Pandoc user data directory to make it available to Pandoc anywhere. Run pandoc -v to see where your Pandoc user data directory is.

Quarto

Users of Quarto can install this filter as an extension with

quarto install extension tarleb/recursive-citeproc.git

and use it by adding recursive-citeproc to the filters entry in their YAML header, before quarto.

---
filters:
- recursive-citeproc
- quarto
---

You must explicitly specify that the filter comes before Quarto’s own, by default Quarto runs its own (incl. Citeproc) first.

R Markdown

Use pandoc_args to invoke the filter, followed by Citeproc. See the R Markdown Cookbook for details.

---
output:
  word_document:
    pandoc_args: ['--lua-filter=recursive-citeproc.lua', '--citeproc']
---

Options

You can specify the filter’s maximum recursive depth in the document’s metadata. Use 0 for infinte (default 100):

recursive-citeproc:
  max-depth: 5

A max-depth of 2, for instance, means that the filter inserts references that are only cited by references cited in the document’s body, but not references that are only cited by references that are themselves only cited by references cited in the document.

If the max depth is reached before all self-recursive citations are processed, PDF output may generate an error.

Testing

To try the filter with Pandoc or Quarto, clone the directory.

Pandoc

Generate Pandoc outputs with make generate. Change the output format with make generate FORMAT=docx. Use FORMAT=latex for latex outputs. You can list multiple formats, make generate FORMAT="docx pdf". The outputs will be in the test folder, named expected.<format>.

Requires Pandoc.

Quarto

As above, replacing generate with qgenerate.

Requires Quarto.

Pandoc within Quarto

With Quarto installed, you can also use the Pandoc engine embedded in Quarto: add the argument PANDOC="quarto pandoc" to the Pandoc commands above, e.g. make generate FORMAT=docx PANDOC="quarto pandoc".

How the filter works

The filter adds a Citeproc-generated bibliography to the document, which may contain citation commands, and sets the metadata key suppress-bibliography to true. When Citeproc itself is run on the result, the bibliography’s citation commands are converted to text.

The filter’s main task is to ensure that its Citeproc-generated bibliography contains all the document’s citations, including those that may only appear in the bibliography itself. To do that, it checks whether the result of generating a bibliography with Citeproc adds new citations. If it does, the filter adds those new citations in the metadata nocite field and tries to generate the bibliography again, and so on until generating the bibliography doesn’t produce any citation that is not already present in the bibliography.

Credits

Based on an idea given by John MacFarlane on the pandoc-discuss mailing list.

License

This pandoc Lua filter is published under the MIT license, see file LICENSE for details.

Example

input.md

---
title: 'Self-citing bibliography example'
author: Julien Dutant
recursive-citeproc: 100 # optional, specify max recursive depth
nocite: 
- '@Smith2001'
- |
 @Smith2003, @Smith2005, 
---

[@Doe2020].

# References

output.html

Code

recursive-citeproc.lua

--[[-- # Recursive-citeproc - Self-citing BibTeX 
bibliographies in Pandoc and Quarto

@author Julien Dutant <julien.dutant@kcl.ac.uk>
@copyright 2021-2023 Julien Dutant
@license MIT - see LICENSE file for details.
]]

-- 2.17 for relying on `elem:walk()`, `pandoc.Inlines`, pandoc.utils.type
PANDOC_VERSION:must_be_at_least '2.17'

--- # Global Setting
DEFAULT_MAX_DEPTH = 100

--- # Helper functions

local stringify = pandoc.utils.stringify
local run_json_filter = pandoc.utils.run_json_filter
type = pandoc.utils.type
local blocks_to_inlines = pandoc.utils.blocks_to_inlines
-- we don't use pandoc.utils.references, twice slower on benchmark
references = pandoc.utils.references

-- metatype: type of a Meta element
metatype = type

-- run citeproc
local function run_citeproc (doc)
  if PANDOC_VERSION >= '2.19.1' then
    return pandoc.utils.citeproc(doc)
  elseif PANDOC_VERSION >= '2.11' then
    local args = {'--from=json', '--to=json', '--citeproc'}
    return run_json_filter(doc, 'pandoc', args)
  else
    return run_json_filter(doc, 'pandoc-citeproc', {FORMAT, '-q'})
  end
end

--- listConcat: concatenate a List of lists
---@param list pandoc.List[] list of pandoc.Lists
---@return pandoc.List result concatenated List
local function listConcat(list)
  local result = pandoc.List:new()
  for _,sublist in ipairs(list) do
    result:extend(sublist)
  end
  return result
end

---Flatten a meta value into Inlines
---in pandoc < 2.17 we only return a pandoc.List of Inline elements
---@param elem pandoc.Inlines|string|number|pandoc.Blocks|pandoc.List
---@return pandoc.Inlines result possibly empty Inlines
local function flattenToInlines(elem)
  local elemType = type(elem)
  return elemType == 'Inlines' and elem
    or elemType == 'string' 
      and pandoc.Inlines(pandoc.Str(elem))
    or elemType == 'number' 
      and pandoc.Inlines(pandoc.Str(tonumber(elem)))
    or elemType == 'Blocks' and blocks_to_inlines(elem)
    or elemType == 'List' and listConcat(
      elem:map(flattenToInlines)
    )
    or pandoc.Inlines({})
end

--- # Options object

---@class Options
---@field new fun(meta: pandoc.Meta):Options create Options object
---@field allowDepth fun(depth: number):boolean depth is allowed
local Options = {}

---create an Options object
---@param meta pandoc.Meta
---@return object Options
function Options:new(meta)
  o = {}
  setmetatable(o,self)
  self.__index = self

  o:read(meta)
  
  return o
end

--- normalize: normalize user options
--- simple string is assumed to be max-depth
--- maxdepth alias of max-depth
---@param meta metaObject
---@return pandoc.MetaMap
function Options:normalize(meta)
  --- ensure its a map; single value assumed to be max-depth
  meta = (metatype(meta) == 'table' and meta)
    or (metatype(meta) == 'string' and
    pandoc.MetaMap({ ['max-depth'] = meta}))
    or (metatype(meta) == 'Inlines' and 
    pandoc.MetaMap({ ['max-depth'] = stringify(meta)}))

  --- provide alias(es)
  aliases = { ['max-depth'] = 'maxdepth' }

  for key,alias in pairs(aliases) do
    meta[key] = meta[key] == nil and meta[alias] ~= nil and meta[alias]
      or meta[key]
  end

  --- 

  return meta

end

---read: read options from doc's meta
---treat maxdepth as alias for max-depth
---@param meta pandoc.Meta
function Options:read(meta)
  local opts = meta['recursive-citeproc']
    and Options:normalize(meta['recursive-citeproc'])
    or nil

  -- allowDepth(depth) must return true when depth = 1
  local userMaXDepth = opts and tonumber(opts['max-depth'])
  local maxDepth = userMaXDepth and userMaXDepth >= 0 and userMaXDepth
    or DEFAULT_MAX_DEPTH
  self.allowDepth = function (depth)
    return maxDepth == 0 or maxDepth >= depth
  end

end

--- # Avoid crash with empty bibliography key
local function fixEmptyBiblio(meta)
  if meta.bibliography and stringify(meta.bibliography) == '' then
    meta.bibliography = nil
    return meta
  end
end
--- # Functions to handle lists of strings
--- could be an object that extends pandoc.List

---@alias CitationIds pandoc.List pandoc.List of strings

---create
---@param list CitationIds|nil
---@return CitationIds cids
local function cids_create(list)
  local cids = pandoc.List:new()
  if list and type(list) == 'table' or type(list) == 'List' then
    for _,item in ipairs(list) do
      if type(item) == 'string' then cids:insert(item) end
    end
  end
  return cids
end

---add Id if not already included
---@param cids CitationIds
---@param id string citation Id
---@return CitationIds
local function cids_addId(cids, id)
  if not cids:find(id) then cids:insert(id) end
  return cids
end

---add citation Ids from Cite elements in blocks
---@param cids CitationIds
---@param blocks pandoc.Blocks|pandoc.Block walkable element
---@return CitationIds
local function cids_addFromBlocks(cids, blocks)
  blocks:walk({
    Cite = function(cite)
      for _,citation in ipairs(cite.citations) do
        cids_addId(cids, citation.id)
        end
      end
  })
  return cids
end

---add citation Ids from Cite elements in doc's meta
--- (fields `nocite`, `abstract`, `thanks`)
---@param cids CitationIds
---@param doc any
---@return CitationIds
local function cids_addFromMeta(cids, doc)
  for _,key in ipairs {'nocite', 'abstract', 'thanks' } do
    if doc.meta[key] then
      cids_addFromBlocks(cids, 
        pandoc.Plain(flattenToInlines(doc.meta[key]))
      )
    end
  end
  return cids
end

---add citation Ids from pandoc.utils.references(doc)
local function cids_addFromReferences(cids, doc)
  for _,item in ipairs(references(doc)) do
    cids_addId(cids, item.id)
  end
end

--- # Filter

---listRefIds: returns doc's references as a list of ids
--- we do not use pandoc.utils.references: twice slower 
--- than collecting ref ID strings manually on benchmark.
---@param doc pandoc.Pandoc
---@return string[] refsList list of ids
local function listRefIds(doc)
  local cids = cids_create()
  -- if references then
  --   cids_addFromReferences(cids, doc)
  -- else
    cids_addFromBlocks(cids, doc.blocks)
    cids_addFromMeta(cids, doc)
  -- end
  return cids
end

---listNewRefs: list references in newDoc not present in oldDoc
---@param oldDoc pandoc.Pandoc
---@param newDoc pandoc.Pandoc
---@return CitationIds cids list of ids
local function listNewRefIds(newDoc, oldDoc)
  local oldRefs, newRefs = listRefIds(oldDoc), listRefIds(newDoc)
  local cids = cids_create()
  for _,ref in ipairs(newRefs) do
    if not oldRefs:find(ref) then cids_addId(cids, ref) end
  end
  return cids
end

---addToNocite: add ref ids list to doc's nocite metadata
---@param doc pandoc.Pandoc
---@param newRefs string[]
---@return pandoc.Pandoc
local function addToNocite(doc, newRefs)
  local inlines = flattenToInlines(doc.meta.nocite)
  for _,ref in ipairs(newRefs) do
    inlines:insert(pandoc.Space())
    inlines:insert(pandoc.Cite(
      pandoc.Str('@'..ref),
      {
        pandoc.Citation(ref, 'AuthorInText')
      }
    ))
  end
  doc.meta.nocite = pandoc.MetaInlines(inlines)
  return doc
end

---recursiveCiteproc: fill in `nocite` field
---until producing a bibliography adds no new citations
---returns document with bibliography, expanded no-cite 
---field, and suppress-bibliography=true
---citeproc will later convert the citations in the biblio
local function recursiveCiteproc(doc)
  local options = Options:new(doc.meta)
  local depth = 1
  local newDoc

  -- avoid "File not found" error with empty 'bibliography' 
  doc.meta = fixEmptyBiblio(doc.meta)

  while options.allowDepth(depth) do
    depth = depth + 1
    -- DEBUG display runs
    -- print('RUN', tostring(depth-1))
    newDoc = run_citeproc(doc)
    local newRefs = listNewRefIds(newDoc, doc)
    if #newRefs > 0 then
      doc = addToNocite(doc, newRefs)
    else
      break
    end
  end

  newDoc.meta['suppress-bibliography'] = true
  return newDoc

end

--- # return filter

return {
  {
    Pandoc = recursiveCiteproc
  }
}