Pandoc/Quarto filter for self-citing BibTeX bibliographies.
BibTeX bibliographies can self-cite: one bibliography entry
may cite another entry. That is done in two ways: the
crossref
field to cite a collection from which an entry is
extracted (see the BibTeX’s
documentation), or by entering citation commands, e.g. in a note
field:
@incollection{Doe:2000,
author = 'Jane Doe',
title = 'What are Fish Even Doing Down There',
crossref = 'Snow:2000',
}@book{Snow:2010,
editor = 'Jane Snow',
title = 'Fishy Works',
note = 'Reprint of~\citet{Snow:2000}',
}@collection{Snow:2000,
editor = 'Jane Snow',
title = 'Fishy Works',
}
LaTeX’s bibliography engines (natbib
,
biblatex
) handle self-citations of both kinds.
Pandoc and Quarto can use those engines but for PDF output only. They come instead with their own engine, Citeproc, which conveniently uses citation styles files and covers all output formats.
However, Citeproc only handles crossref
self-citations.
It fails to process citation commands in bibliographies.
This filter enables Citeproc to process cite commands in the bibliography. It ensures that the self-cited entries are displayed in the document’s bibliography.
Are self-citing bibliographies a good idea? It ensures consistency by avoiding multiple copies of the same data, but creates dependencies between entries. The citation sytle language doesn’t seem to permit it. Be that as it may, many of us have legacy self-citing bibliographies, so we may as well handle them.
The filter modifies the internal document representation; it can be used with many publishing systems that are based on Pandoc.
When using several filters on a document, this filter must be placed: * after any filter that adds citations to the document, * before Citeproc or Quarto
The filter must be used in combination with Citeproc.
Pass the filter to pandoc via the --lua-filter
(or
-L
) command line option, followed by Citeproc
(--citeproc
or -C
):
pandoc --lua-filter recursive-citeproc.lua -C ...
Or via a defaults file:
filters:
- recursive-citeproc.lua
- citeproc
Copy the file in your Pandoc user data directory to make it available
to Pandoc anywhere. Run pandoc -v
to see where your Pandoc
user data directory is.
Users of Quarto can install this filter as an extension with
quarto install extension tarleb/recursive-citeproc.git
and use it by adding recursive-citeproc
to the
filters
entry in their YAML header, before
quarto
.
---
filters:
- recursive-citeproc
- quarto
---
You must explicitly specify that the filter comes before Quarto’s own, by default Quarto runs its own (incl. Citeproc) first.
Use pandoc_args
to invoke the filter, followed by
Citeproc. See the R
Markdown Cookbook for details.
---
output:
word_document:
pandoc_args: ['--lua-filter=recursive-citeproc.lua', '--citeproc']
---
You can specify the filter’s maximum recursive depth in the document’s metadata. Use 0 for infinte (default 100):
recursive-citeproc:
max-depth: 5
A max-depth
of 2, for instance, means that the filter
inserts references that are only cited by references cited in the
document’s body, but not references that are only cited by references
that are themselves only cited by references cited in the document.
If the max depth is reached before all self-recursive citations are processed, PDF output may generate an error.
To try the filter with Pandoc or Quarto, clone the directory.
Generate Pandoc outputs with make generate
. Change the
output format with make generate FORMAT=docx
. Use
FORMAT=latex
for latex outputs. You can list multiple
formats, make generate FORMAT="docx pdf"
. The outputs will
be in the test
folder, named
expected.<format>
.
Requires Pandoc.
As above, replacing generate
with
qgenerate
.
Requires Quarto.
With Quarto installed, you can also
use the Pandoc engine embedded in Quarto: add the argument
PANDOC="quarto pandoc"
to the Pandoc commands above,
e.g. make generate FORMAT=docx PANDOC="quarto pandoc"
.
The filter adds a Citeproc-generated bibliography to the document,
which may contain citation commands, and sets the metadata key
suppress-bibliography
to true
. When Citeproc
itself is run on the result, the bibliography’s citation commands are
converted to text.
The filter’s main task is to ensure that its Citeproc-generated
bibliography contains all the document’s citations, including those that
may only appear in the bibliography itself. To do that, it checks
whether the result of generating a bibliography with Citeproc adds new
citations. If it does, the filter adds those new citations in the
metadata nocite
field and tries to generate the
bibliography again, and so on until generating the bibliography doesn’t
produce any citation that is not already present in the
bibliography.
Based on an idea given by John MacFarlane on the pandoc-discuss mailing list.
This pandoc Lua filter is published under the MIT license, see file
LICENSE
for details.
---
title: 'Self-citing bibliography example'
author: Julien Dutant
recursive-citeproc: 100 # optional, specify max recursive depth
nocite:
- '@Smith2001'
- |
@Smith2003, @Smith2005,
---
[@Doe2020].
# References
--[[-- # Recursive-citeproc - Self-citing BibTeX
bibliographies in Pandoc and Quarto
@author Julien Dutant <julien.dutant@kcl.ac.uk>
@copyright 2021-2023 Julien Dutant
@license MIT - see LICENSE file for details.
]]
-- 2.17 for relying on `elem:walk()`, `pandoc.Inlines`, pandoc.utils.type
PANDOC_VERSION:must_be_at_least '2.17'
--- # Global Setting
DEFAULT_MAX_DEPTH = 100
--- # Helper functions
local stringify = pandoc.utils.stringify
local run_json_filter = pandoc.utils.run_json_filter
type = pandoc.utils.type
local blocks_to_inlines = pandoc.utils.blocks_to_inlines
-- we don't use pandoc.utils.references, twice slower on benchmark
references = pandoc.utils.references
-- metatype: type of a Meta element
metatype = type
-- run citeproc
local function run_citeproc (doc)
if PANDOC_VERSION >= '2.19.1' then
return pandoc.utils.citeproc(doc)
elseif PANDOC_VERSION >= '2.11' then
local args = {'--from=json', '--to=json', '--citeproc'}
return run_json_filter(doc, 'pandoc', args)
else
return run_json_filter(doc, 'pandoc-citeproc', {FORMAT, '-q'})
end
end
--- listConcat: concatenate a List of lists
---@param list pandoc.List[] list of pandoc.Lists
---@return pandoc.List result concatenated List
local function listConcat(list)
local result = pandoc.List:new()
for _,sublist in ipairs(list) do
result:extend(sublist)
end
return result
end
---Flatten a meta value into Inlines
---in pandoc < 2.17 we only return a pandoc.List of Inline elements
---@param elem pandoc.Inlines|string|number|pandoc.Blocks|pandoc.List
---@return pandoc.Inlines result possibly empty Inlines
local function flattenToInlines(elem)
local elemType = type(elem)
return elemType == 'Inlines' and elem
or elemType == 'string'
and pandoc.Inlines(pandoc.Str(elem))
or elemType == 'number'
and pandoc.Inlines(pandoc.Str(tonumber(elem)))
or elemType == 'Blocks' and blocks_to_inlines(elem)
or elemType == 'List' and listConcat(
elem:map(flattenToInlines)
)
or pandoc.Inlines({})
end
--- # Options object
---@class Options
---@field new fun(meta: pandoc.Meta):Options create Options object
---@field allowDepth fun(depth: number):boolean depth is allowed
local Options = {}
---create an Options object
---@param meta pandoc.Meta
---@return object Options
function Options:new(meta)
o = {}
setmetatable(o,self)
self.__index = self
o:read(meta)
return o
end
--- normalize: normalize user options
--- simple string is assumed to be max-depth
--- maxdepth alias of max-depth
---@param meta metaObject
---@return pandoc.MetaMap
function Options:normalize(meta)
--- ensure its a map; single value assumed to be max-depth
meta = (metatype(meta) == 'table' and meta)
or (metatype(meta) == 'string' and
pandoc.MetaMap({ ['max-depth'] = meta}))
or (metatype(meta) == 'Inlines' and
pandoc.MetaMap({ ['max-depth'] = stringify(meta)}))
--- provide alias(es)
aliases = { ['max-depth'] = 'maxdepth' }
for key,alias in pairs(aliases) do
meta[key] = meta[key] == nil and meta[alias] ~= nil and meta[alias]
or meta[key]
end
---
return meta
end
---read: read options from doc's meta
---treat maxdepth as alias for max-depth
---@param meta pandoc.Meta
function Options:read(meta)
local opts = meta['recursive-citeproc']
and Options:normalize(meta['recursive-citeproc'])
or nil
-- allowDepth(depth) must return true when depth = 1
local userMaXDepth = opts and tonumber(opts['max-depth'])
local maxDepth = userMaXDepth and userMaXDepth >= 0 and userMaXDepth
or DEFAULT_MAX_DEPTH
self.allowDepth = function (depth)
return maxDepth == 0 or maxDepth >= depth
end
end
--- # Avoid crash with empty bibliography key
local function fixEmptyBiblio(meta)
if meta.bibliography and stringify(meta.bibliography) == '' then
meta.bibliography = nil
return meta
end
end
--- # Functions to handle lists of strings
--- could be an object that extends pandoc.List
---@alias CitationIds pandoc.List pandoc.List of strings
---create
---@param list CitationIds|nil
---@return CitationIds cids
local function cids_create(list)
local cids = pandoc.List:new()
if list and type(list) == 'table' or type(list) == 'List' then
for _,item in ipairs(list) do
if type(item) == 'string' then cids:insert(item) end
end
end
return cids
end
---add Id if not already included
---@param cids CitationIds
---@param id string citation Id
---@return CitationIds
local function cids_addId(cids, id)
if not cids:find(id) then cids:insert(id) end
return cids
end
---add citation Ids from Cite elements in blocks
---@param cids CitationIds
---@param blocks pandoc.Blocks|pandoc.Block walkable element
---@return CitationIds
local function cids_addFromBlocks(cids, blocks)
blocks:walk({
Cite = function(cite)
for _,citation in ipairs(cite.citations) do
(cids, citation.id)
cids_addIdend
end
})
return cids
end
---add citation Ids from Cite elements in doc's meta
--- (fields `nocite`, `abstract`, `thanks`)
---@param cids CitationIds
---@param doc any
---@return CitationIds
local function cids_addFromMeta(cids, doc)
for _,key in ipairs {'nocite', 'abstract', 'thanks' } do
if doc.meta[key] then
(cids,
cids_addFromBlockspandoc.Plain(flattenToInlines(doc.meta[key]))
)
end
end
return cids
end
---add citation Ids from pandoc.utils.references(doc)
local function cids_addFromReferences(cids, doc)
for _,item in ipairs(references(doc)) do
(cids, item.id)
cids_addIdend
end
--- # Filter
---listRefIds: returns doc's references as a list of ids
--- we do not use pandoc.utils.references: twice slower
--- than collecting ref ID strings manually on benchmark.
---@param doc pandoc.Pandoc
---@return string[] refsList list of ids
local function listRefIds(doc)
local cids = cids_create()
-- if references then
-- cids_addFromReferences(cids, doc)
-- else
(cids, doc.blocks)
cids_addFromBlocks(cids, doc)
cids_addFromMeta-- end
return cids
end
---listNewRefs: list references in newDoc not present in oldDoc
---@param oldDoc pandoc.Pandoc
---@param newDoc pandoc.Pandoc
---@return CitationIds cids list of ids
local function listNewRefIds(newDoc, oldDoc)
local oldRefs, newRefs = listRefIds(oldDoc), listRefIds(newDoc)
local cids = cids_create()
for _,ref in ipairs(newRefs) do
if not oldRefs:find(ref) then cids_addId(cids, ref) end
end
return cids
end
---addToNocite: add ref ids list to doc's nocite metadata
---@param doc pandoc.Pandoc
---@param newRefs string[]
---@return pandoc.Pandoc
local function addToNocite(doc, newRefs)
local inlines = flattenToInlines(doc.meta.nocite)
for _,ref in ipairs(newRefs) do
inlines:insert(pandoc.Space())
inlines:insert(pandoc.Cite(
pandoc.Str('@'..ref),
{
pandoc.Citation(ref, 'AuthorInText')
}
))
end
doc.meta.nocite = pandoc.MetaInlines(inlines)
return doc
end
---recursiveCiteproc: fill in `nocite` field
---until producing a bibliography adds no new citations
---returns document with bibliography, expanded no-cite
---field, and suppress-bibliography=true
---citeproc will later convert the citations in the biblio
local function recursiveCiteproc(doc)
local options = Options:new(doc.meta)
local depth = 1
local newDoc
-- avoid "File not found" error with empty 'bibliography'
doc.meta = fixEmptyBiblio(doc.meta)
while options.allowDepth(depth) do
depth = depth + 1
-- DEBUG display runs
-- print('RUN', tostring(depth-1))
newDoc = run_citeproc(doc)
local newRefs = listNewRefIds(newDoc, doc)
if #newRefs > 0 then
doc = addToNocite(doc, newRefs)
else
break
end
end
newDoc.meta['suppress-bibliography'] = true
return newDoc
end
--- # return filter
return {
{
Pandoc = recursiveCiteproc
}
}