Pandoc/Quarto filter for self-citing BibTeX bibliographies.
BibTeX’s documentation allows self-citing bibliographies, that is bibliography entries citing other bibliography entries in note, title or abstract fields. These aren’t handled properly by Pandoc’s and Quarto’s internal bibliography engine, Citeproc. This filter extends Citeproc’s abilities to cover self-citing bibliographies.
The filter acts as drop-in replacement for Citeproc. It still runs Citeproc in the background: bibliography style files are applied as expected.
BibTeX bibliographies can self-cite: one bibliography entry
may cite another entry. That is done in two ways: the
crossref
field to cite a collection from which an entry is
extracted (see the BibTeX’s
documentation), or by entering citation commands, e.g. in a note
field:
@incollection{Doe:2000,
author = 'Jane Doe',
title = 'What are Fish Even Doing Down There',
crossref = 'Snow:2000',
}@book{Snow:2010,
editor = 'Jane Snow',
title = 'Fishy Works',
note = 'Reprint of~\citet{Snow:2000}',
}@collection{Snow:2000,
editor = 'Jane Snow',
title = 'Fishy Works',
}
LaTeX’s bibliography engines (natbib
,
biblatex
) handle self-citations of both kinds.
Pandoc and Quarto can use those engines but only for PDF output. They come instead with their own engine, Citeproc, which conveniently uses citation styles files and covers all output formats.
However, Citeproc only handles crossref
self-citations.
It fails to process citation commands in bibliographies.
This filter enables Citeproc to process cite commands in the bibliography. It ensures that the self-cited entries are displayed in the document’s bibliography.
Are self-citing bibliographies a good idea? It ensures consistency by avoiding multiple copies of the same data, but creates dependencies between entries. The citation sytle language doesn’t seem to permit it. Be that as it may, many of us have legacy self-citing bibliographies, so we may as well handle them.
Pandoc 2.17+ or Quarto 1.4+
Note. Version 1 of this filter does not work with Pandoc
3.1.10+ and Quarto 1.4.530+. If switching from version 1 to current
version, make sure you do not call -C
or
--citeproc
in Pandoc or set citeproc: false
in
Quarto. See below for details.
This filter remplaces Citeproc.
The filter modifies the internal document representation; it can be used with many publishing systems that are based on Pandoc.
Pass the filter to pandoc via the --lua-filter
(or
-L
) command line option:
pandoc --lua-filter recursive-citeproc.lua ...
Or via a defaults file:
filters:
- recursive-citeproc.lua
Copy the file in your Pandoc user data directory to make it available
to Pandoc anywhere. Run pandoc -v
to see where your Pandoc
user data directory is.
Do not use Citeproc. Do not use the
--citeproc
or -C
option in combination with
this filter. If applied before the filter, it is redundant; if after, it
generates a duplicate bibliography.
Users of Quarto can install this filter as an extension with
quarto install extension dialoa/recursive-citeproc.git
and use it by adding recursive-citeproc
to the
filters
entry in their YAML header. You should also
deactivate Citeproc:
---
citeproc: false
filters:
- recursive-citeproc
---
If you use other filters and specify their order relative to Quarto, it is safer to run this filter after Quarto’s own:
---
citeproc: false
filters:
- ...
- quarto
- recursive-citeproc
---
Use pandoc_args
to invoke the filter. See the R
Markdown Cookbook for details.
---
output:
word_document:
pandoc_args: ['--lua-filter=recursive-citeproc.lua']
---
Do not use Citeproc. Before this filter, it is redundant; after, it duplicates the bibliography.
You can specify the filter’s maximum recursive depth in the document’s metadata. Use 0 for infinte (default 10):
recursive-citeproc:
max-depth: 5
A max-depth
of 2, for instance, means that the filter
inserts references that are only cited by references cited in the
document’s body, but not references that are only cited by references
that are themselves only cited by references cited in the document.
If the max depth is reached before all self-recursive citations are processed, PDF output may generate an error.
To try the filter with Pandoc or Quarto, clone the directory.
Generate Pandoc outputs with make generate
. Change the
output format with make generate FORMAT=docx
. Use
FORMAT=latex
for latex outputs. You can list multiple
formats, make generate FORMAT="docx pdf"
. The outputs will
be in the test
folder, named
expected.<format>
.
Requires Pandoc.
As above, replacing generate
with
qgenerate
.
Requires Quarto.
With Quarto installed, you can also
use the Pandoc engine embedded in Quarto: add the argument
PANDOC="quarto pandoc"
to the Pandoc commands above,
e.g. make generate FORMAT=docx PANDOC="quarto pandoc"
.
Version 2 is meant to replace Citeproc. It returns the document
appended with a refs
Div containing Citeproc bibliography
output.
The filter runs Citeproc on the document and checks whether the generated bibliography contains citations. If not, it simply returns the document with bibliography.
If the bibliography contains citations, the filter recursively runs
Citeproc on those citations, generated citations, and so on recursively
until all needed citations are identified. They are then added to the
document’s nocite
metadata field.
Citeproc is then run on the document, which typesets Cite elements in the document body and adds a bibliography with all needed entries to cover self-citations. However, Cite elements in the bibliography may still contain LaTeX cite commands that aren’t typeset yet. To ensure these are typeset, we run Citeproc on the bibliography itself, and update the document’s bibliography with the result.
The last step of the process generates a duplicate bilbiography which
we discard. There is no way around it since Pandoc 3.1.10: if we ran
Citeproc on the bibliography with suppress-bibliography
the
Cite commands couldn’t be converted to links. To ensure
link-references
adds links to citations even in the
bibliography, we must leave suppress-bibliography
to
false.
Version 1 of this filter was supposed to be run in combination with and before Citeproc.
It added a Citeproc-generated bibliography to the document, which
could contain Cite elements
whose content
could contain a LaTeX citation commands, and
exited with the document’s metadata key
suppress-bibliography
to true
. Citeproc
running after this would:
content
of Cite
elements in the the bibliography.content
of Cite elements, if
document’s metadata key link-references
was
true
,The filter’s main task was to ensure that the Citeproc-generated
bibliography contained all entries cited in bibliography entries, and
entries cited in bibliography entries cited in other bibliographies
entries, and so on. That was done by generating a the bibliography a
first time, checking whether it added citations, adding them to the
metadata nocite
key and trying again until no new citations
was added or the maximal depth was reached.
Since Pandoc 3.1.10, suppress-bibliography
deactivates
link-references
. The filter would still handle self-citing
bibliographies but link-references
would have no effect:
citations would not be linked to bibliographies. To let Citeproc link
references, we would need to remove suppress-bibliography
,
but we would then get a duplicate bibliography.
The solution in version 2 was to incorporate the last Citeproc step
within the filter; we run it witout suppress-bibliography
for the references to be linked if link-references
is set
and we take out the duplicate bibliography it outputs.
Based on an idea given by John MacFarlane on the pandoc-discuss mailing list.
This pandoc Lua filter is published under the MIT license, see file
LICENSE
for details.
---
title: 'Self-citing bibliography example'
author: Julien Dutant
recursive-citeproc: 100 # optional, specify max recursive depth
bibliography: references.bib
nocite:
- '@Smith2001'
- |
@Smith2003, @Smith2005
---
[@Doe2020].
# References
---------------------------------------------------------
----------------Auto generated code block----------------
---------------------------------------------------------
do
local searchers = package.searchers or package.loaders
local origin_seacher = searchers[2]
searchers[2] = function(path)
local files =
{
------------------------
-- Modules part begin --
------------------------
["CitationIdList"] = function()
--------------------
-- Module: 'CitationIdList'
--------------------
--[[ CitationIdList class
Hold and manipulate lists of citations Ids.
]]
--- # Helper functions
local type = pandoc.utils.type
---Concatenate a List of lists
---@param list pandoc.List[] list of pandoc.Lists
---@return pandoc.List result concatenated List
local function listConcat(list)
local result = pandoc.List:new()
for _,sublist in ipairs(list) do
result:extend(sublist)
end
return result
end
---Flatten a meta value to Inlines
---in pandoc < 2.17 we only return a pandoc.List of Inline elements
---@param elem pandoc.Inlines|string|number|pandoc.Blocks|pandoc.List
---@return pandoc.Inlines result possibly empty Inlines
local function flattenToInlines(elem)
local elemType = type(elem)
return elemType == 'Inlines' and elem
or elemType == 'string'
and pandoc.Inlines(pandoc.Str(elem))
or elemType == 'number'
and pandoc.Inlines(pandoc.Str(tonumber(elem)))
or elemType == 'Blocks' and pandoc.utils.blocks_to_inlines(elem)
or elemType == 'List' and listConcat(
elem:map(flattenToInlines)
)
or pandoc.Inlines{}
end
-- # CitationIdList object
---@alias CitationId string Citation Identifier
---@class CitationIdList
---@field data CitationId[] list of citation ids
---@field new fun(self: CitationIdList, source?:pandoc.Pandoc|pandoc.Meta|pandoc.Blocks|pandoc.Block|CitationId[]):CitationIdList
---@field isEmpty fun(self: CitationIdList): boolean
---@field find fun(self: CitationIdList, citationId: CitationId):boolean
---@field includes fun(self: CitationIdList, citationIdList: CitationIdList):boolean
---@field insert fun(self: CitationIdList, citationId: CitationId):nil
---@field clone fun(self: CitationIdList):CitationIdList
---@field minus fun(self: CitationIdList, citationIdList: CitationIdList):CitationIdList
---@field plus fun(self: CitationIdList, citationIdList: CitationIdList):CitationIdList
---@field addFromCitationIds fun(self: CitationIdList, list: CitationId[]):nil
---@field addFromBlocks fun(self: CitationIdList, blocks: pandoc.Blocks):nil
---@field addFromMeta fun(self: CitationIdList, meta: pandoc.Meta):nil
---@field addFromPandoc fun(self: CitationIdList, doc: pandoc.Pandoc):nil
---@field addFromReferences fun(self: CitationIdList, doc: pandoc.Pandoc):nil
---@field insertInNocite fun(self: CitationIdList, meta: pandoc.Meta):pandoc.Meta
local CitationIdList = {}
---Create an CitationIdList object
---@param source? pandoc.Pandoc|pandoc.Meta|pandoc.Blocks|pandoc.Block|CitationId[]
---@return CitationIdList
function CitationIdList:new(source)
o = {}
setmetatable(o,self)
self.__index = self
o.data = {}
if source then
srcType = type(source)
if srcType == 'Pandoc' then
o:addFromPandoc(source)
elseif srcType == 'Meta' then
o:addFromMeta(source)
elseif srcType == 'Blocks' or srcType == 'Block' then
o:addFromBlocks(source)
elseif srcType == 'table' then
o:addFromCitationIds(source)
end
end
return o
end
---Whether the list of citations is empty
---@return boolean
function CitationIdList:isEmpty()
return #self.data == 0
end
---Whether citationId is in the list
---@param citationId CitationId
---@return boolean
function CitationIdList:find(citationId)
for _,id in ipairs(self.data) do
if citationId == id then
return true
end
end
return false
end
---Whether the list includes all items from citationIdList
---@param citationIdList CitationIdList
function CitationIdList:includes(citationIdList)
result = true
for _,id in ipairs(citationIdList.data) do
if not self:find(id) then
result = false
break
end
end
return result
end
---Insert citation in the list if not already present
---@param citationId CitationId
function CitationIdList:insert(citationId)
if not self:find(citationId) then
table.insert(self.data, citationId)
end
end
---Get a copy of the list
---@return CitationIdList
function CitationIdList:clone()
result = CitationIdList:new(self.data)
return result
end
---Get a new list of citations minus those already in citationIdList
---@param citationIdList CitationIdList list of citations to remove
---@return CitationIdList result new CitationIdList
function CitationIdList:minus(citationIdList)
result = self:clone()
for _,id in ipairs(citationIdList) do
result:insert(id)
end
return result
end
---Get a new list of citations plus those in citationIdList
---@param citationIdList CitationIdList list of citations to add
---@return CitationIdList result new CitationIdList
function CitationIdList:plus(citationIdList)
result = CitationIdList:new()
result:addFromCitationIds(self.data)
result:addFromCitationIds(citationIdList.data)
return result
end
---Add from a list of citation Ids
---@param list CitationId[]
function CitationIdList:addFromCitationIds(list)
for _,item in ipairs(list) do
if item and type(item) == 'string' then
self:insert(item)
end
end
end
---Add citation ids found in blocks
---@param blocks pandoc.Blocks
function CitationIdList:addFromBlocks(blocks)
blocks:walk{
Cite = function(cite)
for _,citation in ipairs(cite.citations) do
self:insert(citation.id)
end
end
}
end
---Add citation ids found in selected metadata fields
---namely `title`, `subtitle`, `nocite`, `abstract`, and `thanks`
---@param meta pandoc.Meta
function CitationIdList:addFromMeta(meta)
local keys = {'title', 'subtitle', 'nocite', 'abstract', 'thanks'}
for _,key in ipairs(keys) do
if meta[key] then
self:addFromBlocks(pandoc.Plain(
(meta[key])
flattenToInlines))
end
end
end
---Add citation Ids from a Pandoc document
---@param doc pandoc.Pandoc
function CitationIdList:addFromPandoc(doc)
if doc.meta then
self:addFromMeta(doc.meta)
end
self:addFromBlocks(doc.blocks)
end
---Add citation Ids from a Pandoc document using pandoc.utils.references
---Differences between addFromReferences and addFromPandoc:
---addFromReferences only adds citations present in the bibliography
---addFromPandoc adds citations from any cite element
---addFromReferences adds citations present in any metadata field
---addFromPandoc only adds citations in selected metadata fields
function CitationIdList:addFromReferences(doc)
for _,item in pairs(pandoc.utils.references(doc)) do
self:insert(item.id)
end
end
---Insert citations in the nocite metadata field
---@param meta pandoc.Meta metadata block to modify
---@return pandoc.Meta
function CitationIdList:insertInNocite(meta)
local inlines = flattenToInlines(meta.nocite)
for _,id in ipairs(self.data) do
inlines:insert(pandoc.Space())
inlines:insert(pandoc.Cite(
pandoc.Str('@'..id),
pandoc.List{
pandoc.Citation(id, 'AuthorInText')
}
))
end
meta.nocite = pandoc.MetaInlines(inlines)
return meta
end
--- Use this to run command line tests (with pandoc lua)
if arg and arg[0] == debug.getinfo(1, "S").source:sub(2) then
local SRC =
[[---
bibliography: ../test/references.bib
dummy: |
@Doe2020, @Doe2018
---
Hello.
]]
local mymeta = pandoc.Meta{thanks =
pandoc.Cite('nothing',{pandoc.Citation('Smith','AuthorInText')})
}
local myblocks = pandoc.Para{
pandoc.Cite('nothing',{pandoc.Citation('Jones','AuthorInText')})
}
local mydoc = pandoc.Pandoc(myblocks,mymeta)
local mycites = CitationIdList:new({'hello','john'})
mycites:addFromReferences(pandoc.read(SRC, 'markdown'))
mycites:addFromCitationIds(CitationIdList:new(mydoc).data)
local newcites = CitationIdList:new{'john','Doe2020'}
test = mycites:minus(newcites)
for _,item in ipairs(test.data) do
print(item)
end
else
return CitationIdList
end
end,
["Options"] = function()
--------------------
-- Module: 'Options'
--------------------
local stringify = pandoc.utils.stringify
local metatype = pandoc.utils.type
--- # Options object
---@class Options
---@field new fun(meta: pandoc.Meta):Options create Options object
---@field allowDepth fun(depth: number):boolean depth is allowed
local Options = {}
---create an Options object
---@param meta pandoc.Meta
---@return object Options
function Options:new(meta)
o = {}
setmetatable(o,self)
self.__index = self
o:read(meta)
return o
end
--- normalize: normalize user options
--- simple string is assumed to be max-depth
--- maxdepth alias of max-depth
---@param meta metaObject
---@return pandoc.MetaMap
function Options:normalize(meta)
--- ensure its a map; single value assumed to be max-depth
meta = (metatype(meta) == 'table' and meta)
or (metatype(meta) == 'string' and
pandoc.MetaMap({ ['max-depth'] = meta}))
or (metatype(meta) == 'Inlines' and
pandoc.MetaMap({ ['max-depth'] = stringify(meta)}))
--- provide alias(es)
aliases = { ['max-depth'] = 'maxdepth' }
for key,alias in pairs(aliases) do
meta[key] = meta[key] == nil and meta[alias] ~= nil and meta[alias]
or meta[key]
end
---
return meta
end
---read: read options from doc's meta
---treat maxdepth as alias for max-depth
---@param meta pandoc.Meta
function Options:read(meta)
local opts = meta['recursive-citeproc']
and Options:normalize(meta['recursive-citeproc'])
or nil
-- allowDepth(depth) must return true when depth = 1
local userMaXDepth = opts and tonumber(opts['max-depth'])
local maxDepth = userMaXDepth and userMaXDepth >= 0 and userMaXDepth
or DEFAULT_MAX_DEPTH
self.allowDepth = function (depth)
return maxDepth == 0 or maxDepth >= depth
end
self.getDepth = function()
return maxDepth
end
end
return Options
end,
["log"] = function()
--------------------
-- Module: 'log'
--------------------
local FILTER_NAME = 'Recursive-Citeproc'
---log: send message to std_error
---@param type 'INFO'|'WARNING'|'ERROR'
---@param text string error message
local function log(type, text)
local level = {INFO = 0, WARNING = 1, ERROR = 2}
if level[type] == nil then type = 'ERROR' end
if level[PANDOC_STATE.verbosity] <= level[type] then
local message = '[' .. type .. '] '..FILTER_NAME..': '.. text .. '\n'
if quarto then
quarto.log.output(message)
else
io.stderr:write(message)
end
end
end
return log
end,
----------------------
-- Modules part end --
----------------------
}
if files[path] then
return files[path]
else
return origin_seacher(path)
end
end
end
---------------------------------------------------------
----------------Auto generated code block----------------
---------------------------------------------------------
--[[-- # Recursive-citeproc - Self-citing BibTeX
bibliographies in Pandoc and Quarto
@author Julien Dutant <julien.dutant@kcl.ac.uk>
@copyright 2021-2024 Julien Dutant
@license MIT - see LICENSE file for details.
@release 2.0.1
]]
local log = require('log')
local Options = require('Options')
local CitationIdList = require('CitationIdList')
local stringify = pandoc.utils.stringify
--- # Settings
-- Pandoc 2.17 for relying on `elem:walk()`, `pandoc.Inlines`, pandoc.utils.type
PANDOC_VERSION:must_be_at_least '2.17'
-- Limit recursion depth; 10 should do and avoid the appearance of freezing
DEFAULT_MAX_DEPTH = 10
-- Error messages
ERROR_MESSAGES = {
REFS_FOUND = 'I found a Div block with identifier `refs`. This probably means'
.." that you are running Citeproc alongside this filter. If you are, don't:"
.." this filter replaces Citeproc. If you aren't, you are using `refs` as an"
.." identifier on some Div element. That is a bad idea, as this interferes"
.." with Citeproc and this filter. I'm removing that element from the output.",
MAX_DEPTH = function (depth) return 'Reached maximum depth of self-citations '
..'('.. tostring(depth) ..').'
..'Check if there are circular self-citations in your bibligraphy.'
end
}
--- # Helper functions
---runCiteproc: run citeproc on a document
---@param doc pandoc.Pandoc
---@return pandoc.Pandoc
local function runCiteproc (doc)
if PANDOC_VERSION >= '2.19.1' then
return pandoc.utils.citeproc(doc)
else
local args = {'--from=json', '--to=json', '--citeproc'}
local result = pandoc.utils.run_json_filter(doc, 'pandoc', args)
return result and result
or pandoc.Pandoc({})
end
end
---Avoid crash with empty bibliography key
---@param meta pandoc.Meta
---@return pandoc.Meta meta
local function fixEmptyBiblio(meta)
if meta.bibliography and stringify(meta.bibliography) == '' then
meta.bibliography = nil
return meta
else
return meta
end
end
---Extract a Div block with a certain id from blocks.
---If found, the Div is removed from the blocks.
---@param blocks pandoc.Blocks
---@param identifier string
---@return pandoc.Blocks blocks blocks with the Div removed if found
---@return pandoc.Div|nil div Div if found, or nil
local function extractDivById(blocks, identifier)
if not identifier or identifier == '' then
return blocks, nil
end
local result = nil
return blocks:walk{
Div = function(div)
if div.identifier and div.identifier == identifier then
result = div
return {}
end
end
}, result
end
---Generate a bibliography from a document's meta and citation list
---@param meta pandoc.Meta
---@param citationIdList? CitationIdList
local function makeBibliography(meta, citationIdList)
minidoc = pandoc.Pandoc({}, meta)
if citationIdList then
minidoc.meta = citationIdList:insertInNocite(minidoc.meta)
end
minidoc = runCiteproc(minidoc)
if minidoc.blocks[1] then
return minidoc.blocks[1]
end
end
---Typeset citations in the `refs` Div of a document
---@param doc pandoc.Pandoc document
---@return pandoc.Pandoc|nil result updated document or nil
local function typesetCitationsInRefs(doc)
local blocks, refs = extractDivById(doc.blocks, 'refs')
if not refs then
return nil
end
-- Change identifier, otherwise Citeproc adds to this Div
refs.identifier = 'oldRefs'
-- run Citeprof on refs and extract result
local tmpdoc = runCiteproc(pandoc.Pandoc(pandoc.Blocks{refs}, doc.meta))
local _, newRefs = extractDivById(tmpdoc.blocks, 'oldRefs')
-- Restore identifier
newRefs.identifier = 'refs'
-- Recreate doc
blocks:insert(newRefs)
doc.blocks = blocks
return doc
end
--- # Filter
---recursiveCiteproc: fill in `nocite` field
---until producing a bibliography adds no new citations
---returns document with expanded no-cite field.
local function recursiveCiteproc(doc)
local options = Options:new(doc.meta)
doc.meta = fixEmptyBiblio(doc.meta) -- avoid crash on empty `bibliography` key
-- Check if Citeproc has been applied, otherwise run it; extract bibliography.
-- Quarto users can't avoid it but warn Pandoc users that it's redundant.
local refs
doc.blocks, refs = extractDivById(doc.blocks, 'refs')
if refs then
if not quarto then
('WARNING', ERROR_MESSAGES.REFS_FOUND)
logend
else
doc = runCiteproc(doc)
doc.blocks, refs = extractDivById(doc.blocks, 'refs')
end
-- if no bibliography or no citations in the bibliography, quick exit
if not refs then
return
elseif CitationIdList:new(refs):isEmpty() then
doc.blocks:insert(refs)
return doc
end
-- Second part: the bibliography contains citations, recursion needed
-- store citations already present in the original
originalCites = CitationIdList:new(doc)
-- establish extra citations by recursion.
-- Depends on options, doc.meta, originalCites.
---@param cites CitationIdList
---@param depth number
---@return CitationIdList
local function recursion(cites, depth)
if not options.allowDepth(depth) then
('WARNING', ERROR_MESSAGES.MAX_DEPTH(options.getDepth()))
logreturn cites
end
local bib = makeBibliography(doc.meta, originalCites:plus(cites))
newCites = CitationIdList:new(bib):minus(originalCites)
if cites:includes(newCites) then
return cites
else
return recursion(newCites, depth + 1)
end
end
extraCites = recursion(CitationIdList:new(), 1)
-- Citeproc the doc. Typesets citations *in the body* and adds bibliography.
-- Citations in the bibliography aren't typeset yet.
doc.meta = extraCites:insertInNocite(doc.meta)
doc = runCiteproc(doc)
-- Typeset citations in the bibliography
doc = typesetCitationsInRefs(doc)
return doc
end
--- # return filter
return {
{
Pandoc = recursiveCiteproc
}
}