Adding a table of contents in Substack (with Python)

May 20, 2024

For some reason, I assume intentionally, Substack doesn’t offer the functionality of adding a table of contents to your post.

That's a bummer, because sometimes your post is long enough to require some sort of index or to offer some idea of the content, so you end up creating a table of content manually (for example in this way).

Doing a TOC copying links and text from the actual post is quite mechanical work, and everything that is mechanical work seems like something you can write a script for. I put together that thought, having some time off, and that i don’t write a line of code in several months (or maybe more), and I ended up creating a script to generate a very simple table of content.

You can see an example in my post ChatGPT basics:

Usage

make sure you have python or learn here how to get started
Install lxml with `sudo pip3 install lxml` or check instructions here
download my script and save it to any folder you want
execute it with:

python3 substack_toc.py URL [output_filename]

# URL is the full url of the post 
# output_filename (optional). Filename in which the generated content will be saved. A file with that name will be generated in the folder in which you are running the script. It will be overwritten if it exists. If you don't include this parameter the content will be shown in the terminal.

open the output file in your browser
copy the content, and paste it in your substack post!

The script

You can find the script in GitHub Gist or here:

import requests
#you can install lxml with `sudo pip3 install lxml`
from lxml.html import fromstring
import sys
from urllib.parse import urlparse, urlunparse, urlencode, parse_qs
import time

if len(sys.argv) < 2:
   print("No arguments were given. Use URL [output_file_name]")
   quit()

#To avoid cache we add a parameter withe the timestamp
def add_timestamp_to_url(url):
    parsed_url = urlparse(url)
    query_params = parse_qs(parsed_url.query) | {"timestamp": int(time.time())}
    return urlunparse(parsed_url._replace(query=urlencode(query_params, doseq=True)))
    
url = sys.argv[1]
filename = sys.argv[2] if len(sys.argv) > 2 else None

#fetch the HTML
tree = fromstring(requests.get(str(add_timestamp_to_url(url))).content)

path = "//*[@class='header-with-anchor-widget']"
current_level = 100
ul_open = 0

output = "<ul>"
#get all headers with the right class define above
for header in tree.xpath(path):
	header_level = int(header.tag[1])
	print("H"+str(header_level) +" - " + str(header.text))
	
	#we will nest subheaders inside of parents
	if header_level > current_level:
		print("nesting ")
		output = output[:-5] + "<ul>"
		ul_open = ul_open + 1
	#close current tree and go back to a higher rank header
	elif header_level < current_level and ul_open > 0:
		while (ul_open > 0):
			print("unnesting " + str(ul_open))
			ul_open = ul_open -1
			output = output + "</ul></li>"

	current_level = header_level
	
	#create link
	link = header[0].get('id')
	output = output + "<li><a href='"+url+"#"+str(link)+"'>"
	output = output + str(header.text) + "</a></li>"

#if this is the last one, close current tree before closing the main UL
while (ul_open > 0):
	print("unnesting " + str(ul_open))
	ul_open = ul_open -1
	output = output + "</ul></li>"

output = output + "</ul>"

if filename:
	with open(filename, 'w') as file:
		file.write (output)
		print ("\nSaved to ./" + filename)
else:
	print(output)

Leadership in the time of the robots

Discussion about this post