1 # mdcode: extract C code from a _markdown_ file.
3 _markdown_ is a popular format for simple text markup which can easily
4 be converted to HTML. As it allows easy indication of sections of
5 code, it is quite suitable for use in literate programming. This file
6 is an example of that usage.
8 The code included below provides two related functionalities.
9 Firstly it provides a library routine for extracting code out of a
10 _markdown_ file, so that other routines might make use of it.
12 Secondly it provides a simple client of this routine which extracts
13 1 or more C-language files from a markdown document so they can be
14 passed to a C compiler. These two combined to make a tool that is needed
15 to compile this tool. Yes, this is circular. A prototype tool was
16 used for the first extraction.
18 The tool provided is described as specific to the C language as it
21 ##### Example: a _line_ command
23 #line __line-number__ __file-name__
25 lines so that the C compiler will report where in the markdown file
26 any error is found. This tool is suitable for any other language
27 which allows the same directive, or will treat it as a comment.
31 Literate programming is more than just including comments with the
32 code, even nicely formatted comments. It also involves presenting the
33 code in an order that makes sense to a human, rather than an order
34 that makes sense to a compiler. For this reason a core part of any
35 literate programming tool is the ability to re-arrange the code found
36 in the document into a different order in the final code file - or
37 files. This requires some form of linkage to be encoded.
39 The approach taken here is focused around section headings - of any
42 All the code in any section is treated as a single sequential
43 collection of code, and is named by the section that it is in. If
44 multiple sections have the same name, then the code blocks in all of
45 them are joined together in the order they appear in the document.
47 A code section can contain a special marker which starts with 2
49 The text after the marker must be the name of some section which
50 contains code. Code from that section will be interpolated in place
51 of the marker, and will be indented to match the indent of the marker.
53 It is not permitted for the same code to be interpolated multiple
54 times. Allowing this might make some sense, but it is probably a
55 mistake, and prohibiting it make some of the code a bit cleaner.
57 Equally, every section of code should be interpolated at least once -
58 with two exceptions. These exceptions are imposed by the tool, not
59 the library. A different client could impose different rules on the
60 names of top-level code sections.
62 The first exception we have already seen. A section name starting
63 __Example:__ indicates code that is not to be included in the final product.
65 The second exception is for the top level code sections which will be
66 written to files. Again these are identified by their section name.
67 This must start with __File:__ the following text (after optional
68 spaces) will be used as a file name.
70 Any section containing code that does not start __Example:__ or
71 __File:__ must be included in some other section exactly once.
75 Allowing multiple top level code sections which name different files
76 means that one _markdown_ document can describe several files. This
77 is very useful with the C language where a program file and a header
78 file might be related. For the present document we will have a header
79 file and two code files, one with the library content and one for the
82 It will also be very convenient to create a `makefile` fragment to
83 ensure the code is compiled correctly. A simple `make -f mdcode.mk`
84 will "do the right thing".
90 mdcode.h libmdcode.c md2c.c mdcode.mk : mdcode.mdc
108 ## internal functions
113 libmdcode.o : libmdcode.c mdcode.h
114 $(CC) $(CFLAGS) -c libmdcode.c
130 md2c : md2c.o libmdcode.o
131 $(CC) $(CFLAGS) -o md2c md2c.o libmdcode.o
132 md2c.o : md2c.c mdcode.h
133 $(CC) $(CFLAGS) -c md2c.c
137 As the core purpose of _mdcode_ is to discover and re-arrange blocks
138 of text, it makes sense to map the whole document file into memory and
139 produce a data structure which lists various parts of the file in the
140 appropriate order. Each node in this structure will have some text
141 from the document, a child pointer, and a next pointer, any of which
142 might not be present. The text is most easily stored as a pointer and a
143 length. We'll call this a `text`
145 A list of these `code_nodes` will belong to each section and it will
146 be useful to have a separate `section` data structure to store the
147 list of `code_nodes`, the section name, and some other information.
149 This other information will include a reference counter so we can
150 ensure proper referencing, and an `indent` depth. As referenced
151 content can have an extra indent added, we need to know what that is.
152 The `code_node` will also have an `indent` depth which eventually gets
153 set to the sum for the indents from all references on the path from
156 Finally we need to know if the `code_node` was recognised by being
157 indented or not. If it was, the client of this data will want to
158 strip of the leading tab or 4 spaces. Hence a `needs_strip` flag is
170 struct code_node *code;
171 struct section *next;
179 struct code_node *next;
180 struct section *child;
187 struct code_node *last;
192 You will note that the `struct psection` contains an anonymous `struct
193 section` embedded at the start. To make this work right, GCC
194 requires the `-fplan9-extensions` flag.
196 ##### File: mdcode.mk
198 CFLAGS += -fplan9-extensions
200 ### Manipulating the node
202 Though a tree with `next` and `child` links is the easiest way to
203 assemble the various code sections, it is not the easiest form for
204 using them. For that a simple list would be best.
206 So once we have a fully linked File section we will want to linearize
207 it, so that the `child` links become `NULL` and the `next` links will
208 find everything required. It is at this stage that the requirements
209 that each section is linked only once becomes import.
211 `code_linearize` will merge the `code_node`s from any child into the
212 given `code_node`. As it does this it sets the 'indent' field for
215 Note that we don't clear the section's `last` pointer, even though
216 it no longer owns any code. This allows subsequent code to see if a
217 section ever had any code, and to report an error if a section is
218 referenced but not defined.
220 ##### internal functions
222 static void code_linearize(struct code_node *code)
225 for (t = code; t; t = t->next)
227 for (; code; code = code->next)
229 struct code_node *next = code->next;
230 struct psection *pchild =
231 (struct psection *)code->child;
232 int indent = pchild->indent;
233 code->next = code->child->code;
234 code->child->code = NULL;
236 for (t = code; t->next; t = t->next)
237 t->next->indent = code->indent + indent;
242 Once a client has made use of a linearized code set, it will probably
245 void code_free(struct code_node *code)
248 struct code_node *this;
250 code_linearize(code);
257 ##### exported functions
259 void code_free(struct code_node *code);
261 ### Building the tree
263 As we parse the document there are two things we will want to do to
264 node trees: add some text or add a reference. We'll assume for now
265 that the relevant section structures have been found, and will just
266 deal with the `code_node`.
268 Adding text simply means adding another node. We will never have
269 empty nodes, even if the last node only has a child, new text must go
272 ##### internal functions
274 static void code_add_text(struct psection *where, struct text txt,
275 int line_no, int needs_strip)
280 n = malloc(sizeof(*n));
283 n->line_no = line_no;
284 n->needs_strip = needs_strip;
288 where->last->next = n;
294 However when adding a link, we might be able to include it in the last
295 `code_node` if it currently only has text.
297 void code_add_link(struct psection *where, struct psection *to,
303 to->refcnt++; // this will be checked elsewhere
304 if (where->last && where->last->child == NULL) {
305 where->last->child = to;
308 n = malloc(sizeof(*n));
315 where->last->next = n;
323 Now we need a lookup table to be able to find sections by name.
324 Something that provides an `n*log(N)` search time is probably
325 justified, but for now I want a minimal stand-alone program so a
326 linked list managed by insertion-sort will do. As a comparison
327 function it is easiest to sort based on length before content. So
328 sections won't be in standard lexical order, but that isn't important.
330 If we cannot find a section, we simply want to create it. This allows
331 sections and references to be created in any order. Sections with
332 no references or no content will cause a warning eventually.
334 #### internal functions
336 static int text_cmp(struct text a, struct text b)
339 return a.len - b.len;
340 return strncmp(a.txt, b.txt, a.len);
343 static struct psection *section_find(struct psection **list, struct text name)
345 struct psection *new;
347 int cmp = text_cmp((*list)->section, name);
352 list = (struct psection **)&((*list)->next);
354 /* Add this section */
355 new = malloc(sizeof(*new));
366 ## Parsing the _markdown_
368 Parsing markdown is fairly easy, though there are complications.
370 The document is divided into "paragraphs" which are mostly separated by blank
371 lines (which may contain white space). The first few characters of
372 the first line of a paragraph determine the type of paragraph. For
373 our purposes we are only interested in list paragraphs, code
374 paragraphs, section headings, and everything else. Section headings
375 are single-line paragraphs and so do not require a preceding or
376 following blank line.
378 Section headings start with 1 or more hash characters (__#__). List
379 paragraphs start with hyphen, asterisk, plus, or digits followed by a
380 period. Code paragraphs aren't quite so easy.
382 The "standard" code paragraph starts with 4 or more spaces, or a tab.
383 However if the previous paragraph was a list paragraph, then those
384 spaces indicate another paragraph in the same list item, and 8 or
385 more spaces are required. Unless a nested list is in effect, in
386 which case 12 or more are need. Unfortunately not all _markdown_
387 parsers agree on nested lists.
389 Two alternate styles for marking code are in active use. "Github" uses
390 three backticks(_`` ``` ``_), while "pandoc" uses three or more tildes
391 (_~~~_). In these cases the code should not be indented.
393 Trying to please everyone as much as possible, this parser will handle
394 everything except for code inside lists.
396 So an indented (4+) paragraph after a list paragraph is always a list
397 paragraph, otherwise it is a code paragraph. A paragraph that starts
398 with three backticks or three tildes is code which continues until a
399 matching string of backticks or tildes.
403 While walking the document looking for various markers we will *not*
404 use the `struct text` introduced earlier as advancing that requires
405 updating both start and length which feels clumsy. Instead we will
406 carry `pos` and `end` pointers, only the first of which needs to
409 So to start, we need to skip various parts of the document. `lws`
410 stands for "Linear White Space" and is a term that comes from the
411 Email RFCs (e.g. RFC822). `line` and `para` are self explanatory.
412 Note that `skip_para` needs to update the current line number.
413 `skip_line` doesn't but every caller should.
415 #### internal functions
417 static char *skip_lws(char *pos, char *end)
419 while (pos < end && (*pos == ' ' || *pos == '\t'))
424 static char *skip_line(char *pos, char *end)
426 while (pos < end && *pos != '\n')
433 static char *skip_para(char *pos, char *end, int *line_no)
435 /* Might return a pointer to a blank line, as only
436 * one trailing blank line is skipped
439 pos = skip_line(pos, end);
445 *(pos = skip_lws(pos, end)) != '\n') {
446 pos = skip_line(pos, end);
449 if (pos < end && *pos == '\n') {
456 ### Recognising things
458 Recognising a section header is trivial and doesn't require a
459 function. However we need to extract the content of a section header
460 as a `struct text` for passing to `section_find`.
461 Recognising the start of a new list is fairly easy. Recognising the
462 start (and end) of code is a little messy so we provide a function for
463 matching the first few characters, which has a special case for "4
466 #### internal includes
471 #### internal functions
473 static struct text take_header(char *pos, char *end)
477 while (pos < end && *pos == '#')
479 while (pos < end && *pos == ' ')
482 while (pos < end && *pos != '\n')
484 while (pos > section.txt &&
485 (pos[-1] == '#' || pos[-1] == ' '))
487 section.len = pos - section.txt;
491 static int is_list(char *pos, char *end)
493 if (strchr("-*+", *pos))
496 while (pos < end && isdigit(*pos))
498 if (pos < end && *pos == '.')
504 static int matches(char *start, char *pos, char *end)
507 return matches("\t", pos, end) ||
508 matches(" ", pos, end);
509 return (pos + strlen(start) < end &&
510 strncmp(pos, start, strlen(start)) == 0);
513 ### Extracting the code
515 Now that we can skip paragraphs and recognise what type each paragraph
516 is, it is time to parse the file and extract the code. We'll do this
517 in two parts, first we look at what to do with some code once we
518 find it, and then how to actually find it.
520 When we have some code, we know where it is, what the end marker
521 should look like, and which section it is in.
523 There are two sorts of end markers: the presence of a particular
524 string, or the absence of an indent. We will use a string to
525 represent a presence, and a `NULL` to represent the absence.
527 While looking at code we don't think about paragraphs are all - just
528 look for a line that starts with the right thing.
529 Every line that is still code then needs to be examined to see if it
530 is a section reference.
532 When a section reference is found, all preceding code (if any) must be
533 added to the current section, then the reference is added.
535 When we do find the end of the code, all text that we have found but
536 not processed needs to be saved too.
538 When adding a reference we need to set the `indent`. This is the
539 number of spaces (counting 8 for tabs) after the natural indent of the
540 code (which is a tab or 4 spaces). We use a separate function `count_spaces`
543 #### internal functions
545 static int count_space(char *sol, char *p)
559 static char *take_code(char *pos, char *end, char *marker,
560 struct psection **table, struct text section,
564 int line_no = *line_nop;
565 int start_line = line_no;
566 struct psection *sect;
568 sect = section_find(table, section);
574 if (marker && matches(marker, pos, end))
577 (skip_lws(pos, end))[0] != '\n' &&
578 !matches(NULL, pos, end))
579 /* Paragraph not indented */
582 /* Still in code - check for reference */
587 else if (strcmp(sol, " ") == 0)
590 t = skip_lws(sol, end);
591 if (t[0] != '#' || t[1] != '#') {
592 /* Just regular code here */
593 pos = skip_line(sol, end);
601 txt.len = pos - start;
602 code_add_text(sect, txt, start_line,
605 ref = take_header(t, end);
607 struct psection *refsec = section_find(table, ref);
608 code_add_link(sect, refsec, count_space(sol, t));
610 pos = skip_line(t, end);
613 start_line = line_no;
618 txt.len = pos - start;
619 code_add_text(sect, txt, start_line,
623 pos = skip_line(pos, end);
632 It is when looking for the code that we actually use the paragraph
633 structure. We need to recognise section headings so we can record the
634 name, list paragraphs so we can ignore indented follow-on paragraphs,
635 and the three different markings for code.
637 #### internal functions
639 static struct psection *code_find(char *pos, char *end)
641 struct psection *table = NULL;
644 struct text section = {0};
648 section = take_header(pos, end);
650 pos = skip_line(pos, end);
652 } else if (is_list(pos, end)) {
654 pos = skip_para(pos, end, &line_no);
655 } else if (!in_list && matches(NULL, pos, end)) {
656 pos = take_code(pos, end, NULL, &table,
658 } else if (matches("```", pos, end)) {
660 pos = skip_line(pos, end);
662 pos = take_code(pos, end, "```", &table,
664 } else if (matches("~~~", pos, end)) {
666 pos = skip_line(pos, end);
668 pos = take_code(pos, end, "~~~", &table,
673 pos = skip_para(pos, end, &line_no);
679 ### Returning the code
681 Having found all the code blocks and gathered them into a list of
682 section, we are now ready to return them to the caller. This is where
683 to perform consistency checks, like at most one reference and at least
684 one definition for each section.
686 All the sections with no references are returned in a list for the
687 caller to consider. The are linearized first so that the substructure
688 is completely hidden -- except for the small amount of structure
689 displayed in the line numbers.
691 To return errors, we have the caller pass a function which takes an
692 error message - a `code_err_fn`.
696 typedef void (*code_err_fn)(char *msg);
698 #### internal functions
699 struct section *code_extract(char *pos, char *end, code_err_fn error)
701 struct psection *table;
702 struct section *result = NULL;
703 struct section *tofree = NULL;
705 table = code_find(pos, end);
708 struct psection *t = (struct psection*)table->next;
709 if (table->last == NULL) {
712 "Section \"%.*s\" is referenced but not declared",
713 table->section.len, table->section.txt);
717 if (table->refcnt == 0) {
718 /* Root-section, return it */
719 table->next = result;
721 code_linearize(result->code);
723 table->next = tofree;
725 if (table->refcnt > 1) {
728 "Section \"%.*s\" referenced multiple times (%d).",
729 table->section.len, table->section.txt,
738 struct section *t = tofree->next;
745 ##### exported functions
747 struct section *code_extract(char *pos, char *end, code_err_fn error);
752 Now that we can extract code from a document and link it all together
753 it is time to do something with that code. Firstly we need to print
756 ### Printing the Code
758 Printing is mostly straight forward - we just walk the list and print
759 the code sections, adding whatever indent is required for each line.
760 However there is a complication (isn't there always)?
762 For code that was recognised because the paragraph was indented, we
763 need to strip that indent first. For other code, we don't.
765 The approach taken here is simple, though it could arguably be wrong
766 in some unlikely cases. So it might need to be fixed later.
768 If the first line of a code block is indented, then either one tab or
769 4 spaces are striped from every non-blank line.
771 This could go wrong if the first line of a code block marked by
772 _`` ``` ``_ is indented. To overcome this we would need to
773 record someextra state in each `code_node`. For now we won't bother.
775 The indents we insert will all be spaces. This might not work well
778 ##### client functions
780 static void code_print(FILE *out, struct code_node *node,
783 for (; node; node = node->next) {
784 char *c = node->code.txt;
785 int len = node->code.len;
790 fprintf(out, "#line %d \"%s\"\n",
791 node->line_no, fname);
793 fprintf(out, "%*s", node->indent, "");
794 if (node->needs_strip) {
795 if (*c == '\t' && len > 1) {
798 } else if (strncmp(c, " ", 4) == 0 && len > 4) {
807 } while (len && c[-1] != '\n');
812 ### Bringing it all together
814 We are just about ready for the `main` function of the tool which will
815 extract all this lovely code and compile it. Just one helper is still
818 #### Handling filenames
820 Section names are stored in `struct text` which is not `nul`
821 terminated. Filenames passed to `open` need to be null terminated.
822 So we need to convert one to the other, and strip the leading `File:`
823 of while we are at it.
825 ##### client functions
827 static void copy_fname(char *name, int space, struct text t)
832 if (len < 5 || strncmp(sec, "File:", 5) != 0)
836 while (len && sec[0] == ' ') {
842 strncpy(name, sec, len);
848 And now we take a single file name, extract the code, and if there are
849 no error we write out a file for each appropriate code section. And
853 ##### client includes
857 #include <sys/mman.h>
861 ##### client functions
864 static void pr_err(char *msg)
867 fprintf(stderr, "%s\n", msg);
870 int main(int argc, char *argv[])
875 struct section *table, *s, *prev;
879 fprintf(stderr, "Usage: mdcode file.mdc\n");
882 fd = open(argv[1], O_RDONLY);
884 fprintf(stderr, "mdcode: cannot open %s: %s\n",
885 argv[1], strerror(errno));
888 len = lseek(fd, 0, 2);
889 file = mmap(NULL, len, PROT_READ, MAP_SHARED, fd, 0);
890 table = code_extract(file, file+len, pr_err);
893 (code_free(s->code), prev = s, s = s->next, free(prev))) {
896 if (strncmp(s->section.txt, "Example:", 8) == 0)
898 if (strncmp(s->section.txt, "File:", 5) != 0) {
899 fprintf(stderr, "Unreferenced section is not a file name: %.*s\n",
900 s->section.len, s->section.txt);
904 copy_fname(fname, sizeof(fname), s->section);
906 fprintf(stderr, "Missing file name at:%.*s\n",
907 s->section.len, s->section.txt);
911 fl = fopen(fname, "w");
913 fprintf(stderr, "Cannot create %s: %s\n",
914 fname, strerror(errno));
918 code_print(fl, s->code, argv[1]);