1 # mdcode: extract C code from a _markdown_ file.
3 _markdown_ is a popular format for simple text markup which can easily
4 be converted to HTML. As it allows easy indication of sections of
5 code, it is quite suitable for use in literate programming. This file
6 is an example of that usage.
8 The code included below provides two related functionalities.
9 Firstly it provides a library routine for extracting code out of a
10 _markdown_ file, so that other routines might make use of it.
12 Secondly it provides a simple client of this routine which extracts
13 1 or more C-language files from a markdown document so they can be
14 passed to a C compiler. These two combined to make a tool that is needed
15 to compile this tool. Yes, this is circular. A prototype tool was
16 used for the first extraction.
18 The tool provided is described as specific to the C language as it
21 ##### Example: a _line_ command
23 #line __line-number__ __file-name__
25 lines so that the C compiler will report where in the markdown file
26 any error is found. This tool is suitable for any other language
27 which allows the same directive, or will treat it as a comment.
31 Literate programming is more than just including comments with the
32 code, even nicely formatted comments. It also involves presenting the
33 code in an order that makes sense to a human, rather than an order
34 that makes sense to a compiler. For this reason a core part of any
35 literate programming tool is the ability to re-arrange the code found
36 in the document into a different order in the final code file - or
37 files. This requires some form of linkage to be encoded.
39 The approach taken here is focused around section headings - of any
42 All the code in any section is treated as a single sequential
43 collection of code, and is named by the section that it is in. If
44 multiple sections have the same name, then the code blocks in all of
45 them are joined together in the order they appear in the document.
47 A code section can contain a special marker which starts with 2
49 The text after the marker must be the name of some section which
50 contains code. Code from that section will be interpolated in place
51 of the marker, and will be indented to match the indent of the marker.
53 It is not permitted for the same code to be interpolated multiple
54 times. Allowing this might make some sense, but it is probably a
55 mistake, and prohibiting it make some of the code a bit cleaner.
57 Equally, every section of code should be interpolated at least once -
58 with two exceptions. These exceptions are imposed by the tool, not
59 the library. A different client could impose different rules on the
60 names of top-level code sections.
62 The first exception we have already seen. A section name starting
63 __Example:__ indicates code that is not to be included in the final product.
65 The second exception is for the top level code sections which will be
66 written to files. Again these are identified by their section name.
67 This must start with __File:__ the following text (after optional
68 spaces) will be used as a file name.
70 Any section containing code that does not start __Example:__ or
71 __File:__ must be included in some other section exactly once.
75 Allowing multiple top level code sections which name different files
76 means that one _markdown_ document can describe several files. This
77 is very useful with the C language where a program file and a header
78 file might be related. For the present document we will have a header
79 file and two code files, one with the library content and one for the
82 It will also be very convenient to create a `makefile` fragment to
83 ensure the code is compiled correctly. A simple `make -f mdcode.mk`
84 will "do the right thing".
90 mdcode.h libmdcode.c md2c.c mdcode.mk : mdcode.mdc
100 ### File: libmdcode.c
109 ## internal functions
114 libmdcode.o : libmdcode.c mdcode.h
115 $(CC) $(CFLAGS) -c libmdcode.c
132 md2c : md2c.o libmdcode.o
133 $(CC) $(CFLAGS) -o md2c md2c.o libmdcode.o
134 md2c.o : md2c.c mdcode.h
135 $(CC) $(CFLAGS) -c md2c.c
139 As the core purpose of _mdcode_ is to discover and re-arrange blocks
140 of text, it makes sense to map the whole document file into memory and
141 produce a data structure which lists various parts of the file in the
142 appropriate order. Each node in this structure will have some text
143 from the document, a child pointer, and a next pointer, any of which
144 might not be present. The text is most easily stored as a pointer and a
145 length. We'll call this a `text`
147 A list of these `code_nodes` will belong to each section and it will
148 be useful to have a separate `section` data structure to store the
149 list of `code_nodes`, the section name, and some other information.
151 This other information will include a reference counter so we can
152 ensure proper referencing, and an `indent` depth. As referenced
153 content can have an extra indent added, we need to know what that is.
154 The `code_node` will also have an `indent` depth which eventually gets
155 set to the sum for the indents from all references on the path from
158 Finally we need to know if the `code_node` was recognised by being
159 indented or not. If it was, the client of this data will want to
160 strip of the leading tab or 4 spaces. Hence a `needs_strip` flag is
172 struct code_node *code;
173 struct section *next;
181 struct code_node *next;
182 struct section *child;
189 struct code_node *last;
194 You will note that the `struct psection` contains an anonymous `struct
195 section` embedded at the start. To make this work right, GCC
196 requires the `-fplan9-extensions` flag.
198 ##### File: mdcode.mk
200 CFLAGS += -fplan9-extensions
202 ### Manipulating the node
204 Though a tree with `next` and `child` links is the easiest way to
205 assemble the various code sections, it is not the easiest form for
206 using them. For that a simple list would be best.
208 So once we have a fully linked File section we will want to linearize
209 it, so that the `child` links become `NULL` and the `next` links will
210 find everything required. It is at this stage that the requirements
211 that each section is linked only once becomes import.
213 `code_linearize` will merge the `code_node`s from any child into the
214 given `code_node`. As it does this it sets the 'indent' field for
217 Note that we don't clear the section's `last` pointer, even though
218 it no longer owns any code. This allows subsequent code to see if a
219 section ever had any code, and to report an error if a section is
220 referenced but not defined.
222 ##### internal functions
224 static void code_linearize(struct code_node *code)
227 for (t = code; t; t = t->next)
229 for (; code; code = code->next)
231 struct code_node *next = code->next;
232 struct psection *pchild =
233 (struct psection *)code->child;
234 int indent = pchild->indent;
235 code->next = code->child->code;
236 code->child->code = NULL;
238 for (t = code; t->next; t = t->next)
239 t->next->indent = code->indent + indent;
244 Once a client has made use of a linearized code set, it will probably
247 void code_free(struct code_node *code)
250 struct code_node *this;
252 code_linearize(code);
259 ##### exported functions
261 void code_free(struct code_node *code);
263 ### Building the tree
265 As we parse the document there are two things we will want to do to
266 node trees: add some text or add a reference. We'll assume for now
267 that the relevant section structures have been found, and will just
268 deal with the `code_node`.
270 Adding text simply means adding another node. We will never have
271 empty nodes, even if the last node only has a child, new text must go
274 ##### internal functions
276 static void code_add_text(struct psection *where, struct text txt,
277 int line_no, int needs_strip)
282 n = malloc(sizeof(*n));
285 n->line_no = line_no;
286 n->needs_strip = needs_strip;
290 where->last->next = n;
296 However when adding a link, we might be able to include it in the last
297 `code_node` if it currently only has text.
299 void code_add_link(struct psection *where, struct psection *to,
305 to->refcnt++; // this will be checked elsewhere
306 if (where->last && where->last->child == NULL) {
307 where->last->child = to;
310 n = malloc(sizeof(*n));
317 where->last->next = n;
325 Now we need a lookup table to be able to find sections by name.
326 Something that provides an `n*log(N)` search time is probably
327 justified, but for now I want a minimal stand-alone program so a
328 linked list managed by insertion-sort will do. As a comparison
329 function it is easiest to sort based on length before content. So
330 sections won't be in standard lexical order, but that isn't important.
332 If we cannot find a section, we simply want to create it. This allows
333 sections and references to be created in any order. Sections with
334 no references or no content will cause a warning eventually.
336 #### internal functions
338 static int text_cmp(struct text a, struct text b)
341 return a.len - b.len;
342 return strncmp(a.txt, b.txt, a.len);
345 static struct psection *section_find(struct psection **list, struct text name)
347 struct psection *new;
349 int cmp = text_cmp((*list)->section, name);
354 list = (struct psection **)&((*list)->next);
356 /* Add this section */
357 new = malloc(sizeof(*new));
368 ## Parsing the _markdown_
370 Parsing markdown is fairly easy, though there are complications.
372 The document is divided into "paragraphs" which are mostly separated by blank
373 lines (which may contain white space). The first few characters of
374 the first line of a paragraph determine the type of paragraph. For
375 our purposes we are only interested in list paragraphs, code
376 paragraphs, section headings, and everything else. Section headings
377 are single-line paragraphs and so do not require a preceding or
378 following blank line.
380 Section headings start with 1 or more hash characters (__#__). List
381 paragraphs start with hyphen, asterisk, plus, or digits followed by a
382 period. Code paragraphs aren't quite so easy.
384 The "standard" code paragraph starts with 4 or more spaces, or a tab.
385 However if the previous paragraph was a list paragraph, then those
386 spaces indicate another paragraph in the same list item, and 8 or
387 more spaces are required. Unless a nested list is in effect, in
388 which case 12 or more are need. Unfortunately not all _markdown_
389 parsers agree on nested lists.
391 Two alternate styles for marking code are in active use. "Github" uses
392 three backticks(_`` ``` ``_), while "pandoc" uses three or more tildes
393 (_~~~_). In these cases the code should not be indented.
395 Trying to please everyone as much as possible, this parser will handle
396 everything except for code inside lists.
398 So an indented (4+) paragraph after a list paragraph is always a list
399 paragraph, otherwise it is a code paragraph. A paragraph that starts
400 with three backticks or three tildes is code which continues until a
401 matching string of backticks or tildes.
405 While walking the document looking for various markers we will *not*
406 use the `struct text` introduced earlier as advancing that requires
407 updating both start and length which feels clumsy. Instead we will
408 carry `pos` and `end` pointers, only the first of which needs to
411 So to start, we need to skip various parts of the document. `lws`
412 stands for "Linear White Space" and is a term that comes from the
413 Email RFCs (e.g. RFC822). `line` and `para` are self explanatory.
414 Note that `skip_para` needs to update the current line number.
415 `skip_line` doesn't but every caller should.
417 #### internal functions
419 static char *skip_lws(char *pos, char *end)
421 while (pos < end && (*pos == ' ' || *pos == '\t'))
426 static char *skip_line(char *pos, char *end)
428 while (pos < end && *pos != '\n')
435 static char *skip_para(char *pos, char *end, int *line_no)
437 /* Might return a pointer to a blank line, as only
438 * one trailing blank line is skipped
441 pos = skip_line(pos, end);
447 *(pos = skip_lws(pos, end)) != '\n') {
448 pos = skip_line(pos, end);
451 if (pos < end && *pos == '\n') {
458 ### Recognising things
460 Recognising a section header is trivial and doesn't require a
461 function. However we need to extract the content of a section header
462 as a `struct text` for passing to `section_find`.
463 Recognising the start of a new list is fairly easy. Recognising the
464 start (and end) of code is a little messy so we provide a function for
465 matching the first few characters, which has a special case for "4
468 #### internal includes
473 #### internal functions
475 static struct text take_header(char *pos, char *end)
479 while (pos < end && *pos == '#')
481 while (pos < end && *pos == ' ')
484 while (pos < end && *pos != '\n')
486 while (pos > section.txt &&
487 (pos[-1] == '#' || pos[-1] == ' '))
489 section.len = pos - section.txt;
493 static int is_list(char *pos, char *end)
495 if (strchr("-*+", *pos))
498 while (pos < end && isdigit(*pos))
500 if (pos < end && *pos == '.')
506 static int matches(char *start, char *pos, char *end)
509 return matches("\t", pos, end) ||
510 matches(" ", pos, end);
511 return (pos + strlen(start) < end &&
512 strncmp(pos, start, strlen(start)) == 0);
515 ### Extracting the code
517 Now that we can skip paragraphs and recognise what type each paragraph
518 is, it is time to parse the file and extract the code. We'll do this
519 in two parts, first we look at what to do with some code once we
520 find it, and then how to actually find it.
522 When we have some code, we know where it is, what the end marker
523 should look like, and which section it is in.
525 There are two sorts of end markers: the presence of a particular
526 string, or the absence of an indent. We will use a string to
527 represent a presence, and a `NULL` to represent the absence.
529 While looking at code we don't think about paragraphs are all - just
530 look for a line that starts with the right thing.
531 Every line that is still code then needs to be examined to see if it
532 is a section reference.
534 When a section reference is found, all preceding code (if any) must be
535 added to the current section, then the reference is added.
537 When we do find the end of the code, all text that we have found but
538 not processed needs to be saved too.
540 When adding a reference we need to set the `indent`. This is the
541 number of spaces (counting 8 for tabs) after the natural indent of the
542 code (which is a tab or 4 spaces). We use a separate function `count_spaces`
545 #### internal functions
547 static int count_space(char *sol, char *p)
561 static char *take_code(char *pos, char *end, char *marker,
562 struct psection **table, struct text section,
566 int line_no = *line_nop;
567 int start_line = line_no;
568 struct psection *sect;
570 sect = section_find(table, section);
576 if (marker && matches(marker, pos, end))
579 (skip_lws(pos, end))[0] != '\n' &&
580 !matches(NULL, pos, end))
581 /* Paragraph not indented */
584 /* Still in code - check for reference */
589 else if (strcmp(sol, " ") == 0)
592 t = skip_lws(sol, end);
593 if (t[0] != '#' || t[1] != '#') {
594 /* Just regular code here */
595 pos = skip_line(sol, end);
603 txt.len = pos - start;
604 code_add_text(sect, txt, start_line,
607 ref = take_header(t, end);
609 struct psection *refsec = section_find(table, ref);
610 code_add_link(sect, refsec, count_space(sol, t));
612 pos = skip_line(t, end);
615 start_line = line_no;
620 txt.len = pos - start;
621 code_add_text(sect, txt, start_line,
625 pos = skip_line(pos, end);
634 It is when looking for the code that we actually use the paragraph
635 structure. We need to recognise section headings so we can record the
636 name, list paragraphs so we can ignore indented follow-on paragraphs,
637 and the three different markings for code.
639 #### internal functions
641 static struct psection *code_find(char *pos, char *end)
643 struct psection *table = NULL;
646 struct text section = {0};
650 section = take_header(pos, end);
652 pos = skip_line(pos, end);
654 } else if (is_list(pos, end)) {
656 pos = skip_para(pos, end, &line_no);
657 } else if (!in_list && matches(NULL, pos, end)) {
658 pos = take_code(pos, end, NULL, &table,
660 } else if (matches("```", pos, end)) {
662 pos = skip_line(pos, end);
664 pos = take_code(pos, end, "```", &table,
666 } else if (matches("~~~", pos, end)) {
668 pos = skip_line(pos, end);
670 pos = take_code(pos, end, "~~~", &table,
675 pos = skip_para(pos, end, &line_no);
681 ### Returning the code
683 Having found all the code blocks and gathered them into a list of
684 section, we are now ready to return them to the caller. This is where
685 to perform consistency checks, like at most one reference and at least
686 one definition for each section.
688 All the sections with no references are returned in a list for the
689 caller to consider. The are linearized first so that the substructure
690 is completely hidden -- except for the small amount of structure
691 displayed in the line numbers.
693 To return errors, we have the caller pass a function which takes an
694 error message - a `code_err_fn`.
698 typedef void (*code_err_fn)(char *msg);
700 #### internal functions
701 struct section *code_extract(char *pos, char *end, code_err_fn error)
703 struct psection *table;
704 struct section *result = NULL;
705 struct section *tofree = NULL;
707 table = code_find(pos, end);
710 struct psection *t = (struct psection*)table->next;
711 if (table->last == NULL) {
714 "Section \"%.*s\" is referenced but not declared",
715 table->section.len, table->section.txt);
719 if (table->refcnt == 0) {
720 /* Root-section, return it */
721 table->next = result;
723 code_linearize(result->code);
725 table->next = tofree;
727 if (table->refcnt > 1) {
730 "Section \"%.*s\" referenced multiple times (%d).",
731 table->section.len, table->section.txt,
740 struct section *t = tofree->next;
747 ##### exported functions
749 struct section *code_extract(char *pos, char *end, code_err_fn error);
754 Now that we can extract code from a document and link it all together
755 it is time to do something with that code. Firstly we need to print
758 ### Printing the Code
760 Printing is mostly straight forward - we just walk the list and print
761 the code sections, adding whatever indent is required for each line.
762 However there is a complication (isn't there always)?
764 For code that was recognised because the paragraph was indented, we
765 need to strip that indent first. For other code, we don't.
767 The approach taken here is simple, though it could arguably be wrong
768 in some unlikely cases. So it might need to be fixed later.
770 If the first line of a code block is indented, then either one tab or
771 4 spaces are striped from every non-blank line.
773 This could go wrong if the first line of a code block marked by
774 _`` ``` ``_ is indented. To overcome this we would need to
775 record some extra state in each `code_node`. For now we won't bother.
777 The indents we insert will all be spaces. This might not work well
780 ##### internal functions
782 void code_node_print(FILE *out, struct code_node *node,
785 for (; node; node = node->next) {
786 char *c = node->code.txt;
787 int len = node->code.len;
792 fprintf(out, "#line %d \"%s\"\n",
793 node->line_no, fname);
795 fprintf(out, "%*s", node->indent, "");
796 if (node->needs_strip) {
797 if (*c == '\t' && len > 1) {
800 } else if (strncmp(c, " ", 4) == 0 && len > 4) {
809 } while (len && c[-1] != '\n');
814 ###### exported functions
815 void code_node_print(FILE *out, struct code_node *node, char *fname);
817 ### Bringing it all together
819 We are just about ready for the `main` function of the tool which will
820 extract all this lovely code and compile it. Just one helper is still
823 #### Handling filenames
825 Section names are stored in `struct text` which is not `nul`
826 terminated. Filenames passed to `open` need to be null terminated.
827 So we need to convert one to the other, and strip the leading `File:`
828 of while we are at it.
830 ##### client functions
832 static void copy_fname(char *name, int space, struct text t)
837 if (len < 5 || strncmp(sec, "File:", 5) != 0)
841 while (len && sec[0] == ' ') {
847 strncpy(name, sec, len);
853 And now we take a single file name, extract the code, and if there are
854 no error we write out a file for each appropriate code section. And
858 ##### client includes
862 #include <sys/mman.h>
865 ##### client functions
868 static void pr_err(char *msg)
871 fprintf(stderr, "%s\n", msg);
874 int main(int argc, char *argv[])
879 struct section *table, *s, *prev;
883 fprintf(stderr, "Usage: mdcode file.mdc\n");
886 fd = open(argv[1], O_RDONLY);
888 fprintf(stderr, "mdcode: cannot open %s: %s\n",
889 argv[1], strerror(errno));
892 len = lseek(fd, 0, 2);
893 file = mmap(NULL, len, PROT_READ, MAP_SHARED, fd, 0);
894 table = code_extract(file, file+len, pr_err);
897 (code_free(s->code), prev = s, s = s->next, free(prev))) {
900 if (strncmp(s->section.txt, "Example:", 8) == 0)
902 if (strncmp(s->section.txt, "File:", 5) != 0) {
903 fprintf(stderr, "Unreferenced section is not a file name: %.*s\n",
904 s->section.len, s->section.txt);
908 copy_fname(fname, sizeof(fname), s->section);
910 fprintf(stderr, "Missing file name at:%.*s\n",
911 s->section.len, s->section.txt);
915 fl = fopen(fname, "w");
917 fprintf(stderr, "Cannot create %s: %s\n",
918 fname, strerror(errno));
922 code_node_print(fl, s->code, argv[1]);