ocean-lang.org Git - ocean/blob - csrc/mdcode.mdc

   1 # mdcode: extract C code from a _markdown_ file.
   2
   3 _markdown_ is a popular format for simple text markup which can easily
   4 be converted to HTML.  As it allows easy indication of sections of
   5 code, it is quite suitable for use in literate programming.  This file
   6 is an example of that usage.
   7
   8 The code included below provides two related functionalities.
   9 Firstly it provides a library routine for extracting code out of a
  10 _markdown_ file, so that other routines might make use of it.
  11
  12 Secondly it provides a simple client of this routine which extracts
  13 1 or more C-language files from a markdown document so they can be
  14 passed to a C compiler.  These two combined to make a tool that is needed
  15 to compile this tool.  Yes, this is circular.  A prototype tool was
  16 used for the first extraction.
  17
  18 The tool provided is described as specific to the C language as it
  19 generates
  20
  21 ##### Example: a _line_ command
  22
  23         #line __line-number__ __file-name__
  24
  25 lines so that the C compiler will report where in the markdown file
  26 any error is found.  This tool is suitable for any other language
  27 which allows the same directive, or will treat it as a comment.
  28
  29 ## Literate Details
  30
  31 Literate programming is more than just including comments with the
  32 code, even nicely formatted comments.  It also involves presenting the
  33 code in an order that makes sense to a human, rather than an order
  34 that makes sense to a compiler.  For this reason a core part of any
  35 literate programming tool is the ability to re-arrange the code found
  36 in the document into a different order in the final code file - or
  37 files.  This requires some form of linkage to be encoded.
  38
  39 The approach taken here is focused around section headings - of any
  40 depth.
  41
  42 All the code in any section is treated as a single sequential
  43 collection of code, and is named by the section that it is in.  If
  44 multiple sections have the same name, then the code blocks in all of
  45 them are joined together in the order they appear in the document.
  46
  47 A code section can contain a special marker which starts with 2
  48 hashes: __##__.
  49 The text after the marker must be the name of some section which
  50 contains code.  Code from that section will be interpolated in place
  51 of the marker, and will be indented to match the indent of the marker.
  52
  53 It is not permitted for the same code to be interpolated multiple
  54 times.  Allowing this might make some sense, but it is probably a
  55 mistake, and prohibiting it make some of the code a bit cleaner.
  56
  57 Equally, every section of code should be interpolated at least once -
  58 with one exception.  This exception is imposed by the
  59 tool, not the library.  A different client could impose different
  60 rules on the names of top-level code sections.
  61
  62 One example of the exception we have already seen.  A section name
  63 starting __Example:__ indicates code that is not to be included in the
  64 final product.  Any leading word will do, providing there is a space,
  65 and the first space is preceded by a colon, that section name will be
  66 ignored.
  67
  68 A special case of this exception exists for the leading word
  69 __File__.  These sections are the top level code sections and they
  70 will be written to the named file.  Thus a section named
  71 __File: foo__ should not be referenced by another section, and its
  72 contents after all references are expanded will be written to the file
  73 __foo__.
  74
  75 Any section containing code that does not start __Word:__
  76 must be included in some other section exactly once.
  77
  78 ### Multiple files
  79
  80 Allowing multiple top level code sections which name different files
  81 means that one _markdown_ document can describe several files.  This
  82 is very useful with the C language where a program file and a header
  83 file might be related.  For the present document we will have a header
  84 file and two code files, one with the library content and one for the
  85 tool.
  86
  87 It will also be very convenient to create a `makefile` fragment to
  88 ensure the code is compiled correctly.  A simple `make -f mdcode.mk`
  89 will "do the right thing".
  90
  91 ### File: mdcode.mk
  92
  93         CFLAGS += -Wall -g
  94         all::
  95         mdcode.h libmdcode.c md2c.c mdcode.mk :  mdcode.mdc
  96                 ./md2c mdcode.mdc
  97
  98
  99 ### File: mdcode.h
 100
 101         #include <stdio.h>
 102         ## exported types
 103         ## exported functions
 104
 105 ### File: libmdcode.c
 106         #define _GNU_SOURCE
 107         #include <unistd.h>
 108         #include <stdlib.h>
 109         #include <stdio.h>
 110
 111         #include "mdcode.h"
 112         ## internal includes
 113         ## private types
 114         ## internal functions
 115
 116 ### File: mdcode.mk
 117
 118         all :: libmdcode.o
 119         libmdcode.o : libmdcode.c mdcode.h
 120                 $(CC) $(CFLAGS) -c libmdcode.c
 121
 122
 123 ### File: md2c.c
 124
 125         #include <unistd.h>
 126         #include <stdlib.h>
 127         #include <stdio.h>
 128
 129         #include "mdcode.h"
 130
 131         ## client includes
 132         ## client functions
 133
 134 ### File: mdcode.mk
 135
 136         all :: md2c
 137         md2c : md2c.o libmdcode.o
 138                 $(CC) $(CFLAGS) -o md2c md2c.o libmdcode.o
 139         md2c.o : md2c.c mdcode.h
 140                 $(CC) $(CFLAGS) -c md2c.c
 141
 142 ## Data Structures
 143
 144 As the core purpose of _mdcode_ is to discover and re-arrange blocks
 145 of text, it makes sense to map the whole document file into memory and
 146 produce a data structure which lists various parts of the file in the
 147 appropriate order.  Each node in this structure will have some text
 148 from the document, a child pointer, and a next pointer, any of which
 149 might not be present.  The text is most easily stored as a pointer and a
 150 length.  We'll call this a `text`
 151
 152 A list of these `code_nodes` will belong to each section and it will
 153 be useful to have a separate `section` data structure to store the
 154 list of `code_nodes`, the section name, and some other information.
 155
 156 This other information will include a reference counter so we can
 157 ensure proper referencing, and an `indent` depth.  As referenced
 158 content can have an extra indent added, we need to know what that is.
 159 The `code_node` will also have an `indent` depth which eventually gets
 160 set to the sum for the indents from all references on the path from
 161 the root.
 162
 163 Finally we need to know if the `code_node` was recognised by being
 164 indented or not.  If it was, the client of this data will want to
 165 strip of the leading tab or 4 spaces.  Hence a `needs_strip` flag is
 166 needed.
 167
 168 ##### exported types
 169
 170         struct text {
 171                 char *txt;
 172                 int len;
 173         };
 174
 175         struct section {
 176                 struct text section;
 177                 struct code_node *code;
 178                 struct section *next;
 179         };
 180
 181         struct code_node {
 182                 struct text code;
 183                 int indent;
 184                 int line_no;
 185                 int needs_strip;
 186                 struct code_node *next;
 187                 struct section *child;
 188         };
 189
 190 ##### private types
 191
 192         struct psection {
 193                 struct section;
 194                 struct code_node *last;
 195                 int refcnt;
 196                 int indent;
 197         };
 198
 199 You will note that the `struct psection` contains an anonymous `struct
 200 section` embedded at the start.  To make this work right, GCC
 201 requires the `-fplan9-extensions` flag.
 202
 203 ##### File: mdcode.mk
 204
 205         CFLAGS += -fplan9-extensions
 206
 207 ### Manipulating the node
 208
 209 Though a tree with `next` and `child` links is the easiest way to
 210 assemble the various code sections, it is not the easiest form for
 211 using them.  For that a simple list would be best.
 212
 213 So once we have a fully linked File section we will want to linearize
 214 it, so that the `child` links become `NULL` and the `next` links will
 215 find everything required.  It is at this stage that the requirements
 216 that each section is linked only once becomes import.
 217
 218 `code_linearize` will merge the `code_node`s from any child into the
 219 given `code_node`.  As it does this it sets the 'indent' field for
 220 each `code_node`.
 221
 222 Note that we don't clear the section's `last` pointer, even though
 223 it no longer owns any code.  This allows subsequent code to see if a
 224 section ever had any code, and to report an error if a section is
 225 referenced but not defined.
 226
 227 ##### internal functions
 228
 229         static void code_linearize(struct code_node *code)
 230         {
 231                 struct code_node *t;
 232                 for (t = code; t; t = t->next)
 233                         t->indent = 0;
 234                 for (; code; code = code->next)
 235                         if (code->child) {
 236                                 struct code_node *next = code->next;
 237                                 struct psection *pchild =
 238                                         (struct psection *)code->child;
 239                                 int indent = pchild->indent;
 240                                 code->next = code->child->code;
 241                                 code->child->code = NULL;
 242                                 code->child = NULL;
 243                                 for (t = code; t->next; t = t->next)
 244                                         t->next->indent = code->indent + indent;
 245                                 t->next = next;
 246                         }
 247         }
 248
 249 Once a client has made use of a linearized code set, it will probably
 250 want to free it.
 251
 252         void code_free(struct code_node *code)
 253         {
 254                 while (code) {
 255                         struct code_node *this;
 256                         if (code->child)
 257                                 code_linearize(code);
 258                         this = code;
 259                         code = code->next;
 260                         free(this);
 261                 }
 262         }
 263
 264 ##### exported functions
 265
 266         void code_free(struct code_node *code);
 267
 268 ### Building the tree
 269
 270 As we parse the document there are two things we will want to do to
 271 node trees: add some text or add a reference.  We'll assume for now
 272 that the relevant section structures have been found, and will just
 273 deal with the `code_node`.
 274
 275 Adding text simply means adding another node.  We will never have
 276 empty nodes, even if the last node only has a child, new text must go
 277 in a new node.
 278
 279 ##### internal functions
 280
 281         static void code_add_text(struct psection *where, struct text txt,
 282                                   int line_no, int needs_strip)
 283         {
 284                 struct code_node *n;
 285                 if (txt.len == 0)
 286                         return;
 287                 n = malloc(sizeof(*n));
 288                 n->code = txt;
 289                 n->indent = 0;
 290                 n->line_no = line_no;
 291                 n->needs_strip = needs_strip;
 292                 n->next = NULL;
 293                 n->child = NULL;
 294                 if (where->last)
 295                         where->last->next = n;
 296                 else
 297                         where->code = n;
 298                 where->last = n;
 299         }
 300
 301 However when adding a link, we might be able to include it in the last
 302 `code_node` if it currently only has text.
 303
 304         void code_add_link(struct psection *where, struct psection *to,
 305                            int indent)
 306         {
 307                 struct code_node *n;
 308
 309                 to->indent = indent;
 310                 to->refcnt++;   // this will be checked elsewhere
 311                 if (where->last && where->last->child == NULL) {
 312                         where->last->child = to;
 313                         return;
 314                 }
 315                 n = malloc(sizeof(*n));
 316                 n->code.len = 0;
 317                 n->indent = 0;
 318                 n->line_no = 0;
 319                 n->next = NULL;
 320                 n->child = to;
 321                 if (where->last)
 322                         where->last->next = n;
 323                 else
 324                         where->code = n;
 325                 where->last = n;
 326         }
 327
 328 ### Finding sections
 329
 330 Now we need a lookup table to be able to find sections by name.
 331 Something that provides an `n*log(N)` search time is probably
 332 justified, but for now I want a minimal stand-alone program so a
 333 linked list managed by insertion-sort will do.
 334
 335 The text compare function will likely be useful for any clients of our
 336 library, so we may as well export it.
 337
 338 If we cannot find a section, we simply want to create it.  This allows
 339 sections and references to be created in any order.  Sections with
 340 no references or no content will cause a warning eventually.
 341
 342 #### exported functions
 343
 344         int text_cmp(struct text a, struct text b);
 345
 346 #### internal functions
 347
 348         int text_cmp(struct text a, struct text b)
 349         {
 350                 int len = a.len;
 351                 if (len > b.len)
 352                         len = b.len;
 353                 int cmp = strncmp(a.txt, b.txt, len);
 354                 if (cmp)
 355                         return cmp;
 356                 else
 357                         return a.len - b.len;
 358         }
 359
 360         static struct psection *section_find(struct psection **list, struct text name)
 361         {
 362                 struct psection *new;
 363                 while (*list) {
 364                         int cmp = text_cmp((*list)->section, name);
 365                         if (cmp == 0)
 366                                 return *list;
 367                         if (cmp > 0)
 368                                 break;
 369                         list = (struct psection **)&((*list)->next);
 370                 }
 371                 /* Add this section */
 372                 new = malloc(sizeof(*new));
 373                 new->next = *list;
 374                 *list = new;
 375                 new->section = name;
 376                 new->code = NULL;
 377                 new->last = NULL;
 378                 new->refcnt = 0;
 379                 new->indent = 0;
 380                 return new;
 381         }
 382
 383 ## Parsing the _markdown_
 384
 385 Parsing markdown is fairly easy, though there are complications.
 386
 387 The document is divided into "paragraphs" which are mostly separated by blank
 388 lines (which may contain white space).  The first few characters of
 389 the first line of a paragraph determine the type of paragraph.  For
 390 our purposes we are only interested in list paragraphs, code
 391 paragraphs, section headings, and everything else.  Section headings
 392 are single-line paragraphs and so do not require a preceding or
 393 following blank line.
 394
 395 Section headings start with 1 or more hash characters (__#__).  List
 396 paragraphs start with hyphen, asterisk, plus, or digits followed by a
 397 period.  Code paragraphs aren't quite so easy.
 398
 399 The "standard" code paragraph starts with 4 or more spaces, or a tab.
 400 However if the previous paragraph was a list paragraph, then those
 401 spaces indicate another  paragraph in the same list item, and 8 or
 402 more spaces are required.  Unless a nested list is in effect, in
 403 which case 12 or more are need.   Unfortunately not all _markdown_
 404 parsers agree on nested lists.
 405
 406 Two alternate styles for marking code are in active use.  "Github" uses
 407 three backticks(_`` ``` ``_), while "pandoc" uses three or more tildes
 408 (_~~~_).  In these cases the code should not be indented.
 409
 410 Trying to please everyone as much as possible, this parser will handle
 411 everything except for code inside lists.
 412
 413 So an indented (4+) paragraph after a list paragraph is always a list
 414 paragraph, otherwise it is a code paragraph.  A paragraph that starts
 415 with three backticks or three tildes is code which continues until a
 416 matching string of backticks or tildes.
 417
 418 ### Skipping bits
 419
 420 While walking the document looking for various markers we will *not*
 421 use the `struct text` introduced earlier as advancing that requires
 422 updating both start and length which feels clumsy.  Instead we will
 423 carry `pos` and `end` pointers, only the first of which needs to
 424 change.
 425
 426 So to start, we need to skip various parts of the document.  `lws`
 427 stands for "Linear White Space" and is a term that comes from the
 428 Email RFCs (e.g. RFC822).  `line` and `para` are self explanatory.
 429 Note that `skip_para` needs to update the current line number.
 430 `skip_line` doesn't but every caller should.
 431
 432 #### internal functions
 433
 434         static char *skip_lws(char *pos, char *end)
 435         {
 436                 while (pos < end && (*pos == ' ' || *pos == '\t'))
 437                         pos++;
 438                 return pos;
 439         }
 440
 441         static char *skip_line(char *pos, char *end)
 442         {
 443                 while (pos < end && *pos != '\n')
 444                         pos++;
 445                 if (pos < end)
 446                         pos++;
 447                 return pos;
 448         }
 449
 450         static char *skip_para(char *pos, char *end, int *line_no)
 451         {
 452                 /* Might return a pointer to a blank line, as only
 453                  * one trailing blank line is skipped
 454                  */
 455                 if (*pos == '#') {
 456                         pos = skip_line(pos, end);
 457                         (*line_no) += 1;
 458                         return pos;
 459                 }
 460                 while (pos < end &&
 461                        *pos != '#' &&
 462                        *(pos = skip_lws(pos, end)) != '\n') {
 463                         pos = skip_line(pos, end);
 464                         (*line_no) += 1;
 465                 }
 466                 if (pos < end && *pos == '\n') {
 467                         pos++;
 468                         (*line_no) += 1;
 469                 }
 470                 return pos;
 471         }
 472
 473 ### Recognising things
 474
 475 Recognising a section header is trivial and doesn't require a
 476 function.  However we need to extract the content of a section header
 477 as a `struct text` for passing to `section_find`.
 478 Recognising the start of a new list is fairly easy.  Recognising the
 479 start (and end) of code is a little messy so we provide a function for
 480 matching the first few characters, which has a special case for "4
 481 spaces or tab".
 482
 483 #### internal includes
 484
 485         #include  <ctype.h>
 486         #include  <string.h>
 487
 488 #### internal functions
 489
 490         static struct text take_header(char *pos, char *end)
 491         {
 492                 struct text section;
 493
 494                 while (pos < end && *pos == '#')
 495                         pos++;
 496                 while (pos < end && *pos == ' ')
 497                         pos++;
 498                 section.txt = pos;
 499                 while (pos < end && *pos != '\n')
 500                         pos++;
 501                 while (pos > section.txt &&
 502                        (pos[-1] == '#' || pos[-1] == ' '))
 503                         pos--;
 504                 section.len = pos - section.txt;
 505                 return section;
 506         }
 507
 508         static int is_list(char *pos, char *end)
 509         {
 510                 if (strchr("-*+", *pos))
 511                         return 1;
 512                 if (isdigit(*pos)) {
 513                         while (pos < end && isdigit(*pos))
 514                                 pos += 1;
 515                         if  (pos < end && *pos == '.')
 516                                 return 1;
 517                 }
 518                 return 0;
 519         }
 520
 521         static int matches(char *start, char *pos, char *end)
 522         {
 523                 if (start == NULL)
 524                         return matches("\t", pos, end) ||
 525                                matches("    ", pos, end);
 526                 return (pos + strlen(start) < end &&
 527                         strncmp(pos, start, strlen(start)) == 0);
 528         }
 529
 530 ### Extracting the code
 531
 532 Now that we can skip paragraphs and recognise what type each paragraph
 533 is, it is time to parse the file and extract the code.  We'll do this
 534 in two parts, first we look at what to do with some code once we
 535 find it, and then how to actually find it.
 536
 537 When we have some code, we know where it is, what the end marker
 538 should look like, and which section it is in.
 539
 540 There are two sorts of end markers: the presence of a particular
 541 string, or the absence of an indent.  We will use a string to
 542 represent a presence, and a `NULL` to represent the absence.
 543
 544 While looking at code we don't think about paragraphs at all - just
 545 look for a line that starts with the right thing.
 546 Every line that is still code then needs to be examined to see if it
 547 is a section reference.
 548
 549 When a section reference is found, all preceding code (if any) must be
 550 added to the current section, then the reference is added.
 551
 552 When we do find the end of the code, all text that we have found but
 553 not processed needs to be saved too.
 554
 555 When adding a reference we need to set the `indent`.  This is the
 556 number of spaces (counting 8 for tabs) after the natural indent of the
 557 code (which is a tab or 4 spaces).  We use a separate function `count_spaces`
 558 for that.
 559
 560 If there are completely blank linkes (no indent) at the end of the found code,
 561 these should be considered to be spacing between the code and the next section,
 562 and so no included in the code.  When a marker is used to explicitly mark the
 563 end of the code, we don't need to check for these blank lines.
 564
 565 #### internal functions
 566
 567         static int count_space(char *sol, char *p)
 568         {
 569                 int c = 0;
 570                 while (sol < p) {
 571                         if (sol[0] == ' ')
 572                                 c++;
 573                         if (sol[0] == '\t')
 574                                 c+= 8;
 575                         sol++;
 576                 }
 577                 return c;
 578         }
 579
 580
 581         static char *take_code(char *pos, char *end, char *marker,
 582                                struct psection **table, struct text section,
 583                                int *line_nop)
 584         {
 585                 char *start = pos;
 586                 int line_no = *line_nop;
 587                 int start_line = line_no;
 588                 struct psection *sect;
 589
 590                 sect = section_find(table, section);
 591
 592                 while (pos < end) {
 593                         char *sol, *t;
 594                         struct text ref;
 595
 596                         if (marker && matches(marker, pos, end))
 597                                 break;
 598                         if (!marker &&
 599                             (skip_lws(pos, end))[0] != '\n' &&
 600                             !matches(NULL, pos, end))
 601                                 /* Paragraph not indented */
 602                                 break;
 603
 604                         /* Still in code - check for reference */
 605                         sol = pos;
 606                         if (!marker) {
 607                                 if (*sol == '\t')
 608                                         sol++;
 609                                 else if (strcmp(sol, "    ") == 0)
 610                                         sol += 4;
 611                         }
 612                         t = skip_lws(sol, end);
 613                         if (t[0] != '#' || t[1] != '#') {
 614                                 /* Just regular code here */
 615                                 pos = skip_line(sol, end);
 616                                 line_no++;
 617                                 continue;
 618                         }
 619
 620                         if (pos > start) {
 621                                 struct text txt;
 622                                 txt.txt = start;
 623                                 txt.len = pos - start;
 624                                 code_add_text(sect, txt, start_line,
 625                                               marker == NULL);
 626                         }
 627                         ref = take_header(t, end);
 628                         if (ref.len) {
 629                                 struct psection *refsec = section_find(table, ref);
 630                                 code_add_link(sect, refsec, count_space(sol, t));
 631                         }
 632                         pos = skip_line(t, end);
 633                         line_no++;
 634                         start = pos;
 635                         start_line = line_no;
 636                 }
 637                 if (pos > start) {
 638                         struct text txt;
 639                         txt.txt = start;
 640                         txt.len = pos - start;
 641                         /* strip trailing blank lines */
 642                         while (!marker && txt.len > 2 &&
 643                                start[txt.len-1] == '\n' &&
 644                                start[txt.len-2] == '\n')
 645                                 txt.len -= 1;
 646
 647                         code_add_text(sect, txt, start_line,
 648                                       marker == NULL);
 649                 }
 650                 if (marker) {
 651                         pos = skip_line(pos, end);
 652                         line_no++;
 653                 }
 654                 *line_nop = line_no;
 655                 return pos;
 656         }
 657
 658 ### Finding the code
 659
 660 It is when looking for the code that we actually use the paragraph
 661 structure.  We need to recognise section headings so we can record the
 662 name, list paragraphs so we can ignore indented follow-on paragraphs,
 663 and the three different markings for code.
 664
 665 #### internal functions
 666
 667         static struct psection *code_find(char *pos, char *end)
 668         {
 669                 struct psection *table = NULL;
 670                 int in_list = 0;
 671                 int line_no = 1;
 672                 struct text section = {0};
 673
 674                 while (pos < end) {
 675                         if (pos[0] == '#') {
 676                                 section = take_header(pos, end);
 677                                 in_list = 0;
 678                                 pos = skip_line(pos, end);
 679                                 line_no++;
 680                         } else if (is_list(pos, end)) {
 681                                 in_list = 1;
 682                                 pos = skip_para(pos, end, &line_no);
 683                         } else if (!in_list && matches(NULL, pos, end)) {
 684                                 pos = take_code(pos, end, NULL, &table,
 685                                                 section, &line_no);
 686                         } else if (matches("```", pos, end)) {
 687                                 in_list = 0;
 688                                 pos = skip_line(pos, end);
 689                                 line_no++;
 690                                 pos = take_code(pos, end, "```", &table,
 691                                                 section, &line_no);
 692                         } else if (matches("~~~", pos, end)) {
 693                                 in_list = 0;
 694                                 pos = skip_line(pos, end);
 695                                 line_no++;
 696                                 pos = take_code(pos, end, "~~~", &table,
 697                                                 section, &line_no);
 698                         } else {
 699                                 if (!isspace(*pos))
 700                                         in_list = 0;
 701                                 pos = skip_para(pos, end, &line_no);
 702                         }
 703                 }
 704                 return table;
 705         }
 706
 707 ### Returning the code
 708
 709 Having found all the code blocks and gathered them into a list of
 710 section, we are now ready to return them to the caller.  This is where
 711 to perform consistency checks, like at most one reference and at least
 712 one definition for each section.
 713
 714 All the sections with no references are returned in a list for the
 715 caller to consider.  The are linearized first so that the substructure
 716 is completely hidden -- except for the small amount of structure
 717 displayed in the line numbers.
 718
 719 To return errors, we have the caller pass a function which takes an
 720 error message - a `code_err_fn`.
 721
 722 #### exported types
 723
 724         typedef void (*code_err_fn)(char *msg);
 725
 726 #### internal functions
 727         struct section *code_extract(char *pos, char *end, code_err_fn error)
 728         {
 729                 struct psection *table;
 730                 struct section *result = NULL;
 731                 struct section *tofree = NULL;
 732
 733                 table = code_find(pos, end);
 734
 735                 while (table) {
 736                         struct psection *t = (struct psection*)table->next;
 737                         if (table->last == NULL) {
 738                                 char *msg;
 739                                 asprintf(&msg,
 740                                         "Section \"%.*s\" is referenced but not declared",
 741                                          table->section.len, table->section.txt);
 742                                 error(msg);
 743                                 free(msg);
 744                         }
 745                         if (table->refcnt == 0) {
 746                                 /* Root-section,  return it */
 747                                 table->next = result;
 748                                 result = table;
 749                                 code_linearize(result->code);
 750                         } else {
 751                                 table->next = tofree;
 752                                 tofree = table;
 753                                 if (table->refcnt > 1) {
 754                                         char *msg;
 755                                         asprintf(&msg,
 756                                                  "Section \"%.*s\" referenced multiple times (%d).",
 757                                                  table->section.len, table->section.txt,
 758                                                  table->refcnt);
 759                                         error(msg);
 760                                         free(msg);
 761                                 }
 762                         }
 763                         table = t;
 764                 }
 765                 while (tofree) {
 766                         struct section *t = tofree->next;
 767                         free(tofree);
 768                         tofree = t;
 769                 }
 770                 return result;
 771         }
 772
 773 ##### exported functions
 774
 775         struct section *code_extract(char *pos, char *end, code_err_fn error);
 776
 777
 778 ## Using the library
 779
 780 Now that we can extract code from a document and link it all together
 781 it is time to do something with that code.  Firstly we need to print
 782 it out.
 783
 784 ### Printing the Code
 785
 786 Printing is mostly straight forward - we just walk the list and print
 787 the code sections, adding whatever indent is required for each line.
 788 However there is a complication (isn't there always)?
 789
 790 For code that was recognised because the paragraph was indented, we
 791 need to strip that indent first.  For other code, we don't.
 792
 793 The approach taken here is simple, though it could arguably be wrong
 794 in some unlikely cases.  So it might need to be fixed later.
 795
 796 If the first line of a code block is indented, then either one tab or
 797 4 spaces are striped from every non-blank line.
 798
 799 This could go wrong if the first line of a code block marked by
 800 _`` ``` ``_ is indented.  To overcome this we would need to
 801 record some extra state in each `code_node`.  For now we won't bother.
 802
 803 The indents we insert will mostly be spaces.  All-spaces doesn't work
 804 for `Makefiles`, so if the indent is 8 or more, we use a TAB first.
 805
 806 ##### internal functions
 807
 808         void code_node_print(FILE *out, struct code_node *node,
 809                              char *fname)
 810         {
 811                 for (; node; node = node->next) {
 812                         char *c = node->code.txt;
 813                         int len = node->code.len;
 814
 815                         if (!len)
 816                                 continue;
 817
 818                         fprintf(out, "#line %d \"%s\"\n",
 819                                 node->line_no, fname);
 820                         while (len && *c) {
 821                                 if (node->indent >= 8)
 822                                         fprintf(out, "\t%*s", node->indent - 8, "");
 823                                 else
 824                                         fprintf(out, "%*s", node->indent, "");
 825                                 if (node->needs_strip) {
 826                                         if (*c == '\t' && len > 1) {
 827                                                 c++;
 828                                                 len--;
 829                                         } else if (strncmp(c, "    ", 4) == 0 && len > 4) {
 830                                                 c += 4;
 831                                                 len-= 4;
 832                                         }
 833                                 }
 834                                 do {
 835                                         fputc(*c, out);
 836                                         c++;
 837                                         len--;
 838                                 } while (len && c[-1] != '\n');
 839                         }
 840                 }
 841         }
 842
 843 ###### exported functions
 844         void code_node_print(FILE *out, struct code_node *node, char *fname);
 845
 846 ### Bringing it all together
 847
 848 We are just about ready for the `main` function of the tool which will
 849 extract all this lovely code and compile it.  Just one helper is still
 850 needed.
 851
 852 #### Handling filenames
 853
 854 Section names are stored in `struct text` which is not `nul`
 855 terminated.  Filenames passed to `open` need to be null terminated.
 856 So we need to convert one to the other, and strip the leading `File:`
 857 of while we are at it.
 858
 859 ##### client functions
 860
 861         static void copy_fname(char *name, int space, struct text t)
 862         {
 863                 char *sec = t.txt;
 864                 int len = t.len;
 865                 name[0] = 0;
 866                 if (len < 5 || strncmp(sec, "File:", 5) != 0)
 867                         return;
 868                 sec += 5;
 869                 len -= 5;
 870                 while (len && sec[0] == ' ') {
 871                         sec++;
 872                         len--;
 873                 }
 874                 if (len >= space)
 875                         len = space - 1;
 876                 strncpy(name, sec, len);
 877                 name[len] = 0;
 878         }
 879
 880 #### Main
 881
 882 And now we take a single file name, extract the code, and if there are
 883 no error we write out a file for each appropriate code section.  And
 884 we are done.
 885
 886
 887 ##### client includes
 888
 889         #include <fcntl.h>
 890         #include <errno.h>
 891         #include <sys/mman.h>
 892         #include <string.h>
 893
 894 ##### client functions
 895
 896         static int errs;
 897         static void pr_err(char *msg)
 898         {
 899                 errs++;
 900                 fprintf(stderr, "%s\n", msg);
 901         }
 902
 903         static char *strnchr(char *haystack, int len, char needle)
 904         {
 905                 while (len > 0 && *haystack && *haystack != needle) {
 906                         haystack++;
 907                         len--;
 908                 }
 909                 return len > 0 && *haystack == needle ? haystack : NULL;
 910         }
 911
 912         int main(int argc, char *argv[])
 913         {
 914                 int fd;
 915                 size_t len;
 916                 char *file;
 917                 struct text section = {NULL, 0};
 918                 struct section *table, *s, *prev;
 919
 920                 errs = 0;
 921                 if (argc != 2 && argc != 3) {
 922                         fprintf(stderr, "Usage: mdcode file.mdc [section]\n");
 923                         exit(2);
 924                 }
 925                 if (argc == 3) {
 926                         section.txt = argv[2];
 927                         section.len = strlen(argv[2]);
 928                 }
 929
 930                 fd = open(argv[1], O_RDONLY);
 931                 if (fd < 0) {
 932                         fprintf(stderr, "mdcode: cannot open %s: %s\n",
 933                                 argv[1], strerror(errno));
 934                         exit(1);
 935                 }
 936                 len = lseek(fd, 0, 2);
 937                 file = mmap(NULL, len, PROT_READ, MAP_SHARED, fd, 0);
 938                 table = code_extract(file, file+len, pr_err);
 939
 940                 for (s = table; s;
 941                         (code_free(s->code), prev = s, s = s->next, free(prev))) {
 942                         FILE *fl;
 943                         char fname[1024];
 944                         char *spc = strnchr(s->section.txt, s->section.len, ' ');
 945
 946                         if (spc > s->section.txt && spc[-1] == ':') {
 947                                 if (strncmp(s->section.txt, "File: ", 6) != 0 &&
 948                                     (section.txt == NULL ||
 949                                      text_cmp(s->section, section) != 0))
 950                                         /* Ignore this section */
 951                                         continue;
 952                         } else {
 953                                 fprintf(stderr, "Code in unreferenced section that is not ignored or a file name: %.*s\n",
 954                                         s->section.len, s->section.txt);
 955                                 errs++;
 956                                 continue;
 957                         }
 958                         if (section.txt) {
 959                                 if (text_cmp(s->section, section) == 0)
 960                                         code_node_print(stdout, s->code, argv[1]);
 961                                 break;
 962                         }
 963                         copy_fname(fname, sizeof(fname), s->section);
 964                         if (fname[0] == 0) {
 965                                 fprintf(stderr, "Missing file name at:%.*s\n",
 966                                         s->section.len, s->section.txt);
 967                                 errs++;
 968                                 continue;
 969                         }
 970                         fl = fopen(fname, "w");
 971                         if (!fl) {
 972                                 fprintf(stderr, "Cannot create %s: %s\n",
 973                                         fname, strerror(errno));
 974                                 errs++;
 975                                 continue;
 976                         }
 977                         code_node_print(fl, s->code, argv[1]);
 978                         fclose(fl);
 979                 }
 980                 exit(!!errs);
 981         }