ocean-lang.org Git - ocean/blob - csrc/mdcode.mdc

   1 # mdcode: extract C code from a _markdown_ file.
   2
   3 _markdown_ is a popular format for simple text markup which can easily
   4 be converted to HTML.  As it allows easy indication of sections of
   5 code, it is quite suitable for use in literate programming.  This file
   6 is an example of that usage.
   7
   8 The code included below provides two related functionalities.
   9 Firstly it provides a library routine for extracting code out of a
  10 _markdown_ file, so that other routines might make use of it.
  11
  12 Secondly it provides a simple client of this routine which extracts
  13 1 or more C-language files from a markdown document so they can be
  14 passed to a C compiler.  These two combined to make a tool that is needed
  15 to compile this tool.  Yes, this is circular.  A prototype tool was
  16 used for the first extraction.
  17
  18 The tool provided is described as specific to the C language as it
  19 generates
  20
  21 ##### Example: a _line_ command
  22
  23         #line __line-number__ __file-name__
  24
  25 lines so that the C compiler will report where in the markdown file
  26 any error is found.  This tool is suitable for any other language
  27 which allows the same directive, or will treat it as a comment.
  28
  29 ## Literate Details
  30
  31 Literate programming is more than just including comments with the
  32 code, even nicely formatted comments.  It also involves presenting the
  33 code in an order that makes sense to a human, rather than an order
  34 that makes sense to a compiler.  For this reason a core part of any
  35 literate programming tool is the ability to re-arrange the code found
  36 in the document into a different order in the final code file - or
  37 files.  This requires some form of linkage to be encoded.
  38
  39 The approach taken here is focused around section headings - of any
  40 depth.
  41
  42 All the code in any section is treated as a single sequential
  43 collection of code, and is named by the section that it is in.  If
  44 multiple sections have the same name, then the code blocks in all of
  45 them are joined together in the order they appear in the document.
  46
  47 A code section can contain a special marker which starts with 2
  48 hashes: __##__.
  49 The text after the marker must be the name of some section which
  50 contains code.  Code from that section will be interpolated in place
  51 of the marker, and will be indented to match the indent of the marker.
  52
  53 It is not permitted for the same code to be interpolated multiple
  54 times.  Allowing this might make some sense, but it is probably a
  55 mistake, and prohibiting it make some of the code a bit cleaner.
  56
  57 Equally, every section of code should be interpolated at least once -
  58 with one exception.  This exception is imposed by the
  59 tool, not the library.  A different client could impose different
  60 rules on the names of top-level code sections.
  61
  62 One example of the exception we have already seen.  A section name
  63 starting __Example:__ indicates code that is not to be included in the
  64 final product.  Any leading word will do, providing there is a space,
  65 and the first space is preceded by a colon, that section name will be
  66 ignored.
  67
  68 A special case of this exception exists for the leading word
  69 __File__.  These sections are the top level code sections and they
  70 will be written to the named file.  Thus a section named
  71 __File: foo__ should not be referenced by another section, and its
  72 contents after all references are expanded will be written to the file
  73 __foo__.
  74
  75 Any section containing code that does not start __Word:__
  76 must be included in some other section exactly once.
  77
  78 ### Multiple files
  79
  80 Allowing multiple top level code sections which name different files
  81 means that one _markdown_ document can describe several files.  This
  82 is very useful with the C language where a program file and a header
  83 file might be related.  For the present document we will have a header
  84 file and two code files, one with the library content and one for the
  85 tool.
  86
  87 It will also be very convenient to create a `makefile` fragment to
  88 ensure the code is compiled correctly.  A simple `make -f mdcode.mk`
  89 will "do the right thing".
  90
  91 ### File: mdcode.mk
  92
  93         CFLAGS += -Wall -g
  94         all::
  95         mdcode.h libmdcode.c md2c.c mdcode.mk :  mdcode.mdc
  96                 ./md2c mdcode.mdc
  97
  98
  99 ### File: mdcode.h
 100
 101         #include <stdio.h>
 102         ## exported types
 103         ## exported functions
 104
 105 ### File: libmdcode.c
 106         #define _GNU_SOURCE
 107         #include <unistd.h>
 108         #include <stdlib.h>
 109         #include <stdio.h>
 110
 111         #include "mdcode.h"
 112         ## internal includes
 113         ## private types
 114         ## internal functions
 115
 116 ### File: mdcode.mk
 117
 118         all :: libmdcode.o
 119         libmdcode.o : libmdcode.c mdcode.h
 120                 $(CC) $(CFLAGS) -c libmdcode.c
 121
 122 ### File: md2c.c
 123
 124         #include <unistd.h>
 125         #include <stdlib.h>
 126         #include <stdio.h>
 127
 128         #include "mdcode.h"
 129
 130         ## client includes
 131         ## client functions
 132
 133 ### File: mdcode.mk
 134
 135         all :: md2c
 136         md2c : md2c.o libmdcode.o
 137                 $(CC) $(CFLAGS) -o md2c md2c.o libmdcode.o
 138         md2c.o : md2c.c mdcode.h
 139                 $(CC) $(CFLAGS) -c md2c.c
 140
 141 ## Data Structures
 142
 143 As the core purpose of _mdcode_ is to discover and re-arrange blocks
 144 of text, it makes sense to map the whole document file into memory and
 145 produce a data structure which lists various parts of the file in the
 146 appropriate order.  Each node in this structure will have some text
 147 from the document, a child pointer, and a next pointer, any of which
 148 might not be present.  The text is most easily stored as a pointer and a
 149 length.  We'll call this a `text`
 150
 151 A list of these `code_nodes` will belong to each section and it will
 152 be useful to have a separate `section` data structure to store the
 153 list of `code_nodes`, the section name, and some other information.
 154
 155 This other information will include a reference counter so we can
 156 ensure proper referencing, and an `indent` depth.  As referenced
 157 content can have an extra indent added, we need to know what that is.
 158 The `code_node` will also have an `indent` depth which eventually gets
 159 set to the sum for the indents from all references on the path from
 160 the root.
 161
 162 Finally we need to know if the `code_node` was recognised by being
 163 indented or not.  If it was, the client of this data will want to
 164 strip of the leading tab or 4 spaces.  Hence a `needs_strip` flag is
 165 needed.
 166
 167 ##### exported types
 168
 169         struct text {
 170                 char *txt;
 171                 int len;
 172         };
 173
 174         struct section {
 175                 struct text section;
 176                 struct code_node *code;
 177                 struct section *next;
 178         };
 179
 180         struct code_node {
 181                 struct text code;
 182                 int indent;
 183                 int line_no;
 184                 int needs_strip;
 185                 struct code_node *next;
 186                 struct section *child;
 187         };
 188
 189 ##### private types
 190
 191         struct psection {
 192                 struct section;
 193                 struct code_node *last;
 194                 int refcnt;
 195                 int indent;
 196         };
 197
 198 You will note that the `struct psection` contains an anonymous `struct
 199 section` embedded at the start.  To make this work right, GCC
 200 requires the `-fplan9-extensions` flag.
 201
 202 ##### File: mdcode.mk
 203
 204         CFLAGS += -fplan9-extensions
 205
 206 ### Manipulating the node
 207
 208 Though a tree with `next` and `child` links is the easiest way to
 209 assemble the various code sections, it is not the easiest form for
 210 using them.  For that a simple list would be best.
 211
 212 So once we have a fully linked File section we will want to linearize
 213 it, so that the `child` links become `NULL` and the `next` links will
 214 find everything required.  It is at this stage that the requirements
 215 that each section is linked only once becomes import.
 216
 217 `code_linearize` will merge the `code_node`s from any child into the
 218 given `code_node`.  As it does this it sets the 'indent' field for
 219 each `code_node`.
 220
 221 Note that we don't clear the section's `last` pointer, even though
 222 it no longer owns any code.  This allows subsequent code to see if a
 223 section ever had any code, and to report an error if a section is
 224 referenced but not defined.
 225
 226 ##### internal functions
 227
 228         static void code_linearize(struct code_node *code)
 229         {
 230                 struct code_node *t;
 231                 for (t = code; t; t = t->next)
 232                         t->indent = 0;
 233                 for (; code; code = code->next)
 234                         if (code->child) {
 235                                 struct code_node *next = code->next;
 236                                 struct psection *pchild =
 237                                         (struct psection *)code->child;
 238                                 int indent = pchild->indent;
 239                                 code->next = code->child->code;
 240                                 code->child->code = NULL;
 241                                 code->child = NULL;
 242                                 for (t = code; t->next; t = t->next)
 243                                         t->next->indent = code->indent + indent;
 244                                 t->next = next;
 245                         }
 246         }
 247
 248 Once a client has made use of a linearized code set, it will probably
 249 want to free it.
 250
 251         void code_free(struct code_node *code)
 252         {
 253                 while (code) {
 254                         struct code_node *this;
 255                         if (code->child)
 256                                 code_linearize(code);
 257                         this = code;
 258                         code = code->next;
 259                         free(this);
 260                 }
 261         }
 262
 263 ##### exported functions
 264
 265         void code_free(struct code_node *code);
 266
 267 ### Building the tree
 268
 269 As we parse the document there are two things we will want to do to
 270 node trees: add some text or add a reference.  We'll assume for now
 271 that the relevant section structures have been found, and will just
 272 deal with the `code_node`.
 273
 274 Adding text simply means adding another node.  We will never have
 275 empty nodes, even if the last node only has a child, new text must go
 276 in a new node.
 277
 278 ##### internal functions
 279
 280         static void code_add_text(struct psection *where, struct text txt,
 281                                   int line_no, int needs_strip)
 282         {
 283                 struct code_node *n;
 284                 if (txt.len == 0)
 285                         return;
 286                 n = malloc(sizeof(*n));
 287                 n->code = txt;
 288                 n->indent = 0;
 289                 n->line_no = line_no;
 290                 n->needs_strip = needs_strip;
 291                 n->next = NULL;
 292                 n->child = NULL;
 293                 if (where->last)
 294                         where->last->next = n;
 295                 else
 296                         where->code = n;
 297                 where->last = n;
 298         }
 299
 300 However when adding a link, we might be able to include it in the last
 301 `code_node` if it currently only has text.
 302
 303         void code_add_link(struct psection *where, struct psection *to,
 304                            int indent)
 305         {
 306                 struct code_node *n;
 307
 308                 to->indent = indent;
 309                 to->refcnt++;   // this will be checked elsewhere
 310                 if (where->last && where->last->child == NULL) {
 311                         where->last->child = to;
 312                         return;
 313                 }
 314                 n = malloc(sizeof(*n));
 315                 n->code.len = 0;
 316                 n->indent = 0;
 317                 n->line_no = 0;
 318                 n->next = NULL;
 319                 n->child = to;
 320                 if (where->last)
 321                         where->last->next = n;
 322                 else
 323                         where->code = n;
 324                 where->last = n;
 325         }
 326
 327 ### Finding sections
 328
 329 Now we need a lookup table to be able to find sections by name.
 330 Something that provides an `n*log(N)` search time is probably
 331 justified, but for now I want a minimal stand-alone program so a
 332 linked list managed by insertion-sort will do.
 333
 334 The text compare function will likely be useful for any clients of our
 335 library, so we may as well export it.
 336
 337 If we cannot find a section, we simply want to create it.  This allows
 338 sections and references to be created in any order.  Sections with
 339 no references or no content will cause a warning eventually.
 340
 341 #### exported functions
 342
 343         int text_cmp(struct text a, struct text b);
 344
 345 #### internal functions
 346
 347         int text_cmp(struct text a, struct text b)
 348         {
 349                 int len = a.len;
 350                 if (len > b.len)
 351                         len = b.len;
 352                 int cmp = strncmp(a.txt, b.txt, len);
 353                 if (cmp)
 354                         return cmp;
 355                 else
 356                         return a.len - b.len;
 357         }
 358
 359         static struct psection *section_find(struct psection **list, struct text name)
 360         {
 361                 struct psection *new;
 362                 while (*list) {
 363                         int cmp = text_cmp((*list)->section, name);
 364                         if (cmp == 0)
 365                                 return *list;
 366                         if (cmp > 0)
 367                                 break;
 368                         list = (struct psection **)&((*list)->next);
 369                 }
 370                 /* Add this section */
 371                 new = malloc(sizeof(*new));
 372                 new->next = *list;
 373                 *list = new;
 374                 new->section = name;
 375                 new->code = NULL;
 376                 new->last = NULL;
 377                 new->refcnt = 0;
 378                 new->indent = 0;
 379                 return new;
 380         }
 381
 382 ## Parsing the _markdown_
 383
 384 Parsing markdown is fairly easy, though there are complications.
 385
 386 The document is divided into "paragraphs" which are mostly separated by blank
 387 lines (which may contain white space).  The first few characters of
 388 the first line of a paragraph determine the type of paragraph.  For
 389 our purposes we are only interested in list paragraphs, code
 390 paragraphs, section headings, and everything else.  Section headings
 391 are single-line paragraphs and so do not require a preceding or
 392 following blank line.
 393
 394 Section headings start with 1 or more hash characters (__#__).  List
 395 paragraphs start with hyphen, asterisk, plus, or digits followed by a
 396 period.  Code paragraphs aren't quite so easy.
 397
 398 The "standard" code paragraph starts with 4 or more spaces, or a tab.
 399 However if the previous paragraph was a list paragraph, then those
 400 spaces indicate another  paragraph in the same list item, and 8 or
 401 more spaces are required.  Unless a nested list is in effect, in
 402 which case 12 or more are need.   Unfortunately not all _markdown_
 403 parsers agree on nested lists.
 404
 405 Two alternate styles for marking code are in active use.  "Github" uses
 406 three backticks(_`` ``` ``_), while "pandoc" uses three or more tildes
 407 (_~~~_).  In these cases the code should not be indented.
 408
 409 Trying to please everyone as much as possible, this parser will handle
 410 everything except for code inside lists.
 411
 412 So an indented (4+) paragraph after a list paragraph is always a list
 413 paragraph, otherwise it is a code paragraph.  A paragraph that starts
 414 with three backticks or three tildes is code which continues until a
 415 matching string of backticks or tildes.
 416
 417 ### Skipping bits
 418
 419 While walking the document looking for various markers we will *not*
 420 use the `struct text` introduced earlier as advancing that requires
 421 updating both start and length which feels clumsy.  Instead we will
 422 carry `pos` and `end` pointers, only the first of which needs to
 423 change.
 424
 425 So to start, we need to skip various parts of the document.  `lws`
 426 stands for "Linear White Space" and is a term that comes from the
 427 Email RFCs (e.g. RFC822).  `line` and `para` are self explanatory.
 428 Note that `skip_para` needs to update the current line number.
 429 `skip_line` doesn't but every caller should.
 430
 431 #### internal functions
 432
 433         static char *skip_lws(char *pos, char *end)
 434         {
 435                 while (pos < end && (*pos == ' ' || *pos == '\t'))
 436                         pos++;
 437                 return pos;
 438         }
 439
 440         static char *skip_line(char *pos, char *end)
 441         {
 442                 while (pos < end && *pos != '\n')
 443                         pos++;
 444                 if (pos < end)
 445                         pos++;
 446                 return pos;
 447         }
 448
 449         static char *skip_para(char *pos, char *end, int *line_no)
 450         {
 451                 /* Might return a pointer to a blank line, as only
 452                  * one trailing blank line is skipped
 453                  */
 454                 if (*pos == '#') {
 455                         pos = skip_line(pos, end);
 456                         (*line_no) += 1;
 457                         return pos;
 458                 }
 459                 while (pos < end &&
 460                        *pos != '#' &&
 461                        *(pos = skip_lws(pos, end)) != '\n') {
 462                         pos = skip_line(pos, end);
 463                         (*line_no) += 1;
 464                 }
 465                 if (pos < end && *pos == '\n') {
 466                         pos++;
 467                         (*line_no) += 1;
 468                 }
 469                 return pos;
 470         }
 471
 472 ### Recognising things
 473
 474 Recognising a section header is trivial and doesn't require a
 475 function.  However we need to extract the content of a section header
 476 as a `struct text` for passing to `section_find`.
 477 Recognising the start of a new list is fairly easy.  Recognising the
 478 start (and end) of code is a little messy so we provide a function for
 479 matching the first few characters, which has a special case for "4
 480 spaces or tab".
 481
 482 #### internal includes
 483
 484         #include  <ctype.h>
 485         #include  <string.h>
 486
 487 #### internal functions
 488
 489         static struct text take_header(char *pos, char *end)
 490         {
 491                 struct text section;
 492
 493                 while (pos < end && *pos == '#')
 494                         pos++;
 495                 while (pos < end && *pos == ' ')
 496                         pos++;
 497                 section.txt = pos;
 498                 while (pos < end && *pos != '\n')
 499                         pos++;
 500                 while (pos > section.txt &&
 501                        (pos[-1] == '#' || pos[-1] == ' '))
 502                         pos--;
 503                 section.len = pos - section.txt;
 504                 return section;
 505         }
 506
 507         static int is_list(char *pos, char *end)
 508         {
 509                 if (strchr("-*+", *pos))
 510                         return 1;
 511                 if (isdigit(*pos)) {
 512                         while (pos < end && isdigit(*pos))
 513                                 pos += 1;
 514                         if  (pos < end && *pos == '.')
 515                                 return 1;
 516                 }
 517                 return 0;
 518         }
 519
 520         static int matches(char *start, char *pos, char *end)
 521         {
 522                 if (start == NULL)
 523                         return matches("\t", pos, end) ||
 524                                matches("    ", pos, end);
 525                 return (pos + strlen(start) < end &&
 526                         strncmp(pos, start, strlen(start)) == 0);
 527         }
 528
 529 ### Extracting the code
 530
 531 Now that we can skip paragraphs and recognise what type each paragraph
 532 is, it is time to parse the file and extract the code.  We'll do this
 533 in two parts, first we look at what to do with some code once we
 534 find it, and then how to actually find it.
 535
 536 When we have some code, we know where it is, what the end marker
 537 should look like, and which section it is in.
 538
 539 There are two sorts of end markers: the presence of a particular
 540 string, or the absence of an indent.  We will use a string to
 541 represent a presence, and a `NULL` to represent the absence.
 542
 543 While looking at code we don't think about paragraphs at all - just
 544 look for a line that starts with the right thing.
 545 Every line that is still code then needs to be examined to see if it
 546 is a section reference.
 547
 548 When a section reference is found, all preceding code (if any) must be
 549 added to the current section, then the reference is added.
 550
 551 When we do find the end of the code, all text that we have found but
 552 not processed needs to be saved too.
 553
 554 When adding a reference we need to set the `indent`.  This is the
 555 number of spaces (counting 8 for tabs) after the natural indent of the
 556 code (which is a tab or 4 spaces).  We use a separate function `count_spaces`
 557 for that.
 558
 559 If there are completely blank linkes (no indent) at the end of the found code,
 560 these should be considered to be spacing between the code and the next section,
 561 and so no included in the code.  When a marker is used to explicitly mark the
 562 end of the code, we don't need to check for these blank lines.
 563
 564 #### internal functions
 565
 566         static int count_space(char *sol, char *p)
 567         {
 568                 int c = 0;
 569                 while (sol < p) {
 570                         if (sol[0] == ' ')
 571                                 c++;
 572                         if (sol[0] == '\t')
 573                                 c+= 8;
 574                         sol++;
 575                 }
 576                 return c;
 577         }
 578
 579         static char *take_code(char *pos, char *end, char *marker,
 580                                struct psection **table, struct text section,
 581                                int *line_nop)
 582         {
 583                 char *start = pos;
 584                 int line_no = *line_nop;
 585                 int start_line = line_no;
 586                 struct psection *sect;
 587
 588                 sect = section_find(table, section);
 589
 590                 while (pos < end) {
 591                         char *sol, *t;
 592                         struct text ref;
 593
 594                         if (marker && matches(marker, pos, end))
 595                                 break;
 596                         if (!marker &&
 597                             (skip_lws(pos, end))[0] != '\n' &&
 598                             !matches(NULL, pos, end))
 599                                 /* Paragraph not indented */
 600                                 break;
 601
 602                         /* Still in code - check for reference */
 603                         sol = pos;
 604                         if (!marker) {
 605                                 if (*sol == '\t')
 606                                         sol++;
 607                                 else if (strcmp(sol, "    ") == 0)
 608                                         sol += 4;
 609                         }
 610                         t = skip_lws(sol, end);
 611                         if (t[0] != '#' || t[1] != '#') {
 612                                 /* Just regular code here */
 613                                 pos = skip_line(sol, end);
 614                                 line_no++;
 615                                 continue;
 616                         }
 617
 618                         if (pos > start) {
 619                                 struct text txt;
 620                                 txt.txt = start;
 621                                 txt.len = pos - start;
 622                                 code_add_text(sect, txt, start_line,
 623                                               marker == NULL);
 624                         }
 625                         ref = take_header(t, end);
 626                         if (ref.len) {
 627                                 struct psection *refsec = section_find(table, ref);
 628                                 code_add_link(sect, refsec, count_space(sol, t));
 629                         }
 630                         pos = skip_line(t, end);
 631                         line_no++;
 632                         start = pos;
 633                         start_line = line_no;
 634                 }
 635                 if (pos > start) {
 636                         struct text txt;
 637                         txt.txt = start;
 638                         txt.len = pos - start;
 639                         /* strip trailing blank lines */
 640                         while (!marker && txt.len > 2 &&
 641                                start[txt.len-1] == '\n' &&
 642                                start[txt.len-2] == '\n')
 643                                 txt.len -= 1;
 644
 645                         code_add_text(sect, txt, start_line,
 646                                       marker == NULL);
 647                 }
 648                 if (marker) {
 649                         pos = skip_line(pos, end);
 650                         line_no++;
 651                 }
 652                 *line_nop = line_no;
 653                 return pos;
 654         }
 655
 656 ### Finding the code
 657
 658 It is when looking for the code that we actually use the paragraph
 659 structure.  We need to recognise section headings so we can record the
 660 name, list paragraphs so we can ignore indented follow-on paragraphs,
 661 and the three different markings for code.
 662
 663 #### internal functions
 664
 665         static struct psection *code_find(char *pos, char *end)
 666         {
 667                 struct psection *table = NULL;
 668                 int in_list = 0;
 669                 int line_no = 1;
 670                 struct text section = {0};
 671
 672                 while (pos < end) {
 673                         if (pos[0] == '#') {
 674                                 section = take_header(pos, end);
 675                                 in_list = 0;
 676                                 pos = skip_line(pos, end);
 677                                 line_no++;
 678                         } else if (is_list(pos, end)) {
 679                                 in_list = 1;
 680                                 pos = skip_para(pos, end, &line_no);
 681                         } else if (!in_list && matches(NULL, pos, end)) {
 682                                 pos = take_code(pos, end, NULL, &table,
 683                                                 section, &line_no);
 684                         } else if (matches("```", pos, end)) {
 685                                 in_list = 0;
 686                                 pos = skip_line(pos, end);
 687                                 line_no++;
 688                                 pos = take_code(pos, end, "```", &table,
 689                                                 section, &line_no);
 690                         } else if (matches("~~~", pos, end)) {
 691                                 in_list = 0;
 692                                 pos = skip_line(pos, end);
 693                                 line_no++;
 694                                 pos = take_code(pos, end, "~~~", &table,
 695                                                 section, &line_no);
 696                         } else {
 697                                 if (!isspace(*pos))
 698                                         in_list = 0;
 699                                 pos = skip_para(pos, end, &line_no);
 700                         }
 701                 }
 702                 return table;
 703         }
 704
 705 ### Returning the code
 706
 707 Having found all the code blocks and gathered them into a list of
 708 section, we are now ready to return them to the caller.  This is where
 709 to perform consistency checks, like at most one reference and at least
 710 one definition for each section.
 711
 712 All the sections with no references are returned in a list for the
 713 caller to consider.  The are linearized first so that the substructure
 714 is completely hidden -- except for the small amount of structure
 715 displayed in the line numbers.
 716
 717 To return errors, we have the caller pass a function which takes an
 718 error message - a `code_err_fn`.
 719
 720 #### exported types
 721
 722         typedef void (*code_err_fn)(char *msg);
 723
 724 #### internal functions
 725         struct section *code_extract(char *pos, char *end, code_err_fn error)
 726         {
 727                 struct psection *table;
 728                 struct section *result = NULL;
 729                 struct section *tofree = NULL;
 730
 731                 table = code_find(pos, end);
 732
 733                 while (table) {
 734                         struct psection *t = (struct psection*)table->next;
 735                         if (table->last == NULL) {
 736                                 char *msg;
 737                                 asprintf(&msg,
 738                                         "Section \"%.*s\" is referenced but not declared",
 739                                          table->section.len, table->section.txt);
 740                                 error(msg);
 741                                 free(msg);
 742                         }
 743                         if (table->refcnt == 0) {
 744                                 /* Root-section,  return it */
 745                                 table->next = result;
 746                                 result = table;
 747                                 code_linearize(result->code);
 748                         } else {
 749                                 table->next = tofree;
 750                                 tofree = table;
 751                                 if (table->refcnt > 1) {
 752                                         char *msg;
 753                                         asprintf(&msg,
 754                                                  "Section \"%.*s\" referenced multiple times (%d).",
 755                                                  table->section.len, table->section.txt,
 756                                                  table->refcnt);
 757                                         error(msg);
 758                                         free(msg);
 759                                 }
 760                         }
 761                         table = t;
 762                 }
 763                 while (tofree) {
 764                         struct section *t = tofree->next;
 765                         free(tofree);
 766                         tofree = t;
 767                 }
 768                 return result;
 769         }
 770
 771 ##### exported functions
 772
 773         struct section *code_extract(char *pos, char *end, code_err_fn error);
 774
 775 ## Using the library
 776
 777 Now that we can extract code from a document and link it all together
 778 it is time to do something with that code.  Firstly we need to print
 779 it out.
 780
 781 ### Printing the Code
 782
 783 Printing is mostly straight forward - we just walk the list and print
 784 the code sections, adding whatever indent is required for each line.
 785 However there is a complication (isn't there always)?
 786
 787 For code that was recognised because the paragraph was indented, we
 788 need to strip that indent first.  For other code, we don't.
 789
 790 The approach taken here is simple, though it could arguably be wrong
 791 in some unlikely cases.  So it might need to be fixed later.
 792
 793 If the first line of a code block is indented, then either one tab or
 794 4 spaces are striped from every non-blank line.
 795
 796 This could go wrong if the first line of a code block marked by
 797 _`` ``` ``_ is indented.  To overcome this we would need to
 798 record some extra state in each `code_node`.  For now we won't bother.
 799
 800 The indents we insert will mostly be spaces.  All-spaces doesn't work
 801 for `Makefiles`, so if the indent is 8 or more, we use a TAB first.
 802
 803 ##### internal functions
 804
 805         void code_node_print(FILE *out, struct code_node *node,
 806                              char *fname)
 807         {
 808                 for (; node; node = node->next) {
 809                         char *c = node->code.txt;
 810                         int len = node->code.len;
 811
 812                         if (!len)
 813                                 continue;
 814
 815                         fprintf(out, "#line %d \"%s\"\n",
 816                                 node->line_no, fname);
 817                         while (len && *c) {
 818                                 if (node->indent >= 8)
 819                                         fprintf(out, "\t%*s", node->indent - 8, "");
 820                                 else
 821                                         fprintf(out, "%*s", node->indent, "");
 822                                 if (node->needs_strip) {
 823                                         if (*c == '\t' && len > 1) {
 824                                                 c++;
 825                                                 len--;
 826                                         } else if (strncmp(c, "    ", 4) == 0 && len > 4) {
 827                                                 c += 4;
 828                                                 len-= 4;
 829                                         }
 830                                 }
 831                                 do {
 832                                         fputc(*c, out);
 833                                         c++;
 834                                         len--;
 835                                 } while (len && c[-1] != '\n');
 836                         }
 837                 }
 838         }
 839
 840 ###### exported functions
 841         void code_node_print(FILE *out, struct code_node *node, char *fname);
 842
 843 ### Bringing it all together
 844
 845 We are just about ready for the `main` function of the tool which will
 846 extract all this lovely code and compile it.  Just one helper is still
 847 needed.
 848
 849 #### Handling filenames
 850
 851 Section names are stored in `struct text` which is not `nul`
 852 terminated.  Filenames passed to `open` need to be null terminated.
 853 So we need to convert one to the other, and strip the leading `File:`
 854 of while we are at it.
 855
 856 ##### client functions
 857
 858         static void copy_fname(char *name, int space, struct text t)
 859         {
 860                 char *sec = t.txt;
 861                 int len = t.len;
 862                 name[0] = 0;
 863                 if (len < 5 || strncmp(sec, "File:", 5) != 0)
 864                         return;
 865                 sec += 5;
 866                 len -= 5;
 867                 while (len && sec[0] == ' ') {
 868                         sec++;
 869                         len--;
 870                 }
 871                 if (len >= space)
 872                         len = space - 1;
 873                 strncpy(name, sec, len);
 874                 name[len] = 0;
 875         }
 876
 877 #### Main
 878
 879 And now we take a single file name, extract the code, and if there are
 880 no error we write out a file for each appropriate code section.  And
 881 we are done.
 882
 883 ##### client includes
 884
 885         #include <fcntl.h>
 886         #include <errno.h>
 887         #include <sys/mman.h>
 888         #include <string.h>
 889
 890 ##### client functions
 891
 892         static int errs;
 893         static void pr_err(char *msg)
 894         {
 895                 errs++;
 896                 fprintf(stderr, "%s\n", msg);
 897         }
 898
 899         static char *strnchr(char *haystack, int len, char needle)
 900         {
 901                 while (len > 0 && *haystack && *haystack != needle) {
 902                         haystack++;
 903                         len--;
 904                 }
 905                 return len > 0 && *haystack == needle ? haystack : NULL;
 906         }
 907
 908         int main(int argc, char *argv[])
 909         {
 910                 int fd;
 911                 size_t len;
 912                 char *file;
 913                 struct text section = {NULL, 0};
 914                 struct section *table, *s, *prev;
 915
 916                 errs = 0;
 917                 if (argc != 2 && argc != 3) {
 918                         fprintf(stderr, "Usage: mdcode file.mdc [section]\n");
 919                         exit(2);
 920                 }
 921                 if (argc == 3) {
 922                         section.txt = argv[2];
 923                         section.len = strlen(argv[2]);
 924                 }
 925
 926                 fd = open(argv[1], O_RDONLY);
 927                 if (fd < 0) {
 928                         fprintf(stderr, "mdcode: cannot open %s: %s\n",
 929                                 argv[1], strerror(errno));
 930                         exit(1);
 931                 }
 932                 len = lseek(fd, 0, 2);
 933                 file = mmap(NULL, len, PROT_READ, MAP_SHARED, fd, 0);
 934                 table = code_extract(file, file+len, pr_err);
 935
 936                 for (s = table; s;
 937                         (code_free(s->code), prev = s, s = s->next, free(prev))) {
 938                         FILE *fl;
 939                         char fname[1024];
 940                         char *spc = strnchr(s->section.txt, s->section.len, ' ');
 941
 942                         if (spc > s->section.txt && spc[-1] == ':') {
 943                                 if (strncmp(s->section.txt, "File: ", 6) != 0 &&
 944                                     (section.txt == NULL ||
 945                                      text_cmp(s->section, section) != 0))
 946                                         /* Ignore this section */
 947                                         continue;
 948                         } else {
 949                                 fprintf(stderr, "Code in unreferenced section that is not ignored or a file name: %.*s\n",
 950                                         s->section.len, s->section.txt);
 951                                 errs++;
 952                                 continue;
 953                         }
 954                         if (section.txt) {
 955                                 if (text_cmp(s->section, section) == 0)
 956                                         code_node_print(stdout, s->code, argv[1]);
 957                                 break;
 958                         }
 959                         copy_fname(fname, sizeof(fname), s->section);
 960                         if (fname[0] == 0) {
 961                                 fprintf(stderr, "Missing file name at:%.*s\n",
 962                                         s->section.len, s->section.txt);
 963                                 errs++;
 964                                 continue;
 965                         }
 966                         fl = fopen(fname, "w");
 967                         if (!fl) {
 968                                 fprintf(stderr, "Cannot create %s: %s\n",
 969                                         fname, strerror(errno));
 970                                 errs++;
 971                                 continue;
 972                         }
 973                         code_node_print(fl, s->code, argv[1]);
 974                         fclose(fl);
 975                 }
 976                 exit(!!errs);
 977         }