ocean-lang.org Git - ocean/blob - csrc/mdcode.mdc

   1 # mdcode: extract C code from a _markdown_ file.
   2
   3 _markdown_ is a popular format for simple text markup which can easily
   4 be converted to HTML.  As it allows easy indication of sections of
   5 code, it is quite suitable for use in literate programming.  This file
   6 is an example of that usage.
   7
   8 The code included below provides two related functionalities.
   9 Firstly it provides a library routine for extracting code out of a
  10 _markdown_ file, so that other routines might make use of it.
  11
  12 Secondly it provides a simple client of this routine which extracts
  13 1 or more C-language files from a markdown document so they can be
  14 passed to a C compiler.  These two combined to make a tool that is needed
  15 to compile this tool.  Yes, this is circular.  A prototype tool was
  16 used for the first extraction.
  17
  18 The tool provided is described as specific to the C language as it
  19 generates
  20
  21 ##### Example: a _line_ command
  22
  23         #line __line-number__ __file-name__
  24
  25 lines so that the C compiler will report where in the markdown file
  26 any error is found.  This tool is suitable for any other language
  27 which allows the same directive, or will treat it as a comment.
  28
  29 ## Literate Details
  30
  31 Literate programming is more than just including comments with the
  32 code, even nicely formatted comments.  It also involves presenting the
  33 code in an order that makes sense to a human, rather than an order
  34 that makes sense to a compiler.  For this reason a core part of any
  35 literate programming tool is the ability to re-arrange the code found
  36 in the document into a different order in the final code file - or
  37 files.  This requires some form of linkage to be encoded.
  38
  39 The approach taken here is focused around section headings - of any
  40 depth.
  41
  42 All the code in any section is treated as a single sequential
  43 collection of code, and is named by the section that it is in.  If
  44 multiple sections have the same name, then the code blocks in all of
  45 them are joined together in the order they appear in the document.
  46
  47 A code section can contain a special marker which starts with 2
  48 hashes: __##__.
  49 The text after the marker must be the name of some section which
  50 contains code.  Code from that section will be interpolated in place
  51 of the marker, and will be indented to match the indent of the marker.
  52
  53 It is not permitted for the same code to be interpolated multiple
  54 times.  Allowing this might make some sense, but it is probably a
  55 mistake, and prohibiting it make some of the code a bit cleaner.
  56
  57 Equally, every section of code should be interpolated at least once -
  58 with two exceptions.  These exceptions are imposed by the tool, not
  59 the library.  A different client could impose different rules on the
  60 names of top-level code sections.
  61
  62 The first exception we have already seen.  A section name starting
  63 __Example:__ indicates code that is not to be included in the final product.
  64
  65 The second exception is for the top level code sections which will be
  66 written to files.  Again these are identified by their section name.
  67 This must start with __File:__  the following text (after optional
  68 spaces) will be used as a file name.
  69
  70 Any section containing code that does not start __Example:__ or
  71 __File:__ must be included in some other section exactly once.
  72
  73 ### Multiple files
  74
  75 Allowing multiple top level code sections which name different files
  76 means that one _markdown_ document can describe several files.  This
  77 is very useful with the C language where a program file and a header
  78 file might be related.  For the present document we will have a header
  79 file and two code files, one with the library content and one for the
  80 tool.
  81
  82 It will also be very convenient to create a `makefile` fragment to
  83 ensure the code is compiled correctly.  A simple `make -f mdcode.mk`
  84 will "do the right thing".
  85
  86 ### File: mdcode.mk
  87
  88         CFLAGS += -Wall -g
  89         all::
  90         mdcode.h libmdcode.c md2c.c mdcode.mk :  mdcode.mdc
  91                 ./md2c mdcode.mdc
  92
  93
  94 ### File: mdcode.h
  95
  96         ## exported types
  97         ## exported functions
  98
  99 ### File: libmdcode.c
 100         #define _GNU_SOURCE
 101         #include <unistd.h>
 102         #include <stdlib.h>
 103         #include <stdio.h>
 104
 105         #include "mdcode.h"
 106         ## internal includes
 107         ## private types
 108         ## internal functions
 109
 110 ### File: mdcode.mk
 111
 112         all :: libmdcode.o
 113         libmdcode.o : libmdcode.c mdcode.h
 114                 $(CC) $(CFLAGS) -c libmdcode.c
 115
 116
 117 ### File: md2c.c
 118
 119         #include <unistd.h>
 120         #include <stdlib.h>
 121
 122         #include "mdcode.h"
 123
 124         ## client includes
 125         ## client functions
 126
 127 ### File: mdcode.mk
 128
 129         all :: md2c
 130         md2c : md2c.o libmdcode.o
 131                 $(CC) $(CFLAGS) -o md2c md2c.o libmdcode.o
 132         md2c.o : md2c.c mdcode.h
 133                 $(CC) $(CFLAGS) -c md2c.c
 134
 135 ## Data Structures
 136
 137 As the core purpose of _mdcode_ is to discover and re-arrange blocks
 138 of text, it makes sense to map the whole document file into memory and
 139 produce a data structure which lists various parts of the file in the
 140 appropriate order.  Each node in this structure will have some text
 141 from the document, a child pointer, and a next pointer, any of which
 142 might not be present.  The text is most easily stored as a pointer and a
 143 length.  We'll call this a `text`
 144
 145 A list of these `code_nodes` will belong to each section and it will
 146 be useful to have a separate `section` data structure to store the
 147 list of `code_nodes`, the section name, and some other information.
 148
 149 This other information will include a reference counter so we can
 150 ensure proper referencing, and an `indent` depth.  As referenced
 151 content can have an extra indent added, we need to know what that is.
 152 The `code_node` will also have an `indent` depth which eventually gets
 153 set to the sum for the indents from all references on the path from
 154 the root.
 155
 156 ##### exported types
 157
 158         struct text {
 159                 char *txt;
 160                 int len;
 161         };
 162
 163         struct section {
 164                 struct text section;
 165                 struct code_node *code;
 166                 struct section *next;
 167         };
 168
 169         struct code_node {
 170                 struct text code;
 171                 int indent;
 172                 int line_no;
 173                 struct code_node *next;
 174                 struct section *child;
 175         };
 176
 177 ##### private types
 178
 179         struct psection {
 180                 struct section;
 181                 struct code_node *last;
 182                 int refcnt;
 183                 int indent;
 184         };
 185
 186 You will note that the `struct psection` contains an anonymous `struct
 187 section` embedded at the start.  To make this work right, GCC
 188 requires the `-fplan9-extensions` flag.
 189
 190 ##### File: mdcode.mk
 191
 192         CFLAGS += -fplan9-extensions
 193
 194 ### Manipulating the node
 195
 196 Though a tree with `next` and `child` links is the easiest way to
 197 assemble the various code sections, it is not the easiest form for
 198 using them.  For that a simple list would be best.
 199
 200 So once we have a fully linked File section we will want to linearize
 201 it, so that the `child` links become `NULL` and the `next` links will
 202 find everything required.  It is at this stage that the requirements
 203 that each section is linked only once becomes import.
 204
 205 `code_linearize` will merge the `code_node`s from any child into the
 206 given `code_node`.  As it does this it sets the 'indent' field for
 207 each `code_node`.
 208
 209 Note that we don't clear the section's `last` pointer, even though
 210 it no longer owns any code.  This allows subsequent code to see if a
 211 section ever had any code, and to report an error if a section is
 212 referenced but not defined.
 213
 214 ##### internal functions
 215
 216         static void code_linearize(struct code_node *code)
 217         {
 218                 struct code_node *t;
 219                 for (t = code; t; t = t->next)
 220                         t->indent = 0;
 221                 for (; code; code = code->next)
 222                         if (code->child) {
 223                                 struct code_node *next = code->next;
 224                                 struct psection *pchild =
 225                                         (struct psection *)code->child;
 226                                 int indent = pchild->indent;
 227                                 code->next = code->child->code;
 228                                 code->child->code = NULL;
 229                                 code->child = NULL;
 230                                 for (t = code; t->next; t = t->next)
 231                                         t->next->indent = code->indent + indent;
 232                                 t->next = next;
 233                         }
 234         }
 235
 236 Once a client has made use of a linearized code set, it will probably
 237 want to free it.
 238
 239         void code_free(struct code_node *code)
 240         {
 241                 while (code) {
 242                         struct code_node *this;
 243                         if (code->child)
 244                                 code_linearize(code);
 245                         this = code;
 246                         code = code->next;
 247                         free(this);
 248                 }
 249         }
 250
 251 ##### exported functions
 252
 253         void code_free(struct code_node *code);
 254
 255 ### Building the tree
 256
 257 As we parse the document there are two things we will want to do to
 258 node trees: add some text or add a reference.  We'll assume for now
 259 that the relevant section structures have been found, and will just
 260 deal with the `code_node`.
 261
 262 Adding text simply means adding another node.  We will never have
 263 empty nodes, even if the last node only has a child, new text must go
 264 in a new node.
 265
 266 ##### internal functions
 267
 268         static void code_add_text(struct psection *where, struct text txt,
 269                                   int line_no)
 270         {
 271                 struct code_node *n;
 272                 if (txt.len == 0)
 273                         return;
 274                 n = malloc(sizeof(*n));
 275                 n->code = txt;
 276                 n->indent = 0;
 277                 n->line_no = line_no;
 278                 n->next = NULL;
 279                 n->child = NULL;
 280                 if (where->last)
 281                         where->last->next = n;
 282                 else
 283                         where->code = n;
 284                 where->last = n;
 285         }
 286
 287 However when adding a link, we might be able to include it in the last
 288 `code_node` if it currently only has text.
 289
 290         void code_add_link(struct psection *where, struct psection *to,
 291                            int indent)
 292         {
 293                 struct code_node *n;
 294
 295                 to->indent = indent;
 296                 to->refcnt++;   // this will be checked elsewhere
 297                 if (where->last && where->last->child == NULL) {
 298                         where->last->child = to;
 299                         return;
 300                 }
 301                 n = malloc(sizeof(*n));
 302                 n->code.len = 0;
 303                 n->indent = 0;
 304                 n->line_no = 0;
 305                 n->next = NULL;
 306                 n->child = to;
 307                 if (where->last)
 308                         where->last->next = n;
 309                 else
 310                         where->code = n;
 311                 where->last = n;
 312         }
 313
 314 ### Finding sections
 315
 316 Now we need a lookup table to be able to find sections by name.
 317 Something that provides an `n*log(N)` search time is probably
 318 justified, but for now I want a minimal stand-alone program so a
 319 linked list managed by insertion-sort will do.  As a comparison
 320 function it is easiest to sort based on length before content.  So
 321 sections won't be in standard lexical order, but that isn't important.
 322
 323 If we cannot find a section, we simply want to create it.  This allows
 324 sections and references to be created in any order.  Sections with
 325 no references or no content will cause a warning eventually.
 326
 327 #### internal functions
 328
 329         static int text_cmp(struct text a, struct text b)
 330         {
 331                 if (a.len != b.len)
 332                         return a.len - b.len;
 333                 return strncmp(a.txt, b.txt, a.len);
 334         }
 335
 336         static struct psection *section_find(struct psection **list, struct text name)
 337         {
 338                 struct psection *new;
 339                 while (*list) {
 340                         int cmp = text_cmp((*list)->section, name);
 341                         if (cmp == 0)
 342                                 return *list;
 343                         if (cmp > 0)
 344                                 break;
 345                         list = (struct psection **)&((*list)->next);
 346                 }
 347                 /* Add this section */
 348                 new = malloc(sizeof(*new));
 349                 new->next = *list;
 350                 *list = new;
 351                 new->section = name;
 352                 new->code = NULL;
 353                 new->last = NULL;
 354                 new->refcnt = 0;
 355                 new->indent = 0;
 356                 return new;
 357         }
 358
 359 ## Parsing the _markdown_
 360
 361 Parsing markdown is fairly easy, though there are complications.
 362
 363 The document is divided into "paragraphs" which are mostly separated by blank
 364 lines (which may contain white space).  The first few characters of
 365 the first line of a paragraph determine the type of paragraph.  For
 366 our purposes we are only interested in list paragraphs, code
 367 paragraphs, section headings, and everything else.  Section headings
 368 are single-line paragraphs and so do not require a preceding or
 369 following blank line.
 370
 371 Section headings start with 1 or more hash characters (__#__).  List
 372 paragraphs start with hyphen, asterisk, plus, or digits followed by a
 373 period.  Code paragraphs aren't quite so easy.
 374
 375 The "standard" code paragraph starts with 4 or more spaces, or a tab.
 376 However if the previous paragraph was a list paragraph, then those
 377 spaces indicate another  paragraph in the same list item, and 8 or
 378 more spaces are required.  Unless a nested list is in effect, in
 379 which case 12 or more are need.   Unfortunately not all _markdown_
 380 parsers agree on nested lists.
 381
 382 Two alternate styles for marking code are in active use.  "Github" uses
 383 three backticks(_`` ``` ``_), while "pandoc" uses three or more tildes
 384 (_~~~_).  In these cases the code should not be indented.
 385
 386 Trying to please everyone as much as possible, this parser will handle
 387 everything except for code inside lists.
 388
 389 So an indented (4+) paragraph after a list paragraph is always a list
 390 paragraph, otherwise it is a code paragraph.  A paragraph that starts
 391 with three backticks or three tildes is code which continues until a
 392 matching string of backticks or tildes.
 393
 394 ### Skipping bits
 395
 396 While walking the document looking for various markers we will *not*
 397 use the `struct text` introduced earlier as advancing that requires
 398 updating both start and length which feels clumsy.  Instead we will
 399 carry `pos` and `end` pointers, only the first of which needs to
 400 change.
 401
 402 So to start, we need to skip various parts of the document.  `lws`
 403 stands for "Linear White Space" and is a term that comes from the
 404 Email RFCs (e.g. RFC822).  `line` and `para` are self explanatory.
 405 Note that `skip_para` needs to update the current line number.
 406 `skip_line` doesn't but every caller should.
 407
 408 #### internal functions
 409
 410         static char *skip_lws(char *pos, char *end)
 411         {
 412                 while (pos < end && (*pos == ' ' || *pos == '\t'))
 413                         pos++;
 414                 return pos;
 415         }
 416
 417         static char *skip_line(char *pos, char *end)
 418         {
 419                 while (pos < end && *pos != '\n')
 420                         pos++;
 421                 if (pos < end)
 422                         pos++;
 423                 return pos;
 424         }
 425
 426         static char *skip_para(char *pos, char *end, int *line_no)
 427         {
 428                 /* Might return a pointer to a blank line, as only
 429                  * one trailing blank line is skipped
 430                  */
 431                 if (*pos == '#') {
 432                         pos = skip_line(pos, end);
 433                         (*line_no) += 1;
 434                         return pos;
 435                 }
 436                 while (pos < end &&
 437                        *pos != '#' &&
 438                        *(pos = skip_lws(pos, end)) != '\n') {
 439                         pos = skip_line(pos, end);
 440                         (*line_no) += 1;
 441                 }
 442                 if (pos < end && *pos == '\n') {
 443                         pos++;
 444                         (*line_no) += 1;
 445                 }
 446                 return pos;
 447         }
 448
 449 ### Recognising things
 450
 451 Recognising a section header is trivial and doesn't require a
 452 function.  However we need to extract the content of a section header
 453 as a `struct text` for passing to `section_find`.
 454 Recognising the start of a new list is fairly easy.  Recognising the
 455 start (and end) of code is a little messy so we provide a function for
 456 matching the first few characters, which has a special case for "4
 457 spaces or tab".
 458
 459 #### internal includes
 460
 461         #include  <ctype.h>
 462         #include  <string.h>
 463
 464 #### internal functions
 465
 466         static struct text take_header(char *pos, char *end)
 467         {
 468                 struct text section;
 469
 470                 while (pos < end && *pos == '#')
 471                         pos++;
 472                 while (pos < end && *pos == ' ')
 473                         pos++;
 474                 section.txt = pos;
 475                 while (pos < end && *pos != '\n')
 476                         pos++;
 477                 while (pos > section.txt &&
 478                        (pos[-1] == '#' || pos[-1] == ' '))
 479                         pos--;
 480                 section.len = pos - section.txt;
 481                 return section;
 482         }
 483
 484         static int is_list(char *pos, char *end)
 485         {
 486                 if (strchr("-*+", *pos))
 487                         return 1;
 488                 if (isdigit(*pos)) {
 489                         while (pos < end && isdigit(*pos))
 490                                 pos += 1;
 491                         if  (pos < end && *pos == '.')
 492                                 return 1;
 493                 }
 494                 return 0;
 495         }
 496
 497         static int matches(char *start, char *pos, char *end)
 498         {
 499                 if (start == NULL)
 500                         return matches("\t", pos, end) ||
 501                                matches("    ", pos, end);
 502                 return (pos + strlen(start) < end &&
 503                         strncmp(pos, start, strlen(start)) == 0);
 504         }
 505
 506 ### Extracting the code
 507
 508 Now that we can skip paragraphs and recognise what type each paragraph
 509 is, it is time to parse the file and extract the code.  We'll do this
 510 in two parts, first we look at what to do with some code once we
 511 find it, and then how to actually find it.
 512
 513 When we have some code, we know where it is, what the end marker
 514 should look like, and which section it is in.
 515
 516 There are two sorts of end markers: the presence of a particular
 517 string, or the absence of an indent.  We will use a string to
 518 represent a presence, and a `NULL` to represent the absence.
 519
 520 While looking at code we don't think about paragraphs are all - just
 521 look for a line that starts with the right thing.
 522 Every line that is still code then needs to be examined to see if it
 523 is a section reference.
 524
 525 When a section reference is found, all preceding code (if any) must be
 526 added to the current section, then the reference is added.
 527
 528 When we do find the end of the code, all text that we have found but
 529 not processed needs to be saved too.
 530
 531 When adding a reference we need to set the `indent`.  This is the
 532 number of spaces (counting 8 for tabs) after the natural indent of the
 533 code (which is a tab or 4 spaces).  We use a separate function `count_spaces`
 534 for that.
 535
 536 #### internal functions
 537
 538         static int count_space(char *sol, char *p)
 539         {
 540                 int c = 0;
 541                 while (sol < p) {
 542                         if (sol[0] == ' ')
 543                                 c++;
 544                         if (sol[0] == '\t')
 545                                 c+= 8;
 546                         sol++;
 547                 }
 548                 return c;
 549         }
 550
 551
 552         static char *take_code(char *pos, char *end, char *marker,
 553                                struct psection **table, struct text section,
 554                                int *line_nop)
 555         {
 556                 char *start = pos;
 557                 int line_no = *line_nop;
 558                 int start_line = line_no;
 559                 struct psection *sect;
 560
 561                 sect = section_find(table, section);
 562
 563                 while (pos < end) {
 564                         char *sol, *t;
 565                         struct text ref;
 566
 567                         if (marker && matches(marker, pos, end))
 568                                 break;
 569                         if (!marker &&
 570                             (skip_lws(pos, end))[0] != '\n' &&
 571                             !matches(NULL, pos, end))
 572                                 /* Paragraph not indented */
 573                                 break;
 574
 575                         /* Still in code - check for reference */
 576                         sol = pos;
 577                         if (!marker) {
 578                                 if (*sol == '\t')
 579                                         sol++;
 580                                 else if (strcmp(sol, "    ") == 0)
 581                                         sol += 4;
 582                         }
 583                         t = skip_lws(sol, end);
 584                         if (t[0] != '#' || t[1] != '#') {
 585                                 /* Just regular code here */
 586                                 pos = skip_line(sol, end);
 587                                 line_no++;
 588                                 continue;
 589                         }
 590
 591                         if (pos > start) {
 592                                 struct text txt;
 593                                 txt.txt = start;
 594                                 txt.len = pos - start;
 595                                 code_add_text(sect, txt, start_line);
 596                         }
 597                         ref = take_header(t, end);
 598                         if (ref.len) {
 599                                 struct psection *refsec = section_find(table, ref);
 600                                 code_add_link(sect, refsec, count_space(sol, t));
 601                         }
 602                         pos = skip_line(t, end);
 603                         line_no++;
 604                         start = pos;
 605                         start_line = line_no;
 606                 }
 607                 if (pos > start) {
 608                         struct text txt;
 609                         txt.txt = start;
 610                         txt.len = pos - start;
 611                         code_add_text(sect, txt, start_line);
 612                 }
 613                 if (marker) {
 614                         pos = skip_line(pos, end);
 615                         line_no++;
 616                 }
 617                 *line_nop = line_no;
 618                 return pos;
 619         }
 620
 621 ### Finding the code
 622
 623 It is when looking for the code that we actually use the paragraph
 624 structure.  We need to recognise section headings so we can record the
 625 name, list paragraphs so we can ignore indented follow-on paragraphs,
 626 and the three different markings for code.
 627
 628 #### internal functions
 629
 630         static struct psection *code_find(char *pos, char *end)
 631         {
 632                 struct psection *table = NULL;
 633                 int in_list = 0;
 634                 int line_no = 1;
 635                 struct text section = {0};
 636
 637                 while (pos < end) {
 638                         if (pos[0] == '#') {
 639                                 section = take_header(pos, end);
 640                                 in_list = 0;
 641                                 pos = skip_line(pos, end);
 642                                 line_no++;
 643                         } else if (is_list(pos, end)) {
 644                                 in_list = 1;
 645                                 pos = skip_para(pos, end, &line_no);
 646                         } else if (!in_list && matches(NULL, pos, end)) {
 647                                 pos = take_code(pos, end, NULL, &table,
 648                                                 section, &line_no);
 649                         } else if (matches("```", pos, end)) {
 650                                 in_list = 0;
 651                                 pos = skip_line(pos, end);
 652                                 line_no++;
 653                                 pos = take_code(pos, end, "```", &table,
 654                                                 section, &line_no);
 655                         } else if (matches("~~~", pos, end)) {
 656                                 in_list = 0;
 657                                 pos = skip_line(pos, end);
 658                                 line_no++;
 659                                 pos = take_code(pos, end, "~~~", &table,
 660                                                 section, &line_no);
 661                         } else {
 662                                 if (!isspace(*pos))
 663                                         in_list = 0;
 664                                 pos = skip_para(pos, end, &line_no);
 665                         }
 666                 }
 667                 return table;
 668         }
 669
 670 ### Returning the code
 671
 672 Having found all the code blocks and gathered them into a list of
 673 section, we are now ready to return them to the caller.  This is where
 674 to perform consistency checks, like at most one reference and at least
 675 one definition for each section.
 676
 677 All the sections with no references are returned in a list for the
 678 caller to consider.  The are linearized first so that the substructure
 679 is completely hidden -- except for the small amount of structure
 680 displayed in the line numbers.
 681
 682 To return errors, we have the caller pass a function which takes an
 683 error message - a `code_err_fn`.
 684
 685 #### exported types
 686
 687         typedef void (*code_err_fn)(char *msg);
 688
 689 #### internal functions
 690         struct section *code_extract(char *pos, char *end, code_err_fn error)
 691         {
 692                 struct psection *table;
 693                 struct section *result = NULL;
 694                 struct section *tofree = NULL;
 695
 696                 table = code_find(pos, end);
 697
 698                 while (table) {
 699                         struct psection *t = (struct psection*)table->next;
 700                         if (table->last == NULL) {
 701                                 char *msg;
 702                                 asprintf(&msg,
 703                                         "Section \"%.*s\" is referenced but not declared",
 704                                          table->section.len, table->section.txt);
 705                                 error(msg);
 706                                 free(msg);
 707                         }
 708                         if (table->refcnt == 0) {
 709                                 /* Root-section,  return it */
 710                                 table->next = result;
 711                                 result = table;
 712                                 code_linearize(result->code);
 713                         } else {
 714                                 table->next = tofree;
 715                                 tofree = table;
 716                                 if (table->refcnt > 1) {
 717                                         char *msg;
 718                                         asprintf(&msg,
 719                                                  "Section \"%.*s\" referenced multiple times (%d).",
 720                                                  table->section.len, table->section.txt,
 721                                                  table->refcnt);
 722                                         error(msg);
 723                                         free(msg);
 724                                 }
 725                         }
 726                         table = t;
 727                 }
 728                 while (tofree) {
 729                         struct section *t = tofree->next;
 730                         free(tofree);
 731                         tofree = t;
 732                 }
 733                 return result;
 734         }
 735
 736 ##### exported functions
 737
 738         struct section *code_extract(char *pos, char *end, code_err_fn error);
 739
 740
 741 ## Using the library
 742
 743 Now that we can extract code from a document and link it all together
 744 it is time to do something with that code.  Firstly we need to print
 745 it out.
 746
 747 ### Printing the Code
 748
 749 Printing is mostly straight forward - we just walk the list and print
 750 the code sections, adding whatever indent is required for each line.
 751 However there is a complication (isn't there always)?
 752
 753 For code that was recognised because the paragraph was indented, we
 754 need to strip that indent first.  For other code, we don't.
 755
 756 The approach taken here is simple, though it could arguably be wrong
 757 in some unlikely cases.  So it might need to be fixed later.
 758
 759 If the first line of a code block is indented, then either one tab or
 760 4 spaces are striped from every non-blank line.
 761
 762 This could go wrong if the first line of a code block marked by
 763 _`` ``` ``_ is indented.  To overcome this we would need to
 764 record someextra state in each `code_node`.  For now we won't bother.
 765
 766 The indents we insert will all be spaces.  This might not work well
 767 for `Makefiles`.
 768
 769 ##### client functions
 770
 771         static void code_print(FILE *out, struct code_node *node,
 772                                char *fname)
 773         {
 774                 for (; node; node = node->next) {
 775                         char *c = node->code.txt;
 776                         int len = node->code.len;
 777                         int undent = 0;
 778
 779                         if (!len)
 780                                 continue;
 781
 782                         fprintf(out, "#line %d \"%s\"\n",
 783                                 node->line_no, fname);
 784                         if (*c == ' ' || *c == '\t')
 785                                 undent = 1;
 786                         while (len && *c) {
 787                                 fprintf(out, "%*s", node->indent, "");
 788                                 if (undent) {
 789                                         if (*c == '\t' && len > 1) {
 790                                                 c++;
 791                                                 len--;
 792                                         } else if (strncmp(c, "    ", 4) == 0 && len > 4) {
 793                                                 c += 4;
 794                                                 len-= 4;
 795                                         }
 796                                 }
 797                                 do {
 798                                         fputc(*c, out);
 799                                         c++;
 800                                         len--;
 801                                 } while (len && c[-1] != '\n');
 802                         }
 803                 }
 804         }
 805
 806 ### Bringing it all together
 807
 808 We are just about ready for the `main` function of the tool which will
 809 extract all this lovely code and compile it.  Just one helper is still
 810 needed.
 811
 812 #### Handling filenames
 813
 814 Section names are stored in `struct text` which is not `nul`
 815 terminated.  Filenames passed to `open` need to be null terminated.
 816 So we need to convert one to the other, and strip the leading `File:`
 817 of while we are at it.
 818
 819 ##### client functions
 820
 821         static void copy_fname(char *name, int space, struct text t)
 822         {
 823                 char *sec = t.txt;
 824                 int len = t.len;
 825                 name[0] = 0;
 826                 if (len < 5 || strncmp(sec, "File:", 5) != 0)
 827                         return;
 828                 sec += 5;
 829                 len -= 5;
 830                 while (len && sec[0] == ' ') {
 831                         sec++;
 832                         len--;
 833                 }
 834                 if (len >= space)
 835                         len = space - 1;
 836                 strncpy(name, sec, len);
 837                 name[len] = 0;
 838         }
 839
 840 #### Main
 841
 842 And now we take a single file name, extract the code, and if there are
 843 no error we write out a file for each appropriate code section.  And
 844 we are done.
 845
 846
 847 ##### client includes
 848
 849         #include <fcntl.h>
 850         #include <errno.h>
 851         #include <sys/mman.h>
 852         #include <string.h>
 853         #include <stdio.h>
 854
 855 ##### client functions
 856
 857         static int errs;
 858         static void pr_err(char *msg)
 859         {
 860                 errs++;
 861                 fprintf(stderr, "%s\n", msg);
 862         }
 863
 864         int main(int argc, char *argv[])
 865         {
 866                 int fd;
 867                 size_t len;
 868                 char *file;
 869                 struct section *table, *s, *prev;
 870
 871                 errs = 0;
 872                 if (argc != 2) {
 873                         fprintf(stderr, "Usage: mdcode file.mdc\n");
 874                         exit(2);
 875                 }
 876                 fd = open(argv[1], O_RDONLY);
 877                 if (fd < 0) {
 878                         fprintf(stderr, "mdcode: cannot open %s: %s\n",
 879                                 argv[1], strerror(errno));
 880                         exit(1);
 881                 }
 882                 len = lseek(fd, 0, 2);
 883                 file = mmap(NULL, len, PROT_READ, MAP_SHARED, fd, 0);
 884                 table = code_extract(file, file+len, pr_err);
 885
 886                 for (s = table; s;
 887                         (code_free(s->code), prev = s, s = s->next, free(prev))) {
 888                         FILE *fl;
 889                         char fname[1024];
 890                         if (strncmp(s->section.txt, "Example:", 8) == 0)
 891                                 continue;
 892                         if (strncmp(s->section.txt, "File:", 5) != 0) {
 893                                 fprintf(stderr, "Unreferenced section is not a file name: %.*s\n",
 894                                         s->section.len, s->section.txt);
 895                                 errs++;
 896                                 continue;
 897                         }
 898                         copy_fname(fname, sizeof(fname), s->section);
 899                         if (fname[0] == 0) {
 900                                 fprintf(stderr, "Missing file name at:%.*s\n",
 901                                         s->section.len, s->section.txt);
 902                                 errs++;
 903                                 continue;
 904                         }
 905                         fl = fopen(fname, "w");
 906                         if (!fl) {
 907                                 fprintf(stderr, "Cannot create %s: %s\n",
 908                                         fname, strerror(errno));
 909                                 errs++;
 910                                 continue;
 911                         }
 912                         code_print(fl, s->code, argv[1]);
 913                         fclose(fl);
 914                 }
 915                 exit(!!errs);
 916         }
 917