ocean-lang.org Git - ocean/blob - csrc/mdcode.mdc

   1 # mdcode: extract C code from a _markdown_ file.
   2
   3 _markdown_ is a popular format for simple text markup which can easily
   4 be converted to HTML.  As it allows easy indication of sections of
   5 code, it is quite suitable for use in literate programming.  This file
   6 is an example of that usage.
   7
   8 The code included below provides two related functionalities.
   9 Firstly it provides a library routine for extracting code out of a
  10 _markdown_ file, so that other routines might make use of it.
  11
  12 Secondly it provides a simple client of this routine which extracts
  13 1 or more C-language files from a markdown document so they can be
  14 passed to a C compiler.  These two combined to make a tool that is needed
  15 to compile this tool.  Yes, this is circular.  A prototype tool was
  16 used for the first extraction.
  17
  18 The tool provided is described as specific to the C language as it
  19 generates
  20
  21 ##### Example: a _line_ command
  22
  23         #line __line-number__ __file-name__
  24
  25 lines so that the C compiler will report where in the markdown file
  26 any error is found.  This tool is suitable for any other language
  27 which allows the same directive, or will treat it as a comment.
  28
  29 ## Literate Details
  30
  31 Literate programming is more than just including comments with the
  32 code, even nicely formatted comments.  It also involves presenting the
  33 code in an order that makes sense to a human, rather than an order
  34 that makes sense to a compiler.  For this reason a core part of any
  35 literate programming tool is the ability to re-arrange the code found
  36 in the document into a different order in the final code file - or
  37 files.  This requires some form of linkage to be encoded.
  38
  39 The approach taken here is focused around section headings - of any
  40 depth.
  41
  42 All the code in any section is treated as a single sequential
  43 collection of code, and is named by the section that it is in.  If
  44 multiple sections have the same name, then the code blocks in all of
  45 them are joined together in the order they appear in the document.
  46
  47 A code section can contain a special marker which starts with 2
  48 hashes: __##__.
  49 The text after the marker must be the name of some section which
  50 contains code.  Code from that section will be interpolated in place
  51 of the marker, and will be indented to match the indent of the marker.
  52
  53 It is not permitted for the same code to be interpolated multiple
  54 times.  Allowing this might make some sense, but it is probably a
  55 mistake, and prohibiting it make some of the code a bit cleaner.
  56
  57 Equally, every section of code should be interpolated at least once -
  58 with two exceptions.  These exceptions are imposed by the tool, not
  59 the library.  A different client could impose different rules on the
  60 names of top-level code sections.
  61
  62 The first exception we have already seen.  A section name starting
  63 __Example:__ indicates code that is not to be included in the final product.
  64
  65 The second exception is for the top level code sections which will be
  66 written to files.  Again these are identified by their section name.
  67 This must start with __File:__  the following text (after optional
  68 spaces) will be used as a file name.
  69
  70 Any section containing code that does not start __Example:__ or
  71 __File:__ must be included in some other section exactly once.
  72
  73 ### Multiple files
  74
  75 Allowing multiple top level code sections which name different files
  76 means that one _markdown_ document can describe several files.  This
  77 is very useful with the C language where a program file and a header
  78 file might be related.  For the present document we will have a header
  79 file and two code files, one with the library content and one for the
  80 tool.
  81
  82 It will also be very convenient to create a `makefile` fragment to
  83 ensure the code is compiled correctly.  A simple `make -f mdcode.mk`
  84 will "do the right thing".
  85
  86 ### File: mdcode.mk
  87
  88         CFLAGS += -Wall -g
  89         all::
  90         mdcode.h libmdcode.c md2c.c mdcode.mk :  mdcode.mdc
  91                 ./md2c mdcode.mdc
  92
  93
  94 ### File: mdcode.h
  95
  96         #include <stdio.h>
  97         ## exported types
  98         ## exported functions
  99
 100 ### File: libmdcode.c
 101         #define _GNU_SOURCE
 102         #include <unistd.h>
 103         #include <stdlib.h>
 104         #include <stdio.h>
 105
 106         #include "mdcode.h"
 107         ## internal includes
 108         ## private types
 109         ## internal functions
 110
 111 ### File: mdcode.mk
 112
 113         all :: libmdcode.o
 114         libmdcode.o : libmdcode.c mdcode.h
 115                 $(CC) $(CFLAGS) -c libmdcode.c
 116
 117
 118 ### File: md2c.c
 119
 120         #include <unistd.h>
 121         #include <stdlib.h>
 122         #include <stdio.h>
 123
 124         #include "mdcode.h"
 125
 126         ## client includes
 127         ## client functions
 128
 129 ### File: mdcode.mk
 130
 131         all :: md2c
 132         md2c : md2c.o libmdcode.o
 133                 $(CC) $(CFLAGS) -o md2c md2c.o libmdcode.o
 134         md2c.o : md2c.c mdcode.h
 135                 $(CC) $(CFLAGS) -c md2c.c
 136
 137 ## Data Structures
 138
 139 As the core purpose of _mdcode_ is to discover and re-arrange blocks
 140 of text, it makes sense to map the whole document file into memory and
 141 produce a data structure which lists various parts of the file in the
 142 appropriate order.  Each node in this structure will have some text
 143 from the document, a child pointer, and a next pointer, any of which
 144 might not be present.  The text is most easily stored as a pointer and a
 145 length.  We'll call this a `text`
 146
 147 A list of these `code_nodes` will belong to each section and it will
 148 be useful to have a separate `section` data structure to store the
 149 list of `code_nodes`, the section name, and some other information.
 150
 151 This other information will include a reference counter so we can
 152 ensure proper referencing, and an `indent` depth.  As referenced
 153 content can have an extra indent added, we need to know what that is.
 154 The `code_node` will also have an `indent` depth which eventually gets
 155 set to the sum for the indents from all references on the path from
 156 the root.
 157
 158 Finally we need to know if the `code_node` was recognised by being
 159 indented or not.  If it was, the client of this data will want to
 160 strip of the leading tab or 4 spaces.  Hence a `needs_strip` flag is
 161 needed.
 162
 163 ##### exported types
 164
 165         struct text {
 166                 char *txt;
 167                 int len;
 168         };
 169
 170         struct section {
 171                 struct text section;
 172                 struct code_node *code;
 173                 struct section *next;
 174         };
 175
 176         struct code_node {
 177                 struct text code;
 178                 int indent;
 179                 int line_no;
 180                 int needs_strip;
 181                 struct code_node *next;
 182                 struct section *child;
 183         };
 184
 185 ##### private types
 186
 187         struct psection {
 188                 struct section;
 189                 struct code_node *last;
 190                 int refcnt;
 191                 int indent;
 192         };
 193
 194 You will note that the `struct psection` contains an anonymous `struct
 195 section` embedded at the start.  To make this work right, GCC
 196 requires the `-fplan9-extensions` flag.
 197
 198 ##### File: mdcode.mk
 199
 200         CFLAGS += -fplan9-extensions
 201
 202 ### Manipulating the node
 203
 204 Though a tree with `next` and `child` links is the easiest way to
 205 assemble the various code sections, it is not the easiest form for
 206 using them.  For that a simple list would be best.
 207
 208 So once we have a fully linked File section we will want to linearize
 209 it, so that the `child` links become `NULL` and the `next` links will
 210 find everything required.  It is at this stage that the requirements
 211 that each section is linked only once becomes import.
 212
 213 `code_linearize` will merge the `code_node`s from any child into the
 214 given `code_node`.  As it does this it sets the 'indent' field for
 215 each `code_node`.
 216
 217 Note that we don't clear the section's `last` pointer, even though
 218 it no longer owns any code.  This allows subsequent code to see if a
 219 section ever had any code, and to report an error if a section is
 220 referenced but not defined.
 221
 222 ##### internal functions
 223
 224         static void code_linearize(struct code_node *code)
 225         {
 226                 struct code_node *t;
 227                 for (t = code; t; t = t->next)
 228                         t->indent = 0;
 229                 for (; code; code = code->next)
 230                         if (code->child) {
 231                                 struct code_node *next = code->next;
 232                                 struct psection *pchild =
 233                                         (struct psection *)code->child;
 234                                 int indent = pchild->indent;
 235                                 code->next = code->child->code;
 236                                 code->child->code = NULL;
 237                                 code->child = NULL;
 238                                 for (t = code; t->next; t = t->next)
 239                                         t->next->indent = code->indent + indent;
 240                                 t->next = next;
 241                         }
 242         }
 243
 244 Once a client has made use of a linearized code set, it will probably
 245 want to free it.
 246
 247         void code_free(struct code_node *code)
 248         {
 249                 while (code) {
 250                         struct code_node *this;
 251                         if (code->child)
 252                                 code_linearize(code);
 253                         this = code;
 254                         code = code->next;
 255                         free(this);
 256                 }
 257         }
 258
 259 ##### exported functions
 260
 261         void code_free(struct code_node *code);
 262
 263 ### Building the tree
 264
 265 As we parse the document there are two things we will want to do to
 266 node trees: add some text or add a reference.  We'll assume for now
 267 that the relevant section structures have been found, and will just
 268 deal with the `code_node`.
 269
 270 Adding text simply means adding another node.  We will never have
 271 empty nodes, even if the last node only has a child, new text must go
 272 in a new node.
 273
 274 ##### internal functions
 275
 276         static void code_add_text(struct psection *where, struct text txt,
 277                                   int line_no, int needs_strip)
 278         {
 279                 struct code_node *n;
 280                 if (txt.len == 0)
 281                         return;
 282                 n = malloc(sizeof(*n));
 283                 n->code = txt;
 284                 n->indent = 0;
 285                 n->line_no = line_no;
 286                 n->needs_strip = needs_strip;
 287                 n->next = NULL;
 288                 n->child = NULL;
 289                 if (where->last)
 290                         where->last->next = n;
 291                 else
 292                         where->code = n;
 293                 where->last = n;
 294         }
 295
 296 However when adding a link, we might be able to include it in the last
 297 `code_node` if it currently only has text.
 298
 299         void code_add_link(struct psection *where, struct psection *to,
 300                            int indent)
 301         {
 302                 struct code_node *n;
 303
 304                 to->indent = indent;
 305                 to->refcnt++;   // this will be checked elsewhere
 306                 if (where->last && where->last->child == NULL) {
 307                         where->last->child = to;
 308                         return;
 309                 }
 310                 n = malloc(sizeof(*n));
 311                 n->code.len = 0;
 312                 n->indent = 0;
 313                 n->line_no = 0;
 314                 n->next = NULL;
 315                 n->child = to;
 316                 if (where->last)
 317                         where->last->next = n;
 318                 else
 319                         where->code = n;
 320                 where->last = n;
 321         }
 322
 323 ### Finding sections
 324
 325 Now we need a lookup table to be able to find sections by name.
 326 Something that provides an `n*log(N)` search time is probably
 327 justified, but for now I want a minimal stand-alone program so a
 328 linked list managed by insertion-sort will do.  As a comparison
 329 function it is easiest to sort based on length before content.  So
 330 sections won't be in standard lexical order, but that isn't important.
 331
 332 If we cannot find a section, we simply want to create it.  This allows
 333 sections and references to be created in any order.  Sections with
 334 no references or no content will cause a warning eventually.
 335
 336 #### internal functions
 337
 338         static int text_cmp(struct text a, struct text b)
 339         {
 340                 if (a.len != b.len)
 341                         return a.len - b.len;
 342                 return strncmp(a.txt, b.txt, a.len);
 343         }
 344
 345         static struct psection *section_find(struct psection **list, struct text name)
 346         {
 347                 struct psection *new;
 348                 while (*list) {
 349                         int cmp = text_cmp((*list)->section, name);
 350                         if (cmp == 0)
 351                                 return *list;
 352                         if (cmp > 0)
 353                                 break;
 354                         list = (struct psection **)&((*list)->next);
 355                 }
 356                 /* Add this section */
 357                 new = malloc(sizeof(*new));
 358                 new->next = *list;
 359                 *list = new;
 360                 new->section = name;
 361                 new->code = NULL;
 362                 new->last = NULL;
 363                 new->refcnt = 0;
 364                 new->indent = 0;
 365                 return new;
 366         }
 367
 368 ## Parsing the _markdown_
 369
 370 Parsing markdown is fairly easy, though there are complications.
 371
 372 The document is divided into "paragraphs" which are mostly separated by blank
 373 lines (which may contain white space).  The first few characters of
 374 the first line of a paragraph determine the type of paragraph.  For
 375 our purposes we are only interested in list paragraphs, code
 376 paragraphs, section headings, and everything else.  Section headings
 377 are single-line paragraphs and so do not require a preceding or
 378 following blank line.
 379
 380 Section headings start with 1 or more hash characters (__#__).  List
 381 paragraphs start with hyphen, asterisk, plus, or digits followed by a
 382 period.  Code paragraphs aren't quite so easy.
 383
 384 The "standard" code paragraph starts with 4 or more spaces, or a tab.
 385 However if the previous paragraph was a list paragraph, then those
 386 spaces indicate another  paragraph in the same list item, and 8 or
 387 more spaces are required.  Unless a nested list is in effect, in
 388 which case 12 or more are need.   Unfortunately not all _markdown_
 389 parsers agree on nested lists.
 390
 391 Two alternate styles for marking code are in active use.  "Github" uses
 392 three backticks(_`` ``` ``_), while "pandoc" uses three or more tildes
 393 (_~~~_).  In these cases the code should not be indented.
 394
 395 Trying to please everyone as much as possible, this parser will handle
 396 everything except for code inside lists.
 397
 398 So an indented (4+) paragraph after a list paragraph is always a list
 399 paragraph, otherwise it is a code paragraph.  A paragraph that starts
 400 with three backticks or three tildes is code which continues until a
 401 matching string of backticks or tildes.
 402
 403 ### Skipping bits
 404
 405 While walking the document looking for various markers we will *not*
 406 use the `struct text` introduced earlier as advancing that requires
 407 updating both start and length which feels clumsy.  Instead we will
 408 carry `pos` and `end` pointers, only the first of which needs to
 409 change.
 410
 411 So to start, we need to skip various parts of the document.  `lws`
 412 stands for "Linear White Space" and is a term that comes from the
 413 Email RFCs (e.g. RFC822).  `line` and `para` are self explanatory.
 414 Note that `skip_para` needs to update the current line number.
 415 `skip_line` doesn't but every caller should.
 416
 417 #### internal functions
 418
 419         static char *skip_lws(char *pos, char *end)
 420         {
 421                 while (pos < end && (*pos == ' ' || *pos == '\t'))
 422                         pos++;
 423                 return pos;
 424         }
 425
 426         static char *skip_line(char *pos, char *end)
 427         {
 428                 while (pos < end && *pos != '\n')
 429                         pos++;
 430                 if (pos < end)
 431                         pos++;
 432                 return pos;
 433         }
 434
 435         static char *skip_para(char *pos, char *end, int *line_no)
 436         {
 437                 /* Might return a pointer to a blank line, as only
 438                  * one trailing blank line is skipped
 439                  */
 440                 if (*pos == '#') {
 441                         pos = skip_line(pos, end);
 442                         (*line_no) += 1;
 443                         return pos;
 444                 }
 445                 while (pos < end &&
 446                        *pos != '#' &&
 447                        *(pos = skip_lws(pos, end)) != '\n') {
 448                         pos = skip_line(pos, end);
 449                         (*line_no) += 1;
 450                 }
 451                 if (pos < end && *pos == '\n') {
 452                         pos++;
 453                         (*line_no) += 1;
 454                 }
 455                 return pos;
 456         }
 457
 458 ### Recognising things
 459
 460 Recognising a section header is trivial and doesn't require a
 461 function.  However we need to extract the content of a section header
 462 as a `struct text` for passing to `section_find`.
 463 Recognising the start of a new list is fairly easy.  Recognising the
 464 start (and end) of code is a little messy so we provide a function for
 465 matching the first few characters, which has a special case for "4
 466 spaces or tab".
 467
 468 #### internal includes
 469
 470         #include  <ctype.h>
 471         #include  <string.h>
 472
 473 #### internal functions
 474
 475         static struct text take_header(char *pos, char *end)
 476         {
 477                 struct text section;
 478
 479                 while (pos < end && *pos == '#')
 480                         pos++;
 481                 while (pos < end && *pos == ' ')
 482                         pos++;
 483                 section.txt = pos;
 484                 while (pos < end && *pos != '\n')
 485                         pos++;
 486                 while (pos > section.txt &&
 487                        (pos[-1] == '#' || pos[-1] == ' '))
 488                         pos--;
 489                 section.len = pos - section.txt;
 490                 return section;
 491         }
 492
 493         static int is_list(char *pos, char *end)
 494         {
 495                 if (strchr("-*+", *pos))
 496                         return 1;
 497                 if (isdigit(*pos)) {
 498                         while (pos < end && isdigit(*pos))
 499                                 pos += 1;
 500                         if  (pos < end && *pos == '.')
 501                                 return 1;
 502                 }
 503                 return 0;
 504         }
 505
 506         static int matches(char *start, char *pos, char *end)
 507         {
 508                 if (start == NULL)
 509                         return matches("\t", pos, end) ||
 510                                matches("    ", pos, end);
 511                 return (pos + strlen(start) < end &&
 512                         strncmp(pos, start, strlen(start)) == 0);
 513         }
 514
 515 ### Extracting the code
 516
 517 Now that we can skip paragraphs and recognise what type each paragraph
 518 is, it is time to parse the file and extract the code.  We'll do this
 519 in two parts, first we look at what to do with some code once we
 520 find it, and then how to actually find it.
 521
 522 When we have some code, we know where it is, what the end marker
 523 should look like, and which section it is in.
 524
 525 There are two sorts of end markers: the presence of a particular
 526 string, or the absence of an indent.  We will use a string to
 527 represent a presence, and a `NULL` to represent the absence.
 528
 529 While looking at code we don't think about paragraphs are all - just
 530 look for a line that starts with the right thing.
 531 Every line that is still code then needs to be examined to see if it
 532 is a section reference.
 533
 534 When a section reference is found, all preceding code (if any) must be
 535 added to the current section, then the reference is added.
 536
 537 When we do find the end of the code, all text that we have found but
 538 not processed needs to be saved too.
 539
 540 When adding a reference we need to set the `indent`.  This is the
 541 number of spaces (counting 8 for tabs) after the natural indent of the
 542 code (which is a tab or 4 spaces).  We use a separate function `count_spaces`
 543 for that.
 544
 545 #### internal functions
 546
 547         static int count_space(char *sol, char *p)
 548         {
 549                 int c = 0;
 550                 while (sol < p) {
 551                         if (sol[0] == ' ')
 552                                 c++;
 553                         if (sol[0] == '\t')
 554                                 c+= 8;
 555                         sol++;
 556                 }
 557                 return c;
 558         }
 559
 560
 561         static char *take_code(char *pos, char *end, char *marker,
 562                                struct psection **table, struct text section,
 563                                int *line_nop)
 564         {
 565                 char *start = pos;
 566                 int line_no = *line_nop;
 567                 int start_line = line_no;
 568                 struct psection *sect;
 569
 570                 sect = section_find(table, section);
 571
 572                 while (pos < end) {
 573                         char *sol, *t;
 574                         struct text ref;
 575
 576                         if (marker && matches(marker, pos, end))
 577                                 break;
 578                         if (!marker &&
 579                             (skip_lws(pos, end))[0] != '\n' &&
 580                             !matches(NULL, pos, end))
 581                                 /* Paragraph not indented */
 582                                 break;
 583
 584                         /* Still in code - check for reference */
 585                         sol = pos;
 586                         if (!marker) {
 587                                 if (*sol == '\t')
 588                                         sol++;
 589                                 else if (strcmp(sol, "    ") == 0)
 590                                         sol += 4;
 591                         }
 592                         t = skip_lws(sol, end);
 593                         if (t[0] != '#' || t[1] != '#') {
 594                                 /* Just regular code here */
 595                                 pos = skip_line(sol, end);
 596                                 line_no++;
 597                                 continue;
 598                         }
 599
 600                         if (pos > start) {
 601                                 struct text txt;
 602                                 txt.txt = start;
 603                                 txt.len = pos - start;
 604                                 code_add_text(sect, txt, start_line,
 605                                               marker == NULL);
 606                         }
 607                         ref = take_header(t, end);
 608                         if (ref.len) {
 609                                 struct psection *refsec = section_find(table, ref);
 610                                 code_add_link(sect, refsec, count_space(sol, t));
 611                         }
 612                         pos = skip_line(t, end);
 613                         line_no++;
 614                         start = pos;
 615                         start_line = line_no;
 616                 }
 617                 if (pos > start) {
 618                         struct text txt;
 619                         txt.txt = start;
 620                         txt.len = pos - start;
 621                         code_add_text(sect, txt, start_line,
 622                                       marker == NULL);
 623                 }
 624                 if (marker) {
 625                         pos = skip_line(pos, end);
 626                         line_no++;
 627                 }
 628                 *line_nop = line_no;
 629                 return pos;
 630         }
 631
 632 ### Finding the code
 633
 634 It is when looking for the code that we actually use the paragraph
 635 structure.  We need to recognise section headings so we can record the
 636 name, list paragraphs so we can ignore indented follow-on paragraphs,
 637 and the three different markings for code.
 638
 639 #### internal functions
 640
 641         static struct psection *code_find(char *pos, char *end)
 642         {
 643                 struct psection *table = NULL;
 644                 int in_list = 0;
 645                 int line_no = 1;
 646                 struct text section = {0};
 647
 648                 while (pos < end) {
 649                         if (pos[0] == '#') {
 650                                 section = take_header(pos, end);
 651                                 in_list = 0;
 652                                 pos = skip_line(pos, end);
 653                                 line_no++;
 654                         } else if (is_list(pos, end)) {
 655                                 in_list = 1;
 656                                 pos = skip_para(pos, end, &line_no);
 657                         } else if (!in_list && matches(NULL, pos, end)) {
 658                                 pos = take_code(pos, end, NULL, &table,
 659                                                 section, &line_no);
 660                         } else if (matches("```", pos, end)) {
 661                                 in_list = 0;
 662                                 pos = skip_line(pos, end);
 663                                 line_no++;
 664                                 pos = take_code(pos, end, "```", &table,
 665                                                 section, &line_no);
 666                         } else if (matches("~~~", pos, end)) {
 667                                 in_list = 0;
 668                                 pos = skip_line(pos, end);
 669                                 line_no++;
 670                                 pos = take_code(pos, end, "~~~", &table,
 671                                                 section, &line_no);
 672                         } else {
 673                                 if (!isspace(*pos))
 674                                         in_list = 0;
 675                                 pos = skip_para(pos, end, &line_no);
 676                         }
 677                 }
 678                 return table;
 679         }
 680
 681 ### Returning the code
 682
 683 Having found all the code blocks and gathered them into a list of
 684 section, we are now ready to return them to the caller.  This is where
 685 to perform consistency checks, like at most one reference and at least
 686 one definition for each section.
 687
 688 All the sections with no references are returned in a list for the
 689 caller to consider.  The are linearized first so that the substructure
 690 is completely hidden -- except for the small amount of structure
 691 displayed in the line numbers.
 692
 693 To return errors, we have the caller pass a function which takes an
 694 error message - a `code_err_fn`.
 695
 696 #### exported types
 697
 698         typedef void (*code_err_fn)(char *msg);
 699
 700 #### internal functions
 701         struct section *code_extract(char *pos, char *end, code_err_fn error)
 702         {
 703                 struct psection *table;
 704                 struct section *result = NULL;
 705                 struct section *tofree = NULL;
 706
 707                 table = code_find(pos, end);
 708
 709                 while (table) {
 710                         struct psection *t = (struct psection*)table->next;
 711                         if (table->last == NULL) {
 712                                 char *msg;
 713                                 asprintf(&msg,
 714                                         "Section \"%.*s\" is referenced but not declared",
 715                                          table->section.len, table->section.txt);
 716                                 error(msg);
 717                                 free(msg);
 718                         }
 719                         if (table->refcnt == 0) {
 720                                 /* Root-section,  return it */
 721                                 table->next = result;
 722                                 result = table;
 723                                 code_linearize(result->code);
 724                         } else {
 725                                 table->next = tofree;
 726                                 tofree = table;
 727                                 if (table->refcnt > 1) {
 728                                         char *msg;
 729                                         asprintf(&msg,
 730                                                  "Section \"%.*s\" referenced multiple times (%d).",
 731                                                  table->section.len, table->section.txt,
 732                                                  table->refcnt);
 733                                         error(msg);
 734                                         free(msg);
 735                                 }
 736                         }
 737                         table = t;
 738                 }
 739                 while (tofree) {
 740                         struct section *t = tofree->next;
 741                         free(tofree);
 742                         tofree = t;
 743                 }
 744                 return result;
 745         }
 746
 747 ##### exported functions
 748
 749         struct section *code_extract(char *pos, char *end, code_err_fn error);
 750
 751
 752 ## Using the library
 753
 754 Now that we can extract code from a document and link it all together
 755 it is time to do something with that code.  Firstly we need to print
 756 it out.
 757
 758 ### Printing the Code
 759
 760 Printing is mostly straight forward - we just walk the list and print
 761 the code sections, adding whatever indent is required for each line.
 762 However there is a complication (isn't there always)?
 763
 764 For code that was recognised because the paragraph was indented, we
 765 need to strip that indent first.  For other code, we don't.
 766
 767 The approach taken here is simple, though it could arguably be wrong
 768 in some unlikely cases.  So it might need to be fixed later.
 769
 770 If the first line of a code block is indented, then either one tab or
 771 4 spaces are striped from every non-blank line.
 772
 773 This could go wrong if the first line of a code block marked by
 774 _`` ``` ``_ is indented.  To overcome this we would need to
 775 record some extra state in each `code_node`.  For now we won't bother.
 776
 777 The indents we insert will all be spaces.  This might not work well
 778 for `Makefiles`.
 779
 780 ##### internal functions
 781
 782         void code_node_print(FILE *out, struct code_node *node,
 783                              char *fname)
 784         {
 785                 for (; node; node = node->next) {
 786                         char *c = node->code.txt;
 787                         int len = node->code.len;
 788
 789                         if (!len)
 790                                 continue;
 791
 792                         fprintf(out, "#line %d \"%s\"\n",
 793                                 node->line_no, fname);
 794                         while (len && *c) {
 795                                 fprintf(out, "%*s", node->indent, "");
 796                                 if (node->needs_strip) {
 797                                         if (*c == '\t' && len > 1) {
 798                                                 c++;
 799                                                 len--;
 800                                         } else if (strncmp(c, "    ", 4) == 0 && len > 4) {
 801                                                 c += 4;
 802                                                 len-= 4;
 803                                         }
 804                                 }
 805                                 do {
 806                                         fputc(*c, out);
 807                                         c++;
 808                                         len--;
 809                                 } while (len && c[-1] != '\n');
 810                         }
 811                 }
 812         }
 813
 814 ###### exported functions
 815         void code_node_print(FILE *out, struct code_node *node, char *fname);
 816
 817 ### Bringing it all together
 818
 819 We are just about ready for the `main` function of the tool which will
 820 extract all this lovely code and compile it.  Just one helper is still
 821 needed.
 822
 823 #### Handling filenames
 824
 825 Section names are stored in `struct text` which is not `nul`
 826 terminated.  Filenames passed to `open` need to be null terminated.
 827 So we need to convert one to the other, and strip the leading `File:`
 828 of while we are at it.
 829
 830 ##### client functions
 831
 832         static void copy_fname(char *name, int space, struct text t)
 833         {
 834                 char *sec = t.txt;
 835                 int len = t.len;
 836                 name[0] = 0;
 837                 if (len < 5 || strncmp(sec, "File:", 5) != 0)
 838                         return;
 839                 sec += 5;
 840                 len -= 5;
 841                 while (len && sec[0] == ' ') {
 842                         sec++;
 843                         len--;
 844                 }
 845                 if (len >= space)
 846                         len = space - 1;
 847                 strncpy(name, sec, len);
 848                 name[len] = 0;
 849         }
 850
 851 #### Main
 852
 853 And now we take a single file name, extract the code, and if there are
 854 no error we write out a file for each appropriate code section.  And
 855 we are done.
 856
 857
 858 ##### client includes
 859
 860         #include <fcntl.h>
 861         #include <errno.h>
 862         #include <sys/mman.h>
 863         #include <string.h>
 864
 865 ##### client functions
 866
 867         static int errs;
 868         static void pr_err(char *msg)
 869         {
 870                 errs++;
 871                 fprintf(stderr, "%s\n", msg);
 872         }
 873
 874         int main(int argc, char *argv[])
 875         {
 876                 int fd;
 877                 size_t len;
 878                 char *file;
 879                 struct section *table, *s, *prev;
 880
 881                 errs = 0;
 882                 if (argc != 2) {
 883                         fprintf(stderr, "Usage: mdcode file.mdc\n");
 884                         exit(2);
 885                 }
 886                 fd = open(argv[1], O_RDONLY);
 887                 if (fd < 0) {
 888                         fprintf(stderr, "mdcode: cannot open %s: %s\n",
 889                                 argv[1], strerror(errno));
 890                         exit(1);
 891                 }
 892                 len = lseek(fd, 0, 2);
 893                 file = mmap(NULL, len, PROT_READ, MAP_SHARED, fd, 0);
 894                 table = code_extract(file, file+len, pr_err);
 895
 896                 for (s = table; s;
 897                         (code_free(s->code), prev = s, s = s->next, free(prev))) {
 898                         FILE *fl;
 899                         char fname[1024];
 900                         if (strncmp(s->section.txt, "Example:", 8) == 0)
 901                                 continue;
 902                         if (strncmp(s->section.txt, "File:", 5) != 0) {
 903                                 fprintf(stderr, "Unreferenced section is not a file name: %.*s\n",
 904                                         s->section.len, s->section.txt);
 905                                 errs++;
 906                                 continue;
 907                         }
 908                         copy_fname(fname, sizeof(fname), s->section);
 909                         if (fname[0] == 0) {
 910                                 fprintf(stderr, "Missing file name at:%.*s\n",
 911                                         s->section.len, s->section.txt);
 912                                 errs++;
 913                                 continue;
 914                         }
 915                         fl = fopen(fname, "w");
 916                         if (!fl) {
 917                                 fprintf(stderr, "Cannot create %s: %s\n",
 918                                         fname, strerror(errno));
 919                                 errs++;
 920                                 continue;
 921                         }
 922                         code_node_print(fl, s->code, argv[1]);
 923                         fclose(fl);
 924                 }
 925                 exit(!!errs);
 926         }
 927