dbay-rustybuzz

1.5.0 • Public • Published

𓆤DBay 𓏞RustyBuzz

Table of Contents generated with DocToc

𓆤DBay 𓏞RustyBuzz

There's no documentation, yet, just demos, tests, benchmarks, and a few scattered notes

Notes

General Procedure

  • Prepare

    • Initialize
    • Normalize text (replacing e.g. symbols like ­, &wbr; with their Unicode equivalents etc.)
    • Split into paragraphs
    • Hyphenate
    • Register fonts in DB and in rustybuzz-wasm
  • Arrange (shape_text()). Translate text to a series of ( x, y )-positioned glyfs (a.k.a. shapes or outlines, identified by numerical Glyf IDs (GIDs)). Input is (possibly hyphenated) text, output is a series of positioned Arrangement Data items (ADs) on a single, long line.

  • Distribute (distribute()). Given the metrics of a paragraph (i.e. line lengths) and shaped text (in the form of ADs), break text into individual lines by looking for good points in the ADs (spaces, hyphens, WBRs) where line wrapping may occur. Input is ADs and shape of paragraph, output is in tables lines, line_ads with line numbers and corrected coordinates.

  • Compose (compose())

    • XXX Update table outlines with the outlines of all glyfs needed for composition
    • YYY write HTML+SVG
  • Arrangement and composition is currently done per glyf / outline / 'sort' (German, interestingly: 'Letter') but it could conceivably also be done per word, per shape group (SG), or per 'syllable'. Using a bigger unit than the glyf for composition allows to reduce the number of <use/> tags needed, at the cost of having intermediary <use/> tags. Where we now have <use/> tags that refer directly to outlines defined as <path/> elements, we'd then build lines with <use/> tags that represent entire (parts of) words; words and parts thereof are in turn defined as series of relatively positioned <use/> tags that themselves refer to <path/>s. Given a long enough text in a uniform font, arrange()/shape_text() would only have to be called for the ever decreasing portions of text that have not yet been arranged; whether this would lead to a more efficient / faster / space-saving process is an open question.

IDs of SVG Paths / Glyf Outlines

  • Shape IDs (SIDs)

    • identify unscaled outlines, each being an SVG path element that pictures a glyf outline in a 1000⨉1000 em square.
    • initial letter o (Outline)
    • followed by decimal Glyf ID / gid
    • followed by fontnick; these must always start with a letter
    • o${gid}${fontnick}
    • Ex.: o42foo identifies a glyf outline with GID 42 from font foo
  • Special SIDs

    • o0${fontnick} represents the .notdef element of a font ${fontnick}; this will normally displayed by a highly visible rectangle with an overlayed text element.
    • oshy-${fontnick} represents U+00ad, the soft hyphen (SHY) which is inserted manually or by using a hyphenating algorithm to signal positions in a text where line breaking may occur by inserting a hyphen.
    • owbr-${fontnick} represents U+200b, the zero-width space (WBR) which is inserted manually or by applying the Unicode Line Breaking Algorithm(UAX#14) to signal positions in a text where line breaking may occur without inserting a hyphen.
    • Normally SHYs and WBRs will not be output to the typeset SVG; when they are, they may be displayed as vertical or slanted lines.

Selectable Text

Since printable text is rendered as SVG elements such as <path/>, text will not be selectable unless special care is taken.

  • We overlay the entire document with a single custom HTML <textcontainer> element.
  • Inside this, each line of output text is represented by an absolutely positioned <span> element.
  • The first line of each paragraph is preceded by an unclosed HTML5 <p> element.
  • There must be no space between the <span>s or between a <p> and a <span>, which may either be achieved by putting them end-to-end, or else use HTML comments to hide a newline character (this latter option will only be used when more human-readable output is being called for).
  • The final character of each <span> should be the one that would be used in absence of the line break, i.e. a space if the next line starts with a new word, and no space in the case of a hyphenated word.
  • Therefore, depending on where a line break happens to occur, render run-of-the-mill procedure as
    • <span>run-of-the-</span></span>mill procedure<span> (run-of-the-⏎mill procedure)
    • <span>run-of-the-mill </span></span>procedure<span> (run-of-the-mill⏎procedure)
    • <span>run-of-the-mill pro</span></span>cedure<span> (run-of-the-mill pro-⏎cedure)
  • Each page gets its own <textcontainer> element to provide a convenient reference point for relative coordinates. It turns out that when the user selects text crossing the boundaries of the block elements (as in one </span> </textcontainer><textcontainer><span>two </span><span>three</span></textcontainer>) they will get one ⏎two three with a newline inserted where the page break was. This is ideal since paragraph breaks are rendered as empty lines (two line breaks in a row); markup languages like MarkDown and the reformatting utilites of text editors alike will do the right thing in this case.
  • All elements except <textcontainer> are in the CSS set to be non-selectable with user-select: none;.
  • To summarize, in the copied text, paragraphs will appear as single lines, separated by blank lines (double newlines), and page breaks as single newlines.
  • The spans must appear in logical, not geometric order, i.e. they must appear in the order the source text was intended to be read. The browser will take care of proper hilite: when one starts to mark text near the bottom of the page in the left hand column and moves the cursor over to the right, the hilites section will 'snap' to include the rest of the left column (after & below), and the beginning of the right column (before & above) the current cursor position (and thus include all of the text that logically comes between these two points).
  • The text is sized, positioned and formatted such as to closely mimick the positions and sizes of the SVG outlines; it is given the CSS transparent color to make it invisible (unless when selected).
  • As an aside, it does not seem to be possible to obtain full compatibility with the standard <span> element just by defining a custom element and declaring it to display: inline;; when text is copied across multiple such elements, line breaks will still be inserted by the browser.
<textcontainer><!--?textcontainer-start?-->
<p><!--
--><span style='left:0mm;top:40mm;'>first line of text </span><!--
--><span style='left:00mm;top:50mm;'>second line-</span><!--
--><span style='left:50mm;top:40mm;'>of-text; third line </span><!--
--><span style='left:50mm;top:50mm;'>fourth line</span><!--
--><p><!--
--><span style='left:0mm;top:40mm;'>first line of text </span><!--
--><span style='left:00mm;top:50mm;'>second line-</span><!--
--><span style='left:50mm;top:40mm;'>of-text; third line </span><!--
--><span style='left:50mm;top:50mm;'>fourth line</span>
<!--?textcontainer-end?--></textcontainer>

DB Structure

  • table ads:
    • field ads.gid: Glyf ID, the numeric ID of the respective glyf (outline, shape) in the current font. By convention and OTF specification, GID 0 is used to represent any codepoint that cannot be represented with the current font; many applications will show those as so-called 'tofus' (often a rectangle like ▯); beyond that, Glyf IDs are arbitrary and vary from font to font.
    • fields ads.b1, ads.b2: The indexes of the respective first bytes that correspond to the respective current and next ADs in the source text.
      • The b1 value of an AD always equals the b2 value of the preceding AD (except for the first AD in a paragraph, whose b1 is always 0 without a predecessor); conversely, the b2 value of AD always equals the b1 of the succeeding AD (except for the last AD in a paragraph, whose b2 always equals the byte length of the source text without a successor).
      • These constraints mean that fields b1 and b2 can be interpreted as implementing a doubly linked list; this is a simple ('single-track') linked list when only looking at track trk: 1 and a multiply linked list when one looks at all the tracks.
        • Multiple tracks occur because of the need to re-shape Shape Groups (SGs) of interdependent outlines in the presence of a Soft Hyphen (SHY); in the below example, there's
          • a soft hyphen between the two fs of affirm (i.e. af¬firm, where the SHY is not normally visible and does not inhibit the usage of GID 85 that pictures ff nor the application of kerning between ff and i) on track trk: 1 and an alternative track trk: > 1 (an arbitrary but locally unique number greater than 1), and
          • a Hard Hyphen (HHY) between the two fs of affirm (i.e. af-, firm).
          • Observe that some GIDs differ between the two tracks, that's because one track uses a ligature where the other track uses two (identical) outlines for the same portion of the source text.
      • The pair ( b1, b2 ) can not only be used to sort ADs such that they are in logical order (for which see Chapter 2.2: Unicode Design Principles in The Unicode Standard (v14), p19).
      • Since byte_range := ( b1, b2 ) refers to a range of contiguous bytes in the source text, one can, by preserving the two fields across transformations, pinpoint, for any outline in the resulting output (HTML or PDF), the exact location in the source text that is responsible for the outline in question.
      • One may say that the coordinate tuple ( x1, x2 ) reifies (materializes) the visual ordering whereas the byte index tuple ( b1, b2 ) reifies the logical ordering of a given source text.

( x1, x2 )            ┌ 6110 ┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄ 6874 ┐
                      │                                      │
gid          23         85          3         -1         176        40         180         3
          ┌───────┐  ┌───────┐  ┌───────┐  ┌───────┐  ┌───────┐  ┌───────┐  ┌───────┐  ┌───────┐
trk: 1  ──┼─> a <─┼─┬┼─> f <─┼──┼─> ¬ <─┼──┼─> f <─┼──┼─> i <─┼┬─┼─> r <─┼──┼─> m <─┼──┼─> ␣ <─┼──
          └───────┘ ┆└───────┘  └───────┘  └───────┘  └───────┘┆ └───────┘  └───────┘  └───────┘
b1, b2      10, 11  ┆  11, 12     12, 14     14, 15     15, 16 ┆   16, 17     17, 18     18, 19
                    ┆                                          ┆
                    ┆                                          ┆
gid                 ┆   28         50         28         176   ┆
                    ┆┌───────┐  ┌───────┐  ┌───────┐  ┌───────┐┆
trk: > 1            └┼─> f <─┼──┼─> - <─┼──┼─> f <─┼──┼─> i <─┼┘
                     └───────┘  └───────┘  └───────┘  └───────┘
b1, b2                 11, 12     12, 14     14, 15     15, 16
                      │                │    │                │
( x1, x2 )            └ 6110 ┄┄┄┄ 6711 ┘    └ 6359 ┄┄┄┄ 6874 ┘

excerpt of table `ads` with `trk` track 1:               .. and with `trk` track > 1:
┌─────┬────┬────┬─────┬──────┬──────┬──────┬──────┐      ┌─────┬────┬────┬─────┬──────┬──────┬──────┬──────┐
│ adi │ b1 │ b2 │ sgi │ osgi │  gid │ chrs │  x   │      │ adi │ b1 │ b2 │ sgi │ osgi │  gid │ chrs │  x   │
├─────┼────┼────┼─────┼──────┼──────┼──────┼──────┤      ├─────┼────┼────┼─────┼──────┼──────┼──────┼──────┤
│     │    │    │     │      │      │      │      │      │     │    │    │     │      │      │      │      │
│  10 │ 10 │ 11 │ 11  │ null │   23 │ 'a'  │ 5610 │      │     │    │    │     │      │      │      │      │
│  11 │ 11 │ 12 │ 12  │ null │   85 │ 'f'  │ 6110 │      │  47 │ 11 │ 12 │ xxx │  12  │   28 │ 'f'  │ 6110 │
│  12 │ 12 │ 14 │ 12  │ null │    3 │ '¬'  │ 6671 │      │  48 │ 12 │ 14 │ xxx │  12  │   50 │ '-'  │ 6422 │
│  13 │ 14 │ 15 │ 12  │ null │   -1 │ 'f'  │ 6671 │      │  49 │ 14 │ 15 │ xxx │  12  │   28 │ 'f'  │ 6359 │
│  14 │ 15 │ 16 │ 12  │ null │  176 │ 'i'  │ 6671 │      │  50 │ 15 │ 16 │ xxx │  12  │  176 │ 'i'  │ 6671 │
│  15 │ 16 │ 17 │ 13  │ null │   40 │ 'r'  │ 6874 │      │     │    │    │     │      │      │      │      │
│  16 │ 17 │ 18 │ 13  │ null │  180 │ 'm'  │ 7259 │
│  17 │ 18 │ 19 │ 14  │ null │    3 │ ' '  │ 8014 │
│     │    │    │     │      │      │      │      │



To Do

  • [–] [RustyBuzz-WASM] modify glyph_to_svg_pathdata() / rectangle_from_bbox() to return coordinates instead of ready-built SVG <rect/> as wee need those numbers to construct the line bounding box

  • [–] [RustyBuzz-WASM] implement caching of font face in glyph_to_svg_pathdata() to avoid re-building it for each single outline; benchmark

  • [–] Two independent imports of [RustyBuzz-WASM] will share state if they happen to import the same file

  • [+] keep bounding box as { x, y, x1, y1, } in outline table to make it easier to compute composite bounding box

  • [–] [RustyBuzz-WASM] implement

    despace_svg_pathdata = ( svg_pathda ) ->
      R = svg_pathda
      R = R.replace /([0-9])\x20([^0-9])/g, '$1$2'
      R = R.replace /([^0-9])\x20([0-9])/g, '$1$2'
      return R
  • [–] [RustyBuzz-WASM] implement SVG path compression

  • [+] allow to pass in custom instance of rustybuzz-wasm as cfg.RBW for testing purposes

  • [–] [RustyBuzz-WASM] move to WASI to enable reading from file system &c:

  • [–] document SVG construction principles

  • [–] document SVG units (mm, px, implicit)

  • [–] implement CSS scaling function (so 1mm in CSS shows up as 1mm in rendering)

  • [+] rename shape_text() to arrange():

    • arrange() -> ads: gets an arrangement of SIDs (arrangement data items); this called by
    • compose() -> { known_ods, new_ods, missing_chrs, ads, fm, }: turns an arrangement into a hitherto unnamed data structure that indicates which outlines to put where, which SIDs are known, which are new, which are missing from the current font, and fontmetrics
    • distribute(): distributes text over lines
  • [–] [icql-dba-hollerith] update & rebrand as dbay-hollerith, removing peer dependency on icql-dba

  • [–] Ensure that entire text gets reflected in SVG comments when so configured, even while omitting refs for whitespace.

  • [+] [hengist] Implement a 'Glyf Grid' to see which outlines are mapped to which GIDs

  • [+] Rename field ads.alt to ads.trk ('track', as used in this README)

  • [+] Remove field ads.adi (or rename ads.id -> ads.adi) if possible since fields id and b1 are better fits for the same puproses (identifying an AD, logical ordering of ADs).

  • [–] (re-)implement display of additional data / invisible elements / text metrics.

    • Symbols should take account of font metrics, especially ascender, baseline, descender, italic angle (skewX), en-width (for tofus / missing glyfs)

    • [–] SPC symbol for space; should show actual extent, similar to

      <!--&spc;--><path transform='skewX(-14)' d='M 2350 0 L 2350 200 L 2593 200 L 2593 0' style='fill:transparent;stroke:red;stroke-width:10mm'/>

      (albeit with style attributes in <style> section). Since each SPC has (potentially) an individual width, no <use/> tags are used, instead, elements will be generated by an approprate method.

    • [–] WBR automatic and manual line break opportunities (can we distinguish between manual, automatic?)

    • [–] SHY hyphenation points

    • [–] GID Glyph IDs

    • [–] 'Tofus' for missing characters (one for half-width, one for full-width codepoints)

    • [–] BBOX glyph bounding box

    • [–] NL manual line breaks

Package Sidebar

Install

npm i dbay-rustybuzz

Weekly Downloads

1

Version

1.5.0

License

MIT

Unpacked Size

3.09 MB

Total Files

48

Last publish

Collaborators

  • loveencounterflow