Table of Contents generated with DocToc
There's no documentation, yet, just demos, tests, benchmarks, and a few scattered notes
-
Prepare
- Initialize
- Normalize text (replacing e.g. symbols like
­
,&wbr;
with their Unicode equivalents etc.) - Split into paragraphs
- Hyphenate
- Register fonts in DB and in
rustybuzz-wasm
-
Arrange (
shape_text()
). Translate text to a series of( x, y )
-positioned glyfs (a.k.a. shapes or outlines, identified by numerical Glyf IDs (GIDs)). Input is (possibly hyphenated) text, output is a series of positioned Arrangement Data items (ADs) on a single, long line. -
Distribute (
distribute()
). Given the metrics of a paragraph (i.e. line lengths) and shaped text (in the form of ADs), break text into individual lines by looking for good points in the ADs (spaces, hyphens, WBRs) where line wrapping may occur. Input is ADs and shape of paragraph, output is in tableslines
,line_ads
with line numbers and corrected coordinates. -
Compose (
compose()
)-
XXX Update table
outlines
with the outlines of all glyfs needed for composition - YYY write HTML+SVG
-
XXX Update table
-
Arrangement and composition is currently done per glyf / outline / 'sort' (German, interestingly: 'Letter') but it could conceivably also be done per word, per shape group (SG), or per 'syllable'. Using a bigger unit than the glyf for composition allows to reduce the number of
<use/>
tags needed, at the cost of having intermediary<use/>
tags. Where we now have<use/>
tags that refer directly to outlines defined as<path/>
elements, we'd then build lines with<use/>
tags that represent entire (parts of) words; words and parts thereof are in turn defined as series of relatively positioned<use/>
tags that themselves refer to<path/>
s. Given a long enough text in a uniform font,arrange()
/shape_text()
would only have to be called for the ever decreasing portions of text that have not yet been arranged; whether this would lead to a more efficient / faster / space-saving process is an open question.
-
Shape IDs (SIDs)
- identify unscaled outlines, each being an SVG path element that pictures a glyf outline in a 1000⨉1000 em square.
- initial letter
o
(Outline) - followed by decimal Glyf ID /
gid
- followed by
fontnick
; these must always start with a letter o${gid}${fontnick}
- Ex.:
o42foo
identifies a glyf outline with GID42
from fontfoo
-
Special SIDs
-
o0${fontnick}
represents the.notdef
element of a font${fontnick}
; this will normally displayed by a highly visible rectangle with an overlayed text element. -
oshy-${fontnick}
represents U+00ad, the soft hyphen (SHY) which is inserted manually or by using a hyphenating algorithm to signal positions in a text where line breaking may occur by inserting a hyphen. -
owbr-${fontnick}
represents U+200b, the zero-width space (WBR) which is inserted manually or by applying the Unicode Line Breaking Algorithm(UAX#14) to signal positions in a text where line breaking may occur without inserting a hyphen. - Normally SHYs and WBRs will not be output to the typeset SVG; when they are, they may be displayed as vertical or slanted lines.
-
Since printable text is rendered as SVG elements such as <path/>
, text will not be selectable
unless special care is taken.
- We overlay the entire document with a single custom HTML
<textcontainer>
element. - Inside this, each line of output text is represented by an absolutely positioned
<span>
element. - The first line of each paragraph is preceded by an unclosed HTML5
<p>
element. - There must be no space between the
<span>
s or between a<p>
and a<span>
, which may either be achieved by putting them end-to-end, or else use HTML comments to hide a newline character (this latter option will only be used when more human-readable output is being called for). - The final character of each
<span>
should be the one that would be used in absence of the line break, i.e. a space if the next line starts with a new word, and no space in the case of a hyphenated word. - Therefore, depending on where a line break happens to occur, render
run-of-the-mill procedure
as-
<span>run-of-the-</span></span>mill procedure<span>
(run-of-the-⏎mill procedure
) -
<span>run-of-the-mill </span></span>procedure<span>
(run-of-the-mill⏎procedure
) -
<span>run-of-the-mill pro</span></span>cedure<span>
(run-of-the-mill pro-⏎cedure
)
-
- Each page gets its own
<textcontainer>
element to provide a convenient reference point for relative coordinates. It turns out that when the user selects text crossing the boundaries of the block elements (as inone </span> </textcontainer><textcontainer><span>two </span><span>three</span></textcontainer>
) they will getone ⏎two three
with a newline inserted where the page break was. This is ideal since paragraph breaks are rendered as empty lines (two line breaks in a row); markup languages like MarkDown and the reformatting utilites of text editors alike will do the right thing in this case. - All elements except
<textcontainer>
are in the CSS set to be non-selectable withuser-select: none;
. - To summarize, in the copied text, paragraphs will appear as single lines, separated by blank lines (double newlines), and page breaks as single newlines.
- The spans must appear in logical, not geometric order, i.e. they must appear in the order the source text was intended to be read. The browser will take care of proper hilite: when one starts to mark text near the bottom of the page in the left hand column and moves the cursor over to the right, the hilites section will 'snap' to include the rest of the left column (after & below), and the beginning of the right column (before & above) the current cursor position (and thus include all of the text that logically comes between these two points).
- The text is sized, positioned and formatted such as to closely mimick the positions and sizes of the SVG
outlines; it is given the CSS
transparent
color to make it invisible (unless when selected). - As an aside, it does not seem to be possible to obtain full compatibility with the standard
<span>
element just by defining a custom element and declaring it todisplay: inline;
; when text is copied across multiple such elements, line breaks will still be inserted by the browser.
<textcontainer><!--?textcontainer-start?-->
<p><!--
--><span style='left:0mm;top:40mm;'>first line of text </span><!--
--><span style='left:00mm;top:50mm;'>second line-</span><!--
--><span style='left:50mm;top:40mm;'>of-text; third line </span><!--
--><span style='left:50mm;top:50mm;'>fourth line</span><!--
--><p><!--
--><span style='left:0mm;top:40mm;'>first line of text </span><!--
--><span style='left:00mm;top:50mm;'>second line-</span><!--
--><span style='left:50mm;top:40mm;'>of-text; third line </span><!--
--><span style='left:50mm;top:50mm;'>fourth line</span>
<!--?textcontainer-end?--></textcontainer>
-
table
ads
:-
field
ads.gid
: Glyf ID, the numeric ID of the respective glyf (outline, shape) in the current font. By convention and OTF specification, GID0
is used to represent any codepoint that cannot be represented with the current font; many applications will show those as so-called 'tofus' (often a rectangle like ▯); beyond that, Glyf IDs are arbitrary and vary from font to font. -
fields
ads.b1
,ads.b2
: The indexes of the respective first bytes that correspond to the respective current and next ADs in the source text.- The
b1
value of an AD always equals theb2
value of the preceding AD (except for the first AD in a paragraph, whoseb1
is always0
without a predecessor); conversely, theb2
value of AD always equals theb1
of the succeeding AD (except for the last AD in a paragraph, whoseb2
always equals the byte length of the source text without a successor). - These constraints mean that fields
b1
andb2
can be interpreted as implementing a doubly linked list; this is a simple ('single-track') linked list when only looking at tracktrk: 1
and a multiply linked list when one looks at all the tracks.- Multiple tracks occur because of the need to re-shape Shape Groups (SGs) of interdependent outlines
in the presence of a Soft Hyphen (SHY); in the below example, there's
- a soft hyphen between the two
f
s ofaffirm
(i.e.af¬firm
, where the SHY is not normally visible and does not inhibit the usage of GID 85 that picturesff
nor the application of kerning betweenff
andi
) on tracktrk: 1
and an alternative tracktrk: > 1
(an arbitrary but locally unique number greater than 1), and - a Hard Hyphen (HHY) between the two
f
s ofaffirm
(i.e.af-
,firm
). - Observe that some GIDs differ between the two tracks, that's because one track uses a ligature where the other track uses two (identical) outlines for the same portion of the source text.
- a soft hyphen between the two
- Multiple tracks occur because of the need to re-shape Shape Groups (SGs) of interdependent outlines
in the presence of a Soft Hyphen (SHY); in the below example, there's
- The pair
( b1, b2 )
can not only be used to sort ADs such that they are in logical order (for which see Chapter 2.2: Unicode Design Principles in The Unicode Standard (v14), p19). - Since
byte_range := ( b1, b2 )
refers to a range of contiguous bytes in the source text, one can, by preserving the two fields across transformations, pinpoint, for any outline in the resulting output (HTML or PDF), the exact location in the source text that is responsible for the outline in question. - One may say that the coordinate tuple
( x1, x2 )
reifies (materializes) the visual ordering whereas the byte index tuple( b1, b2 )
reifies the logical ordering of a given source text.
- The
-
field
( x1, x2 ) ┌ 6110 ┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄ 6874 ┐
│ │
gid 23 85 3 -1 176 40 180 3
┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐
trk: 1 ──┼─> a <─┼─┬┼─> f <─┼──┼─> ¬ <─┼──┼─> f <─┼──┼─> i <─┼┬─┼─> r <─┼──┼─> m <─┼──┼─> ␣ <─┼──
└───────┘ ┆└───────┘ └───────┘ └───────┘ └───────┘┆ └───────┘ └───────┘ └───────┘
b1, b2 10, 11 ┆ 11, 12 12, 14 14, 15 15, 16 ┆ 16, 17 17, 18 18, 19
┆ ┆
┆ ┆
gid ┆ 28 50 28 176 ┆
┆┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐┆
trk: > 1 └┼─> f <─┼──┼─> - <─┼──┼─> f <─┼──┼─> i <─┼┘
└───────┘ └───────┘ └───────┘ └───────┘
b1, b2 11, 12 12, 14 14, 15 15, 16
│ │ │ │
( x1, x2 ) └ 6110 ┄┄┄┄ 6711 ┘ └ 6359 ┄┄┄┄ 6874 ┘
excerpt of table `ads` with `trk` track 1: .. and with `trk` track > 1:
┌─────┬────┬────┬─────┬──────┬──────┬──────┬──────┐ ┌─────┬────┬────┬─────┬──────┬──────┬──────┬──────┐
│ adi │ b1 │ b2 │ sgi │ osgi │ gid │ chrs │ x │ │ adi │ b1 │ b2 │ sgi │ osgi │ gid │ chrs │ x │
├─────┼────┼────┼─────┼──────┼──────┼──────┼──────┤ ├─────┼────┼────┼─────┼──────┼──────┼──────┼──────┤
│ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │
│ 10 │ 10 │ 11 │ 11 │ null │ 23 │ 'a' │ 5610 │ │ │ │ │ │ │ │ │ │
│ 11 │ 11 │ 12 │ 12 │ null │ 85 │ 'f' │ 6110 │ │ 47 │ 11 │ 12 │ xxx │ 12 │ 28 │ 'f' │ 6110 │
│ 12 │ 12 │ 14 │ 12 │ null │ 3 │ '¬' │ 6671 │ │ 48 │ 12 │ 14 │ xxx │ 12 │ 50 │ '-' │ 6422 │
│ 13 │ 14 │ 15 │ 12 │ null │ -1 │ 'f' │ 6671 │ │ 49 │ 14 │ 15 │ xxx │ 12 │ 28 │ 'f' │ 6359 │
│ 14 │ 15 │ 16 │ 12 │ null │ 176 │ 'i' │ 6671 │ │ 50 │ 15 │ 16 │ xxx │ 12 │ 176 │ 'i' │ 6671 │
│ 15 │ 16 │ 17 │ 13 │ null │ 40 │ 'r' │ 6874 │ │ │ │ │ │ │ │ │ │
│ 16 │ 17 │ 18 │ 13 │ null │ 180 │ 'm' │ 7259 │
│ 17 │ 18 │ 19 │ 14 │ null │ 3 │ ' ' │ 8014 │
│ │ │ │ │ │ │ │ │
-
[–] [RustyBuzz-WASM] modify
glyph_to_svg_pathdata()
/rectangle_from_bbox()
to return coordinates instead of ready-built SVG<rect/>
as wee need those numbers to construct the line bounding box -
[–] [RustyBuzz-WASM] implement caching of font face in
glyph_to_svg_pathdata()
to avoid re-building it for each single outline; benchmark -
[–] Two independent imports of [RustyBuzz-WASM] will share state if they happen to import the same file
-
[+] keep bounding box as
{ x, y, x1, y1, }
in outline table to make it easier to compute composite bounding box -
[–] [RustyBuzz-WASM] implement
despace_svg_pathdata = ( svg_pathda ) -> R = svg_pathda R = R.replace /([0-9])\x20([^0-9])/g, '$1$2' R = R.replace /([^0-9])\x20([0-9])/g, '$1$2' return R
-
[–] [RustyBuzz-WASM] implement SVG path compression
-
[+] allow to pass in custom instance of
rustybuzz-wasm
ascfg.RBW
for testing purposes -
[–] [RustyBuzz-WASM] move to WASI to enable reading from file system &c:
- https://github.com/topheman/webassembly-wasi-experiments#file-system-access
- https://github.com/rustwasm/wasm-pack/issues/654
-
rustwasmc
doesn't allow file access for Rust versions >= 1.51, see https://github.com/second-state/rustwasmc#known-issues - https://www.secondstate.io/articles/wasi-access-system-resources/
- https://blog.knoldus.com/hosting-wasm-modules-in-rust-easily-using-wasmi/
- https://lib.rs/search?q=wasi
- https://wasi.dev/
- https://github.com/bytecodealliance/wasmtime/blob/main/docs/WASI-documents.md
- https://wasmer.io/
- https://de.wikipedia.org/wiki/WebAssembly#Nutzung_au%C3%9Ferhalb_des_Browsers
- https://forum.holochain.org/t/wasmi-vs-wasmer/1929
-
[–] document SVG construction principles
-
[–] document SVG units (mm, px, implicit)
-
[–] implement CSS scaling function (so
1mm
in CSS shows up as 1mm in rendering) -
[+] rename
shape_text()
toarrange()
:-
arrange() -> ads
: gets an arrangement of SIDs (a
rrangementd
ata items
); this called by -
compose() -> { known_ods, new_ods, missing_chrs, ads, fm, }
: turns an arrangement into a hitherto unnamed data structure that indicates which outlines to put where, which SIDs are known, which are new, which are missing from the current font, and fontmetrics -
distribute()
: distributes text over lines
-
-
[–] [icql-dba-hollerith] update & rebrand as
dbay-hollerith
, removing peer dependency onicql-dba
-
[–] Ensure that entire text gets reflected in SVG comments when so configured, even while omitting refs for whitespace.
-
[+] [hengist] Implement a 'Glyf Grid' to see which outlines are mapped to which GIDs
-
[+] Rename field
ads.alt
toads.trk
('track', as used in this README) -
[+] Remove field
ads.adi
(or renameads.id
->ads.adi
) if possible since fieldsid
andb1
are better fits for the same puproses (identifying an AD, logical ordering of ADs). -
[–] (re-)implement display of additional data / invisible elements / text metrics.
-
Symbols should take account of font metrics, especially ascender, baseline, descender, italic angle (skewX), en-width (for tofus / missing glyfs)
-
[–] SPC symbol for space; should show actual extent, similar to
<!--&spc;--><path transform='skewX(-14)' d='M 2350 0 L 2350 200 L 2593 200 L 2593 0' style='fill:transparent;stroke:red;stroke-width:10mm'/>
(albeit with style attributes in
<style>
section). Since each SPC has (potentially) an individual width, no<use/>
tags are used, instead, elements will be generated by an approprate method. -
[–] WBR automatic and manual line break opportunities (can we distinguish between manual, automatic?)
-
[–] SHY hyphenation points
-
[–] GID Glyph IDs
-
[–] 'Tofus' for missing characters (one for half-width, one for full-width codepoints)
-
[–] BBOX glyph bounding box
-
[–] NL manual line breaks
-