Documentation for the AscToRTF conversion utility : Understanding the RTF generated

Documentation for the AscToRTF conversion utility

The latest version of these files is available online at http://www.jafsoft.com/doco/docindex.html

Understanding the RTF generated

Before converting files to RTF, AscToRTF first attempts to analyse your document looking for the following components.

Text layout

Paragraph detection

Indentation detection

Bullets and list detection

Definition detection

Text formatting

Centred text detection

Quoted line detection

Emphasis detection

Unix Emphasis character detection

Adding hyperlinks

Contents List detection

Cross-reference detection

URL detection

Usenet Newsgroup detection

E-mail address detection

User-specified keywords

Headings and section titles

Numbered heading detection

Capitalised heading detection

Underlined heading detection

Embedded heading detection

Key phrase headings

Numbered paragraph detection

diagrams and tables

Line detection

Form feed page markers

User defined pre-formatted text

Automatically detected pre-formatted text

Table detection

Code sample detection

ASCII art and diagram detection

Text block detection

Other formatted text

Adding features to the document

Adding a Document Title

Adding a Contents list

The use of RTF stylesheets

Text layout

The software can detect several types of text layout. For more details see the following topics.

Paragraph detection

Indentation detection

Hanging paragraph indent detection

Bullets and list detection

Definition detection

Paragraph detection

AscToRTF can automatically detect paragraphs in your document. Normally this is done by detecting blank lines between paragraphs, but when there are no blank lines other features such as short lines at the end of a paragraph and an offset at the start of each new paragraph may also be taken into account.

Indentation detection

AscToRTF performs statistical analysis on the document to determine at what character positions indentations occur. This information is used on the output pass to determine the indentation level for each source line.

In calculating the indent positions AscToRTF first converts all tabs to spaces. This may result in unexpected indent positions, but shouldn't normally be a problem. If it is, adjust the Tab size policy.

AscToRTF may reject indentations that appear too close together, so as to keep the number of indent levels manageable.

You can override the analysis by specifying your own indentation policy. This can sometimes be useful to add an extra indentation level, or to better match up bullet paragraphs with non-bullet paragraphs.

Hanging paragraph indent detection

Some documents have hanging paragraph indents. That is, the first line of each paragraph starts at an offset to the rest of the paragraph.

AscToRTF struggles heroically with this, and tries not to treat this as text at two indent levels, but it does occasionally get confused.

If writing a text file from scratch with AscToRTF in mind, then it is best to avoid this practice.

Bullets and list detection

AscToRTF detects and supports several types of bullets and lists. However it doesn't attempt to convert these into auto-numbered lists (introduced in a later version of RTF). This has the effect of putting the bulleted text one level of indentation to the right of the current text.

Should the analysis fail, you can override any and all of these via the analysis bullet policies

Such text is marked up using the "Bullet" Style. See "the use of RTF stylesheets".

Bullet paragraphs
AscToRTF will attempt to detect bullet paragraphs, that is, paragraphs that belong to the bullet point. To do this it attempts to match the indentation of follow-on lines with that past the bullet character(s) on the bullet line itself.

Currently this detection only stretches to the paragraph containing the bullet.

Possible problems

Numbered bullets may sometimes get confused with numbered sections. This can be corrected by switching off numbered sections (if there aren't any), replacing the numbered bullets by letters or roman numerals, or by moving the numbered bullets to a different indentation level from the section numbers.
AscToRTF currently only detects the first paragraph belonging to a bullet. If the bullet has several paragraphs there may be alignment problems, as the positioning of the second and subsequent paragraphs will depend on the indentation policy. Sometimes careful balancing of the indentations and the indentation policies can sort the problem.

Bullet chars

Bullet chars are lines of the type

        - this is a bullet line

        - this is a bullet paragraph
          because it carries over onto
          more lines

That is, a single character followed by the bullet line. AscToRTF can determine via statistical analysis which character, if any, is being used in this way. Special attention is paid to the '-' and 'o' characters.

Numbered bullet detection

AscToRTF can spot numbered bullets. These can sometimes be confused with section headings in some documents. This is one area where the use of a document policy really pays dividends in sorting the sheep from the goats.

Alphabetic bullet detection

AscToRTF detects upper and lower case alphabetic bullets.

Roman Numeral bullet detection

AscToRTF detects upper and lower case roman numeral bullets.

Definition detection

AscToRTF will search for definitions. Definitions consist of a definition term and then the definition description.

One-line definitions

Definition paragraphs

One-line definitions

A definition line is a single line that appears to be defining something. Usually this is a line with either a colon (:) or an equals sign (=) in it. For example

        IMHO = In my humble opinion

        Address : Somewhere over the rainbow.

AscToRTF attempts to determine what definition characters are used and whether they are "strong" (only ever used in a definition) or "weak" (only sometimes used in a definition).

AscToRTF marks up definition lines by placing a line break on the end of the line to preserve the original line structure. Where this decision is made incorrectly unexpected breaks can appear in text.

AscToRTF offers the option of marking up the definition term in bold. This is not the default behaviour however.

Definition paragraphs

AscToRTF also recognises the use of definition paragraphs such as :-

      Note:     This is a definition paragraph whereby the whole
                paragraph is defining the term shown on the first line.
                Unfortunately AscToRTF currently only copes with single
                paragraphs (i.e. not with continuation paragraphs), and
                only with single word definitions.

AscToRTF can detect such definitions, subject to the current limitations

Only one word definition terms are detected

Only the first definition paragraph is detected. Whether or not subsequent paragraphs are aligned correctly will depend on the indentation policy applied to it.

These limitations will hopefully be removed in later versions.

Where definition paragraphs are detected the definition will be marked up as hanging paragraphs and (optionally) can have the definition term highlighted in bold.

Text formatting

In addition to various types of formatted text layouts, the software can detect a number of special types of text formatting, including the following.

Centred text

Quoted lines (such as in emails)

Emphasised text

Unix emphasis characters

Centred text detection

New in version 2.0

AscToRTF can be made to attempt automatic detection of centred text. When enabled the indentation and length of each line is compared to the nominal page width within a specified tolerance (see page width and Automatic centring tolerance)

If the line appears centred (and meets a few other conditions) then it will be rendered centred in the output.

This option is normally left switched off, as it is fairly prone to errors, not least because the calculation is sensitive to getting the page width calculation correct. When it goes wrong you are liable to find the document centres lines that shouldn't be.

Quoted line detection

AscToRTF recognises that, especially in Internet files, it is increasingly common to quote from other text sources such as e-mail. The convention used in such cases is to insert a quote character such as ">" at the start of each line.

Consequently, AscToRTF adds a line break at the end of such lines to preserve the line structure of the original, and marks it up in italics to differentiate the quoted text

Such text is marked up using the "Quotes" Style. See "the use of RTF stylesheets"

Emphasis detection

AscToRTF can look for text emphasised by placing asterisks (*) either side of it, or underscores (_). AscToRTF will convert the enclosed text to bold and italic respectively using Bold and italic tags respectively.

AscToRTF will also look for combinations of asterisks and underscores which will be placed in bold italic. The asterisks and underscores should be properly nested.

The emphasised word or phrase should span no more than a few lines, and in particular should not span a blank line. If the phrase is longer, or if AscToRTF fails to match opening and closing emphasis marks, the characters are left unconverted.

Tests are made to ignore double asterisks and underscores, and sometimes adjacent punctuation will prevent the text being marked up.

Only markup that occurs in matched pairs over 2-3 lines will be converted, so _this and that* won't be converted.

Unix emphasis character detection

AscToRTF also tries to handle use of Ctrl-H in Unix documents. In such documents Ctrl-H can be used to overstrike characters. Common effects are double printing and underlining. Where detected AscToRTF will use bold and underlining markup.

Examples could include:-

The word this^H^H^H^H____ is underlined. The word that^H^H^H^Hthat is bold (overwritten twice).

Adding hyperlinks

The software can add active hyperlinks to the following :-

Cross-references to numbered sections

URLs of various types

Usenet newsgroups

email addresses

User-specified keywords

Contents List detection

Unlike AscToHTM, AscToRTF leaves any detected contents list intact and unchanged. However, since headings are marked up in a Heading style, it should be possible to create a TOC in Word from the marked up headings. This being the case, the original text TOC is redundant and best deleted.

See adding a contents list

Cross-reference detection

AscToRTF can convert cross-references to other sections into hyperlinks to those sections. Unfortunately this is currently only possible for second, third, fourth... level numeric headings (n.n, n.n.n, n.n.n.n etc)

This is because the error rate becomes too high on single numbers/letters or roman numerals. This may be refined in future releases, although it's hard to see how that would work.

It is possible to use AscToRTF tags though, for example the GOTO command and POPUP command can create links to named sections.

For example

        See [[goto cross-reference detection]]

becomes

See cross-reference detection

URL detection

AscToRTF can convert any URLs in the document to hyperlinks. This includes http and FTP URLs and any web addresses beginning with www.

The domain name part of the URL will be checked against the known domain name structures and country codes to check it falls within an allowed group. So www.somewhere.thing won't be allowed as ".thing" isn't a proper top level domain.

URLs that use IP addresses or some more obscure methods of specifying domain names will also be recognised, but the link will be changed wherever to either a domain name or an IP address. This will de-obfuscate any obscure references so beloved by spammers.

Unlike AscToHTM, AscToRTF will only convert hyperlinks to a full URLs (i.e. those where a site name is supplied). If a URL like "\home\index.html" is detected it is left unconverted. This is because it is less likely that the relationship between source and target can be relied on.

Usenet Newsgroup detection

AscToRTF can convert any newsgroup names it spots into hyperlinks to those newsgroups. Because this is prone to error, AscToRTF currently only converts newsgroups in known USENET hierarchies such as rec.gardens by default.

This can be overcome either by

placing "news:" in front of the newsgroup name (e.g. news:this.is.a.newsgroup.honest)
relaxing this condition via a document policy (see Only use known groups).
specifying the newsgroup hierarchy as recognised via a policy (see Recognised USENET groups).

E-mail address detection

AscToRTF can convert any email addresses into hypertext mailto: links. As with URL detection, the domain name is checked to see it falls into a recognised group.

User-specified keywords

AscToRTF can convert use-specified keywords into hyperlinks. The words or phrase to be converted must lie on a single line in the source document. Care should be taken to ensure keywords are unambiguous. Normally I mark my keywords in [] brackets if authoring for conversion by AscToRTF

See the discussion in Using link dictionary files.

Headings and section titles

AscToRTF recognises various types of headings. Where headings are found, and deemed to be consistent with the prevailing document policy (correct indentation, right type, in numerical sequence etc), AscToRTF will use the standard "Heading n" styles.

In addition to this, AscToRTF will insert a named bookmark to allow hyperlink jumps to this point. These bookmarks are used for example in any cross-reference hyperlinks that AscToRTF generates, and also by any GOTO tags.

Numbered heading detection

Sections of type N.N.N can be checked for consistency, and references to them can be spotted and converted into hyperlinks.

At present more exotic numbering schemes using roman numerals and letters of the alphabet are not fully supported.

Capitalised heading detection

AscToRTF can treat wholly capitalised lines as headings. It also allows for such headings to be spread over more than one line.

Underlined heading detection

AscToRTF can recognize underlined text (e.g. a row of minus signs), and optionally promote the preceding line to be a section header.

The "underlining" line should have no gaps in it, and should be a similar length to the preceding heading. If these conditions aren't met you'll probably get a horizontal rule instead.

If you're authoring a file from scratch, it is probably best to use underlined headings for ease of use.

Embedded heading detection

New in version 2.0

The program can look for headings "embedded" in the first paragraph. Such headings are expected to be a complete sentence or phrase in UPPER CASE at the start of a paragraph. Where detected the heading will be marked up in bold, rather than <Hn> markup, although it will still be added to, and accessible from any hyperlinked contents list you generate for the document.

At present such headings are not auto-detected... you need to switch on the Expect Embedded headings policy.

Key phrase headings

New in version 2.0

The program can now look for lines that start with particular words or phrases (such as "Chapter", "Part", Title") of your choice and treat these lines as headings. Previously this only worked in a limited way if the heading line was also numbered ("Chapter 1") etc.

To use this feature, set the policy Heading Key phrases

Numbered paragraph detection

Some types of documents use what look like section numbers to number paragraphs (e.g. legal documents, or sets of rules).

AscToRTF can recognize this, and mark up such lines by placing the number in bold, and not using the "Heading n" style on the whole line.

Mail and USENET headers

Some documents, especially those that were originally email or USENET posts, come with header lines, usually in the form of a number of lines with a keyword followed by a colon and then some value.

AscToRTF can recognize these (to a limited extent). Where these are detected the program will parse the header lines to extract the Subject, Author and Date of the article concerned. A heading containing this information will then be generated to replace all the unsightly header lines.

Pre-formatted text

The software can detect various forms of pre-formatted text. This is text laid out in such a way that the spacing used is critical. Spacing is not normally preserved in conversion to RTF, so the correct detection and handling of these special types of text is quite important.

Types of text recognised include the following

Lines

Form feed page markers

User defined pre-formatted text

Automatically detected pre-formatted text

Tables

Code samples

Diagrams and ASCII art

Text blocks

Other formatted text

Line detection

Lines are interpreted in context. If they appear to be underlining text, or part of some pre-formatted structure such as a table, then they are treated as such. Otherwise they become horizontal rules.

An attempt is made to interpret half-lines etc as such, although the effect is only approximate.

Form feed page markers

Form feeds or page breaks become page breaks in the RTF

User defined pre-formatted text

AscToRTF allows users to define their own regions of pre-formatted text, using the BEGIN_PRE and END_PRE pre-processor tags (see Using the pre-processor).

Such areas are marked up in the "Preformatted" style (see "the use of RTF stylesheets"), which uses a non-proportional font to preserve the relative spacing.

For example :-

      The use of BEGIN_PRE and END_PRE preprocessor
        commands (see 7.1) in
          the text documents
            tells AscToHTM that
              this portion of the
            document
          has been formatted
        by the user and
      should be left unchanged.

Automatically detected pre-formatted text

AscToRTF attempts to spot sections of preformatted text. This can vary from a single line (e.g. a line with a page number on the right-hand margin) to a complete table of data.

Where such text is detected AscToRTF analyses the section to determine what type of pre-formatted text it is. Options include

Tables

Code samples

ASCII Art and diagrams

some other formatted text

A number of policies allow you to control

whether or not the program looks for such text

how sensitivity it is to "pre-formatted" text

how inclined the program is to "extend" the region to adjacent lines

whether or not table generation should be attempted

various aspects of any table analysis that is carried out.

See Pre-formatted text policies for full details.

You can adjust the sensitivity of AscToRTF to pre-formatted text by setting the minimum number of lines required for a pre-formatted region using the Minimum size of automatic <PRE> section policy.

RTF ignores all white space in the source document, thus any hand-crafted layout information would normally get lost. When AscToRTF detects such regions it marks them up in fixed width font which tells RTF this region is pre-formatted.

When tables are detected, AscToRTF will attempt to generate the correct RTF table.

When AscToRTF gets the detection wrong you can use the AscToRTF pre-processor to mark up regions of your document you wish preserved.

Table detection

Tables are marked out by their use of white space, and a regular pattern of gaps or vertical bars being spotted on each lines. AscToRTF will attempt to spot the table, its columns, its headings, its cell alignment and entries that span multiple columns or rows.

Should AscToRTF wrongly detect the extent of a table, you can mark up a section of text by using the TABLE pre-processor markup (see the Tag manual). Alternatively you can try adding blank lines before and after, as the analysis uses white space to delimit tables.

You can alter the characteristics of all tables via the table policies (see Formatting policies).

You can alter the characteristics of all or individual tables via the table pre-processor commands (see TABLE).

Or you can suppress the whole thing altogether via the Attempt TABLE generation policy

Tables will be marked up using the "Table" style. See "the use of RTF stylesheets".

Code sample detection

AscToRTF attempts to recognize code fragments in technical documents. The code is assumed to be "C++" or "Java"-like, and key indicators are, for example, the presence of ";" characters on the end of lines.

Should AscToRTF wrongly detect the extent of a code fragment, you can mark up a section of text by using the CODE pre-processor markup.

Or you can suppress the whole thing altogether via the policy Expect code samples.

Code samples will be marked up using the "Code" style. See "the use of RTF stylesheets".

ASCII art and diagram detection

AscToRTF attempts to recognize ASCII art and diagrams in documents. Key indicators include large numbers of non-alphanumeric characters and the use of white space.

However, some diagrams use the same mix of line and alphabetic characters as tables, so the two sometimes get confused.

Should AscToRTF wrongly detect the extent or type of a diagram, you can mark up a section of text by using the DIAGRAM pre-processor markup.

Diagrams are marked up using the "Diagram" style. See "the use of RTF stylesheets".

Text block detection

New in version 2.0

If AscToRTF detects a block of text at a large indent, it will now place that text in such a way as to preserve as faithfully as possible the original indent.

Other formatted text

If AscToRTF detects formatted text, but decides that it is neither table, code or art (and it knows what it likes), then the text may be put out "as normal", but with the original line structure preserved.

In such regions other markup (such as bullets) may not be processed such as it would be elsewhere.

Adding features to the document

As well as detection features present in the source text, the software allows you to add in features that you would expect in the output file that can't be inferred from the input

These include the following.

Document title

A working contents list

Adding a Document Title

AscToRTF can calculate - or be told - the title of a document. This will be placed in document properties section in the header of each RTF file produced.

The Title is calculated as in the order shown below. If the first algorithm returns a value, the subsequent ones are ignored.

If a TITLE command is placed in the source text, that value is used
If the Document title policy is set then this value is used.
Finally, if none of the above result in a title the text "Converted from <filename>" is used.

Adding a Contents list

AscToRTF can detect the presence of a contents list in the original document, or it can insert a field code that will generate a contents list from the headings that it observes. This is possible because AscToRTF marks headings up in the Headings style. See The use of RTF stylesheets

This contents field added can be recalculated in Word by pressing F9.

There are a number of policies that give you control over how and where a contents list is generated (see contents policies).

Contents lists placement

By default the contents list will be placed at the top of the output file. You can cause contents lists to be placed wherever you want by using the CONTENTS_LIST preprocessor command (see pre-processor directives).

Contents list detection
AscToRTF can detect contents lists in a number of ways

By detecting "table of contents" "end contents" or something similar in the text.
By spotting the numbering sequence has been repeated twice. AscToRTF will assume the first set is the contents list.
By spotting pre-processor markup.

This is often a hit-and-miss procedure, and is liable to error.

Should the analysis fail, you can attempt to correct it via the

Contents lists policies.

The use of RTF stylesheets

AscToRTF supports the use of stylesheets. That is the marking up of text in particular styles. AscToRTF uses this to identify how the text was analysed, thus headings acquire a "Headings" style, and bulleted lists are marked up in the Bullet Style.

Initially most of these styles are the same, but if you use a word processor that supports RTF stylesheets (such as Word), you'll be able to globally change attributes line font face and colour. For example you could turn all code samples green by changing the attributes of the code style.

Styles are implemented in a hierarchy, with style attributes being inherited from their parents. Later versions of AscToRTF may allow style attributes to be selected before conversion.

The style hierarchy is as follows

      Normal                            (generic normal text style)
        |
        +-- 1 Body                      (main body text)
        |       |
        |       +--- 11 ShortLine       (short lines)
        |       +--- 12 Bullet          (bullets and numbered lists)
        |       +--- 13 Quoted          ("quoted" text as found in emails)
        |       +--- 14 Hanging         (hanging paragraphs)
        |       +--- 15 Definition      (definitions)
        |
        +-- 2 Table                     (Table text)
        +-- 3 Preform                   (preformatted text)
        +-- 4 Diagram                   (diagrams)
        +-- 5 Code                      (code samples)
        |
        +-- 6 Heading                   (generic heading style)
        |       |
        |       +--- 61 Heading1        (level 1 headings)
        |       +--- 62 Heading2        (level 2 headings)
        |       +--- 63 Heading3        (level 3 headings)
        |       +--- 64 Heading4        (level 4 headings)
        |       +--- 65 Heading5        (level 5 headings)
        |
        +-- 7 TOC                       (generic TOC style)
        |       |
        |       +--- 71 TOC1            (level 1 TOC entry)
        |       +--- 72 TOC2            (level 2 TOC entry)
        |       +--- 73 TOC3            (level 3 TOC entry)
        |       +--- 74 TOC4            (level 4 TOC entry)
        |       +--- 75 TOC5            (level 5 TOC entry)
        o

The default implementations of these styles are as follows:-

Body Uses the user-supplied font. Created
justified text by default.

ShortLine Same as Body, but with a \par at the end
of each line to preserve the original line structure.
These paragraphs have zero spacing before
and after, to closely mimic the original text
file structure.

Bullet Styling is the same as Body, but the bullet
itself is output using a hanging indent with a
tab after the bullet.

Quoted Text is placed in italics, and left justified.
Each line is given a \par to preserve the original
line structure.

Hanging The text is divided into two parts. The first
is placed on the left, and the "hanging" part is
placed on the right, after a tab. The position
of the tab stop is calculated according to
the size of the text to be placed on the left.
Often text that AscToHTM would put in a table comes
out as a hanging list.

Definition Much like Hanging. The definition term is on left,
the rest is hung on the right after a tab. Options
exist to allow the definition term to be made bold.

Table The text is styled as in Body, but is placed into
cells in a table. Table analysis is complex, and
deserves a document in its own right, but
in essence the text is placed in cells and
aligned according to original placement and data
type. The whole process can sometimes go wrong.

Preformatted Preformatted text is output in a non-proportional font
(usually Courier) with no spacing between lines and
a \par on each line to preserve the line structure.
A font size of 8pt is used as this best represents
80 characters across a page without wrapping.

Diagram Same as Preformatted.

Code Same as Preformatted.

Heading Heading itself is unused, but acts as a common parent
for the actual styles "Heading 1", "Heading 2" etc.
These are set to be the same as the Microsoft Word
equivalents.

TOC The table of contents style TOC itself is unused,
but acts as a common parent for the actual styles
"TOC 1", "TOC 2" etc. These are set to be the same
as the Microsoft Word equivalents.

Back to Contents List

*Body*	Uses the user-supplied font. Created justified text by default.
*ShortLine*	Same as Body, but with a \par at the end of each line to preserve the original line structure. These paragraphs have zero spacing before and after, to closely mimic the original text file structure.
*Bullet*	Styling is the same as Body, but the bullet itself is output using a hanging indent with a tab after the bullet.
*Quoted*	Text is placed in italics, and left justified. Each line is given a \par to preserve the original line structure.
*Hanging*	The text is divided into two parts. The first is placed on the left, and the "hanging" part is placed on the right, after a tab. The position of the tab stop is calculated according to the size of the text to be placed on the left. Often text that AscToHTM would put in a table comes out as a hanging list.
*Definition*	Much like Hanging. The definition term is on left, the rest is hung on the right after a tab. Options exist to allow the definition term to be made bold.
*Table*	The text is styled as in Body, but is placed into cells in a table. Table analysis is complex, and deserves a document in its own right, but in essence the text is placed in cells and aligned according to original placement and data type. The whole process can sometimes go wrong.
*Preformatted*	Preformatted text is output in a non-proportional font (usually Courier) with no spacing between lines and a \par on each line to preserve the line structure. A font size of 8pt is used as this best represents 80 characters across a page without wrapping.
*Diagram*	Same as Preformatted.
*Code*	Same as Preformatted.
*Heading*	Heading itself is unused, but acts as a common parent for the actual styles "Heading 1", "Heading 2" etc. These are set to be the same as the Microsoft Word equivalents.
*TOC*	The table of contents style TOC itself is unused, but acts as a common parent for the actual styles "TOC 1", "TOC 2" etc. These are set to be the same as the Microsoft Word equivalents.