Documentation for the AscToHTM conversion utility : Understanding the HTML generated

Documentation for the AscToHTM Text to HTML converter

Understanding the HTML generated

Before converting files to HTML, AscToHTM first attempts to analyse your document looking for the following components.

Text layout
Text formatting
Adding hyperlinks
Headings and section titles
diagrams and tables
Automatically detected pre-formatted text

In addition to the HTML generated as a result of analysis, you have the ability to customise and add to the HTML created.

Adding HTML features to the document

Text layout

The software can detect several types of text layout. For more details see the following topics.

Paragraph detection
Indentation detection
Hanging paragraph indent detection
Bullets and list detection
Contents list generation
Definition detection

Paragraph detection

AscToHTM can automatically detect paragraphs in your document. Normally this is done by detecting blank lines between paragraphs, but when there are no blank lines other features such as short lines at the end of a paragraph and an offset at the start of each new paragraph may also be taken into account.

Indentation detection

AscToHTM performs statistical analysis on the document to determine at what character positions indentations occur. The software attempts to honour your indentation pattern by using <BLOCKQUOTE> .. </BLOCKQUOTE> markup, however the effect can only ever be approximate.

In calculating the indent positions AscToHTM first converts all tabs to spaces. This may result in unexpected indent positions, but shouldn't normally be a problem. If it is, adjust the Tab size policy.

AscToHTM may reject indentations that appear too close together, so as to keep the number of indent levels manageable.

You can override the analysis by specifying your own indentation policy. This can sometimes be useful to add an extra indentation level, or to better match up bullet paragraphs with non-bullet paragraphs.

Hanging paragraph indent detection

Some documents have hanging paragraph indents. That is, the first line of each paragraph starts at an offset to the rest of the paragraph.

AscToHTM struggles heroically with this, and tries not to treat this as text at two indent levels, but it does occasionally get confused.

If writing a text file from scratch with AscToHTM in mind, then it is best to avoid this practice.

Bullets and list detection

AscToHTM attempts to automatically detect bullets of several types:

Numbered bullets
Alphabetic bullets (upper and lower case)
Roman numeral bullets (upper and lower case)
Hyphen `-` bullet points
Small letter `o' bullet points
bullets using other characters

Should the analysis fail, you can override any and all of these via the analysis bullet policies

AscToHTM will use <UL>..<LI>..</UL> or <OL>..<LI>..</LI> markup for bullets. This has the effect of putting the bulleted text one level of indentation to the right of the current text.

Bullet paragraphs

AscToHTM will attempt to detect bullet paragraphs, that is, paragraphs that belong to the bullet point. To do this it attempts to match the indentation of follow-on lines with that past the bullet character(s) on the bullet line itself.

Currently this detection only stretches to the paragraph containing the bullet.

Possible problems

Numbered bullets may sometimes get confused with numbered sections. This can be corrected by switching off numbered sections (if there aren't any), replacing the numbered bullets by letters or roman numerals, or by moving the numbered bullets to a different indentation level from the section numbers.
AscToHTM currently only detects the first paragraph belonging to a bullet. If the bullet has several paragraphs there may be alignment problems, as the positioning of the second and subsequent paragraphs will depend on the indentation policy. Sometimes careful balancing of the indentations and the indentation policies can sort the problem.

Bullet chars

Bullet chars are lines of the type

        - this is a bullet line

        - this is a bullet paragraph
          because it carries over onto
          more lines

That is, a single character followed by the bullet line. AscToHTM can determine via statistical analysis which character, if any, is being used in this way. Special attention is paid to the '-' and 'o' characters.

Numbered bullet detection

AscToHTM can spot numbered bullets. These can sometimes be confused with section headings in some documents. This is one area where the use of a document policy really pays dividends in sorting the sheep from the goats.

Alphabetic bullet detection

AscToHTM detects upper and lower case alphabetic bullets.

Roman Numeral bullet detection

AscToHTM detects upper and lower case roman numeral bullets.

Contents list generation

AscToHTM can detect the presence of a contents list in the original document, or it can generate a contents list for you from the headings that it observes. There are a number of policies that give you control over how and where a contents list is generated (see Contents list policies).

There are four different situations in which contents lists may, or may not be generated. These are :-

Contents lists in default conversions
Contents lists in conversions to a single HTML file
Contents lists in conversions to multiple HTML files
Contents lists in conversions to frames

Contents lists in default conversions

By default AscToHTM will not generate a contents list for a file unless it already has one.

If it should detect a contents list in the document, then that list is changed into hyperlinks to the named sections. This only works currently for files with numbered headings.

Where an existing list is detected, headings shown in the contents list are converted into links, and the link text is that in the original contents list, and not the text in the actual heading (often they are different).

Note:: AsctoHTM currently only detected numbered contents lists, and is occasionally prone to error when they are present. If you experience problems, either delete the contents list and get AscToHTM to generate one for you, or mark up the existing list using the contents pre-processor commands (see Pre-processor section delimiters)

Contents lists in conversions to a single HTML file

As described in Contents lists in default conversions, AscToHTM will not generate a contents list by default unless it already has one.

Requesting a contents list

You can request that a contents list is always generated, by using the Add contents list policy. In this case a contents list is either

made from the existing contents list, or

generated from the observed headings. in this case the contents list will only be as good as the detection of headings in the rest of the document permits.

Forcing a generated contents list

You can force a generated list to be used by disabling the Use any existing contents list policy.

If an existing contents list is present, it is deleted from the output. Normally it's best to either use the existing contents list, or to delete it from the source text and request a generated list.

Contents lists placement

By default the contents list is placed at the top of the output file. In earlier versions of AscToHTM the contents list was always placed in a separate file.

You can cause contents lists to be placed wherever you want by using the CONTENTS_LIST preprocessor command. If you do this, then contents lists is placed only where you place CONTENTS_LIST markers.

Generating a contents list in a separate file

If you select the Generate external contents file policy the contents list is placed in a separate file, and a hyperlink to that file called "Contents List" is placed at the top of the HTML page generated from the document.

You can choose the name of the external file using the External contents list filename policy. If omitted, the file is called "Contents_<filename>", where <filename> is the name of the document being converted.

Contents lists in conversions to multiple HTML files

AscToHTM can be made to split the output into many files. At present this is only possible at detected section headings. Each generated page usually has a navigation bar, which includes a hyperlink back to the following section in any contents list.

The behaviour is identical to that in Contents lists in conversions to a single HTML file expect that

the output is now split into several files.

the options to generate an external contents list in a separate file are no longer available.

if the contents list is being generated, it is now placed at the foot of the first document, rather than at the top (unless the CONTENTS_LIST preprocessor command is used)

This is usually before the first heading (which now starts the second document), and after any document preamble.

Note:: Where the original contents list is used when splitting files it is possible that not every file is directly accessible from the contents list, and that the back links to the contents list may not function as expected. In such cases you can go from the contents list to a major section, and then use the navigation bars to page through to the minor section.

Contents lists in conversions to frames

Contents list generation for the main document will proceed as described in the previous sections.

When making a set of frames, you can elect to have a contents frame generated (the default behaviour), and this will have a generated list placed in a frame on the left. This can mean you have a contents list in the contents frame on the left, and also at the top of the first page in the main document. For this reason the main frame often starts by displaying the second page.

The number of levels shown in the contents frame list can be controlled by policy. Alternatively you can replace the whole contents of the contents frame by defining a CONTENTS_FRAME HTML fragment.

Definition detection

AscToHTM will search for definitions. Definitions consist of a definition term and then the definition description.

One-line definitions
Definition paragraphs

One-line definitions

A definition line is a single line that appears to be defining something. Usually this is a line with either a colon (:) or an equals sign (=) in it. For example

        IMHO = In my humble opinion

        Address : Somewhere over the rainbow.

AscToHTM attempts to determine what definition characters are used and whether they are "strong" (only ever used in a definition) or "weak" (only sometimes used in a definition).

AscToHTM marks up definition lines by placing a line break on the end of the line to preserve the original line structure. Where this decision is made incorrectly unexpected breaks can appear in text.

AscToHTM offers the option of marking up the definition term in bold. This is not the default behaviour however.

Definition paragraphs

AscToHTM also recognises the use of definition paragraphs such as :-

      Note:     This is a definition paragraph whereby the whole
                paragraph is defining the term shown on the first line.
                Unfortunately AscToHTM currently only copes with single
                paragraphs (i.e. not with continuation paragraphs), and
                only with single word definitions.

AscToHTM can detect such definitions, subject to the current limitations

Only one word definition terms are detected

Only the first definition paragraph is detected. Whether or not subsequent paragraphs are aligned correctly will depend on the indentation policy applied to it.

These limitations will hopefully be removed in later versions.

Where definition paragraphs are detected the definition can be marked up in <DL> ... <DT>..</DT> <DD>..</DD> </DL> and (optionally) can have the definition term highlighted in <B> ... </B> markup.

Text formatting

In addition to various types of formatted text layouts, the software can detect a number of special types of text formatting, including the following.

Centred text
Quoted lines (such as in emails)
Emphasised text
Unix emphasis characters

Centred text detection

AscToHTM can be made to attempt automatic detection of centred text. When enabled the indentation and length of each line is compared to the nominal page width within a specified tolerance (see page width and Automatic centring tolerance)

If the line appears centred (and meets a few other conditions) then it will be rendered centred in the output.

This option is normally left switched off, as it is fairly prone to errors, not least because the calculation is sensitive to getting the page width calculation correct. When it goes wrong you are liable to find the document centres lines that shouldn't be.

Quoted line detection

AscToHTM recognises that, especially in Internet files, it is increasingly common to quote from other text sources such as e-mail. The convention used in such cases is to insert a quote character such as ">" at the start of each line.

Consequently, AscToHTM adds a line break at the end of such lines to preserve the line structure of the original, and marks it up in italics to differentiate the quoted text

Emphasis detection

AscToHTM can look for text emphasised by placing asterisks (*) either side of it, or underscores (_). AscToHTM will convert the enclosed text to bold and italic respectively using Bold and italic tags respectively.

AscToHTM will also look for combinations of asterisks and underscores which will be placed in bold italic. The asterisks and underscores should be properly nested.

The emphasised word or phrase should span no more than a few lines, and in particular should not span a blank line. If the phrase is longer, or if AscToHTM fails to match opening and closing emphasis marks, the characters are left unconverted.

Tests are made to ignore double asterisks and underscores, and sometimes adjacent punctuation will prevent the text being marked up.

Only markup that occurs in matched pairs over 2-3 lines will be converted, so _this and that* won't be converted.

For example the following two paragraphs :-

        Here are *bold* and _italic_ words.  The phrase _The Guardian Newspaper_
        would appear in italics.  The words *_this_* and _*that*_ would appear in
        bold italics.

        The program can cope with phrases such as _Alice in
        Wonderland_ which span more than one line.

Becomes

Here are bold and italic words. The phrase The Guardian Newspaper would appear in italics. The words this and that would appear in bold italics.

The program can cope with phrases such as Alice in Wonderland which span more than one line.

Unix emphasis character detection

AscToHTM also tries to handle use of Ctrl-H in Unix documents. In such documents Ctrl-H can be used to overstrike characters. Common effects are double printing and underlining. Where detected AscToHTM will use bold and underlining markup.

Examples could include:-

The word this^H^H^H^H____ is underlined. The word that^H^H^H^Hthat is bold (overwritten twice).

Adding hyperlinks

The software can add active hyperlinks to the following :-

Cross-references to numbered sections
URLs of various types
Usenet newsgroups
email addresses
User-specified keywords

You can control which features get hyperlinks added by modifying the available hyperlink policies

Contents List detection

AscToHTM will attempt to detect contents list in a number of ways :

By detecting "table of contents" "end contents" or something similar in the text.
By spotting the numbering sequence has been repeated twice. AscToHTM will assume the first set is the contents list.
By spotting pre-processor markup.

This is often a hit-and-miss procedure, and is liable to error.

Should the analysis fail, you can attempt to correct it via the contents list policies

In addition to converting existing contents lists, AscToHTM can generate a contents list from the observed headings.

The various situations in which a contents list may, or may not be created are described in Contents list generation

Cross-reference detection

AscToHTM can convert cross-references to other sections into hyperlinks to those sections. Unfortunately this is currently only possible for second, third, fourth... level numeric headings (n.n, n.n.n, n.n.n.n etc)

This is because the error rate becomes too high on single numbers/letters or roman numerals. This may be refined in future releases, although it's hard to see how that would work.

It is possible to use AscToHTM tags though, for example the GOTO command and POPUP command can create links to named sections.

For example

        See [[goto cross-reference detection]]

becomes

See cross-reference detection

URL detection

AscToHTM can convert any URLs in the document to hyperlinks. This includes http and FTP URLs and any web addresses beginning with www.

The domain name part of the URL will be checked against the known domain name structures and country codes to check it falls within an allowed group. So www.somewhere.thing won't be allowed as ".thing" isn't a proper top level domain.

URLs that use IP addresses or some more obscure methods of specifying domain names will also be recognised, but the link will be changed wherever to either a domain name or an IP address. This will de-obfuscate any obscure references so beloved by spammers.

Usenet Newsgroup detection

AscToHTM can convert any newsgroup names it spots into hyperlinks to those newsgroups. Because this is prone to error, AscToHTM currently only converts newsgroups in known USENET hierarchies such as rec.gardens by default.

This can be overcome either by

placing "news:" in front of the newsgroup name (e.g. news:this.is.a.newsgroup.honest)
relaxing this condition via a document policy (see Only use known groups).
specifying the newsgroup hierarchy as recognised via a policy (see Recognised USENET groups).

E-mail address detection

AscToHTM can convert any email addresses into hypertext mailto: links. As with URL detection, the domain name is checked to see it falls into a recognised group.

User-specified keywords

AscToHTM can convert use-specified keywords into hyperlinks. The words or phrase to be converted must lie on a single line in the source document. Care should be taken to ensure keywords are unambiguous. Normally I mark my keywords in [] brackets if authoring for conversion by AscToHTM

See the discussion in Using link dictionary files.

Headings and section titles

AscToHTM recognises various types of headings. Where headings are found, and deemed to be consistent with the prevailing document policy (correct indentation, right type, in numerical sequence etc), AscToHTM will use the standard "Heading n" styles.

In addition to this, AscToHTM will insert a named bookmark to allow hyperlink jumps to this point. These bookmarks are used for example in any cross-reference hyperlinks that AscToHTM generates, and also by any GOTO tags.

The configuration of these options is described under Headings Policies.

Numbered heading detection

Sections of type N.N.N can be checked for consistency, and references to them can be spotted and converted into hyperlinks.

At present more exotic numbering schemes using roman numerals and letters of the alphabet are not fully supported.

Capitalised heading detection

AscToHTM can treat wholly capitalised lines as headings. It also allows for such headings to be spread over more than one line.

Underlined heading detection

AscToHTM can recognize underlined text (e.g. a row of minus signs), and optionally promote the preceding line to be a section header.

The "underlining" line should have no gaps in it, and should be a similar length to the preceding heading. If these conditions aren't met you'll probably get a horizontal rule instead.

If you're authoring a file from scratch, it is probably best to use underlined headings for ease of use.

The level of heading associated with an underlined heading depends on the underline character as follows:-

'****' level 1

'====','////' level 2

'----','____','~~~~' level 3

'....' level 4

The actual markup that each heading gets may depend on your policies. In particular level 3 and level 4 headings may be given the same size markup to prevent the level 4 heading becoming smaller than the text it is heading. However the logical different will be maintained, e.g. in a generated contents list, or when choosing the level of heading at which to split large files into many HTML pages.

Embedded heading detection

The program can look for headings "embedded" in the first paragraph. Such headings are expected to be a complete sentence or phrase in UPPER CASE at the start of a paragraph. Where detected the heading will be marked up in bold, rather than <Hn> markup, although it will still be added to, and accessible from any hyperlinked contents list you generate for the document.

At present such headings are not auto-detected... you need to switch on the Expect Embedded headings policy.

Key phrase headings

The program can now look for lines that start with particular words or phrases (such as "Chapter", "Part", Title") of your choice and treat these lines as headings. Previously this only worked in a limited way if the heading line was also numbered ("Chapter 1") etc.

To use this feature, set the policy Heading Key phrases

Numbered paragraph detection

Some types of documents use what look like section numbers to number paragraphs (e.g. legal documents, or sets of rules).

AscToHTM can recognize this, and mark up such lines by placing the number in bold, and not using the "Heading n" style on the whole line.

Mail and USENET headers

Some documents, especially those that were originally email or USENET posts, come with header lines, usually in the form of a number of lines with a keyword followed by a colon and then some value.

AscToHTM can recognize these (to a limited extent). Where these are detected the program will parse the header lines to extract the Subject, Author and Date of the article concerned. A heading containing this information will then be generated to replace all the unsightly header lines.

Heading policies in a policy file

AscToHTM has the following section heading policies that will normally be correctly calculated on the analysis pass :-

Expect Numbered Headings

Expect Underlined Headings

Expect Capitalised Headings

Expect Embedded Headings

Heading key phrases

Check indentation for consistency

Expect Second Word Headings

First Section Number

Smallest allowed <Hn> tag

Largest allowed <Hn> tag

Preserve underlining of headings

Section headers are far and away the most complex things the analysis pass has to detect, and the most likely area for errors to occur.

AscToHTM will also document to a policy file the headings it finds. This is still to be finalised, but currently has the format

      We have 4 recognised headings
          Heading level 0 = "" N at indent 0
          Heading level 1 = "" N.N at indent 0
          Contents level 0 = "" N at indent 0
          Contents level 1 = "" N.N at indent 2

AscToHTM will read in such lines from a policy text file, but does not yet fully supported editing these via the Windows interface.

The syntax is explained below, but this will probably change in future releases. You can edit these lines in your policy file, and through the policy options in Windows.

The lines are currently structured as follows

Line component Value

xxxx Either "Heading" or "Contents" according
to the part of the policy being described

Level n Level number, starting at 0 for chapters
1 for level 1 headings etc.

"Some_word" Any text that may be expected to occur before
the heading number. E.g. "Chapter" or "Section"
or "[". The case is unimportant.

N.Nx The style of the heading number. This will
ultimately (in later versions) be read
as a series of number/separator pairs.

The proposed format is
"N" = number
"i" / "I" = lower/upper case roman numeral
with an 'x' at the end signalling that trailing
letters may be expected (e.g. 5.6a, 5.6b)

at indent n The indentation that this heading is expected
at. This is important in helping to eliminate
false candidates.

Line component	Value
xxxx	Either "Heading" or "Contents" according to the part of the policy being described
Level n	Level number, starting at 0 for chapters 1 for level 1 headings etc.
"Some_word"	Any text that may be expected to occur before the heading number. E.g. "Chapter" or "Section" or "[". The case is unimportant.
N.Nx	The style of the heading number. This will ultimately (in later versions) be read as a series of number/separator pairs.
	The proposed format is "N" = number "i" / "I" = lower/upper case roman numeral with an 'x' at the end signalling that trailing letters may be expected (e.g. 5.6a, 5.6b)
at indent n	The indentation that this heading is expected at. This is important in helping to eliminate false candidates.

Pre-formatted text

The software can detect various forms of pre-formatted text. This is text laid out in such a way that the spacing used is critical. Spacing is not normally preserved in conversion to HTML, so the correct detection and handling of these special types of text is quite important.

Types of text recognised include the following

Lines
Form feed page markers
User defined pre-formatted text
Automatically detected pre-formatted text

Line detection

Lines are interpreted in context. If they appear to be underlining text, or part of some pre-formatted structure such as a table, then they are treated as such. Otherwise they become horizontal rules.

An attempt is made to interpret half-lines etc as such, although the effect is only approximate.

Form feed page markers

Form feeds or page breaks become <HR> tags in the HTML, as HTML doesn't really support the concept of (layout) pages.

User defined pre-formatted text

AscToHTM allows users to define their own regions of pre-formatted text, using the BEGIN_PRE and END_PRE pre-processor tags (see Using the pre-processor).

Such areas are marked up in <PRE> tags which uses a non-proportional font to preserve the relative spacing.

For example :-

      The use of BEGIN_PRE and END_PRE pre-processor
        commands (see 7.1) in
          the text documents
            tells AscToHTM that
              this portion of the
            document
          has been formatted
        by the user and
      should be left unchanged.

Automatically detected pre-formatted text

AscToHTM attempts to spot sections of preformatted text. This can vary from a single line (e.g. a line with a page number on the right-hand margin) to a complete table of data.

Where such text is detected AscToHTM analyses the section to determine what type of pre-formatted text it is. Options include

Tables
Code samples
ASCII Art and diagrams
some other formatted text

A number of policies allow you to control

whether or not the program looks for such text
how sensitivity it is to "pre-formatted" text
how inclined the program is to "extend" the region to adjacent lines
whether or not table generation should be attempted
various aspects of any table analysis that is carried out.

See Pre-formatted text policies for full details.

You can adjust the sensitivity of AscToHTM to pre-formatted text by setting the minimum number of lines required for a pre-formatted region using the Minimum automatic <PRE> size policy.

HTML ignores all white space in the source document, thus any hand-crafted layout information would normally get lost. When AscToHTM detects such regions it marks them up in fixed width font which tells HTML this region is pre-formatted.

When tables are detected, AscToHTM will attempt to generate the correct HTML table.

When AscToHTM gets the detection wrong you can use the AscToHTM pre-processor to mark up regions of your document you wish preserved.

Table detection

Tables are marked out by their use of white space, and a regular pattern of gaps or vertical bars being spotted on each lines. AscToHTM will attempt to spot the table, its columns, its headings, its cell alignment and entries that span multiple columns or rows.

Should AscToHTM wrongly detect the extent of a table, you can mark up a section of text by using the TABLE pre-processor markup (see the Tag manual). Alternatively you can try adding blank lines before and after, as the analysis uses white space to delimit tables.

You can alter the characteristics of all tables via the table policies (see Table generation policies).

You can alter the characteristics of all or individual tables via the table pre-processor commands (see TABLE).

Or you can suppress the whole thing altogether via the Attempt TABLE generation policy

Code sample detection

AscToHTM attempts to recognize code fragments in technical documents. The code is assumed to be "C++" or "Java"-like, and key indicators are, for example, the presence of ";" characters on the end of lines.

Should AscToHTM wrongly detect the extent of a code fragment, you can mark up a section of text by using the CODE pre-processor markup.

Or you can suppress the whole thing altogether via the policy Expect code samples.

ASCII art and diagram detection

AscToHTM attempts to recognize ASCII art and diagrams in documents. Key indicators include large numbers of non-alphanumeric characters and the use of white space.

However, some diagrams use the same mix of line and alphabetic characters as tables, so the two sometimes get confused.

Should AscToHTM wrongly detect the extent or type of a diagram, you can mark up a section of text by using the DIAGRAM pre-processor markup.

Text block detection

(New in version 5.0)

If AscToHTM detects a block of text at a large indent, it will now place that text in such a way as to preserve as faithfully as possible the original indent.

Other formatted text

If AscToHTM detects formatted text, but decides that it is neither table, code or art (and it knows what it likes), then the text may be put out "as normal", but with the original line structure preserved.

In such regions other markup (such as bullets) may not be processed such as it would be elsewhere.

Back to Contents List

''	level 1
'====','////'	level 2
'----','____','~~~~'	level 3
'....'	level 4