Documentation for the Detagger html to text converter and markup removal utility

The latest version of these files is available online at http://www.jafsoft.com/doco/docindex.html


Previous page Back to Contents List Next page

Using policy files

Options available from the conversion options menu are also know as "policies". All of JafSoft Limited conversion tools use the same policy file mechanism.

Options can be saved to these policy files so that different sets of options can be saved and then easily reloaded during later conversions

Policy files are just plain text files, with one policy per line. Each policy line takes the form

<policy name> : <policy value>

You can edit these files in a text editor, but must be careful to use the correct policy name.

In Detagger almost all policies can be set through the conversion options menu.

Contents of this section

What are Policy files?
Alphabetical list of Detagger policies
Markup removal policies
Detag policies
Detag tables policies
Tag manipulation policies
Text conversion policies
Data Extraction policies
Text Format policies
Page Layout options
Bullet options
Heading options
Miscellaneous Formatting policies
Dialogue options
Other Options
Unicode Options
Text Hyperlink policies
Text Marker policies
Text paragraph policies
Text Table policies
Table border options
Table width options
Miscellaneous Table options
Miscellaneous policies
Configuration file policies
Other policies

What are Policy files?

Detagger has a large number of options available to influence the processing of your text files. These options are called "policies" as they govern how the source file should be interpreted and converted.

Policies may be saved in text files, known as policy files. These files have a ".pol" extension by default. The policy files are usually updated by changing the policies and saving the changes in a new file. Because they are text files you can also edit them directly, in a text editor. The files have the format of one policy per line of

Text in the form

PolicyText : <policy value>

The use of policy files allow a given set of options to be saved and reused for other conversions, or later conversions of the same file. See Using policy files for more information.


Alphabetical list of Detagger policies

Here is an alphabetic list of policy names. Where possible a link is supplied to the equivalent option in the user interface.

Markup removal policies

These policies allow you to control the tag removal process. You can choose what tags are to be removed, or what manipulations you'd like performed on each tag.

Detag policies
Detag tables policies
tag manipulation policies

Detag policies

These policies allow you to choose which tags are to be removed from the markup.

Remove all tags

Remove All Tags

Menu location: Configuration Options -> Markup Manipulation ->Detagger options

When this option is selected, all tags will be removed from the markup. This effectively turns the conversion into a html-to-text conversions

Remove All HTML Tags

Menu location: Configuration Options -> Markup Manipulation ->Detagger options

When this option is selected, all HTML tags will be removed from the markup. This will leave only the non-HTML tags.

Remove <!DOCTYPE> tags

Menu location: Configuration Options -> Markup Manipulation ->Detagger options

When this option is selected any <!DOCTYPE> tag in the document will be removed. This might be a useful precursor to remove a no longer valid DOCTYPE e.g. if you were concatenating the results or migrating the files to XML of some description.

Remove HTML <HEAD> section

Menu location: Configuration Options -> Markup Manipulation ->Detagger options

When this option is selected any <HEAD>...</HEAD> section in the document is removed. This might be a useful precursor to merging two HTML files together, although the <HTML> and <BODY> tags will be left in the output.

Remove HTML-style comments

Menu location: Configuration Options -> Markup Manipulation ->Detagger options

When this option is selected all <!-- ... --> style comments are removed.

Remove HTML <SCRIPT> section

Menu location: Configuration Options -> Markup Manipulation ->Detagger options

When this option is selected all <SCRIPT>..</SCRIPT> sections are removed, effectively removing all scripted content from the file.

Remove HTML <OBJECT> tags

Menu location: Configuration Options -> Markup Manipulation ->Detagger options

When this option is selected all <OBJECT>..</OBJECT> sections are removed, effectively removing embedded active content from the file.

Remove HTML <DIV> tags

New in version 2.4

Menu location: Configuration Options -> Markup Manipulation ->Detagger options

When this option is selected all <DIV..> and </DIV> tags are removed.


Remove HTML <SPAN> tags

New in version 2.4

Menu location: Configuration Options -> Markup Manipulation ->Detagger options

When this option is selected all <SPAN..> and </SPAN> tags are removed.

Remove HTML <FORM>..</FORM> tags

Menu location: Configuration Options -> Markup Manipulation ->Detagger options

When this option is selected all tags associated with forms will be removed. This includes the <FORM>, <INPUT>, <SELECT>, <OPTION>, <TEXTAREA>, <FIELDSET>, <LEGEND> and <LABEL> tags.

Any visible markup (such as tables) integrated with the <FORM> is left intact. This may not always be desirable, and this issue may be addressed in later releases.

Remove HTML <FONT> tags

Menu location: Configuration Options -> Markup Manipulation ->Detagger options

When this option is selected all <FONT..> and </FONT> tags are removed. <FONT> tags are deprecated in later versions of HTML in favour of CSS style sheets. <FONT> tags can also make a file much larger than it need be, so removing them can be desirable for a number of reasons.

Remove emphasis tags

Menu location: Configuration Options -> Markup Manipulation ->Detagger options

When this option is selected, all emphasis markup such as <B>, <I>, <STRONG> and <EM> tags are removed.

Replace hyperlinks by the display value

Menu location: Configuration Options -> Markup Manipulation ->Detagger options

When this option is selected all hyperlinks are removed. The hyperlink is replaced by it's visible content, only the "link" part is removed.

Remove style sheet

Menu location: Configuration Options -> Markup Manipulation ->Detagger options

When this option is selected all <STYLE>..</STYLE> sections are removed and all references to external style sheets are removed

Remove HTML <IMG> tags

Menu location: Configuration Options -> Markup Manipulation ->Detagger options

When this option is selected all Images defined as <IMG> tags are removed.

Remove non-standard tags

Menu location: Configuration Options -> Markup Manipulation ->Detagger options

When this option is selected all tags not recognised as being part of HTML are removed. The standard used is currently HTML 4.0 Transitional.

See also Keep deprecated tags

Keep deprecated tags

Menu location: Configuration Options -> Markup Manipulation ->Detagger options

When the option Remove non-standard tags is selected, this option determines that any "deprecated" tags may be left in. These are tags recognised in earlier versions of HTML, but which are no longer strictly supported.

Usually these tags are still supported in browsers, and sometimes removing these tags will adversely affect your page's appearance, as newer forms of tagging are often required to achieve the same effect.

Remove non-standard tag attributes

Menu location: Configuration Options -> Markup Manipulation ->Detagger options

When this option is selected all tag attributes not recognised as being part of HTML are removed. The standard used is currently HTML 4.0 Transitional.

Note, in HTML 4.0 many tag attributes were deprecated, so that HTML code that relied heavily on tag attributes (e.g.. to set paragraph alignment) may need to be heavily re-written to achieve the same effect under the later standard.

See also Keep deprecated tag attributes

Keep deprecated tag attributes

Menu location: Configuration Options -> Markup Manipulation ->Detagger options

When the option Remove non-standard tag attributes is selected, this option determines that any "deprecated" tag attributes may be left in. These are attributes that were recognised in earlier versions of HTML, but which are no longer strictly supported.

Usually these attributes are still supported in browsers, and sometimes removing these tags will adversely affect your page's appearance, as newer forms of tagging are often required to achieve the same effect.

Remove all non-HTML tags

Menu location: Configuration Options -> Markup Manipulation ->Detagger options

When selected all tags not recognised as HTML will be removed from the file.

Depending on the source of the HTML these could be XML tags, or proprietary tags added by the software used to create the HTML.

Remove all x-tags used in mail messages

Menu location: Configuration Options -> Markup Manipulation ->Detagger options

When selected tags believed to be added by an email package will be removed. For example Eudora adds tags in the form <x-whatever> to markup key aspects of a HTML message. this option should remove such tags.

Remove Microsoft Office tags

Menu location: Configuration Options -> Markup Manipulation ->Detagger options

When selected tags believed to have been added by MS Office applications are removed. Some versions of MS Office add a lot of tagging - particularly to HTML exported from Excel - to describe a document's ownership and structure.

This option will attempt to remove (and to a limited extent tidy up) such markup.


Detag tables policies

These options allow a few limited tag manipulations to be performed on any <TABLE>s in the HTML during the markup removal.

Paragraphs

Attributes

Remove HTML table tags

Menu location: Configuration Options -> Markup Manipulation ->Tables

When this option is selected all "table" tags are removed (i.e. <table>, <tr>,<th>,<td>,<thead>,<tbody> and <tfoot>). This effectively removes all the table structure in a document, which can be useful if you want to view the HTML on a device with a small display less suited to tables.

Remove HTML <P> tags from tables

Menu location: Configuration Options -> Markup Manipulation ->Tables

When selected this option will replace <p>..</p> tags by a suitable pattern of <br> tags. <br> tags will be inserted between each paragraph in each cell. If a cell has only one paragraph, nothing is inserted.

This option can be useful when tidying up HTML created by certain Word processing packages which needlessly insert <p>..</p> markup.

Remove HTML alignment attributes from tables

Menu location: Configuration Options -> Markup Manipulation ->Tables

When selected all alignment attributes ("align" and "valign") are discarded from the various TABLE, TBODY, THEAD, TFOOT, TR, TH and TD tags

Remove HTML color attributes from tables

Menu location: Configuration Options -> Markup Manipulation ->Tables

When selected all colour attributes ("bgcolor") are discarded from the various TABLE, TBODY, THEAD, TFOOT, TR, TH and TD tags

Remove HTML size attributes from tables

Menu location: Configuration Options -> Markup Manipulation ->Tables

When selected all sizing attributes ("cellpadding", "cellspacing", "height" and "width") are discarded from the various TABLE, TBODY, THEAD, TFOOT, TR, TH and TD tags.

Note logical size attributes such as "colspan" and "rowspan" are left intact, as is "border".


Tag manipulation policies

These options allow a few limited tag manipulations to be performed during the markup removal.

tag conversions

character replacements

Convert tags to lower case

Menu location: Configuration Options -> Markup Manipulation -> Tag manipulation

When selected all tags and attributes will be converted to lower case. Any attribute values will be left unchanged.

Convert tags to UPPER case

Menu location: Configuration Options -> Markup Manipulation -> Tag manipulation

When selected all tags and attributes will be converted to UPPER CASE. Any attribute values will be left unchanged.

Replace entities by text

Menu location: Configuration Options -> Markup Manipulation -> Tag manipulation

When selected any attribute values such as &copy; or &#163; will be converted to their ANSI equivalents.

Note, not all such entities have an ANSI equivalent, and those that don't will be left unchanged.

The ANSI character set is an "8-bit" character sets, that is each character is represented by a value in the range 0-255. Of these the first 128 characters are known as the "7-bit" characters. Whilst almost all character sets support 7-bit characters, not all support the "upper" 8-bit characters, so you may not want to allow 8-bit characters. For that reason there are two more options

See also Allow ANSI alternatives (e.g. space for &nbsp;) and Allow 8-bit ANSI values in output

Allow ANSI alternatives (e.g. space for &nbsp;)

Menu location: Configuration Options -> Markup Manipulation -> Tag manipulation

When selected, any entity that maps onto an upper 8-bit character will be allowed (e.g. &copy; will be replaced by the copyright symbol)

Allow 8-bit ANSI values in output

Menu location: Configuration Options -> Markup Manipulation -> Tag manipulation

The GUI shows this option as "Use 7-bit ASCII alternatives where possible" (which, in fact, has the exact opposite meaning). When that is set, this policy is disabled (and vice versa).

When this policy is enabled (i.e. 'Use 7-bit...' is unchecked), then any 8-bit characters are allowed to pass unchanged. The display of such characters will depend on the language and Operating System of the computer used to view the results file.

When this policy is disabled (i.e. 'Use 7-bit...' is checked), then any entity that can be approximated by using one or more 7-bit characters will be replaced by that approximation. For example &nbsp; will become a single space, while &mdash; will become two hyphens "--".


Text conversion policies

The following screens allow access to options to control various aspects of the text conversion process.

Data Extraction policies

New in version 2.4

These policies help you refine the use of Detagger as a data extraction tool. Note, Detagger wasn't designed as a data mining tool, but with these options you can choose to focus only on data in tables at a certain level, and to then turn the selected data into a delimited format that will make it easier post-process the results file (e.g. to import it into a spreadsheet or database).

Table extraction policies

These options can be used to tell the program to only extract data from tables, or that only certain level(s) of nested table should be extracted from the soured document.

Table data handling

These options specify how the data inside tables should be handled. Only one option may be selected. If comma- or tab-delimited data is selected, then the text will be output between table delimiters.

In the Policy File the policy Output table format takes the value 1,2 or 3 according to which of the following is selected. In the user interface you need only select the desired option.

If you choose to convert the table into a delimited data format, the following two options become available :-

Show only table data

New in version 2.4

Menu location: Configuration Options -> Conversion to text -> Data extraction

When enabled, only that part of the input file contained in HTML <TABLE> markup will be included in the output. How that data is formatted will depend on the other text table policies, and the remaining data extraction policies that have been chosen.

To further refine which table data is extracted, use the Minimum table depth and Maximum table depth policies.

Minimum table depth

New in version 2.4

Menu location: Configuration Options -> Conversion to text -> Data extraction

This policy, together with Maximum table depth specifies a range of table depths from which data should be extracted.

Many HTML use tables to lay out a page as well as to mark up tabular data. These nested table structures can be quite complex and can get in the way of accessing the inner data. Similarly Menus and the like are often implemented by "mini" tables inside tables, inside tables etc.

With these policies you can elect to ignore all data placed in tables at a level higher or lower than that of interest

Consider the HTML code

        <TABLE WIDTH="100%">
          <TR>
             <TD WIDTH="30%">
                <!--- left hand menu -->
                <a href="http://www.jafsoft.com/products.html">This is menu item 1</a><br>
                <a href="http://www.jafsoft.com/detagger/">This is menu item 2</a><br>
             </TD>
             <TD WIDTH="70%">

                <!-- main part of the page -->
                <TABLE>
                 <TR>
                   <TH>Date</TH>
                   <TH>Total</TH>
                 </TR>
                 <TR>
                   <TD>April 13</TD>
                   <TD>1,001</TD>
                 </TR>
                 <TR>
                   <TD>May 21</TD>
                   <TD>908</TD>
                 </TR>
               </TABLE>
             </TD>
           </TR>
        </TABLE>
        <TABLE WIDTH="100%">
           <TR>
             <TD COLSPAN="2">
               <TABLE WIDTH="100%">
                 <TR>
                   <TD>
                      <TABLE WIDTH="100%">
                        <TR>
                          <TD><a href="http://www.jafsoft.com/asctohtm/">Text to HTML</a></TD>
                          <TD><a href="http://www.jafsoft.com/asctotab/">Text to table</a></TD>
                          <TD><a href="http://www.jafsoft.com/asctortf/">Text to RTF</a></TD>
                          <TD><a href="http://www.jafsoft.com/asctopdf/">Text to PDF</a></TD>
                          <TD><a href="http://www.jafsoft.com/detagger/">HTML to text</a></TD>
                        </TR>
                      </TABLE>
                    </TD>
                 </TR>
               </TABLE>
             </TD>
           </TR>
         </TABLE>

When converted to text this becomes

        +-------------------+----------------+
        |This is menu item 1|+--------+-----+|
        |This is menu item 2||Date    |Total||
        |                   |+--------+-----+|
        |                   ||April 13|1,001||
        |                   |+--------+-----+|
        |                   ||May 21  |908  ||
        |                   |+--------+-----+|
        +-------------------+----------------+
        +---------------------------------------------------------------------+
        |+-----------------------------------------------------------------+  |
        ||+------------+-------------+-----------+-----------+------------+|  |
        |||Text to HTML|Text to table|Text to RTF|Text to PDF|HTML to text||  |
        ||+------------+-------------+-----------+-----------+------------+|  |
        |+-----------------------------------------------------------------+  |
        +---------------------------------------------------------------------+

That is, the page consists of a table of data containing dates and totals, but in addition this is placed in an outer table with a menu on the left. At the foot of the page is a navigation menu, implemented as a heavily nested table.

NOTE: The hyperlinks have been removed using the text hyperlink policies.

To get rid of the outer table set the Minimum table depth to 2. Doing this gives the following

        +--------+-----+
        |Date    |Total|
        +--------+-----+
        |April 13|1,001|
        +--------+-----+
        |May 21  |908  |
        +--------+-----+
        +-----------------------------------------------------------------+
        |+------------+-------------+-----------+-----------+------------+|
        ||Text to HTML|Text to table|Text to RTF|Text to PDF|HTML to text||
        |+------------+-------------+-----------+-----------+------------+|
        +-----------------------------------------------------------------+

Note that the text in the outer table (in this case the menu text) has been discarded, but we still have a doubly nested footer table. If we also set the Maximum table depth to 2, we get

        +--------+-----+
        |Date    |Total|
        +--------+-----+
        |April 13|1,001|
        +--------+-----+
        |May 21  |908  |
        +--------+-----+
        ++
        ||
        ++

That is, the footer becomes an empty table (when borders are displayed). Using the table data handling options you can now convert this to, for example, CSV format.

        "Date","Total"
        "April 13","1,001"
        "May 21","908"

Note, we've switched off borders and removed delimited table markers in this output.

Note, the Minimum table depth and maximum table depth are fairly broad brush policies - if there had been multiple tables with nested content at level 2, then it would all be included in the output. That said, these policies do offer some prospect of focusing the output on the data you want.


Maximum table depth

New in version 2.4

Menu location: Configuration Options -> Conversion to text -> Data extraction

See the discussion in Minimum table depth


Output table format

Menu location: Configuration Options -> Conversion to text -> Data extraction

In the Policy File this policy takes the value 1,2 or 3 according to which of the following is selected.

Any tables processed during the conversion will then be formatted according to the selection made.

Convert table to plain text

Menu location: Configuration Options -> Conversion to text -> Data extraction

When selected any tables will be converted into plain text. The software will look at the row and column structure of the HTML original, and attempt to lay this out in the current page width, although this may not always be possible.

See also Output table format

Convert table to comma-delimited data

Menu location: Configuration Options -> Conversion to text -> Data extraction

When selected any tables will be converted into comma-delimited data. Each row is put out as a row of comma-delimited data, with the data values themselves in quotes. The resulting data is in a format suitable for importing into spreadsheets.

Where tables are nested, by default only the innermost table will be converted in this way. Because these could be multiple tables in a file, each table is delimited as follows

        $_$_BEGIN_COMMA_DELIMITED_TABLE
        ...
        comma-delimited data rows
        ...
        $_$_END_TABLE

You can change this behaviour through the options:-

See also Output table format

Convert table to tab-delimited data

Menu location: Configuration Options -> Conversion to text -> Data extraction

When selected any tables will be converted into tab-delimited data. Each row is put out as a row of tab-delimited data, with the data values themselves in quotes. The resulting data is in a format suitable for importing into spreadsheets.

Where tables are nested, by default only the innermost table will be converted in this way. Because these could be multiple tables in a file, each table is delimited as follows

        $_$_BEGIN_DELIMITED_TABLE
        ...
        tab-delimited data rows
        ...
        $_$_END_TABLE

You can change this behaviour through the options:-

See also Output table format

Convert only innermost tables

Menu location: Configuration Options -> Conversion to text -> Data extraction

For nested tables conversion to a delimited format can become a bit of a nightmare. Usually it will be the innermost table which is data, with outer tables being used for page layout, and so by default the software will only convert the innermost table of a nested set into delimited format.

However this may not always be the case, and so this option can be switched on to convert all levels of table into delimited format. In such cases you may have to tidy up the text (to delete and unwanted portions) before importing it into a spreadsheet.

Note, from version 2.4 onwards, the policies maximum table depth and minimum table depth offer a bit more control than this policy.

Add delimited table markers

Menu location: Configuration Options -> Conversion to text -> Data extraction

By default markers are put round the delimited data to separate it from normal text, and from other sections of delimited data. This makes sense for a file which contains non-table elements, or multiple tables, but probably isn't for files that contain just a single table.

When enabled each table (or sub-table) will appear like this

        $_$_BEGIN_DELIMITED_TABLE
        ...
        delimited data rows
        ...
        $_$_END_TABLE

You can use this option to control this behaviour.


Text Format policies

Detagger has a number of options to allow you to tailor the conversion to text. Some control what is copied across to the output, but most offer options to format the output above and beyond the formatting implicit in the original HTML.

Page Layout options

These options control how the text is laid out on the page.

Preserve all white space from the original source

Menu location: Configuration Options -> Conversion to text -> Formatting

When this option is selected the text is laid out as it was in the original HTML file, minus the actual tags. If this option is selected then all other text formatting options are ignored.

This option is suitable when the input file is only lightly tagged (e.g. a document with large <PRE> sections).

Impose a page width on the output

Menu location: Configuration Options -> Conversion to text -> Formatting

When selected this means that the lines of the text file should be formatted to match a target page width. This will involve moving text around within a paragraph if necessary

Target page width

Menu location: Configuration Options -> Conversion to text -> Formatting

When line formatting is switched on, this is the target page width. If omitted there will in any case be a default page width applied (set to 76 characters).

See also target table width

Right justify the output text

Menu location: Configuration Options -> Conversion to text -> Formatting

When selected white space will be added to each line inside a paragraph so that the right margins are aligned as well as the left.

Output indentation positions

Menu location: Configuration Options -> Conversion to text -> Formatting

The is a comma-separated list of up to 8 levels of indentation, specifying how the output text should be indented when laying out indented text such as nested lists.

By conversion the first value should be a zero to set the left hand margin to start at the beginning of the line.


Bullet options

These options control the presentation of bullets and list in the output text document

Bullet point characters

Menu location: Configuration Options -> Conversion to text -> Formatting

When converting to text, this identifies characters which - if they end up at the start of a line by themselves - can be taken to be bullet points. The hyphen character '-' is implicitly regarded as a bullet point, but characters such as 'o', 'q' and '§' can sometimes appear in the output text as bullet points depending on how the original HTML was generated.

When a bullet point is found to match one of these characters, the first of the text bullet characters is used as a replacement.

Text bullet characters

Menu location: Configuration Options -> Conversion to text -> Formatting

When converting to text, this is a comma-separated list that specifies which characters are to be used as the bullet at each level of list. The special value "middot" will be taken to mean the "middot" character.

At the same time, any "middot" characters occurring in the text will be replaced by the first "bullet" on this list.

For example the value

(+),+,-

Would convert any level 1 list bullets to "(+)", level 2 to "+" and level 3 to "-". At the same time any middot characters '·' will be converted into "(+)" (the first bullet on the list).

The default value is

middot,middot,middot,middot,middot,middot

List item templates

New in version 2.4

Menu location: Configuration Options -> Conversion to text -> Formatting

This policy is a comma-separated list of "templates" to be used when outputting ordered lists. The policy allows you to specify the format of each list item for up to 6 levels of list. The format should include an "x", which will be replaced by the item number.

For example a value of

(x), x), x:

would lead to a list structure as follows

        (1) item at list level 1
            a) item at list level 2
                i: item at list level 3

The default value is

x),x),x),x),x),x)

Heading options

These options control how headings in the original document are presented

Heading underline characters

Menu location: Configuration Options -> Conversion to text -> Formatting

This is a comma-separated list controlling the underlining character (or character pattern) at each heading level. A value of ",,,,," (or indeed blank) would suppress all heading underlining.

As an example the value

=+ , -

would cause <H1> headings to be underlined with the pattern
"=+=+=+=", <H2> to be underlined with "-------" and all other
heading levels no to be underlined at all.


Miscellaneous Formatting policies

Detagger has a number of options to allow you to tailor the conversion to text. Some control what is copied across to the output, but most offer options to format the output above and beyond the formatting implicit in the original HTML.

Dialogue options

For those HTML files that represent works of fiction, Detagger has some sophisticated "dialogue" detection and formatting options. Dialogue is deemed to be any words or phrases in double quotation marks. Where they occur at the start of a line this can often (but not always) signify the next line of dialogue (i.e. a new "character" speaking)

Care is taken - as far is possible - to distinguish text that happens to be in "quotes" and true dialogue. However this will never be a 100% accurate process.

Look for dialogue lines

Menu location: Configuration Options -> Conversion to text -> Miscellaneous Formatting

When this option is switched on, the program will attempt to spot dialogue at the start of a line and format it accordingly, with each new speaker starting a new paragraph in the output.

Apply extra dialogue checks

Menu location: Configuration Options -> Conversion to text -> Miscellaneous Formatting

When this option is enabled, extra tests are made to check the validity of lines believed to be the start or end of a dialogue phrase. Tests include looking for suitable use of capitalisation and punctuation inside and outside the quoted text.

Some of these tests are biased towards the text being in English.

Break lines where dialogue starts in the middle

Menu location: Configuration Options -> Conversion to text -> Miscellaneous Formatting

When this option is enabled, the software will attempt to spot new dialogue from a different character that appears deep inside a paragraph. When this is detected, the larger paragraph will be broken so that the dialogue of the new character starts in a new paragraph


Other Options

Remove all horizontal rules and lines

Menu location: Configuration Options -> Conversion to text -> Miscellaneous Formatting

When selected all horizontal lines and rules in the input will be omitted in the output.

Highlight bold and italic text

Menu location: Configuration Options -> Conversion to text -> Miscellaneous Formatting

When selected, HTML marked as <b>bold</b> or <i>italic</i> can be emphasised in the output text as bold and italic.

This can work well when the occasional word is emphasised, but in some HTML pages entire menus are placed in bold, and in such files this options is probably best switched off.


Unicode Options

Detagger has a limited ability to deal with Unicode in the HTML files.

At present the following options are available

May add Unicode marker to output file

Menu location: Configuration Options -> Conversion to text -> Miscellaneous Formatting

When selected files which are detected to contain Unicode characters will get a "Unicode file marker" output at the start of the file.

A Unicode file marker at the top of a file is recognised by some software applications (e.g. text editors) as marking the file as containing Unicode. The marker characters themselves don't get displayed.

Where possible, the software will create a UTF8 file. Full Unicode support has not been tested, and it is not expected that Detagger will support all types of Unicode

Input text encoding

Menu location: (none at present)

The program has the ability to detect Unicode Files on input if Byte Order Mark (BOM) is present, or if - under some circumstances - Unicode HTML entities are present in the input text, but in files without the BOMs the software may fail to detect the input is Unicode.

In such circumstances this policy allows you to tell the software that the input should be treated as Unicode. The possible values for this policy are

auto
UTF8
UTF16-BE
UTF16-LE
automatic detection (the default)
UTF-8
UTF-16 "Big Endian"
UTF-16 "Little Endian"

For a fuller discussion see Working with Unicode


Text Hyperlink policies

These options control what happens to any hyperlinks in the original document. Since text files don't support hyperlinks, the options are to ignore the link entirely, only use the display text, or to turn the link into a reference and add a reference table at the end, listing the URLs the links pointed to.

Hyperlink removal

Display of URLs

Images

Preserve hyperlinks in text output

Menu location: Configuration Options -> Conversion to text -> Hyperlinks

This is a peculiar one added in response to a customer request. When enabled during a conversion to text, all hyperlinks are left intact, so what you end up with is a text file with HTML hyperlinks in it. This may be of interest to those wishing to import text into a database for display on HTML pages.

It is expected that in the conversion any HTML entities (specifically &amp;) in the URL will get converted to their ASCII equivalent. This may cause usability problems with the link after conversion.

Omit email hyperlinks from the output

Menu location: Configuration Options -> Conversion to text -> Hyperlinks

When this option is selected, all visible email addresses are omitted from the output. This can be a useful privacy option.

Omit local hyperlinks from the output

Menu location: Configuration Options -> Conversion to text -> Hyperlinks

When this option is selected, any link to a local resource (a jump point, or a non http-qualified URL) is omitted from the output. This can be used to remove local navigation links from documents where "next", "previous", "top of page" links will mean nothing in the final text.

If this option isn't selected, the display part of such links will be copied to the output text.

Text to replace omitted links by

Menu location: Configuration Options -> Conversion to text -> Hyperlinks

If either of the two previous options is selected, then this is the text that any deleted links will be replaced by. If set to blank the links are completely removed.

Add URL references at end of file

Menu location: Configuration Options -> Conversion to text -> Hyperlinks

If this option is selected then hyperlinks to resources are replaced by the display text, and a reference number [n] added after it. A full reference table, listing the original URL that matches the reference numbers is then added at the end of the file.

When selected the option Display link URLs is disabled.

Display link URLs

Menu location: Configuration Options -> Conversion to text -> Hyperlinks

When selected the URL for hyperlinks is displayed in brackets in the main text, after the display text.

When selected the option Add URL references at end of file is disabled.

Replace <IMG> tags by a text marker

Menu location: Configuration Options -> Conversion to text -> Hyperlinks

When selected any <IMG> tags in the original are replaced by an "[Image]" marker.

See also Use the ALT attribute to replace <IMG> tags

Use the ALT attribute to replace <IMG> tags

Menu location: Configuration Options -> Conversion to text -> Hyperlinks

When selected any ALT attribute on a <IMG> tag will be used as the replacement text marker for the tag.

See also Replace <IMG> tags by a text marker


Text Marker policies

New in V2.3

These policies allow you to specify special "markers" that should be added to the output to delimit tables and lists. This can be useful if you want to pass the output to some further software package for post-processing.

Tables

Lists

End list marker

Menu location: Configuration Options -> Conversion to text -> Markers

See start list marker


End table marker

Menu location: Configuration Options -> Conversion to text -> Markers

See start table marker

Start list marker

Menu location: Configuration Options -> Conversion to text -> Markers

When converting to text, this option identifies a marker that will be output on the line before the start of any marked-up list that is detected. This can be useful if you want to subsequently identify lists in the text.

See also End list marker

Start table marker

Menu location: Configuration Options -> Conversion to text -> Markers

When converting to text, this option identifies a marker that will be output on the line before the start of any marked-up table that is detected. This can be useful if you want to subsequently identify tables in the text.

See also End table marker


Text paragraph policies

New in V2.3
These options control the layout of text into sentences and paragraphs. <P> tags in the original text are preserved, but some HTML files use means other than that to layout text (e.g. multiple <BR> tags). In such cases Detagger applies extra intelligence to detect the paragraph structure.

Output each paragraph on a single line

Menu location: Configuration Options -> Conversion to text -> Paragraphs and sentences

When selected this option specifies that each paragraph is put out as a single line (i.e. with a single hard break). This produces text that will display well in those environments that automatically wrap text.

This options won't work on text inside a table unless you switch off Attempt to parse tables

Preserve short lines

Menu location: Configuration Options -> Conversion to text -> Paragraphs and sentences

When selected any line deemed to be "short" will keep it's line break, even if text is rearranged to fit into a target page width.

First line indent for paragraphs

Menu location: Configuration Options -> Conversion to text -> Paragraphs and sentences

Specifies the number of characters by which the first line in a new paragraph should be indented relative to those that follow.

Treat short lines as paragraph endings

Menu location: Configuration Options -> Conversion to text -> Paragraphs and sentences

Specifies that in files where there are no paragraph markers, a short line amongst longer lines should be taken as signalling a paragraph end. This can be useful when converting files that use multiple <BR> tags, but where you want a different page width in the output. Without this test there would be no paragraphs detected.

Insert gap between sentences

Menu location: Configuration Options -> Conversion to text -> Paragraphs and sentences

When selected the software will impose a 2-character space between sentences, regardless of what spacing was in the original. This is a common style in written text.


Text Table policies

These policies control how (if at all) Detagger will attempt to faithfully represent tables in the output.

The software does a reasonable job of representing simple tables as text, and can even cope with nested tables to a limited extent.

Attempt to parse tables

Menu location: Configuration Options -> Conversion to text -> Tables

When selected Detagger will attempt to correctly format any HTML tables in the text.

In doing so Detagger will attempt to preserve the width, alignment etc of the original, but this process can only ever be approximate due to the quite different formats of HTML and text. Bear in mind that if the software is adding emphasis characters (see Highlight bold and italic text) or URL references (see text hyperlink policies), then this will end up with more output text to fit in the table.

Although you can try to adjust table to page width, if the table is too wide or the page too narrow, then this will often fail - particularly for heavily nested tables.


Table border options

These options specify whether any tables should have borders added in the text file. This only applies when converting to plain text (see Output table format). By default the software will replicate the border status of the outermost tables, but suppress borders in any inner, nested tables. This is because the space taken by borders in the ASCII file limits the room for data.

Add border to all tables

Menu location: Configuration Options -> Conversion to text -> Tables

When selected a border will be added round each table created. This will be regardless of whether the original table set a border or not.

If omitted, then the border attribute of the original HTML should be honoured.

Suppress borders on nested tables

Menu location: Configuration Options -> Conversion to text -> Tables

When table borders are present (or switched on), this option suppresses any borders on tables that appear inside other tables. This is done because nesting tables is usually done to achieve layout, and in the text file white space usually looks better (less distracting).

This options will not prevent a border being added to the outermost table (if one is requested).

Allow blank row separator lines

Menu location: Configuration Options -> Conversion to text -> Tables

This specifies whether or not blank lines are allowed in the output between rows. In this case a blank line is output between each table row (when there is no border). This can space out the table making it easier to read, especially if some of the cell contents are split over several lines.

This option only works when the table is being converted to plain text as opposed to a data delimited format.


Table width options

Controls the width the table will take on the output "page". The default behaviour is to regard 800 pixels as 80 characters wide.

Adjust table to page width

Menu location: Configuration Options -> Conversion to text -> Tables

When selected this specifies that any table should attempt to fit the current target table width.

For a reasonable page size and simple table, Detagger can usually re-arrange the table's contents cell-by-cell to fit the table to the page.

However for narrow pages and/or large tables or heavily nested tables it becomes virtually impossible to achieve this goal.

Target table width

Menu location: Configuration Options -> Conversion to text -> Tables

When table formatting is being attempted, this is the target page width for tables. This is the maximum width that a table should be allowed to grow to. In some nested tables however this limit may on occasion be exceeded.

If omitted the value will default to the target page width value, which in turn will default to 76 characters.

Ignore table WIDTH attributes

New in version 2.4

Menu location: Configuration Options -> Conversion to text -> Tables

Whenever Detagger converts a tables with explicit WIDTH attributes, it tries to honour this layout. However, sometimes the widths set in an HTML table don't give enough room to lay out the text when converting to text.

This can happen for a number of reasons

In such cases this policy allows all WIDTH attributes to be ignored by Detagger. When WIDTH attributes are ignored, Detagger is free to do the best it can to fit the data into the space available, subject to any limits suggested by the Target table width and Adjust table to page width policies.

May break words to fit target width

New in version 2.4

Menu location: Configuration Options -> Conversion to text -> Tables

When Detagger converts tables, it will do a best-fit to the available page width, but there are occasions when it is difficult to fit a wide table into a narrow target page width. This is especially true of tables where small font sizes have been used in the HTML to achieve a fit, in plain ASCII text Detagger can't use an equivalent trick.

When compressing a table to fit a small target width, Detagger will split text across multiple lines, moving words onto the next line to make the column narrower, but it won't break individual words in two in order to achieve a fit.

This option tells Detagger that it can, if necessary, break up long words in order to narrow a column to fit. Detagger implements this in a fairly brutal manner - for example it doesn't hyphenate the broken text - and so this option should be used sparingly.

Nested Table scaling factor (percent)

New in version 2.4

Menu location: Configuration Options -> Conversion to text -> Tables

When table widths are not supplied (or the policy Ignore table width attributes is enabled), then Detagger sometimes struggles to fit heavily nested tables to a restrictive target page or table width.

The reason for this is that Detagger first lays out the inner tables, and then embeds these in the outer tables when they are laid out. the problem is that the inner tables may expand to be too wide, making it impossible to get the outer tables to fit on the page.

This policy scales back the amount of space a nested table can take. It's expressed as a percentage. 100(%) implies no limit on the inner table, giving the behaviour of earlier versions of Detagger. 0 implies the maximum restriction on inner tables (although this isn't, of course, to zero width).

The default value of set to 75.

If you find that heavily nested tables aren't fitting into the desired page width, try reducing this value.

If you find that inner tables are getting broken over several lines, try setting this policy to 100. If that still fails, you may need to consider disabling the policy Adjust table to page width and increasing the value of Target table width

NOTE:
This policy is ignored is the Target table width is set larger than 100, since it is then assumed the user is allowing wide tables to preserve the layout.

Miscellaneous Table options

These are miscellaneous options related to how tables are converted and the markup found inside tables is treated.

Allow headings inside tables

Menu location: Configuration Options -> Conversion to text -> Tables

When this option is selected it prevents any heading markup inside tables being interpreted as such. Not only will this suppress any underlining the heading would have had added, but it will also prevent the width of the whole heading being used in calculating the "minimum width" required for each column.

As a result this will actually help make some tables narrower, as the "minimum" width required drops as the "heading" text is now allowed to be split over two or more lines.

Default table indentation

Menu location: Configuration Options -> Conversion to text -> Tables

This option specifies the value - in spaces - of an indentation to be applied to all tables in the output. This indentation will reduce the page width available to the table.


Miscellaneous policies

These policies are not set via the main options screens.

Configuration file policies

These policies record the location of any additional configuration files that are to be used in the conversion. They are usually selected using the Configuration Files menu

Fragments File

Menu location: Configuration Options -> Configuration Files -> Text fragments file

In the full version you can add headers and footers to each text file created by using Using a Text Fragments File. This policy allows you to specify the file in which those fragments are defined.

If you don't select a fragment file, then no headers or footers will be added.

Note: In the evaluation version, a standard header and footer are added, and this feature is not available. It is available in the registered version

See also Using a Text Fragments File

Text Commands File

Menu location: Configuration Options -> Configuration Files -> Text commands file

Specifies the location of any text commands file to be used to define text manipulations to be performed on the text as it is being read in, and prior to conversion.


Other policies

Some policies cannot be set through the normal options screens. These include

Allow by-line to be used for Author field

Menu location: (none at present. Edit the policy file)

When selected, Detagger will search the first 40 lines of the document looking for a "By" line in the hope of identifying the author. Any value located this way, will then be available for use in a generated text fragment via the AUTHOR fragment tag.

Note:
this option can be edited into a policy file, but cannot currently be set via the GUI.

Concatenate results into one file

Menu location: On main dialog, set 'Output Type' to concatenate files

When selected this specifies that when converting multiple files at once, all the results should be concatenated into a single results file.

This has the same effect as selecting _"Concatenate results into one file"_ as the output type

Lines to ignore at start of file

Menu location: (none at present. Edit the policy file)

This specifies how many lines from the input files should be ignored at the start of the file. These lines will be discarded from the output.

This can be useful when converting file copied from a news feed or whatever that adds a small data header to the file.

Lines to ignore at end of file

Menu location: (none at present. Edit the policy file)

This specifies how many lines from the input files should be ignored at the end of the file. Up to 40 lines may be ignored in this way. These lines will be discarded from the output.

This can be useful when converting file copied from a news feed or whatever that adds a small data footer to the file.



Previous page Back to Contents List Next page

Valid HTML 4.0! Converted from a single text file by AscToHTM
© 1997-2005 John A Fotheringham
Converted by AscToHTM