Documentation for the Detagger markup removal utility : Using policy files

Documentation for the Detagger html to text converter and markup removal utility

The latest version of these files is available online at http://www.jafsoft.com/doco/docindex.html

Using policy files

Options available from the conversion options menu are also know as "policies". All of JafSoft Limited conversion tools use the same policy file mechanism.

Options can be saved to these policy files so that different sets of options can be saved and then easily reloaded during later conversions

Policy files are just plain text files, with one policy per line. Each policy line takes the form

<policy name> : <policy value>

You can edit these files in a text editor, but must be careful to use the correct policy name.

In Detagger almost all policies can be set through the conversion options menu.

Contents of this section

What are Policy files?
Alphabetical list of Detagger policies
Markup removal policies

Detag policies
Detag tables policies
Tag manipulation policies

Text conversion policies

Data Extraction policies
Text Format policies
Page Layout options
Bullet options
Heading options
Miscellaneous Formatting policies
Dialogue options
Other Options
Unicode Options
Text Hyperlink policies
Text Marker policies
Text paragraph policies
Text Table policies
Table border options
Table width options
Miscellaneous Table options

Miscellaneous policies

Configuration file policies
Other policies

What are Policy files?

Detagger has a large number of options available to influence the processing of your text files. These options are called "policies" as they govern how the source file should be interpreted and converted.

Policies may be saved in text files, known as policy files. These files have a ".pol" extension by default. The policy files are usually updated by changing the policies and saving the changes in a new file. Because they are text files you can also edit them directly, in a text editor. The files have the format of one policy per line of

Text in the form

PolicyText : <policy value>

The use of policy files allow a given set of options to be saved and reused for other conversions, or later conversions of the same file. See Using policy files for more information.

Alphabetical list of Detagger policies

Here is an alphabetic list of policy names. Where possible a link is supplied to the equivalent option in the user interface.

Add border to all tables

Add delimited table markers

Add URL references at end of file

Adjust table to page width

Allow 8-bit ANSI values in output

Allow ANSI alternatives (e.g. space for  )

Allow blank row separator lines

Allow by-line to be used for Author field (not available in the GUI)

Allow headings inside tables

Apply extra dialogue checks

Attempt to parse tables

Break lines where dialogue starts in the middle

Bullet point characters

Concatenate results into one file

Convert only innermost tables

Convert tags to lower case

Convert tags to upper case

Default table indentation

Display link URLs

End list marker

End table marker

First line indent for paragraphs

Fragments file

Heading underline characters

Highlight bold and italic text

Ignore table WIDTH attributes New in version 2.4

Impose a page width on the output

Insert gap between sentences

Input text encoding

Keep deprecated tag attributes

Keep deprecated tags

Lines to ignore at end of file

Lines to ignore at start of file

List item templates New in version 2.4

Look for dialogue lines

Maximum table depth New in version 2.4

May add Unicode marker to output file

May break words to fit target width New in version 2.4

Minimum table depth New in version 2.4

Nested Table scaling factor (percent) New in version 2.4

Omit email hyperlinks from the output

Omit local hyperlinks from the output

Output each paragraph on a single line

Output indentation positions

Output table format

Preserve all white space from the original source

Preserve hyperlinks in text output

Preserve short lines

Remove <!DOCTYPE> tags

Remove all horizontal rules and lines

Remove all HTML tags

Remove all non-HTML tags

Remove all tags

Remove all x-tags used in mail messages

Remove emphasis tags

Remove HTML <DIV> tags New in version 2.4

Remove HTML <FONT> tags

Remove HTML <FORM>..</FORM> tags

Remove HTML <HEAD> section

Remove HTML <IMG> tags

Remove HTML <OBJECT> tags

Remove HTML <P> tags from tables

Remove HTML <SCRIPT> section

Remove HTML <SPAN> tags New in version 2.4

Remove HTML alignment attributes from tables

Remove HTML color attributes from tables

Remove HTML size attributes from tables

Remove HTML table tags

Remove HTML-style comments

Remove Microsoft Office tags

Remove non-standard tag attributes

Remove non-standard tags

Remove style sheet

Replace <IMG> tags by a text marker

Replace entities by text

Replace hyperlinks by the display value

Right justify the output text

Show only table data New in version 2.4

Start list marker

Start table marker

Suppress borders on nested tables

Target page width

Target table width

Text bullet characters

Text commands file

Text to replace omitted links by

Treat short lines as paragraph endings

Use the ALT attribute to replace <IMG> tags

Markup removal policies

These policies allow you to control the tag removal process. You can choose what tags are to be removed, or what manipulations you'd like performed on each tag.

Detag policies
Detag tables policies
tag manipulation policies

Detag policies

These policies allow you to choose which tags are to be removed from the markup.

Remove all tags

Remove all HTML tags
Remove all non-HTML tags
- remove tags and attributes added by email systems
- remove tags and attributes added by MS Office