Documentation for the Detagger html to text converter and markup removal utility |
The latest version of these files is available online at http://www.jafsoft.com/doco/docindex.html
Options available from the conversion options menu are also know as "policies". All of JafSoft Limited conversion tools use the same policy file mechanism.
Options can be saved to these policy files so that different sets of options can be saved and then easily reloaded during later conversions
Policy files are just plain text files, with one policy per line. Each policy line takes the form
<policy name> : <policy value>
You can edit these files in a text editor, but must be careful to use the correct policy name.
In Detagger almost all policies can be set through the conversion options menu.
Contents of this section
What are Policy files?
Alphabetical list of Detagger policies
Markup removal policies
Detag policiesText conversion policies
Detag tables policies
Tag manipulation policies
Data Extraction policiesMiscellaneous policies
Text Format policies
Page Layout options
Bullet options
Heading options
Miscellaneous Formatting policies
Dialogue options
Other Options
Unicode Options
Text Hyperlink policies
Text Marker policies
Text paragraph policies
Text Table policies
Table border options
Table width options
Miscellaneous Table options
Configuration file policies
Other policies
Detagger has a large number of options available to influence the processing of your text files. These options are called "policies" as they govern how the source file should be interpreted and converted.
Policies may be saved in text files, known as policy files. These files have a ".pol" extension by default. The policy files are usually updated by changing the policies and saving the changes in a new file. Because they are text files you can also edit them directly, in a text editor. The files have the format of one policy per line of
Text in the form
PolicyText : <policy value>
The use of policy files allow a given set of options to be saved and reused for other conversions, or later conversions of the same file. See Using policy files for more information.
Here is an alphabetic list of policy names. Where possible a link is supplied to the equivalent option in the user interface.
- Add border to all tables
- Add delimited table markers
- Add URL references at end of file
- Adjust table to page width
- Allow 8-bit ANSI values in output
- Allow ANSI alternatives (e.g. space for )
- Allow blank row separator lines
- Allow by-line to be used for Author field (not available in the GUI)
- Allow headings inside tables
- Apply extra dialogue checks
- Attempt to parse tables
- Break lines where dialogue starts in the middle
- Bullet point characters
- Concatenate results into one file
- Convert only innermost tables
- Convert tags to lower case
- Convert tags to upper case
- Default table indentation
- Display link URLs
- End list marker
- End table marker
- First line indent for paragraphs
- Fragments file
- Heading underline characters
- Highlight bold and italic text
- Ignore table WIDTH attributes New in version 2.4
- Impose a page width on the output
- Insert gap between sentences
- Input text encoding
- Keep deprecated tag attributes
- Keep deprecated tags
- Lines to ignore at end of file
- Lines to ignore at start of file
- List item templates New in version 2.4
- Look for dialogue lines
- Maximum table depth New in version 2.4
- May add Unicode marker to output file
- May break words to fit target width New in version 2.4
- Minimum table depth New in version 2.4
- Nested Table scaling factor (percent) New in version 2.4
- Omit email hyperlinks from the output
- Omit local hyperlinks from the output
- Output each paragraph on a single line
- Output indentation positions
- Output table format
- Preserve all white space from the original source
- Preserve hyperlinks in text output
- Preserve short lines
- Remove <!DOCTYPE> tags
- Remove all horizontal rules and lines
- Remove all HTML tags
- Remove all non-HTML tags
- Remove all tags
- Remove all x-tags used in mail messages
- Remove emphasis tags
- Remove HTML <DIV> tags New in version 2.4
- Remove HTML <FONT> tags
- Remove HTML <FORM>..</FORM> tags
- Remove HTML <HEAD> section
- Remove HTML <IMG> tags
- Remove HTML <OBJECT> tags
- Remove HTML <P> tags from tables
- Remove HTML <SCRIPT> section
- Remove HTML <SPAN> tags New in version 2.4
- Remove HTML alignment attributes from tables
- Remove HTML color attributes from tables
- Remove HTML size attributes from tables
- Remove HTML table tags
- Remove HTML-style comments
- Remove Microsoft Office tags
- Remove non-standard tag attributes
- Remove non-standard tags
- Remove style sheet
- Replace <IMG> tags by a text marker
- Replace entities by text
- Replace hyperlinks by the display value
- Right justify the output text
- Show only table data New in version 2.4
- Start list marker
- Start table marker
- Suppress borders on nested tables
- Target page width
- Target table width
- Text bullet characters
- Text commands file
- Text to replace omitted links by
- Treat short lines as paragraph endings
- Use the ALT attribute to replace <IMG> tags
These policies allow you to control the tag removal process. You can choose what tags are to be removed, or what manipulations you'd like performed on each tag.
Detag policies
Detag tables policies
tag manipulation policies
These policies allow you to choose which tags are to be removed from the markup.
- Remove <!DOCTYPE> tags
- Remove HTML <HEAD> section
- remove all comments
- Remove HTML <SCRIPT> section
- Remove HTML <OBJECT> tags
- Remove HTML <DIV> tags
- Remove HTML <FORM>..</FORM> tags
- Remove HTML <FONT> tags
- Remove HTML <SPAN> tags
- Remove emphasis tags
- remove all hyperlinks
- Remove style sheet
- Remove HTML <IMG> tags
- Remove non-standard tags
- Remove non-standard tag attributes
Menu location: Configuration Options -> Markup Manipulation ->Detagger options
When this option is selected, all tags will be removed from the markup. This effectively turns the conversion into a html-to-text conversions
Menu location: Configuration Options -> Markup Manipulation ->Detagger options
When this option is selected, all HTML tags will be removed from the markup. This will leave only the non-HTML tags.
Menu location: Configuration Options -> Markup Manipulation ->Detagger options
When this option is selected any <!DOCTYPE> tag in the document will be removed. This might be a useful precursor to remove a no longer valid DOCTYPE e.g. if you were concatenating the results or migrating the files to XML of some description.
Menu location: Configuration Options -> Markup Manipulation ->Detagger options
When this option is selected any <HEAD>...</HEAD> section in the document is removed. This might be a useful precursor to merging two HTML files together, although the <HTML> and <BODY> tags will be left in the output.
Menu location: Configuration Options -> Markup Manipulation ->Detagger options
When this option is selected all <!-- ... --> style comments are removed.
Menu location: Configuration Options -> Markup Manipulation ->Detagger options
When this option is selected all <SCRIPT>..</SCRIPT> sections are removed, effectively removing all scripted content from the file.
Menu location: Configuration Options -> Markup Manipulation ->Detagger options
When this option is selected all <OBJECT>..</OBJECT> sections are removed, effectively removing embedded active content from the file.
New in version 2.4
Menu location: Configuration Options -> Markup Manipulation ->Detagger options
When this option is selected all <DIV..> and </DIV> tags are removed.
New in version 2.4
Menu location: Configuration Options -> Markup Manipulation ->Detagger options
When this option is selected all <SPAN..> and </SPAN> tags are removed.
Menu location: Configuration Options -> Markup Manipulation ->Detagger options
When this option is selected all tags associated with forms will be removed. This includes the <FORM>, <INPUT>, <SELECT>, <OPTION>, <TEXTAREA>, <FIELDSET>, <LEGEND> and <LABEL> tags.
Any visible markup (such as tables) integrated with the <FORM> is left intact. This may not always be desirable, and this issue may be addressed in later releases.
Menu location: Configuration Options -> Markup Manipulation ->Detagger options
When this option is selected all <FONT..> and </FONT> tags are removed. <FONT> tags are deprecated in later versions of HTML in favour of CSS style sheets. <FONT> tags can also make a file much larger than it need be, so removing them can be desirable for a number of reasons.
Menu location: Configuration Options -> Markup Manipulation ->Detagger options
When this option is selected, all emphasis markup such as <B>, <I>, <STRONG> and <EM> tags are removed.
Menu location: Configuration Options -> Markup Manipulation ->Detagger options
When this option is selected all hyperlinks are removed. The hyperlink is replaced by it's visible content, only the "link" part is removed.
Menu location: Configuration Options -> Markup Manipulation ->Detagger options
When this option is selected all <STYLE>..</STYLE> sections are removed and all references to external style sheets are removed
Menu location: Configuration Options -> Markup Manipulation ->Detagger options
When this option is selected all Images defined as <IMG> tags are removed.
Menu location: Configuration Options -> Markup Manipulation ->Detagger options
When this option is selected all tags not recognised as being part of HTML are removed. The standard used is currently HTML 4.0 Transitional.
See also Keep deprecated tags
Menu location: Configuration Options -> Markup Manipulation ->Detagger options
When the option Remove non-standard tags is selected, this option determines that any "deprecated" tags may be left in. These are tags recognised in earlier versions of HTML, but which are no longer strictly supported.
Usually these tags are still supported in browsers, and sometimes removing these tags will adversely affect your page's appearance, as newer forms of tagging are often required to achieve the same effect.
Menu location: Configuration Options -> Markup Manipulation ->Detagger options
When this option is selected all tag attributes not recognised as being part of HTML are removed. The standard used is currently HTML 4.0 Transitional.
Note, in HTML 4.0 many tag attributes were deprecated, so that HTML code that relied heavily on tag attributes (e.g.. to set paragraph alignment) may need to be heavily re-written to achieve the same effect under the later standard.
See also Keep deprecated tag attributes
Menu location: Configuration Options -> Markup Manipulation ->Detagger options
When the option Remove non-standard tag attributes is selected, this option determines that any "deprecated" tag attributes may be left in. These are attributes that were recognised in earlier versions of HTML, but which are no longer strictly supported.
Usually these attributes are still supported in browsers, and sometimes removing these tags will adversely affect your page's appearance, as newer forms of tagging are often required to achieve the same effect.
Menu location: Configuration Options -> Markup Manipulation ->Detagger options
When selected all tags not recognised as HTML will be removed from the file.
Depending on the source of the HTML these could be XML tags, or proprietary tags added by the software used to create the HTML.
Menu location: Configuration Options -> Markup Manipulation ->Detagger options
When selected tags believed to be added by an email package will be removed. For example Eudora adds tags in the form <x-whatever> to markup key aspects of a HTML message. this option should remove such tags.
Menu location: Configuration Options -> Markup Manipulation ->Detagger options
When selected tags believed to have been added by MS Office applications are removed. Some versions of MS Office add a lot of tagging - particularly to HTML exported from Excel - to describe a document's ownership and structure.
This option will attempt to remove (and to a limited extent tidy up) such markup.
These options allow a few limited tag manipulations to be performed on any <TABLE>s in the HTML during the markup removal.
Paragraphs
Attributes
Menu location: Configuration Options -> Markup Manipulation ->Tables
When this option is selected all "table" tags are removed (i.e. <table>, <tr>,<th>,<td>,<thead>,<tbody> and <tfoot>). This effectively removes all the table structure in a document, which can be useful if you want to view the HTML on a device with a small display less suited to tables.
Menu location: Configuration Options -> Markup Manipulation ->Tables
When selected this option will replace <p>..</p> tags by a suitable pattern of <br> tags. <br> tags will be inserted between each paragraph in each cell. If a cell has only one paragraph, nothing is inserted.
This option can be useful when tidying up HTML created by certain Word processing packages which needlessly insert <p>..</p> markup.
Menu location: Configuration Options -> Markup Manipulation ->Tables
When selected all alignment attributes ("align" and "valign") are discarded from the various TABLE, TBODY, THEAD, TFOOT, TR, TH and TD tags
Menu location: Configuration Options -> Markup Manipulation ->Tables
When selected all colour attributes ("bgcolor") are discarded from the various TABLE, TBODY, THEAD, TFOOT, TR, TH and TD tags
Menu location: Configuration Options -> Markup Manipulation ->Tables
When selected all sizing attributes ("cellpadding", "cellspacing", "height" and "width") are discarded from the various TABLE, TBODY, THEAD, TFOOT, TR, TH and TD tags.
Note logical size attributes such as "colspan" and "rowspan" are left intact, as is "border".
These options allow a few limited tag manipulations to be performed during the markup removal.
tag conversions
character replacements
Menu location: Configuration Options -> Markup Manipulation -> Tag manipulation
When selected all tags and attributes will be converted to lower case. Any attribute values will be left unchanged.
Menu location: Configuration Options -> Markup Manipulation -> Tag manipulation
When selected all tags and attributes will be converted to UPPER CASE. Any attribute values will be left unchanged.
Menu location: Configuration Options -> Markup Manipulation -> Tag manipulation
When selected any attribute values such as © or £ will be converted to their ANSI equivalents.
Note, not all such entities have an ANSI equivalent, and those that don't will be left unchanged.
The ANSI character set is an "8-bit" character sets, that is each character is represented by a value in the range 0-255. Of these the first 128 characters are known as the "7-bit" characters. Whilst almost all character sets support 7-bit characters, not all support the "upper" 8-bit characters, so you may not want to allow 8-bit characters. For that reason there are two more options
See also Allow ANSI alternatives (e.g. space for ) and Allow 8-bit ANSI values in output
Menu location: Configuration Options -> Markup Manipulation -> Tag manipulation
When selected, any entity that maps onto an upper 8-bit character will be allowed (e.g. © will be replaced by the copyright symbol)
Menu location: Configuration Options -> Markup Manipulation -> Tag manipulation
The GUI shows this option as "Use 7-bit ASCII alternatives where possible" (which, in fact, has the exact opposite meaning). When that is set, this policy is disabled (and vice versa).
When this policy is enabled (i.e. 'Use 7-bit...' is unchecked), then any 8-bit characters are allowed to pass unchanged. The display of such characters will depend on the language and Operating System of the computer used to view the results file.
When this policy is disabled (i.e. 'Use 7-bit...' is checked), then any entity that can be approximated by using one or more 7-bit characters will be replaced by that approximation. For example will become a single space, while — will become two hyphens "--".
The following screens allow access to options to control various aspects of the text conversion process.
New in version 2.4
These policies help you refine the use of Detagger as a data extraction tool. Note, Detagger wasn't designed as a data mining tool, but with these options you can choose to focus only on data in tables at a certain level, and to then turn the selected data into a delimited format that will make it easier post-process the results file (e.g. to import it into a spreadsheet or database).
Table extraction policies
These options can be used to tell the program to only extract data from tables, or that only certain level(s) of nested table should be extracted from the soured document.
Table data handling
These options specify how the data inside tables should be handled. Only one option may be selected. If comma- or tab-delimited data is selected, then the text will be output between table delimiters.
In the Policy File the policy Output table format takes the value 1,2 or 3 according to which of the following is selected. In the user interface you need only select the desired option.
If you choose to convert the table into a delimited data format, the following two options become available :-
New in version 2.4
Menu location: Configuration Options -> Conversion to text -> Data extraction
When enabled, only that part of the input file contained in HTML <TABLE> markup will be included in the output. How that data is formatted will depend on the other text table policies, and the remaining data extraction policies that have been chosen.
To further refine which table data is extracted, use the Minimum table depth and Maximum table depth policies.
New in version 2.4
Menu location: Configuration Options -> Conversion to text -> Data extraction
This policy, together with Maximum table depth specifies a range of table depths from which data should be extracted.
Many HTML use tables to lay out a page as well as to mark up tabular data. These nested table structures can be quite complex and can get in the way of accessing the inner data. Similarly Menus and the like are often implemented by "mini" tables inside tables, inside tables etc.
With these policies you can elect to ignore all data placed in tables at a level higher or lower than that of interest
Consider the HTML code
<TABLE WIDTH="100%"> <TR> <TD WIDTH="30%"> <!--- left hand menu --> <a href="http://www.jafsoft.com/products.html">This is menu item 1</a><br> <a href="http://www.jafsoft.com/detagger/">This is menu item 2</a><br> </TD> <TD WIDTH="70%"> <!-- main part of the page --> <TABLE> <TR> <TH>Date</TH> <TH>Total</TH> </TR> <TR> <TD>April 13</TD> <TD>1,001</TD> </TR> <TR> <TD>May 21</TD> <TD>908</TD> </TR> </TABLE> </TD> </TR> </TABLE> <TABLE WIDTH="100%"> <TR> <TD COLSPAN="2"> <TABLE WIDTH="100%"> <TR> <TD> <TABLE WIDTH="100%"> <TR> <TD><a href="http://www.jafsoft.com/asctohtm/">Text to HTML</a></TD> <TD><a href="http://www.jafsoft.com/asctotab/">Text to table</a></TD> <TD><a href="http://www.jafsoft.com/asctortf/">Text to RTF</a></TD> <TD><a href="http://www.jafsoft.com/asctopdf/">Text to PDF</a></TD> <TD><a href="http://www.jafsoft.com/detagger/">HTML to text</a></TD> </TR> </TABLE> </TD> </TR> </TABLE> </TD> </TR> </TABLE>
When converted to text this becomes
+-------------------+----------------+ |This is menu item 1|+--------+-----+| |This is menu item 2||Date |Total|| | |+--------+-----+| | ||April 13|1,001|| | |+--------+-----+| | ||May 21 |908 || | |+--------+-----+| +-------------------+----------------+ +---------------------------------------------------------------------+ |+-----------------------------------------------------------------+ | ||+------------+-------------+-----------+-----------+------------+| | |||Text to HTML|Text to table|Text to RTF|Text to PDF|HTML to text|| | ||+------------+-------------+-----------+-----------+------------+| | |+-----------------------------------------------------------------+ | +---------------------------------------------------------------------+
That is, the page consists of a table of data containing dates and totals, but in addition this is placed in an outer table with a menu on the left. At the foot of the page is a navigation menu, implemented as a heavily nested table.
NOTE: The hyperlinks have been removed using the text hyperlink policies.
To get rid of the outer table set the Minimum table depth to 2. Doing this gives the following
+--------+-----+ |Date |Total| +--------+-----+ |April 13|1,001| +--------+-----+ |May 21 |908 | +--------+-----+ +-----------------------------------------------------------------+ |+------------+-------------+-----------+-----------+------------+| ||Text to HTML|Text to table|Text to RTF|Text to PDF|HTML to text|| |+------------+-------------+-----------+-----------+------------+| +-----------------------------------------------------------------+
Note that the text in the outer table (in this case the menu text) has been discarded, but we still have a doubly nested footer table. If we also set the Maximum table depth to 2, we get
+--------+-----+ |Date |Total| +--------+-----+ |April 13|1,001| +--------+-----+ |May 21 |908 | +--------+-----+ ++ || ++
That is, the footer becomes an empty table (when borders are displayed). Using the table data handling options you can now convert this to, for example, CSV format.
"Date","Total" "April 13","1,001" "May 21","908"
Note, we've switched off borders and removed delimited table markers in this output.
Note, the Minimum table depth and maximum table depth are fairly broad brush policies - if there had been multiple tables with nested content at level 2, then it would all be included in the output. That said, these policies do offer some prospect of focusing the output on the data you want.
New in version 2.4
Menu location: Configuration Options -> Conversion to text -> Data extraction
See the discussion in Minimum table depth
Menu location: Configuration Options -> Conversion to text -> Data extraction
In the Policy File this policy takes the value 1,2 or 3 according to which of the following is selected.
Any tables processed during the conversion will then be formatted according to the selection made.
Menu location: Configuration Options -> Conversion to text -> Data extraction
When selected any tables will be converted into plain text. The software will look at the row and column structure of the HTML original, and attempt to lay this out in the current page width, although this may not always be possible.
See also Output table format
Menu location: Configuration Options -> Conversion to text -> Data extraction
When selected any tables will be converted into comma-delimited data. Each row is put out as a row of comma-delimited data, with the data values themselves in quotes. The resulting data is in a format suitable for importing into spreadsheets.
Where tables are nested, by default only the innermost table will be converted in this way. Because these could be multiple tables in a file, each table is delimited as follows
$_$_BEGIN_COMMA_DELIMITED_TABLE ... comma-delimited data rows ... $_$_END_TABLE
You can change this behaviour through the options:-
See also Output table format
Menu location: Configuration Options -> Conversion to text -> Data extraction
When selected any tables will be converted into tab-delimited data. Each row is put out as a row of tab-delimited data, with the data values themselves in quotes. The resulting data is in a format suitable for importing into spreadsheets.
Where tables are nested, by default only the innermost table will be converted in this way. Because these could be multiple tables in a file, each table is delimited as follows
$_$_BEGIN_DELIMITED_TABLE ... tab-delimited data rows ... $_$_END_TABLE
You can change this behaviour through the options:-
See also Output table format
Menu location: Configuration Options -> Conversion to text -> Data extraction
For nested tables conversion to a delimited format can become a bit of a nightmare. Usually it will be the innermost table which is data, with outer tables being used for page layout, and so by default the software will only convert the innermost table of a nested set into delimited format.
However this may not always be the case, and so this option can be switched on to convert all levels of table into delimited format. In such cases you may have to tidy up the text (to delete and unwanted portions) before importing it into a spreadsheet.
Note, from version 2.4 onwards, the policies maximum table depth and minimum table depth offer a bit more control than this policy.
Menu location: Configuration Options -> Conversion to text -> Data extraction
By default markers are put round the delimited data to separate it from normal text, and from other sections of delimited data. This makes sense for a file which contains non-table elements, or multiple tables, but probably isn't for files that contain just a single table.
When enabled each table (or sub-table) will appear like this
$_$_BEGIN_DELIMITED_TABLE ... delimited data rows ... $_$_END_TABLE
You can use this option to control this behaviour.
Detagger has a number of options to allow you to tailor the conversion to text. Some control what is copied across to the output, but most offer options to format the output above and beyond the formatting implicit in the original HTML.
These options control how the text is laid out on the page.
Menu location: Configuration Options -> Conversion to text -> Formatting
When this option is selected the text is laid out as it was in the original HTML file, minus the actual tags. If this option is selected then all other text formatting options are ignored.
This option is suitable when the input file is only lightly tagged (e.g. a document with large <PRE> sections).
Menu location: Configuration Options -> Conversion to text -> Formatting
When selected this means that the lines of the text file should be formatted to match a target page width. This will involve moving text around within a paragraph if necessary
Menu location: Configuration Options -> Conversion to text -> Formatting
When line formatting is switched on, this is the target page width. If omitted there will in any case be a default page width applied (set to 76 characters).
See also target table width
Menu location: Configuration Options -> Conversion to text -> Formatting
When selected white space will be added to each line inside a paragraph so that the right margins are aligned as well as the left.
Menu location: Configuration Options -> Conversion to text -> Formatting
The is a comma-separated list of up to 8 levels of indentation, specifying how the output text should be indented when laying out indented text such as nested lists.
By conversion the first value should be a zero to set the left hand margin to start at the beginning of the line.
These options control the presentation of bullets and list in the output text document
Menu location: Configuration Options -> Conversion to text -> Formatting
When converting to text, this identifies characters which - if they end up at the start of a line by themselves - can be taken to be bullet points. The hyphen character '-' is implicitly regarded as a bullet point, but characters such as 'o', 'q' and '§' can sometimes appear in the output text as bullet points depending on how the original HTML was generated.
When a bullet point is found to match one of these characters, the first of the text bullet characters is used as a replacement.
Menu location: Configuration Options -> Conversion to text -> Formatting
When converting to text, this is a comma-separated list that specifies which characters are to be used as the bullet at each level of list. The special value "middot" will be taken to mean the "middot" character.
At the same time, any "middot" characters occurring in the text will be replaced by the first "bullet" on this list.
For example the value
(+),+,-
Would convert any level 1 list bullets to "(+)", level 2 to "+" and level 3 to "-". At the same time any middot characters '·' will be converted into "(+)" (the first bullet on the list).
The default value is
middot,middot,middot,middot,middot,middot
New in version 2.4
Menu location: Configuration Options -> Conversion to text -> Formatting
This policy is a comma-separated list of "templates" to be used when outputting ordered lists. The policy allows you to specify the format of each list item for up to 6 levels of list. The format should include an "x", which will be replaced by the item number.
For example a value of
(x), x), x:
would lead to a list structure as follows
(1) item at list level 1 a) item at list level 2 i: item at list level 3
The default value is
x),x),x),x),x),x)
These options control how headings in the original document are presented
Menu location: Configuration Options -> Conversion to text -> Formatting
This is a comma-separated list controlling the underlining character (or character pattern) at each heading level. A value of ",,,,," (or indeed blank) would suppress all heading underlining.
As an example the value
=+ , -
would cause <H1> headings to be underlined with the pattern
"=+=+=+=", <H2> to be underlined with "-------" and all other
heading levels no to be underlined at all.
Detagger has a number of options to allow you to tailor the conversion to text. Some control what is copied across to the output, but most offer options to format the output above and beyond the formatting implicit in the original HTML.
For those HTML files that represent works of fiction, Detagger has some sophisticated "dialogue" detection and formatting options. Dialogue is deemed to be any words or phrases in double quotation marks. Where they occur at the start of a line this can often (but not always) signify the next line of dialogue (i.e. a new "character" speaking)
Care is taken - as far is possible - to distinguish text that happens to be in "quotes" and true dialogue. However this will never be a 100% accurate process.
Menu location: Configuration Options -> Conversion to text -> Miscellaneous Formatting
When this option is switched on, the program will attempt to spot dialogue at the start of a line and format it accordingly, with each new speaker starting a new paragraph in the output.
Menu location: Configuration Options -> Conversion to text -> Miscellaneous Formatting
When this option is enabled, extra tests are made to check the validity of lines believed to be the start or end of a dialogue phrase. Tests include looking for suitable use of capitalisation and punctuation inside and outside the quoted text.
Some of these tests are biased towards the text being in English.
Menu location: Configuration Options -> Conversion to text -> Miscellaneous Formatting
When this option is enabled, the software will attempt to spot new dialogue from a different character that appears deep inside a paragraph. When this is detected, the larger paragraph will be broken so that the dialogue of the new character starts in a new paragraph
Menu location: Configuration Options -> Conversion to text -> Miscellaneous Formatting
When selected all horizontal lines and rules in the input will be omitted in the output.
Menu location: Configuration Options -> Conversion to text -> Miscellaneous Formatting
When selected, HTML marked as <b>bold</b> or <i>italic</i> can be emphasised in the output text as bold and italic.
This can work well when the occasional word is emphasised, but in some HTML pages entire menus are placed in bold, and in such files this options is probably best switched off.
Detagger has a limited ability to deal with Unicode in the HTML files.
At present the following options are available
Menu location: Configuration Options -> Conversion to text -> Miscellaneous Formatting
When selected files which are detected to contain Unicode characters will get a "Unicode file marker" output at the start of the file.
A Unicode file marker at the top of a file is recognised by some software applications (e.g. text editors) as marking the file as containing Unicode. The marker characters themselves don't get displayed.
Where possible, the software will create a UTF8 file. Full Unicode support has not been tested, and it is not expected that Detagger will support all types of Unicode
Menu location: (none at present)
The program has the ability to detect Unicode Files on input if Byte Order Mark (BOM) is present, or if - under some circumstances - Unicode HTML entities are present in the input text, but in files without the BOMs the software may fail to detect the input is Unicode.
In such circumstances this policy allows you to tell the software that the input should be treated as Unicode. The possible values for this policy are
auto
UTF8
UTF16-BE
UTF16-LEautomatic detection (the default)
UTF-8
UTF-16 "Big Endian"
UTF-16 "Little Endian"
For a fuller discussion see Working with Unicode
These options control what happens to any hyperlinks in the original document. Since text files don't support hyperlinks, the options are to ignore the link entirely, only use the display text, or to turn the link into a reference and add a reference table at the end, listing the URLs the links pointed to.
Hyperlink removal
Display of URLs
Images
Menu location: Configuration Options -> Conversion to text -> Hyperlinks
This is a peculiar one added in response to a customer request. When enabled during a conversion to text, all hyperlinks are left intact, so what you end up with is a text file with HTML hyperlinks in it. This may be of interest to those wishing to import text into a database for display on HTML pages.
It is expected that in the conversion any HTML entities (specifically &) in the URL will get converted to their ASCII equivalent. This may cause usability problems with the link after conversion.
Menu location: Configuration Options -> Conversion to text -> Hyperlinks
When this option is selected, all visible email addresses are omitted from the output. This can be a useful privacy option.
Menu location: Configuration Options -> Conversion to text -> Hyperlinks
When this option is selected, any link to a local resource (a jump point, or a non http-qualified URL) is omitted from the output. This can be used to remove local navigation links from documents where "next", "previous", "top of page" links will mean nothing in the final text.
If this option isn't selected, the display part of such links will be copied to the output text.
Menu location: Configuration Options -> Conversion to text -> Hyperlinks
If either of the two previous options is selected, then this is the text that any deleted links will be replaced by. If set to blank the links are completely removed.
Menu location: Configuration Options -> Conversion to text -> Hyperlinks
If this option is selected then hyperlinks to resources are replaced by the display text, and a reference number [n] added after it. A full reference table, listing the original URL that matches the reference numbers is then added at the end of the file.
When selected the option Display link URLs is disabled.
Menu location: Configuration Options -> Conversion to text -> Hyperlinks
When selected the URL for hyperlinks is displayed in brackets in the main text, after the display text.
When selected the option Add URL references at end of file is disabled.
Menu location: Configuration Options -> Conversion to text -> Hyperlinks
When selected any <IMG> tags in the original are replaced by an "[Image]" marker.
See also Use the ALT attribute to replace <IMG> tags
Menu location: Configuration Options -> Conversion to text -> Hyperlinks
When selected any ALT attribute on a <IMG> tag will be used as the replacement text marker for the tag.
See also Replace <IMG> tags by a text marker
New in V2.3
These policies allow you to specify special "markers" that should be added to the output to delimit tables and lists. This can be useful if you want to pass the output to some further software package for post-processing.
Tables
Lists
Menu location: Configuration Options -> Conversion to text -> Markers
Menu location: Configuration Options -> Conversion to text -> Markers
Menu location: Configuration Options -> Conversion to text -> Markers
When converting to text, this option identifies a marker that will be output on the line before the start of any marked-up list that is detected. This can be useful if you want to subsequently identify lists in the text.
See also End list marker
Menu location: Configuration Options -> Conversion to text -> Markers
When converting to text, this option identifies a marker that will be output on the line before the start of any marked-up table that is detected. This can be useful if you want to subsequently identify tables in the text.
See also End table marker
New in V2.3
These options control the layout of text into sentences and paragraphs.
<P> tags in the original text are preserved, but some HTML files use
means other than that to layout text (e.g. multiple <BR> tags). In
such cases Detagger applies extra intelligence to detect the paragraph
structure.
Menu location: Configuration Options -> Conversion to text -> Paragraphs and sentences
When selected this option specifies that each paragraph is put out as a single line (i.e. with a single hard break). This produces text that will display well in those environments that automatically wrap text.
This options won't work on text inside a table unless you switch off Attempt to parse tables
Menu location: Configuration Options -> Conversion to text -> Paragraphs and sentences
When selected any line deemed to be "short" will keep it's line break, even if text is rearranged to fit into a target page width.
Menu location: Configuration Options -> Conversion to text -> Paragraphs and sentences
Specifies the number of characters by which the first line in a new paragraph should be indented relative to those that follow.
Menu location: Configuration Options -> Conversion to text -> Paragraphs and sentences
Specifies that in files where there are no paragraph markers, a short line amongst longer lines should be taken as signalling a paragraph end. This can be useful when converting files that use multiple <BR> tags, but where you want a different page width in the output. Without this test there would be no paragraphs detected.
Menu location: Configuration Options -> Conversion to text -> Paragraphs and sentences
When selected the software will impose a 2-character space between sentences, regardless of what spacing was in the original. This is a common style in written text.
These policies control how (if at all) Detagger will attempt to faithfully represent tables in the output.
The software does a reasonable job of representing simple tables as text, and can even cope with nested tables to a limited extent.
Menu location: Configuration Options -> Conversion to text -> Tables
When selected Detagger will attempt to correctly format any HTML tables in the text.
In doing so Detagger will attempt to preserve the width, alignment etc of the original, but this process can only ever be approximate due to the quite different formats of HTML and text. Bear in mind that if the software is adding emphasis characters (see Highlight bold and italic text) or URL references (see text hyperlink policies), then this will end up with more output text to fit in the table.
Although you can try to adjust table to page width, if the table is too wide or the page too narrow, then this will often fail - particularly for heavily nested tables.
These options specify whether any tables should have borders added in the text file. This only applies when converting to plain text (see Output table format). By default the software will replicate the border status of the outermost tables, but suppress borders in any inner, nested tables. This is because the space taken by borders in the ASCII file limits the room for data.
Menu location: Configuration Options -> Conversion to text -> Tables
When selected a border will be added round each table created. This will be regardless of whether the original table set a border or not.
If omitted, then the border attribute of the original HTML should be honoured.
Menu location: Configuration Options -> Conversion to text -> Tables
When table borders are present (or switched on), this option suppresses any borders on tables that appear inside other tables. This is done because nesting tables is usually done to achieve layout, and in the text file white space usually looks better (less distracting).
This options will not prevent a border being added to the outermost table (if one is requested).
Menu location: Configuration Options -> Conversion to text -> Tables
This specifies whether or not blank lines are allowed in the output between rows. In this case a blank line is output between each table row (when there is no border). This can space out the table making it easier to read, especially if some of the cell contents are split over several lines.
This option only works when the table is being converted to plain text as opposed to a data delimited format.
Controls the width the table will take on the output "page". The default behaviour is to regard 800 pixels as 80 characters wide.
Menu location: Configuration Options -> Conversion to text -> Tables
When selected this specifies that any table should attempt to fit the current target table width.
For a reasonable page size and simple table, Detagger can usually re-arrange the table's contents cell-by-cell to fit the table to the page.
However for narrow pages and/or large tables or heavily nested tables it becomes virtually impossible to achieve this goal.
Menu location: Configuration Options -> Conversion to text -> Tables
When table formatting is being attempted, this is the target page width for tables. This is the maximum width that a table should be allowed to grow to. In some nested tables however this limit may on occasion be exceeded.
If omitted the value will default to the target page width value, which in turn will default to 76 characters.
New in version 2.4
Menu location: Configuration Options -> Conversion to text -> Tables
Whenever Detagger converts a tables with explicit WIDTH attributes, it tries to honour this layout. However, sometimes the widths set in an HTML table don't give enough room to lay out the text when converting to text.
This can happen for a number of reasons
- the widths have been wrongly chosen to be too small in the HTML, but the browser has widened the table (thereby effectively ignoring the set width)
- text in the table has been set to a small font size that fits the HTML width. On conversion to text small sizes can't be honoured, and so the allocated width may be too narrow for the text to be placed in the cell
- sometimes the "nowrap" attribute has been set on a cell to stop it breaking over several lines. Again most browsers will honour this request by widening the table if need be, but this option isn't always available to Detagger, especially in heavily nested tables.
In such cases this policy allows all WIDTH attributes to be ignored by Detagger. When WIDTH attributes are ignored, Detagger is free to do the best it can to fit the data into the space available, subject to any limits suggested by the Target table width and Adjust table to page width policies.
New in version 2.4
Menu location: Configuration Options -> Conversion to text -> Tables
When Detagger converts tables, it will do a best-fit to the available page width, but there are occasions when it is difficult to fit a wide table into a narrow target page width. This is especially true of tables where small font sizes have been used in the HTML to achieve a fit, in plain ASCII text Detagger can't use an equivalent trick.
When compressing a table to fit a small target width, Detagger will split text across multiple lines, moving words onto the next line to make the column narrower, but it won't break individual words in two in order to achieve a fit.
This option tells Detagger that it can, if necessary, break up long words in order to narrow a column to fit. Detagger implements this in a fairly brutal manner - for example it doesn't hyphenate the broken text - and so this option should be used sparingly.
New in version 2.4
Menu location: Configuration Options -> Conversion to text -> Tables
When table widths are not supplied (or the policy Ignore table width attributes is enabled), then Detagger sometimes struggles to fit heavily nested tables to a restrictive target page or table width.
The reason for this is that Detagger first lays out the inner tables, and then embeds these in the outer tables when they are laid out. the problem is that the inner tables may expand to be too wide, making it impossible to get the outer tables to fit on the page.
This policy scales back the amount of space a nested table can take. It's expressed as a percentage. 100(%) implies no limit on the inner table, giving the behaviour of earlier versions of Detagger. 0 implies the maximum restriction on inner tables (although this isn't, of course, to zero width).
The default value of set to 75.
If you find that heavily nested tables aren't fitting into the desired page width, try reducing this value.
If you find that inner tables are getting broken over several lines, try setting this policy to 100. If that still fails, you may need to consider disabling the policy Adjust table to page width and increasing the value of Target table width
These are miscellaneous options related to how tables are converted and the markup found inside tables is treated.
Menu location: Configuration Options -> Conversion to text -> Tables
When this option is selected it prevents any heading markup inside tables being interpreted as such. Not only will this suppress any underlining the heading would have had added, but it will also prevent the width of the whole heading being used in calculating the "minimum width" required for each column.
As a result this will actually help make some tables narrower, as the "minimum" width required drops as the "heading" text is now allowed to be split over two or more lines.
Menu location: Configuration Options -> Conversion to text -> Tables
This option specifies the value - in spaces - of an indentation to be applied to all tables in the output. This indentation will reduce the page width available to the table.
These policies are not set via the main options screens.
These policies record the location of any additional configuration files that are to be used in the conversion. They are usually selected using the Configuration Files menu
Menu location: Configuration Options -> Configuration Files -> Text fragments file
In the full version you can add headers and footers to each text file created by using Using a Text Fragments File. This policy allows you to specify the file in which those fragments are defined.
If you don't select a fragment file, then no headers or footers will be added.
Note: In the evaluation version, a standard header and footer are added, and this feature is not available. It is available in the registered version
See also Using a Text Fragments File
Menu location: Configuration Options -> Configuration Files -> Text commands file
Specifies the location of any text commands file to be used to define text manipulations to be performed on the text as it is being read in, and prior to conversion.
Some policies cannot be set through the normal options screens. These include
Menu location: (none at present. Edit the policy file)
When selected, Detagger will search the first 40 lines of the document looking for a "By" line in the hope of identifying the author. Any value located this way, will then be available for use in a generated text fragment via the AUTHOR fragment tag.
Menu location: On main dialog, set 'Output Type' to concatenate files
When selected this specifies that when converting multiple files at once, all the results should be concatenated into a single results file.
This has the same effect as selecting _"Concatenate results into one file"_ as the output type
Menu location: (none at present. Edit the policy file)
This specifies how many lines from the input files should be ignored at the start of the file. These lines will be discarded from the output.
This can be useful when converting file copied from a news feed or whatever that adds a small data header to the file.
Menu location: (none at present. Edit the policy file)
This specifies how many lines from the input files should be ignored at the end of the file. Up to 40 lines may be ignored in this way. These lines will be discarded from the output.
This can be useful when converting file copied from a news feed or whatever that adds a small data footer to the file.
Converted from
a single text file by
AscToHTM © 1997-2005 John A Fotheringham |