Documentation for the AscToHTM Text to HTML converter


Previous page Back to Contents List Next page

Using Text Command Files

(New in version 5.0)
As of version 5.0, AscToHTM allows the use of "Text Commands". These are commands that allow you to modify the text before it is converted, or to label certain lines as being of a particular type.

The commands should be placed in an external "Text Command File". This file can be chosen from Conversion Options -> Config File Locations menu option.

Contents of this section

Text Commands available
Text Command : ignore_line
Text Command : remove_text
Text Command : replace_text
Text Command : treat_line
Text Command : meta_tag_line
Text Command line elements
line_selection
line_match
match_type
replace_type
as_line_type
Text Command tags
The DATA tag
An example Text Command File

Text Commands available


Text Command : ignore_line

The ignore_line command identifies lines that should be ignored in the input.

Syntax:

        ignore_line <line_selection>


Any line matching the specified line_selection criteria will be ignored in the output. This can be a useful way of ignoring page markers in an input file, as these don't always transfer well under the conversion.


Text Command : remove_text

The remove_text command identifies text that should be removed from the input.

Syntax:

        remove_text <match_type> "match string"

Any line containing text that matches the specified match_type for the supplied "match string" will have the matching text removed.


Text Command : replace_text

The remove_text command identifies text that should be removed from the input.

Syntax:

        replace_text <match_type> "match string" by_string "new string"
or
        replace_text <match_type> "match string" by_character "<char>"

Any line containing text that matches the specified match_type for the supplied "match string" will have the matching text replaced.

If the replacement is specified as

        by_string "new string"

then the text is replaced by the new string. If the replacement is specified as

        by_character "<char>"

then the string is replaced by a string of equal length consisting of this single character repeated. This can be useful for example to replace change bar characters by spaces in a document where the change bars have confused the program, or to replace other characters inside a table that are confusing the detection of the table's true layout.


Text Command : treat_line

The treat_line command allows you to specify how a line should be regarded during the analysis of the file.

Syntax:

        treat_line <line_selection> <as_line_type>

With this command any line that matches the specified line_selection criteria will be regarded as the specified as_line_type.

For example the command

        treat_line starting_with string "news" as_header_1

specifies that any line in which the string "News" is found at the start should be considered as a level 1 heading.


Text Command : meta_tag_line

The meta_tag_line is meant solely for HTML conversion. It identifies lines that should be converted into HTML META tags.

Syntax:

        meta_tag_line "tag name" <line_selection> [remove_match_text]

This command specifies that any line matching the line_selection criteria should be used to create a META tag called "tag name". The value of this META tag will be the line itself. If the remove_match_text argument is supplied, the match text itself will be removed from the value.

For example the command

        meta_tag_line "author" starting_with string "author: " remove_match_text

Will match the line

        Author: Dr John A Fotheringham

will remove the "author: " from this line, and create a META tag as follows

        <META NAME="author" CONTENT="Dr John A Fotheringham">

This can be useful when processing text files created by other systems that add "tagging" and catalogue information at the top.

Note
You could combine this command with a matching ignore_line command to ensure that the line became a META tag, but wasn't included in the conversion output itself.


Text Command line elements

line_selection

The line_selection element is actually a combination of a number of simpler elements as follows

Syntax:

        <line_match> <match_type> "match string"

That is the line_selection consists of a line_match, a match_type, and then the actual "match string" to be matched. All three elements must be present in order for the line_selection to be valid.

The following are all valid examples

        starting_with   string          "Chapter"
        starting_with   exact_phrase    "Author : "
 
        containing      phrase          "click here"
        containing      string          "http://"


line_match

The line_match element specifies where on the input line the specified text should be located. The options are

  starting_with Text should be at start of line (ignoring any white space)
  containing Text can be anywhere on the input line

Care should be used when using the containing option, as false matches are more likely to occur.


match_type

The match_type element specifies how any supplied match string should be matched. The options are

string This specifies that a string should be matched.
This is, in fact, the most general of match types
and is the one that would normally be used. This
match type is case-insensitive.
exact_string Same as "string", but case-sensitive.
phrase A "phrase" is a string that is surrounded by white space
and/or punctuation on either side (see below).
This match type is case-insensitive
exact_phrase Same as "phrase", but case-sensitive.
wildcard Not yet supported (*)

The match_type phrase is a special case. This is a string that is surrounded by white space or punctuation on either side. So whereas the string "the" would match "then", the phrase "the" wouldn't because the "n" in "then" is not a white space character.

The start and end of a line count as white space, and any leading or trailing punctuation is allowed. Phase is therefore a more precise match - even for single words - than string.

Consider the following example, concentrating on the letters "ten" in the word "tense"

        This is a tense situation....

The following would apply

match_type Matches?
string "ten" Yes. The "ten" matches the first three
characters in "tense" in the middle
extact_string "Ten" No. The "t" in "tense" is lower case, so
the match fails
phrase "ten" No. "ten" is not surrounded by white space
or punctuation because it is followed by "se"
exact_phrase "tense situation" Yes. The case matches, and there is a space
before and punctuation (the "...") afterwards.


replace_type

The replace_type element is used in the replace_text command to specify what type of text replacement should be executed. The element should be immediately followed by the replacement text in quotes.

There are two options:-

by_string The matched text should simply be replaced
by the replacement text.
by_character The matched text should be replaced by an
equal length string composed solely of the
single character in the replacement text.

The by_character option allows a string to be "blanked out" by the character of your choice, but without altering the line length or spacing etc. This can be useful, for example to replace all DOS line drawing characters by blanks in table, so as to let the software make a better stab at detecting the table layout.


as_line_type

The as_line_type element is used by the treat_line command to specify how the matching line should be treated. The as_line_type assigns to the matching line a type that would otherwise have to be automatically be detected by the program. It can therefore help the analysis if you can tell the program how such lines should be treated.

The options are

as_heading_<n> Where <n> is "1","2"..."6". The matched
line is treated as a heading of level <n>
as_bullet The matched line is treated as being
an unordered list item (bullet)
as_alpha_bullet The matched line is treated as being an
item on an alphabetic list.
as_capalpha_bullet The matched line is treated as being an
item on an UPPER CASE alphabetic list.
as_roman_bullet The matched line is treated as being an
item on an roman numeral list.
as_caproman_bullet The matched line is treated as being an
item on an UPPER CASE roman numeral list.
as_quoted The matched line is treated as being "quoted
text" such as lines in emails that start with
a ">" are.
as_new_page The matched line is treated as being the
start of a new page.
as_number_bullet The matched line is treated as being an
item on a numbered list.

For example the command

        treat_line starting_with string ":" as_quoted

can be used to ensure that lines that start with ":" are treated as if they are "quoted text" such as one finds inside emails. See quoted line detection

Text Command tags

To add further flexibility to your text command, the DATA tag has been added. This tags allows your replacement strings to have variable text - although only to a limited extent


The DATA tag

Syntax:

        DATA <type_of_data>

where <type_of_data> can be

VERSION Program name and version
TITLE Document title
AUTHOR Document Author (if known)
IN_FILENAME Input filename
OUT_FILENAME Output filename
IN_FILESIZE Input file size
OUT_FILESIZE Output file size (approximate if known)
IN_FILEDATE Input file's date
TIMESTAMP Current time

Text commands are executed against the input text, and can be used to substitute data on input before conversion. For example the command

    replace_text string "Created by [author]" by_string "created by [[DATA AUTHOR]]</TD>"

Would allow a standard line to have the author's name embedded in it. Note, for this to work the DATA tag has to know the value of Author at execution time, so you would most likely need to set this is the policy file.

For this reason some of the DATA statements will be less useful than others (for example OUT_FILESIZE won't work in AscToHTM)


An example Text Command File

Below is an example Text Command file:

        treat_line starting_with exact_string "new page" as_new_page

        treat_line starting_with string "head_1" as heading_1
        treat_line starting_with string "head_2" as heading_2
        treat_line starting_with string "head_3" as heading_3
 
        remove_text exact_string "head_1"
        remove_text exact_string "head_2"
        remove_text exact_string "head_3"
 
        ignore_line containing exact_string "PAGE"

In this example lines starting with "new_page" are treated as page breaks. Lines starting with "head_1" etc are treated as headings, and then the text "head_1" is removed. In this way you could label your heading lines without the labelling appearing in the output. Finally any line containing the exact_string "PAGE" is discarded. Note that by using "exact_string" you ensure that the case is matched so "PAGE" matches but "page" does not.



Previous page Back to Contents List Next page

Valid HTML 4.0! Converted from a single text file by AscToHTM
© 1997-2004 John A Fotheringham
Converted by AscToHTM