Supported Validators¶
RedPen supports the following validators.
- SentenceLength
- InvalidExpression
- InvalidWord
- SpaceAfterPeriod
- CommaNumber
- WordNumber
- SuggestExpression
- InvalidSymbol
- SpaceWithSymbol
- KatakanaEndHyphen
- KatakanaSpellCheck
- SectionLength
- SpaceBetweenAlphabeticalWord
- ParagraphNumber
- ParagraphStartWith
- Contraction
- Spelling
- DoubledWord
- SuccessiveWord
- DuplicatedSection
SentenceLength¶
SentenceLength validator checks the length of sentences in input doucment. If the length of the sentence is over the specified maximum length, the validator returns the warning.
Property | Default Value | Description |
---|---|---|
"max_len" |
50 | Maximum length of sentence. |
InvalidExpression¶
InvalidExpression validator checks if input sentences contains invalid expressions (words or phrases). If the input sentence contains invalid expressions, this validaor retuns the warning.
Property | Default Value | Description |
---|---|---|
"dict" |
None | File name of dictionary. |
"list" |
None | List of invalid expression split by comma. |
The dictionary is a set of words or exressions. The following is the example of the dictionary.
like
you know
hey
kidding
why is hell
...
InvalidWord¶
InvalidWord validator checks if input sentences contains invalid words. If the input sentence contains invalid words, this validaor retuns the warning.
Property | Default Value | Description |
---|---|---|
"dict" |
None | File name of dictionary. |
"list" |
None | List of invalid expression split by comma. |
The dictionary is a set of words. The following is the example of the dictionary.
like
hey
wow
...
SpaceAfterPeriod¶
SpaceAfterPeriod validator checks if there is a white space after the end of input sentences (except for the last sentence of paragraph). If the input sentence does not contain the white space returns the wanring.
CommaNumber¶
CommaNumber validator checks the number of commas.
Property | Default Value | Description |
---|---|---|
"max_num" |
4 | Maximum number of commas in a sentence. |
WordNumber¶
WordNumber validator checks the number of word in one setnece.
Property | Default Value | Description |
---|---|---|
"max_num" |
50 | Maximum number of words in a sentence. |
SuggestExpression¶
SuggestExpression validator works the sample as the InvalidExpression validator. If the input sentence contains invalid expressions, this validaor retuns the warning and suggest the correct expression.
Property | Default Value | Description |
---|---|---|
"dict" |
None | File name of dictionary. |
The dictionary is a TSV file with two columns. First column contains the invalid expression, and the second expression is for suggested expression.
SVM Support Vector Machine
LLVM Low Level Virtual Machine
...
InvalidSymbol¶
Some symbols or characters have the difference characters with the same role. For example question mark ”? (0x003F)” have another variation “?(0xFF1F)” in the unicode table. InvalidSymbol checks if input sentences contains invalid characters or symbols. We write the symbols and character settings into character setting file (char-table.xml). In the setting file, we write the symbols we should use in the document, and in addition the invalid symbols. The details of the character settings are described in the next section.
SpaceWithSymbol¶
Some symbols need space before or after them. For example, we add add space left brancket “(”. we add the setting in the character setting file (char-table.xml).
KatakanaEndHyphen¶
KatakanaEndHyphen validator checks the end hyphens of Katakana words in Japanese documents. Japanese Katakana words have variations in end hyphen. For example, “computer” is written in Katakana by “コンピュータ (without hyphen) ”, and “コンピューター (with hypen) ”. This validator check if Katakana words ending format is match the predefined standard. See JIS Z8301, G.6.2.2 b) G.3.
- a: Words of 3 characters or more can not have the end hyphen.
- b: Words of 2 characters or less can have the end hyphen.
- c: A compound word applies a and b for each component.
- d: In the cases from a to c, the length of a syllable which are represented as a hyphen, flip syllable, and stuffed syllable is 1 except for Youon.
KatakanaSpellCheck¶
KatakanaSpellCheck validator checks the Katakana words has the very similar words in the document. For example when there is a Katakana word “インデックス” and the variation “インデクス” in the same document, this validator returns the warning.
SectionLength¶
SectionLength validator checks the character number of input seciton.
Property | Default Value | Description |
---|---|---|
"max_num" |
1000 | Maximum number of words in a seciton. |
ParagraphNumber¶
ParagraphNumber validator checks the number of paragraph in one input section.
Property | Default Value | Description |
---|---|---|
max_num" |
5 | Maximum number of paragraphs in a seciton. |
ParagraphStartWith¶
ParagraphStartWith validator checks if the characters in the beggning of paragraphs follows the style.
Property | Default Value | Description |
---|---|---|
start_with |
” “ | Characters in the beggning of paragraphs. |
SpaceBetweenAlphabeticalWord¶
SpaceBetweenAlphabeticalWord validator checks if the alphabet words are surrounded with white spaces. This validator is used in Non-latin languages such as Japanese or Chrinese.
Contraction¶
Contraction validator throws a error when contractions are used in the documents in which more than half of verbs are written in non contracted form.
Spelling¶
Spelling validator throws a error if threre are spelling mistaks in the input documents. This validator works only in English documents.
DoubledWord¶
DoubledWord validator throws a error if a word is used more than once. For example a input document has a following sentence, the validator reports a error since good is used twice.
the good item is very good.
Property | Default Value | Description |
---|---|---|
"dict" |
None | File name of skip list dictionary. |
"list" |
None | List of skip words split by comma. |
SuccessiveWord¶
SuccessiveWord validator throws a error if a word is used in succession. For example a input document has a following sentence, the validator reports a error since is is used in succession.
the item is is very good.
DuplicatedSection¶
DuplicatedSection validator throws a error if there are section pairs which have almost the same contents.