Supported Validators¶
RedPen supports the following validators.
- SentenceLength
- InvalidExpression
- InvalidWord
- SpaceAfterPeriod
- CommaNumber
- WordNumber
- SuggestExpression
- InvalidSymbol
- SpaceWithSymbol
- KatakanaEndHyphen
- KatakanaSpellCheck
- SectionLength
- SpaceBetweenAlphabeticalWord
- ParagraphNumber
- ParagraphStartWith
- Contraction
- Spelling
- DoubledWord
- SuccessiveWord
- DuplicatedSection
- JapaneseStyle
SentenceLength¶
SentenceLength validator checks the length of sentences in the input document. If the length of the sentence is greater than the specified maximum length, the validator generates a warning.
Property | Default Value | Description |
---|---|---|
"max_len" |
50 | Maximum length of sentence. |
InvalidExpression¶
InvalidExpression validator checks if input sentences contain invalid expressions (words or phrases). If the input sentence contains invalid expressions, the validator generates a warning.
Property | Default Value | Description |
---|---|---|
"dict" |
None | File name of dictionary. |
"list" |
None | List of invalid expression split by comma. |
The dictionary is a set of words or expressions. The following is an example of a dictionary.
like
you know
hey
kidding
what the hell
...
InvalidWord¶
InvalidWord validator checks if input sentences contain invalid words. If the input sentence contains invalid words, the validator generates a warning.
Property | Default Value | Description |
---|---|---|
"dict" |
None | File name of dictionary. |
"list" |
None | List of invalid expression split by comma. |
The dictionary is a set of words. The following is an example of a dictionary.
like
hey
wow
...
SpaceAfterPeriod¶
SpaceAfterPeriod validator checks if there is a white space at the end of input sentences (except for the very last sentence of paragraph). If the input sentence does end with a white space, a warning is given.
CommaNumber¶
CommaNumber validator checks the number of commas in a sentence.
Property | Default Value | Description |
---|---|---|
"max_num" |
4 | Maximum number of commas in a sentence. |
WordNumber¶
WordNumber validator checks the number of words in one setnece.
Property | Default Value | Description |
---|---|---|
"max_num" |
50 | Maximum number of words in a sentence. |
SuggestExpression¶
SuggestExpression validator works in a similar way to the InvalidExpression validator. If the input sentence contains invalid expressions, this validator returns a warning suggesting the correct expression.
Property | Default Value | Description |
---|---|---|
"dict" |
None | File name of dictionary. |
The dictionary is a TSV file with two columns. First column contains the invalid expression, and the second column contains a suggested replacement expression.
SVM Support Vector Machine
LLVM Low Level Virtual Machine
...
InvalidSymbol¶
Some symbols or characters have alternate characters with the same role. For example question mark ”? (0x003F)” has another unicode variation “?(0xFF1F)”. InvalidSymbol checks if input sentences contains invalid characters or symbols. The symbols and character settings are entered into the character setting file (char-table.xml). In this file, we write the symbols we should use in the document and their invalid counterparts. The details of these settings is described in the next section.
SpaceWithSymbol¶
Some symbols need space before or after them. For example, if we want to ensure a space is added before a left parentheses “(”, we could add this preference to the character setting file (char-table.xml).
KatakanaEndHyphen¶
KatakanaEndHyphen validator checks the end hyphens of Katakana words in Japanese documents. Japanese Katakana words have variations in their end hyphen. For example, “computer” is written in Katakana as “コンピュータ” (without hyphen), and “コンピューター” (with hypen). This validator checks to ensure that Katakana words match the predefined standard. See JIS Z8301, G.6.2.2 b) G.3.
- a: Words of 3 characters or more cannot have an end hyphen.
- b: Words of 2 characters or less can have an end hyphen.
- c: A compound word should apply a and b to each component word.
- d: In the cases from a to c, the length of a syllable which is represented by a hyphen is 1 except for Youon.
KatakanaSpellCheck¶
KatakanaSpellCheck validator checks if Katakana words have very similar words with different spellings in the document. For example, if the Katakana word “インデックス” and the variation “インデクス” exist within the same document, this validator will return a warning.
SectionLength¶
SectionLength validator checks the maximum number of words allowed in an section.
Property | Default Value | Description |
---|---|---|
"max_num" |
1000 | Maximum number of words in a section. |
ParagraphNumber¶
ParagraphNumber validator checks the maximum number of paragraphs allowed in one section.
Property | Default Value | Description |
---|---|---|
max_num" |
5 | Maximum number of paragraphs in a seciton. |
ParagraphStartWith¶
ParagraphStartWith validator checks to see if the characters at the beginning of paragraphs conforms to the correct style.
Property | Default Value | Description |
---|---|---|
start_with |
” “ | Characters in the beginning of paragraphs. |
SpaceBetweenAlphabeticalWord¶
SpaceBetweenAlphabeticalWord validator checks that alphabetic words are surrounded with whitespace. This validator is used in non-latin languages such as Japanese or Chinese.
Contraction¶
Contraction validator throws an error when contractions are used in a document in which more than half of the verbs are written in non-contracted form.
Spelling¶
Spelling validator throws an error if there are spelling mistakes in the input documents. This validator only works for English documents.
DoubledWord¶
DoubledWord validator throws an error if a word is used more than once in a sentence. For example, if an input document contains the following sentence, the validator will report an error since good is used twice.
this good item is very good.
Property | Default Value | Description |
---|---|---|
"dict" |
None | File name of skip list dictionary. |
"list" |
None | List of skip words split by comma. |
SuccessiveWord¶
SuccessiveWord validator throws an error if the same word is used twice in succession. For example, if an input document contains the following sentence, the validator will report an error since is is used twice in succession.
the item is is very good.
DuplicatedSection¶
DuplicatedSection validator throws an error if there are section pairs which have almost the same content.
JapaneseStyle¶
JapaneseStyle validator reports errors if the input file contains both “dearu” and “desu-masu” style.