Supported Validators¶
RedPen supports the following validators.
- SentenceLength
- InvalidExpression
- InvalidWord
- SpaceBeginningOfSentence
- CommaNumber
- WordNumber
- SuggestExpression
- InvalidSymbol
- SymbolWithSpace
- KatakanaEndHyphen
- KatakanaSpellCheck
- SectionLength
- SpaceBetweenAlphabeticalWord
- ParagraphNumber
- ParagraphStartWith
- Contraction
- Spelling
- DoubledWord
- SuccessiveWord
- DuplicatedSection
- JapaneseStyle
- DoubleNegative
- FrequentSentenceStart
- UnexpandedAcronym
- WordFrequency
- Hyphenation
- NumberFormat
- ParenthesizedSentence
- WeakExpression
SentenceLength¶
SentenceLength validator checks the length of sentences in the input document. If the length of the sentence is greater than the specified maximum length, the validator generates a warning.
Properties¶
Property | Default Value | Description |
---|---|---|
max_len |
50 | Maximum length of sentence. |
Supported langauges¶
SentenceLength can be applied to any languages.
InvalidExpression¶
InvalidExpression validator checks if input sentences contain invalid expressions (words or phrases). If the input sentence contains invalid expressions, the validator generates a warning.
Properties¶
Property | Default Value | Description |
---|---|---|
dict |
None | File name of dictionary. |
list |
None | List of invalid expression split by comma. |
The dictionary is a set of words or expressions. The following is an example of a dictionary.
like
you know
hey
kidding
what the hell
...
Supported langauges¶
InvalidExpression can be applied to any languages.
InvalidWord¶
InvalidWord validator checks if input sentences contain invalid words. If the input sentence contains invalid words, the validator generates a warning.
Properties¶
Property | Default Value | Description |
---|---|---|
dict |
None | File name of dictionary. |
list |
None | List of invalid expression split by comma. |
The dictionary is a set of words. The following is an example of a dictionary.
like
hey
wow
...
Supported Languages¶
InvalidWord can be any of langauges (but the default dictionaries are supplied only for English and Japanese).
SpaceBeginningOfSentenceValidator¶
SpaceBeginningOfSentenceValidator validator checks if there is a white space at the end of input sentences (except for the very last sentence of paragraph). If the input sentence does end with a white space, a warning is given.
Supported langauges¶
SpaceBeginningOfSentenceValidator can be applied to any langauges.
CommaNumber¶
CommaNumber validator checks the number of commas in a sentence.
Properties¶
Property | Default Value | Description |
---|---|---|
max_num |
4 | Maximum number of commas in a sentence. |
Supported languages¶
CommaNumber can be applied to any languages.
WordNumber¶
WordNumber validator checks the number of words in one setnece.
Properties¶
Property | Default Value | Description |
---|---|---|
max_num |
50 | Maximum number of words in a sentence. |
Supported langauges¶
WordNumber can be applied to any languages except for some Asian languages (Chinese or Thai), since RedPen does not have the tokenizer for the unspported languages.
SuggestExpression¶
SuggestExpression validator works in a similar way to the InvalidExpression validator. If the input sentence contains invalid expressions, this validator returns a warning suggesting the correct expression.
Properties¶
Property | Default Value | Description |
---|---|---|
dict |
None | File name of dictionary. |
The dictionary is a TSV file with two columns. First column contains the invalid expression, and the second column contains a suggested replacement expression.
SVM Support Vector Machine
LLVM Low Level Virtual Machine
...
Supported langauges¶
SuggestExpression can be any of languages but the default dictionaries are provided only for English and Japanese.
InvalidSymbol¶
Some symbols or characters have alternate characters with the same role. For example question mark ”? (0x003F)” has another unicode variation “?(0xFF1F)”. InvalidSymbol checks if input sentences contains invalid characters or symbols. The symbols and character settings are entered into the character setting file (char-table.xml). In this file, we write the symbols we should use in the document and their invalid counterparts. The details of these settings is described in the next section.
Supported languages¶
InvalidSymbol works for any langugages. See the settings of symbols in the Configuration page.
SymbolWithSpace¶
Some symbols need space before or after them. For example, if we want to ensure a space is added before a left parentheses “(”, we could add this preference to the character setting file (char-table.xml).
Supported languages¶
InvalidSymbol works for any languages.
KatakanaEndHyphen¶
KatakanaEndHyphen validator checks the end hyphens of Katakana words in Japanese documents. Japanese Katakana words have variations in their end hyphen. For example, “computer” is written in Katakana as “コンピュータ” (without hyphen), and “コンピューター” (with hypen). This validator checks to ensure that Katakana words match the predefined standard. See JIS Z8301, G.6.2.2 b) G.3.
- a: Words of 3 characters or more cannot have an end hyphen.
- b: Words of 2 characters or less can have an end hyphen.
- c: A compound word should apply a and b to each component word.
- d: In the cases from a to c, the length of a syllable which is represented by a hyphen is 1 except for Youon.
Supported languages¶
KatakanaEndSymbol works only for Japanees texts.
KatakanaSpellCheck¶
KatakanaSpellCheck validator checks if Katakana words have very similar words with different spellings in the document. For example, if the Katakana word “インデックス” and the variation “インデクス” exist within the same document, this validator will return a warning.
Property | Default Value | Description |
---|---|---|
dict |
None | Path to a user dictionary for skip list of Katakana words. |
min_ratio |
0.2 | Threshold of the minimum similarity. KatakanaSpellCheck reports an error when there is a pair of words of which the similarity is more than the min_ratio. |
min_freq |
5 | Threshold of the minimum word frequency. KatakanaSpellCheck checks words of which frequencies are less than min_freq. |
Supported languages¶
KatakanaSpellCheck works only for Japanees texts.
SectionLength¶
SectionLength validator checks the maximum number of words allowed in an section.
Properties¶
Property | Default Value | Description |
---|---|---|
max_num |
1000 | Maximum number of words in a section. |
Supported lanauges¶
SectionLength works for any languages.
ParagraphNumber¶
ParagraphNumber validator checks the maximum number of paragraphs allowed in one section.
Properteis¶
Property | Default Value | Description |
---|---|---|
max_num |
5 | Maximum number of paragraphs in a seciton. |
Supported lanauges¶
ParagraphNumber works for any languages.
ParagraphStartWith¶
ParagraphStartWith validator checks to see if the characters at the beginning of paragraphs conforms to the correct style.
Properties¶
Property | Default Value | Description |
---|---|---|
start_with |
” “ | Characters in the beginning of paragraphs. |
Supported languages¶
ParagraphStartWith works for any langugaes.
SpaceBetweenAlphabeticalWord¶
SpaceBetweenAlphabeticalWord validator checks that alphabetic words are surrounded with whitespace. This validator is used in non-latin languages such as Japanese or Chinese.
Supported languages¶
SpaceBetweenAlphabeticalWord works for languages whose words are not split by white spaces such as Japanese or Chinese.
Contraction¶
Contraction validator throws an error when contractions are used in a document in which more than half of the verbs are written in non-contracted form.
Supported languages¶
Contraction works only for English texts.
Spelling¶
Spelling validator throws an error if there are spelling mistakes in the input documents. This validator only works for English documents.
Supported languages¶
Spelling works only for English texts.
DoubledWord¶
DoubledWord validator throws an error if a word is used more than once in a sentence. For example, if an input document contains the following sentence, the validator will report an error since good is used twice.
Properties¶
this good item is very good.
Property | Default Value | Description |
---|---|---|
dict |
None | File name of skip list dictionary. |
list |
None | List of skip words split by comma. |
Supported languages¶
DoubledWord works for any langages except for Chiense or other Asian languages. Note that the default dictionaries are supplied for Japanese and English.
SuccessiveWord¶
SuccessiveWord validator throws an error if the same word is used twice in succession. For example, if an input document contains the following sentence, the validator will report an error since is is used twice in succession.
the item is is very good.
Supported languages¶
SuccessiveWord works for any langages except for Chiense or other Asian languages.
DuplicatedSection¶
DuplicatedSection validator throws an error if there are section pairs which have almost the same content.
Supported languages¶
DuplicatedSection works for any languages.
JapaneseStyle¶
JapaneseStyle validator reports errors if the input file contains both “dearu” and “desu-masu” style.
Supported languages¶
JapaneseStyle works only for Japanese
DoubleNegative¶
DoubleNegative validator reports errors when input sentence contains double negative expression.
Supported languages¶
DoubleNegative works only for English and Japanese texts.
FrequentSentenceStart¶
This validator reports an error if too many sentences start with the same sequence of words.
Property | Default Value | Description |
---|---|---|
leading_word_limit |
3 | Number of words starting each sentence to consider. |
percentage_threshold |
25 | Maximum percentage of sentences that can start with the same words. |
min_sentence_count |
5 | Minimum number of sentences required for the validator to report errors. |
Supported languages¶
FrequentSentenceStart works for any languages.
UnexpandedAcronym¶
This validator ensures that there are candidates for expanded versions of acronyms somewhere in the document.
That is, if there exists an acronym ABC in the document, then there must also exist a sequence of capitalized words such as Axxx Bxx Cxxx.
Properties¶
Property | Default Value | Description |
---|---|---|
min_acronym_length |
3 | Minimum size for the acronym |
Supported languages¶
UnexpandedAcronym works only for English texts.
WordFrequency¶
This validator ensures that usage of specific words in the document don’t occur too frequently. It calculates the frequency that words are used and compares them the a reference histogram of word frequency for written English.
Excessive deviation from normal usage generates a validation error.
Properties¶
Property | Default Value | Description |
---|---|---|
deviation_factor |
3 | Permitted factor of deviation from the norm. So if a word is normally used 3% of the time, your document can use it up to 9% of the time. |
min_word_count |
200 | Minimum number of words in a document before this validator starts to validate |
Supported languages¶
WordFrequency works only for English texts.
Hyphenation¶
This validator ensures that sequences of words that are hyphenated in the dictionary are hyphenated in your document.
Supported languages¶
Hyphenation works only for English texts.
NumberFormat¶
This validator ensures that numbers in a sentence are formatted using commas (ie: 12,000 instead of 120000), and don’t have excessive decimal points.
Properties¶
Property | Default Value | Description |
---|---|---|
decimal_delimiter_is_comma |
false | Change the decimal delimiter from . to , (as in Europe) |
ignore_years |
false | Ignore 4 digit integers (2015, 1998) |
Supported languages¶
NumberFormat works for texts written in European languages such as English or French.
ParenthesizedSentence¶
This validator generates errors if parenthesized sentences (such as this) are used too frequently, or are nested too heavily.
Properties¶
Property | Default Value | Description |
---|---|---|
max_nesting_level |
2 | The limit on how many parenthesized expressions are permitted |
max_count |
1 | The number of parenthesized expressions allowed |
max_length |
4 | The maximum number of words in a parenthesized expression |
Supported languages¶
ParenthesizedSentence works only for texts written in Eurpopean languages.