Chinese word-segmented writing, or Chinese word-separated writing (), is a style of written Chinese where texts are written with spaces between words like written English. Chinese sentences are traditionally written as strings of characters, with no marks between words. Hence, word segmentation according to the context (done either consciously or unconsciously) is a task for the reader.
There are many advantages or reasons of word-segmented writing. An important reason lies in the existence of ambiguous texts where only the author knows the intended meaning and the correct segmentation. For example, "ç¾ÂÃ¥ÂÂæÂÂä¸ÂÃ¥ÂÂæÂÂã ç¾Âå½ä¼Âä¸ÂÃ¥ÂÂæÂÂãÂÂ" may mean "ç¾Âå æÂ ä¸ÂÃ¥ÂÂæÂÂã ç¾Âå½ ä¼ ä¸ÂÃ¥ÂÂæÂÂãÂÂ" (The US will not agree.) or "ç¾ åÂÂæÂ ä¸ÂÃ¥ÂÂæÂÂã 羠å½传ä¸ÂÃ¥ÂÂæÂÂãÂÂ" (The US Congress does not agree).
In ancient China, texts were written without punctuation marks, which led to the reader needing to spend a considerable amount of time finding the boundary of a sentence. It was not until the early 1900s when the present punctuation marks were adopted.
In the 1950s, there was a proposal for the employment of word-segmented writing in a discussion among the Chinese linguists, however it was not passed.
In 1987, the idea of Chinese word-segmented writing was put forward again by Chen Liwei in an international conference on Chinese information processing.
Chinese word-segmented writing was first put into application no later than 1998, when a paper entitled Written Chinese Word Segmentation Revisited: Ten advantages of word-segmented writing was published in a key academic journal in China. The whole paper, seven pages altogether, was written word-segmentedly, with the abstract presented as: æÂÂè¦Â: Ã¥ÂÂ诠ç åÂÂå 对 ç°代 æ±Â诠ç è¿Âç¨ãÂÂç Âç©¶ å 计ç®ÂæÂº ä¿¡æÂ¯ å¤Âç ç é½ å ·æÂ ç¸彠éÂÂ覠ç æÂÂä¹ÂãÂÂæÂ¾Â éÂÂè¿° 书é¢ æ±Â诠åÂÂ诠è¿Âå ç å 大 好å¤Â, å¹¶ 讨论 ä¸Â亠å®ÂæÂ½ æÂ¹é¢ ç é®é¢ÂãÂÂæÂÂç« å ¨æÂ åÂÂ诠è¿ÂÃ¥ÂÂãÂÂ
In 2018, a one-paragraph short article was published on Wikiversity entitled Word segmentation of Hanzi, with the Chinese text word-segmented as follows: Ã¥ÂÂå²ä¸Âï¼Âä¸Âå½å¤æÂ æÂ¯ 没æÂ æ Âç¹符å·çÂÂãÂÂ读è éÂÂ覠ä»Âåº é¢Âå¤Âç 精å ä¸Â注亠æÂÂå¥ï¼ÂèÂÂ丠ç¨ÂæÂÂ差池便传é æÂ è¯¯è¯»ãÂÂæÂÂè° å·®ä¹Â毫å 失ä¹ÂÃ¥ÂÂéÂÂãÂÂå¼Âå ¥ æ Âç¹符å· æÂ¯ ä¸Â次 éÂÂ大ç æÂÂÃ¥ÂÂæÂ¹é©ï¼Â使徠æ±ÂÃ¥ÂÂæÂÂæÂÂ é 读æÂÂç æÂÂ亠å¾Â大ç æÂÂé«ÂãÂÂä½ ä¸ÂæÂÂç æÂ¹é© æÂ åÂÂå 起æÂ¥ï¼ è¿ÂæÂªè¾¾å° å°½åÂÂå°½ç¾Âç ç¨Â度ãÂÂè³尠å¨ é 读æÂÂç æÂ¹é¢ ä»Âç¶ Ã¥ÂÂå¨ç ä¸Â个 æÂ¾èÂÂæÂÂè§Âç éÂÂ碠- æÂÂ诠ï¼Âæ±ÂÃ¥ÂÂç åÂÂè¯Âè¿ÂÃ¥ÂÂï¼ÂãÂÂ
The first book written in word segmentation was è¯Âè¨ÂçÂÂ论 (Language theories) published in 2000.
Chinese is usually written in Chinese characters, so Chinese word segmented writing mainly refers to the segmentation of Chinese character text. The following are some methods or skills.
The most important purpose of word-segmented writing is to express the intended meaning of the writer accurately and clearly. For example, the traditional non-word-segmented text "ä¹Âä¹ÂçÂÂæÂÂÃ¥ÂÂå®ÂäºÂãÂÂ" has two possible meanings, which can be expressed in word-segmented writing as "ä¹Âä¹ çÂÂæÂ åÂÂå®ÂäºÂãÂÂ" (Ping pong bats are sold out) and "ä¹Âä¹Âç æÂÂå å®ÂäºÂãÂÂ" (The ping pong balls have been auctioned). The author is to make a selection to correctly express the intended meaning without ambiguity.
If not sure whether a character string is a legal word, the writer can check its existence in a reliable word dictionary, such as Xiandai Hanyu Cidian and CEDICT. Or check whether it is a linguistically qualified word according to lexical, morphological and syntactical knowledge.
In spoken language, there is usually a pause between two words (and pause is not allowed within a word), so it is natural to put a pause (represented by a space) between the words in written language.
Methods to identify word boundaries can also be found in Word#Word boundaries.
The space between two words should be set at half the width of a Chinese character, shorter than the distance between two lines. Because the average length of a Chinese word is about 2 characters, if a space is of full width of a Chinese character, longer than the inter-line distance, the lines of words will appear scattered, not compact.
To further help the reader, the proper nouns should be marked as well, such as by underlines. In fact this is already done in the Holy Bible (Union Version with modern punctuation).
Pinyin is usually used to mark the pronunciation of Chinese characters, but in elementary Chinese teaching or teaching Chinese as a foreign language, Pinyin is sometimes used to express Chinese directly. Therefore, Pinyin writing is also a kind of Chinese writing, and it can also be an important reference for Chinese character word segmentation. "Basic Rules of Chinese Pinyin Orthography" is the Chinese national standard for Pinyin writing and word segmentation. Its main content "5. General rules" is excerpted as follows:
The general rules are
In addition to the general rules, there are specific rules for nouns, verbs, adjectives, pronouns, numerals, quantifiers, adverbs, prepositions, conjunctions, auxiliary words, interjections, onomatopoeias, idioms, sayings, as well as names of people and places.
Below is an example with a longer text from the Chinese version of the United Nations Universal Declaration of Human Rights:
Article 1 of the Universal Declaration of Human Rights in simplified Chinese characters:
The pinyin transcription can be word-segmented into Rénrén shÃÂng ér zìyóu, zài zà «nyán hé quánlì shàng yëlàpÃÂngdÃÂng. TÃÂmen fùyÃÂu lÃÂxìng hé liángxën, bìng yëng yàxià Ângdì guÃÂnxì de jëngshén xiÃÂng duìdài. Accordingly, the Chinese character text can be segmented into 人人 ç è èªç±ï¼Âå¨ å°Â严 å æÂÂå© 丠ä¸Âå¾ å¹³çÂÂã ä»Â们 èµÂæÂ çÂÂæÂ§ å è¯å¿Âï¼ å¹¶ 庠以 å Â张堳系 ç 精祠ç¸ 对徠ãÂÂ
Before word-segmented writing was popularized, computer-based word segmentation was often used for language information processing. Although the quality of such systems has improved over time, manual post-editing is still required.