Querying Text Files

Shouldn't we be using ASCII 31 as a delimiter for text fields instead of a Comma or a Tab? These non-printable control characters (ASCII 28, 29, 30 and 31) make it easy to write to and read in files without worrying too much about escape characters and other restrictions that come with them. Looks like these four were designed for this very purpose:

28: File Separator
29: Group Separator
30: Record Separator
31: Unit Separator

Parsing Comma and Tab delimited files breaks when they have a new line or a Tab under their text fields. Even though every file becomes a database when AWK'ed hard enough, AWK'ing becomes a lot easier on text files with ASCII delimiters 28-31. Same goes for Base64 encoded binary data. So far as query planning and indexing go, I'd rather expand queries with synonyms, optional stemming, etc. instead of using the tokenizer/filter I used for indexing. This helps avoid rebuilding my index and lends the user better control over his queries.