Skip to content

Allow escaping of non-printable characters in CSV output/input #124

@hgschmie

Description

@hgschmie

We have a use case where we replace an existing CSV writer (based on Apache Commons CSV) with Jackson. The old CSV writer was configured to write "special characters" such as CR and LF as \r and \n. Jackson does not support this but adheres to RFC 4180 (where no escaping exists). This causes a lot of pain for our customers as the data we write contains often CR and LF characters.

Here is some test code:

public class CsvTest {

    public static final void main(String... args) throws Exception {

        CsvSchema schema = CsvSchema.emptySchema().withEscapeChar('\\');
        CsvFactory csvFactory = new SchemaAwareCsvFactory(schema);

        ObjectMapper csvMapper = new ObjectMapper(csvFactory);

        String [] line = new String [] { "a", "\n", "\"", ","};

        csvMapper.writeValue(new PrintWriter(System.out), line);

    }

    // Is there actually a better way to set the schema for a ObjectMapper? This seems painful.
    public static class SchemaAwareCsvFactory extends CsvFactory {
        SchemaAwareCsvFactory(CsvSchema schema) {
            super();
            this._schema = schema;
        }
    }
}

Which produces

a,"
","""",","

I can get it to produce

"a","
","\"",","

by adding

csvFactory.enable(CsvGenerator.Feature.ALWAYS_QUOTE_STRINGS);
csvFactory.enable(Feature.ESCAPE_QUOTE_CHAR_WITH_ESCAPE_CHAR);

But what I am actually looking for is

"a","\n","\"",","

There seems to be no way to get the generator (and probably also the parser) to generate and parse control characters. Having CR and LF within quotes is legal from the RFC 4180 PoV, however most of the CSV that our systems produce get parsed by legacy ("brain dead") tools that assume that every LF is a record separator.

Apache Commons CSV has a nice summary on their Javadoc page for CSVFormat: https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/CSVFormat.html#DEFAULT (and below)

(god, CSV is such a mess. And that is the standard format for enterprise data???)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions