August 7, 2013
PC World reported on the findings of computer scientist David Kriesel, who last week posted a number of scans on his website after discovering information had changed from the original documents to the scans.
Kriesel used a Xerox WorkCentre 7556 to scan some building construction documents, with the floorplan scanned marking each room with its area in square metres – 14.13m2, 21.11m2 and 17.42m2 – but on studying the scans, he discovered that the figures had changed from the original document. Having switched off optical character recognition, Kriesel knew “it wasn’t related to that”, and so he investigated further.
He soon found that “when scanned in TIFF mode, a pixel-for-pixel reproduction, the copy was identical to the original”, but when images were compressed, “things started getting weird”, with one image showing every room marked with 14.13m2, another with two rooms marked 17.42m2, and a third with two labelled 14.13m2.
Kriesel stated on his website that “there seems to be a correlation between font size, scan dpi used”, and that he “was able to reliably reproduce the error for 200 DPI PDF scans without OCR, of sheets with Arial 7pt and 8pt numbers”. He instantly began receiving emails from other Xerox users after publicising his findings, and narrowed down the issue to the scanner’s JBIG2 image compression.
The compression software, in order to “reduce file space”, looks for areas in an image “that are similar” and “makes one compressed version and reuses it across all the similar areas”, and because the numbers were printed “in a small, fine font, the scanner apparently mistook them for identical and reused data resulting in figures for room area getting reproduced”.
In a statement responding to the claims, Xerox confirmed the JBIG2 software “as being at the root of the problem”, noting that it “stems from a combination of compression level and resolution setting”, and that the machines have been warning users “for years” that this could occur. The OEM continued by stating: “The devices mentioned are shipped from the factory with a compression level and resolution that produces scanned files which are optimized for viewing or printing while maintaining a reasonable file size.
“We do not normally see a character substitution issue with the factory default settings however, the defect may be seen at lower quality and resolution settings. For data integrity purposes, we recommend the use of the factory defaults with a quality level set to ‘higher’.”
Categories : Products and Technology