Skip to content

SAS formats with string values of length longer than 10 are parsed incorrectly #378

@hpoettker

Description

@hpoettker

Summary

ReadStat currently reads formats in SAS catalog files incorrectly when

  • the values of a format are strings,
  • and one of the string values is longer than 10 characters.

The code to read such formats currently contains a hard-coded maximum of 16 characters but it also reads 6 bytes of padding, which makes the effective current maximum length 10.

The problem also affects the write feature, which would need to be changed to align with the read feature.

Resolution

I'll describe the patterns I've observed in SAS catalog files for formats with string values below.

Adjusting the code along these lines allows me to read a large number of large SAS formats from real-world little-endian catalog files in both 32-bit and 64-bit.

In the course of such a change one could not only change the existing write feature for 32-bit files to match the read feature but also add the feature to write 64-bit catalog files to facilitate testing.

I'd be happy to open a PR. Please let me know if you're interested.

Hexdump examples

The SAS catalog files look slightly different depending on whether the longest string value of a format needs more than 16 bytes or not.

Consider the following SAS code:

PROC FORMAT LIBRARY=mylib;
  VALUE $fmt16c
    'One' = '1'
    'Two' = '2'
    'abcdeABCDEabcdeA' = '3'
  ;
  VALUE $fmt17c
    'One' = '1'
    'Two' = '2'
    'abcdeABCDEabcdeAB' = '3'
  ;
RUN;

In the little-endian 64-bit case, these are the relevant snippets of the hexdumps of the produced catalog file:

000050b0  00 00 00 00 00 00 10 00  10 00 01 00 01 00 00 01  |................|
000050c0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000050d0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 04 00  |................|
000050e0  20 00 00 00 0a 00 02 b0  53 7f 00 00 00 00 00 00  | .......S.......|
000050f0  00 00 00 00 4f 6e 65 20  20 20 20 20 20 20 20 20  |....One         |
00005100  20 20 20 20 04 00 20 00  00 00 0a 00 02 b0 53 7f  |    .. .......S.|
00005110  00 00 01 00 00 00 00 00  00 00 54 77 6f 20 20 20  |..........Two   |
00005120  20 20 20 20 20 20 20 20  20 20 04 00 20 00 00 00  |          .. ...|
00005130  0a 00 02 b0 53 7f 00 00  02 00 00 00 00 00 00 00  |....S...........|
00005140  61 62 63 64 65 41 42 43  44 45 61 62 63 64 65 41  |abcdeABCDEabcdeA|
00005150  05 00 06 00 00 00 06 00  01 00 31 00 05 00 06 00  |..........1.....|
00005160  00 00 06 00 01 00 32 00  05 00 06 00 00 00 06 00  |......2.........|
00005170  01 00 33 00 00 00 00 00  00 00 00 00 00 00 00 00  |..3.............|
...
000052b0  00 00 11 00 11 00 01 00  01 00 00 01 01 00 00 00  |................|
000052c0  00 00 04 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000052d0  00 00 00 00 00 00 00 00  00 00 04 00 1a 00 00 00  |................|
000052e0  03 00 03 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000052f0  00 4f 6e 65 00 00 00 00  00 00 04 00 1a 00 00 00  |.One............|
00005300  03 00 03 00 00 00 00 00  01 00 00 00 00 00 00 00  |................|
00005310  00 54 77 6f 00 00 00 00  00 00 04 00 28 00 00 00  |.Two........(...|
00005320  03 00 11 00 00 00 00 00  02 00 00 00 00 00 00 00  |................|
00005330  00 61 62 63 64 65 41 42  43 44 45 61 62 63 64 65  |.abcdeABCDEabcde|
00005340  41 42 00 00 00 00 00 00  05 00 06 00 00 00 06 00  |AB..............|
00005350  01 00 31 00 05 00 06 00  00 00 06 00 01 00 32 00  |..1...........2.|
00005360  05 00 06 00 00 00 06 00  01 00 33 00 00 00 00 00  |..........3.....|

As usual, the corresponding 32-bit case looks similar but a bit denser.

My observations are that when a single string value of a format gets longer than 17 characters then

  • a flag (or offset) for the format at positions 0x50c0 / 0x52bc above switches from 0 to 1
  • the offset of the beginning of the string counted from the 0x04 in positions 0x512a / 0x531a above increases from 22 to 23, and this holds analogously for all string values of the format, i.e. also the short ones
  • the length of a string value, which is not explicitly given in the first format, is now provided at offset 8 from the same 0x04, e.g. in the positions 0x52e2, 0x5302, 0x5322 for the second format
  • the padding to the right of the string value switches from 0x20 (' ') to 0x00, again for all strings.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions