How to Fix: Read file bytes difference.

5 min read

If your program reads a file and the returned byte count does not match what you expect, the bug is usually not in the filesystem at all. It is almost always caused by how the file descriptor position changes between calls, how partial reads are handled, or how the test compares buffers after multiple low-level I/O operations.

Understanding the Root Cause

This issue typically appears when using POSIX APIs such as read, pread, readv, or duplicated file descriptors in a C test case. The important detail is that a normal file descriptor has a shared file offset. Every successful read() advances that offset. If the next operation assumes the offset is still at the beginning of the file, the program will observe a byte difference, shortened reads, or mismatched buffer contents.

There are a few technical reasons this happens:

  • Sequential reads advance the cursor: calling read(fd, ...) twice does not read the same bytes twice unless you reposition with lseek().
  • Duplicated descriptors share state: if one descriptor was created with dup() or inherited from the same open file description, both handles may share the same offset.
  • Partial reads are valid: a single read() is not guaranteed to fill the entire buffer, especially on pipes, sockets, procfs entries, or special files.
  • Buffer comparison mistakes: comparing the full allocated buffer instead of the actual returned byte count can produce false differences because unread bytes remain unchanged.
  • Text expectations on binary data: if the file contains \0, newline translation assumptions, or non-printable bytes, string functions like strcmp() or strlen() will report misleading results.

In short, the root bug is usually a mismatch between expected file position and actual file position, combined with insufficient validation of the number of bytes truly read.

Step-by-Step Solution

The safest fix is to make every read operation explicit about offset, length, and result checking.

1. Always check the return value from read APIs

ssize_t n = read(fd, buf, buf_size);
if (n < 0) {
    perror("read");
    exit(EXIT_FAILURE);
}

A negative return means an error. A smaller-than-expected positive value means a partial read, which must be handled explicitly.

2. Reset the file offset when re-reading the same file

if (lseek(fd, 0, SEEK_SET) == (off_t)-1) {
    perror("lseek");
    exit(EXIT_FAILURE);
}

ssize_t n = read(fd, buf, buf_size);

If the test expects the same bytes again, reposition the descriptor before the second read.

3. Prefer pread() when the test depends on a fixed offset

ssize_t n = pread(fd, buf, buf_size, 0);
if (n < 0) {
    perror("pread");
    exit(EXIT_FAILURE);
}

pread() reads from a specific offset without changing the descriptor’s current position. This is often the cleanest fix for tests that compare byte ranges repeatedly.

4. Compare only the bytes actually read

ssize_t n1 = pread(fd1, buf1, sizeof(buf1), 0);
ssize_t n2 = pread(fd2, buf2, sizeof(buf2), 0);

if (n1 < 0 || n2 < 0) {
    perror("pread");
    exit(EXIT_FAILURE);
}

if (n1 != n2) {
    fprintf(stderr, "byte count mismatch: %zd vs %zd\n", n1, n2);
    exit(EXIT_FAILURE);
}

if (memcmp(buf1, buf2, (size_t)n1) != 0) {
    fprintf(stderr, "content mismatch\n");
    exit(EXIT_FAILURE);
}

This prevents stale bytes in the buffer from affecting the result.

5. Use a loop if the full file content must be consumed

ssize_t read_all(int fd, unsigned char *buf, size_t len) {
    size_t total = 0;
    while (total < len) {
        ssize_t n = read(fd, buf + total, len - total);
        if (n < 0) {
            return -1;
        }
        if (n == 0) {
            break;
        }
        total += (size_t)n;
    }
    return (ssize_t)total;
}

This is critical when reading from sources that may legitimately return fewer bytes per call.

6. Example corrected test pattern

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <string.h>
#include <errno.h>

static int get_fd(const char *filename, int flags) {
    int fd = open(filename, flags);
    if (fd < 0) {
        perror("open");
        exit(EXIT_FAILURE);
    }
    return fd;
}

int main(void) {
    const char *path = "test.bin";
    unsigned char a[256];
    unsigned char b[256];

    int fd = get_fd(path, O_RDONLY);

    ssize_t na = pread(fd, a, sizeof(a), 0);
    if (na < 0) {
        perror("pread a");
        close(fd);
        return EXIT_FAILURE;
    }

    ssize_t nb = pread(fd, b, sizeof(b), 0);
    if (nb < 0) {
        perror("pread b");
        close(fd);
        return EXIT_FAILURE;
    }

    if (na != nb) {
        fprintf(stderr, "Different byte counts: %zd vs %zd\n", na, nb);
        close(fd);
        return EXIT_FAILURE;
    }

    if (memcmp(a, b, (size_t)na) != 0) {
        fprintf(stderr, "Read file bytes differ\n");
        close(fd);
        return EXIT_FAILURE;
    }

    printf("Reads match: %zd bytes\n", na);
    close(fd);
    return EXIT_SUCCESS;
}

If your original test used repeated read() calls on the same descriptor, switching to pread() or calling lseek() before the second read will usually resolve the issue immediately.

Common Edge Cases

  • EOF reached earlier than expected: if the file is smaller than the buffer, read() returns only available bytes. That is normal behavior, not corruption.
  • Shared offsets across dup() or fork(): if another code path reads from the same open file description, your test may start from a different offset than expected.
  • Special files: files under /proc, device nodes, or pseudo-filesystems may not behave like regular disk files and can return dynamic content or short reads.
  • Mixing stdio and low-level I/O: combining fread() with read() on the same underlying file can cause confusing offset behavior because stdio buffering adds another layer of state.
  • Comparing as strings: binary file data may contain null bytes, so use memcmp() instead of strcmp().
  • Interrupted system calls: on some systems, read() may fail with EINTR. Robust code retries the operation when appropriate.

FAQ

Why do two reads from the same file descriptor return different bytes?

Because the first read() advances the file offset. The second read starts where the first one ended unless you call lseek() or use pread().

Why does memcmp show differences even when the file is the same?

You may be comparing more bytes than were actually read. Always compare only the exact number returned by read() or pread().

Should I use read() or pread() for deterministic tests?

Use pread() when the test depends on a fixed offset and should not be affected by descriptor position changes. It is usually the most deterministic choice for byte-for-byte verification.

For deeper reference on POSIX file I/O semantics, review the read manual and the pread manual. They explain exactly how offsets, partial reads, and return values work at the system-call level.

Leave a Reply

Your email address will not be published. Required fields are marked *