This is meant to be a brief primer on regular files, which are easy to overlook in their simplicity. This isn’t meant to be exhaustive, and there’s a lot I won’t talk about (inodes!, file descriptors!, POSIX permissioning!).
Regular files are one of the 7 standard Unix file types:
File Type | Purpose | Common Examples |
---|---|---|
Regular | Persistent data storage | .txt , .jpg , .mp3 |
Directory | Structuring the file system | /home , /etc , /usr |
Block | Buffered access to hardware devices | Hard drives (/dev/sda ) |
Character | Unbuffered access to hardware devices | Terminals (/dev/tty ) |
Pipe | Uni-directional inter-process communication | Pipes created with mkfifo |
Socket | Bi-directional inter-process communication | Network sockets |
Symbolic link | Pointing to another file or directory | Files created with ln -s |
As the table hints at, they serve a crucial purpose: permanent data storage in file systems. As such, I’ll try to convey 3 key characteristics of regular files:
- They’re unstructured streams of bytes
- They have random access patterns
- They allow multiple access
Regular files as unstructured streams of bytes on blocks
If you take 1 idea from this, it’s that files are streams of bytes with no underlying structure written on blocks.
Filesystems use “blocks” as the lowest unit of physical storage. Blocks are typically 4KB in size, and files are stored in collections of 1 or more blocks. Thus, each file stores the blocks it uses in its metadata so that when the OS interacts with a file it reads and writes to its blocks.
Thus, the operating system just provides blocks to read to and write from, and you can manipulate its data through fairly flexible system call APIs (e.g. open
, read
, write
). This means that it’s up to the applications opening a file to interpret the bytes of a file and see if the data is as expected or corrupted — which is why corrupted files happen in the first place!
Images as regular files
For example, PNG images have a structured format that specifies how bytes should be structured for the computer to interpret the file (see PNG file format). Yet that doesn’t stop me from being able to create an arbitrary file with .png
that doesn’t conform to the format or modify an existing one to corrupt it!
Code as regular files
Another example is that we can all write wrong code, like a .c
file with Python code:
|
|
I can still save it because the file system allows me to indiscriminately edit the contents of my files (assuming I have the appropriate permissions). However, if I try to compile the C program I’ll get compile-time errors because it doesn’t conform to the C code that the compiler expects.
Postgres as regular files
A good example of this idea are SQL databases like Postgres, which heavily use regular files to store data.
Naturally, the application logic of Postgres — just like any program — is composed of a group of source code in regular files that are compiled to binary (link to GitHub). In my Unix file system my Postgres binary is in /usr/local/opt/postgresql@14
.
Yet it’s not just the source code: the underlying data read and written through Postgres also goes to regular files. Like other databases, Postgres typically stores its data in /var/
directories to signal files that change during the program’s normal operations (mine is in /usr/local/var/postgresql@14
). Postgres databases themselves are organized within the /base/
subdirectory, with each subdirectory being named after the object ID (OID) associated with each database in Postgres.
This is a simplified tree view of how said directory tree might look like:
|
|
Where, again, each OID is associated by Postgres to a database name — so there’s a 1-to-1 mapping between them.
FROM pg_database;
oid | datname
-------+-----------
14034 | postgres
12345 | template1
14033 | template0
16385 | homework
...
For example, if I ls
in one of the subdirectories for a database’s files I get a list of files that stores the database’s data:
|
|
And each file is a regular file that can be read from or written to according to its file permissions (note how file 1247
describes the file as data
):
|
|
We can even use psql
to find that the file 1247
stores the data for pg_type
, which is a critical system catalog with Postgres type definitions:
postgres=# SELECT relname FROM pg_class WHERE oid = 1247;
relname
---------
pg_type
(1 row)
So deleting this file would mean the database wouldn’t be able to interpretet the data or run queries! How could it without understanding its data types?
We can easily verify this (don’t try this at home!):
|
|
And then try to use a query with a data type in that database in Postgres.
homework=# SELECT 1;
2024-12-30 22:02:44.498 GMT [60150] ERROR: could not open file "base/16385/1247": No such file or directory
2024-12-30 22:02:44.498 GMT [60150] STATEMENT: SELECT 1;
ERROR: could not open file "base/16385/1247": No such file or directory
Structured vs unstructured file types: contrasting regular and directory files
This feature of the user being able to write any data to regular files doesn’t apply to all file types. For example, you can’t just open a directory file and write to it like a regular file. This is why the below program will output Error occurred: [Errno 21] Is a directory: 'test_dir'
:
|
|
The reason for this is that directories are a structured file format. To protect it, Unix provides APIs that give less control to the user to prevent them from corrupting their file system — unlike regular files, which don’t have those protections. That’s why you can’t use open
or write
on directories but a more controlled API like opendir
and readdir
for both safety and convenience.
Random access
Our Postgres example hinted at another key feature of regular files: random access.
Random access means that we can read byte data at any location in the file’s byte stream. This contrasts with sequential access, where we need to read data sequentially — so you can’t just read an arbitrary byte without reading all previous bytes before it. This is why the lseek()
system call exists: to offset a byte position when reading from a file.
|
|
As a contrast to regular files, pipes need to be read sequentially! You can’t just jump at an arbitrary data point in a pipe’s memory.
Multiple access
Multiple access means that multiple processes can have the file connection to read and write to the file. Since regular files handle persistent storage, this is helpful in cases like:
- Multiple services writing to the same log files
- Multiple database systems reading the same tables
- Multiple processes reading the same shared config
As a corollary, multiple access is why we have concurrency risk and locking mechanisms to address those risks. If you have multiple processes writing to the same file you have a potential race condition that could create inconsistencies.
Again, pipes are a contrast: there can only be 1 reader and 1 writer rather than multiple.
Conclusion
The key idea I wanted to convey is that our systems run on regular files because persistent storage depends on regular files to store data on the filesystem’s blocks. Files are just some metadata and collections of blocks, where as long as you have permissions you can write anything because applications are responsible for interpreting the data.
As I’ll explore in future articles, there’s a need for other file types with different trade-offs than regular files. For starters, there’s other use cases where
- Data doesn’t need to persist
- Access patterns should be sequential rather than random
- Latency is critical
In cases like these, regular files are probably not the best fit because they involve writing to disk. Reading and writing from memory has a much lower latency (100 nanoseconds vs 100K nanoseconds to 2M nanoseconds, depending on the memory device), so that’s a big motivation for other file types with different trade-offs.