Welcome to End Point’s blog

Ongoing observations by End Point people

Editing large files in place

Running out of disk space seems to be an all too common problem lately, especially when dealing with large databases. One situation that came up recently was a client who needed to import a large Postgres dump file into a new database. Unfortunately, they were very low on disk space and the file needed to be modified. Without going into all the reasons, we needed the databases to use template1 as the template database, and not template0. This was a very large, multi-gigabyte file, and the amount of space left on the disk was measured in megabytes. It would have taken too long to copy the file somewhere else to edit it, so I did a low-level edit using the Unix utility dd. The rest of this post gives the details.

To demonstrate the problem and the solution, we'll need a disk partition that has little-to-no free space available. In Linux, it's easy enough to create such a thing by using a RAM disk. Most Linux distributions already have these ready to go. We'll check it out with:

$ ls -l /dev/ram*
brw-rw---- 1 root disk 1,  0 2009-12-14 13:04 /dev/ram0
brw-rw---- 1 root disk 1,  1 2009-12-14 22:27 /dev/ram1

From the above, we see that there are some RAM disks available (there are actually 16 of them available on my box, but I only showed two). Here's the steps to create a usable partition from /dev/ram1, and to then check the size:

$ mkdir /home/greg/ramtest

$ sudo mke2fs /dev/ram1
mke2fs 1.41.4 (27-Jan-2009)
Filesystem label=
OS type: Linux
Block size=1024 (log=0)
Fragment size=1024 (log=0)
4096 inodes, 16384 blocks
819 blocks (5.00%) reserved for the super user
First data block=1
Maximum filesystem blocks=16777216
2 block groups
8192 blocks per group, 8192 fragments per group
2048 inodes per group
Superblock backups stored on blocks:

Writing inode tables: done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 29 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.

$ sudo mount /dev/ram1 /home/greg/ramtest

$ sudo chown greg:greg /home/greg/ramtest

$ df -h /dev/ram1
Filesystem            Size  Used Avail Use% Mounted on
/dev/ram1              16M  140K   15M   1% /home/greg/ramtest

First we created a new directory to server as the mount point, then we used the mke2fs utility to create a new file system (ext2) on the RAM disk at /dev/ram1. It's a fairly verbose program by default, but there is nothing in the output that's really important for this example. Then we mounted our new filesystem to the directory we just created. Finally, we reset the permissions on the directory such that an ordinary user (e.g. 'greg') can read and write to it. At this point, we've got a directory/filesystem that is just under 16 MB large (we could have made it much closer to 16 MB by specifying a -m 0 to mke2fs, but the actual size doesn't matter).

To simulate what happened, let's create a database dump and then bloat it until there it takes up all available space:

$ cd /home/greg/ramtest

$ pg_dumpall >

$ ls -l
-rw-r--r-- 1 greg greg 3685 2009-12-15 10:42

$ dd seek=3685 if=/dev/zero bs=1024 count=99999
dd: writing '': No space left on device
13897+0 records in
13896+0 records out
14229504 bytes (14 MB) copied, 0.0814188 s, 175 MB/s

$ df -h .
Filesystem            Size  Used Avail Use% Mounted on
/dev/ram1              16M   15M     0 100% /home/greg/ramtest

First we created the dump, then we found the size of it, and told dd via the 'seek' argument to start adding data to it at the 3685 byte mark (in other words, we appended to the file). We used the special file /dev/zero as the 'if' (input file), and our existing dump as the 'of' (output file). Finally, we told it to chunk the inserts into 1024 bytes at a time, and to attempt to add 999,999 of those chunks. Since this is approximately 100MB, we ran out of disk space quickly, as we intended. The filesystem is now at 100% usage, and will refuse any further writes to it.

To recap, we need to change the first three instances of template0 with template1. Let's use grep to view the lines:

$ grep --text --max-count=3 template

We need the --text argument here because grep correctly surmises that we've changed the file from text-based to binary with the addition of all those zeroes on the end. We also used the handy --max-count argument to stop processing once we've found the lines we want. Very handy argument when the actual file is gigabytes in size!

There are two major problems with using a normal text editor to change the file. First, the file (in the real situation, not this example!) was very, very large. We only needed to edit something at the very top of the file, so loading the entire thing into an editor is very inefficient. Second, editors need to save their changes somewhere, and there just was not enough room to do so.

Attempting to edit with emacs gives us: emacs: IO error writing /home/greg/ramtest/ No space left on device

An attempt with vi gives us: vi: Write error in swap file on startup. "" E514: write error (file system full?)

Although emacs gives the better error message (why is vim making a guess and outputting some weird E514 error?), the advantage always goes to vi in cases like this as emacs has a major bug in that it cannot even open very large files.

What about something more low-level like sed? Unfortunately, while sed is more efficient than emacs or vim, it still needs to read the old file and write the new one. We can't do that writing as we have no disk space! More importantly, in sed there is no way (that I could find anyway) to tell it stop processing after a certain number of matches.

What we need is something *really* low-level. The utility dd comes to the rescue again. We can use dd to truly edit the file in place. Basically, we're going to overwrite some of the bytes on disk, without needing to change anything else. First though, we have to figure out exactly which bytes to change. The grep program has a nice option called --byte-offset that can help us out:

$ grep --text --byte-offset --max-count=3 template

This tells us the offset for each line, but we want to replace the number '0' in 'template0' with the number '1'. Rather than count it out manually, let's just use another Unix utility, hexdump, to help us find the number:

$ grep --text --byte-offset --max-count=3 template | hexdump -C
00000000  33 30 31 3a 43 52 45 41  54 45 20 44 41 54 41 42  |301:CREATE DATAB|
00000010  41 53 45 20 67 72 65 67  20 57 49 54 48 20 54 45  |ASE greg WITH TE|
00000020  4d 50 4c 41 54 45 20 3d  20 74 65 6d 70 6c 61 74  |MPLATE = templat|
00000030  65 30 20 4f 57 4e 45 52  20 3d 20 67 72 65 67 20  |e0 OWNER = greg |
00000040  45 4e 43 4f 44 49 4e 47  20 3d 20 27 55 54 46 38  |ENCODING = 'UTF8|

Each line is 16 characters, so the first three lines comes to 48 characters, then we add two for the 'e0', subtract four for the '301:', and get 301+48+2-4=347. We subtract one more as we want to seek to the point just before that character, and we can now use our dd command:

$ echo 1 | dd seek=346 bs=1 count=1 conv=notrunc
1+0 records in
1+0 records out
1 byte (1 B) copied, 0.00012425 s, 8.0 kB/s

Instead of an input file (the 'if' argument), we simply pass the number '1' via stdin to the dd command. We use our calculated seek, tell it to copy a single byte (bs=1), one time (count=1), and (this is very important!) tell dd NOT to truncate the file when it is done (conv=notrunc). Technically, we are sending two characters to the dd program, the number one and a newline, but the bs=1 argument ensures only the first character is being copied. We can now verify that the change was made as we expected:

$ grep --text --byte-offset --max-count=3 TEMPLATE

Now for the other two entries. From before, the magic number is 45, so we now add 380 to 45 to get 425. For the third line, the name of the database is 1 character longer so we add 459+45+1 = 505:

$ echo 1 | dd seek=425 bs=1 count=1 conv=notrunc
1+0 records in
1+0 records out
1 byte (1 B) copied, 0.000109234 s, 9.2 kB/s

$ echo 1 | dd seek=505 bs=1 count=1 conv=notrunc
1+0 records in
1+0 records out
1 byte (1 B) copied, 0.000109932 s, 9.1 kB/s

$ grep --text --byte-offset --max-count=3 TEMPLATE

Success! On the real system, the database was loaded with no errors, and the large file was removed. If you've been following along and need to cleanup:

$ cd ~
$ sudo umount /home/greg/ramtest
$ rmdir ramtest

Keep in mind that dd is a very powerful and thus very dangerous utility, so treat it with care. It can be invaluable for times like this however!


Adrian said...

Nice work, and nice write-up! Thanks for sharing!

Moltonel said...

This trick can come in handy, but gets painfull when you need to alter the number of characters.

What about piping sed output to psql, instead of asking sed to write a file ? I've used this technique on pg dumps before, out of lazyness when I could easily have edited the file. It feels natural enough.

Dont know about stoping after N replacements with sed. Sure it's possible, but dont want to dig in when you can just specify line numbers, or just ignore the negligible overhead of sed'ing the entire file (we'll be waiting for postgres anyway, not sed).

Oh, and apropos emacs, what kind of DBA is still using a 32bit OS these days ? :p

Platonides said...

A quite risky operation.
I would have probably taken out ~1 sector/page, edited it and then replaced the original one.

Not the main for this post, since you already had the file loaded, but instead of manually copying the file size to the seek parameter, you can do
dd seek=`stat -c %s` if=/dev/zero bs=1024 count=99999

Greg Sabino Mullane said...


True, about a piped-to-psql sed not being expensive compared to the other bits, but my worry there would be about replacing something it shouldn't. It's easy enough to know that 'template0' is a unique string in the first few lines, but what if it appears buried in the data later on in the 50GB+ file? Also, in this case there were similar lines immediately after the three in question that I did *not* want to be replaced. If I did have to alter the number of characters, I probably would write a quick perl script to pipe it through. All the power of sed, plus I can tell it to stop processing after a certain point and just turn into a dumb pipe (e.g. (print and next) if $. > 300)

As far as emacs, that's seldom in my control on client boxes, and not all distros have a 64-bit compiled emacs available. (While even the 64-bit version has a limit (unlike vi), at that point I'm not likely to edit directly anyway, but use something like dd or split. :) But there are plenty of times when I've wanted to edit a few hundred meg file and emacs failed me so I had to use vi.

Greg Sabino Mullane said...


Yes, quite risky, but also quite efficient, and this work was done under a lot of pressure to get things done quick. In my defense, I did test the process out first by doing a head -10000 largefile > foobar, and then modifying foobar. Thanks for the stat -c trick!

Moltonel said...

Here, I looked it up and it is actually very simple :

sed 20,30s/template0/template1/
will do the work only for lines 20 to 30.

sed '15s/template0/template1/;17s/template0/template1/'
will do the work specifically for line 15 and 17.

Ezekiel said...

Really awesome tricks, thanks Greg!

Add the "-i" or "--in-place" flag to sed to make it modify in-place.

Using dd to copy a piece of the file to another machine over ssh for editing ("vim -R" to open in "read-only" mode and refrain from creating a swap file on-disk. You can still write changes when finished.) might also be useful under stress; then use dd again to put the chunk right back over the file.

Certainly enjoyed the post!

Jon Jensen said...

Ezekiel, I just did a quick check out of curiosity and found that sed -i does "in-place" edits the same way Perl does: by writing a new file and then moving it in place of the original. Worth noting in case anyone reading is misled by the term "in-place".

Is there anything more automated than Greg's method that does true in-place edits of the blocks of an existing inode where they are? There are nice hex editors etc., but anything commonly distributed on e.g. Linux systems?

Johan Chang said...

Why not just use tmpfs?

Greg Sabino Mullane said...

Ezekiel: nice tip about editing just part of the file, then plopping it back in. I'd probably make a two copies and diff them before dd-ing back in place.

Moltonel: thanks, I suspected sed had some simple solution like that.

Johan Chang: No particular reason to use ramdisk over tmpfs. I'll use tmpfs in my next demo to keep things balanced.

Anonymous said...

as Moltonel pointed out, sed
can do changes based on line numbers.

but please note that in

sed '15s/template0/template1/;17s/template0/template1/'

separating sed commands with ';' is a non-posix extension and might not work with all the sed versions out there.

Anonymous said...

Rather interesting place you've got here. Thanks for it. I like such topics and everything connected to them. I definitely want to read more soon.