IDE Flash / Sandisk IDE Flash ---------------------------------- QNX's Fsys filesystem is designed to be 'fault resistant'. It is not a 'fault tolerant' filesystem. It is not based on transactions, but is based on a series of integral pieces: bitmap file to handle allocation/deallocation of blocks root blocks, directory blocks, inode tables all these structures are deemed 'critical' structures in that if corruption occurs in them, there is possibility of data loss and loss of filesystem integrity. The QNX System Architecture Book page 91-92 describes this in more detail. The attempt is always made to try to stay in a 'sane' state. Sandisk/ATA issue: QNX writes 1 block (or sector) at a time to the media. Upon loss of power it is possible for the sandisk to be in mid write of a block. Current information from Sandisk is when this happens the data in the block will be unusable and upon reread of that block on next power up QNX will see it as a bad block... (in other words, when being read, the ATA command will return with EIO indicating a failure to read). It is current understanding that if you then do a 'write' to this bad block that the circuitry will change from bad block state to a good state and do the write correctly. Note that this behaviour is common in all drives that do ECC error detection. Even a regular ATA harddisk could write a bad block on power down. As well, there are other implications... on power down there is no guarantee that even the block addressing circuitry will be correct. we could write data to position C but it's possible that the addressing circuitry could be shifted to position D. Note that chkfsys will not try to write to a block that is marked by Sandisk as being a bad block. The only recovery of a bad block that is marked bad due to a power loss is to write back to the block. Then upon next read by the Sandisk circuitry the bad block could then be in a usable mode and be readable. This is the way the Sandisk flash IDE is designed... it is their mechanism to ensure that incorrect data is not presented to a program. Shutdown issue.. ------------------ Let's say your system can lose power at any time. Let's say you do a chkfsys on power up. The chkfsys command will want to automatically update directory structures, inode tables and bitmaps. If you lose power during this update then the entire filesystem could be corrupted beyond repair. Also, if power is lost while writing a directory entry, the entire directory block being written could be written incorrectly. Possible solutions ? ------------------------- First off, QNX does not believe that there is any 100% safe solution that does not involve a UPS and some form of coordinated shutdown procedure. To become more fault resistant all of the following approaches could be used: 1. only do an chkfsys when you know you have enough power up to run to completion 2. pregrow all files so that no growing of extents and no updating of directory entries occur on a grow. 3. use Fsys options and driver options to minimize cache and write immediately to disk. e.g. Fsys -A -c0K & Fsys.ata -w 1000000 & # for 1 second busy wait ** this last option to wait for 1 second for DRQ/BSY is an important one. The Sandisk can spin for up to 1 second while shuffling a bad block. If you do not set this option then it is possible to have file system corruption. 4. program in such a manner that data is synched to disk as soon as possible. Details follow in appendix A. 5. try to partition your flash. have read only mount points where appropriate. if possible, consider breaking up your data writes into separate directory structures as described below: Directory blocks: ---------------------------------------------------------------- problem: if a directory block is destroyed, all 8 directory entries will be gone. (each dir entry is 64 bytes, a block is 512 bytes, therefore 8 entries per block) how do we minimize this possibility? A possible solution is described below: Build systems with 1 file in each directory block. fill the rest of the 7 entries with empty files. e.g. /dataarea1/ critical_file empty1 empty2 empty3 empty4 empty5 empty6 empty7 /dataarea2/ critical_file empty1 empty2 empty3 empty4 empty5 empty6 empty7 This way, loss of a directory entry will only affect 1 of the critical files. Appendix A: ------------------ Use fd-based I/O rather than FILE * based I/O when possible. For synchronisation with the OS the former is preferred; the extra level of buffering provided by FILEs can get in the way, and routines like fflush() are misleading in terms of not doing anything much with respect to robustness. If you need, for example, the text support routines of a FILE, you should first set up an fd yourself, and fdopen() it to get a FILE; also you can use fileno() to get the fd of a FILE for passing to lower-level IO calls. You may also want to look into setvbuf() too. Once you are fd-based, you can control the performance/reliability ratio of file output a number of ways: (i) Give open() the O_SYNC flag; this will make all writes synchronous, and block until completed; (ii) Use the fsync() call periodically; this will flush any dirty disk blocks to the physical device, and block until completed. It's worth saying again that if your file access is FILE-based, you must first cleanse your local dirty stdio buffers with fflush(f) before calling fsync(fileno(f)). Think of the file access of being layered: physical disk <-> Fsys cache <-> stdio lib <-> your app You have to move the data to be written securely all the way along (its normal/default path is more lesiurely). There are also global Fsys options, in particular the '-d' delay, that can be used to reduce the window of dirty cache data, as well as options on a per-mount basis. Chkfsys... Chkfsys can only recover data back to a known state, i.e. the last time the on-disk inode (the structure which maintains all the information about a file) was updated. Under normal circumstances, the on-disk inode is only updated periodically; such as when a file is first created (it will show a size of 0), and when the file is closed. There are other events (some dictated by POSIX and others by the design of Fsys) which will cause the inode to be updated. Examples are: when an extent grows or a new extent is created, when stat/fstat is called for that file, when fsync/fdatasync is called for that file, sometime after sync is called (there is no guaranteed time for this) or before a write operation completes if O_SYNC or O_DSYNCH is in effect. If you want guaranteed recoverable write operations, you will have to open the file with O_SYNC or O_DYSNCH. With one of those flags set, a call to write won't return until the data has been written to disk (actually, passed to the controller by the driver -- if the controller buffers the writes then all bets are off, but this is unusual) and the inode has been similarly updated if required. When this is done, then chkfsys will be able to recover up to the last successful write operation. Of course, this slows down file writing considerably. A faster alternative is to write the data normally, and periodically, at suitable synchronization points, call fsync or fdatasync on that file. These calls tell Fsys to flush any buffered writes for the specified file and, if required, update the on-disk inode. This could be faster because you could issue a number of (buffered) writes followed by a single synchronization call. Similar to the previous suggestion, chkfsys would then be able to recover to the last time you called one of these functions. Appendix B ---------------------------- Program to try to write back data to a disk in order to kick blocks marked as unusable by the Sandisk circuitry back to a usable state. Please note that this code is provided as an example. It is assumed that the user will test to their level of satisfaction that their system performs correctly. /******************************************************************************* * dblock.c * Program to read every block in a mounted partition and write back bad * blocks. * ******************************************************************************/ #ifdef __USAGE %C use: %C [-d mountpoint] [-b numblocks_per_read_or_write] [-v] mountpoint = filesystem to check. Must be raw device. Default is /dev/hd0t77 numblocks = the max number of blocks to read or write at once. Default is the max number supported by the driver. v = verbose e.g. %C -d /dev/hd0t77 -v Check first QNX partition. #endif #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include char *diskd = "/dev/hd0t77"; int blocks_to_read = _MAX_BLOCK_OPS - 1; int verbose = 0; char null_block[_BLOCK_SIZE]; /* block to writeback as nulls */ int fd; /* descriptor for disk device */ main(argc,argv) int argc; char *argv[]; { int c; while ( ( c = getopt( argc, argv, "d:b:v" ) ) != -1 ) { switch(c) { case 'd': diskd = optarg; break; case 'b': blocks_to_read = atoi(optarg); break; case 'v': verbose++; break; } } do_block_test(); } do_block_test() { int num_blocks; char *buf; long bcount = 1; unsigned total_blocks = 0; int keepgo = 1; int read_errno; if ( verbose ) { printf("Opening drive %s\n", diskd); } fd = open(diskd, O_RDWR | O_DSYNC); if ( fd == -1 ) { fprintf(stderr,"Error opening drive. (%s)\n",strerror(errno)); exit(-1); } printf("operation: block size = %d, max blocks per op = %d\n", _BLOCK_SIZE, blocks_to_read); buf = (char *) malloc( _BLOCK_SIZE * blocks_to_read ); if ( buf == NULL ) { fprintf(stderr,"Error allocating buffer. (%s)\n",strerror(errno)); exit(-1); } printf("hit enter to start\n"); getchar(); while ( keepgo ) { num_blocks = block_read( fd, bcount, blocks_to_read, buf ); if ( verbose ) printf("read %d blocks starting from %d: wanted %d blocks.\n", num_blocks, bcount, blocks_to_read); if ( num_blocks == -1 ) { read_errno = errno; switch ( read_errno ) { case EBADF: /* device does not support block I/O */ fprintf(stderr, "Device %s does not support block_read. (%s)\n", strerror(read_errno) ); keepgo = 0; break; case EINVAL: /* must be beyond end of disk... finished */ keepgo = 0; break; case EIO: /* a bad block */ fprintf(stderr, "Error reading at block address %d for %d blocks. (%s)\n", bcount, num_blocks, strerror(read_errno)); try_write_back( bcount, num_blocks ); break; } } else { total_blocks += num_blocks; bcount += num_blocks; } } printf("total blocks read: %u\n", total_blocks); } try_write_back( int bcount, int num_blocks ) { int i, ret; int write_errno; memset( null_block, 0xff, _BLOCK_SIZE ); /* try to write the null blocks */ for ( i=0; i < num_blocks; i++ ) { ret = block_write( fd, bcount+i, 1, null_block ); write_errno = errno; if ( verbose ) printf("wrote null block at %d\n", bcount ); if ( ret == -1 ) { switch ( write_errno ) { case EBADF: /* device does not support block I/O */ fprintf(stderr, "Device does not support block_write. (%s)\n", strerror(write_errno) ); break; case EINVAL: /* must be beyond end of disk... finished */ break; case EIO: /* a bad block */ fprintf(stderr, "Error writing at block address %d. (%s)\n", bcount+i, strerror(write_errno)); fprintf(stderr, "Critical error... aborting.\n"); exit(-1); break; } } } }