IDE Flash / Sandisk IDE Flash
----------------------------------

QNX's Fsys filesystem is designed to be 'fault resistant'. It is not a 
'fault tolerant' filesystem. It is not based on transactions, but is based
on a series of integral pieces:
       bitmap file to handle allocation/deallocation of blocks
       root blocks, directory blocks, inode tables

    all these structures are deemed 'critical' structures in that if corruption
    occurs in them, there is possibility of data loss and loss of filesystem
    integrity.

    The QNX System Architecture Book page 91-92 describes this in more detail.
    The attempt is always made to try to stay in a 'sane' state.


Sandisk/ATA issue:

QNX writes 1 block (or sector) at a time to the media. Upon loss of power it
is possible for the sandisk to be in mid write of a block. Current information
from Sandisk is when this happens the data in the block will be unusable and
upon reread of that block on next power up QNX will see it as a bad block...
(in other words, when being read, the ATA command will return with EIO 
indicating a failure to read).
It is current understanding that if you then do a 'write' to this bad block
that the circuitry will change from bad block state to a good state and do the
write correctly.
Note that this behaviour is common in all drives that do ECC error detection.
Even a regular ATA harddisk could write a bad block on power down.

As well, there are other implications... on power down there is no guarantee
that even the block addressing circuitry will be correct. we could write
data to position C but it's possible that the addressing circuitry could be
shifted to position D.

Note that chkfsys will not try to write to a block that is marked by Sandisk
as being a bad block. 
The only recovery of a bad block that is marked bad due to a power loss is
to write back to the block. Then upon next read by the Sandisk circuitry the
bad block could then be in a usable mode and be readable. This is the way
the Sandisk flash IDE is designed... it is their mechanism to ensure that
incorrect data is not presented to a program. 

Shutdown issue..
------------------

Let's say your system can lose power at any time. 
Let's say you do a chkfsys on power up. 
The chkfsys command will want to automatically update directory structures,
inode tables and bitmaps.
If you lose power during this update then the entire filesystem could be
corrupted beyond repair.

Also, if power is lost while writing a directory entry, the entire directory
block being written could be written incorrectly.

Possible solutions ?  
-------------------------

First off, QNX does not believe that there is any 100% safe solution that
does not involve a UPS and some form of coordinated shutdown procedure.

To become more fault resistant all of the following approaches could be used:

1. only do an chkfsys when you know you have enough power up to run to 
   completion

2. pregrow all files so that no growing of extents and no updating of
   directory entries occur on a grow.

3. use Fsys options and driver options to minimize cache and write immediately
   to disk.

      e.g. Fsys -A -c0K &
           Fsys.ata -w 1000000 &    # for 1 second busy wait

   ** this last option to wait for 1 second for DRQ/BSY is an important one.
      The Sandisk can spin for up to 1 second while shuffling a bad block. If
      you do not set this option then it is possible to have file system
      corruption.

4. program in such a manner that data is synched to disk as soon as possible.

   Details follow in appendix A.

5. try to partition your flash. have read only mount points where appropriate.
   if possible, consider breaking up your data writes into separate directory
   structures as described below:

   Directory blocks:
   ----------------------------------------------------------------
   problem:  if a directory block is destroyed, all 8 directory entries will be
             gone. (each dir entry is 64 bytes, a block is 512 bytes, therefore
             8 entries per block)

   how do we minimize this possibility?

   A possible solution is described below:
   Build systems with 1 file in each directory block. fill the rest of the
   7 entries with empty files.
      e.g.

      /dataarea1/
            critical_file
            empty1
            empty2
            empty3
            empty4
            empty5
            empty6
            empty7

      /dataarea2/
            critical_file
            empty1
            empty2
            empty3
            empty4
            empty5
            empty6
            empty7

      This way, loss of a directory entry will only affect 1 of the critical
      files.


Appendix A:
------------------
Use fd-based I/O rather than FILE * based I/O when possible.

For synchronisation with the OS the former is preferred; the
extra level of buffering provided by FILEs can get in the
way, and routines like fflush() are misleading in terms of
not doing anything much with respect to robustness.

If you need, for example, the text support routines of a FILE, you
should first set up an fd yourself, and fdopen() it to get a FILE; also
you can use fileno() to get the fd of a FILE for passing to lower-level
IO calls.  You may also want to look into setvbuf() too.

Once you are fd-based, you can control the performance/reliability
ratio of file output a number of ways:
(i)  Give open() the O_SYNC flag; this will make all writes synchronous,
     and block until completed;
(ii) Use the fsync() call periodically; this will flush any dirty disk
     blocks to the physical device, and block until completed.

It's worth saying again that if your file access is FILE-based, you must
first cleanse your local dirty stdio buffers with fflush(f) before calling
fsync(fileno(f)).  Think of the file access of being layered:
    physical disk <-> Fsys cache <-> stdio lib <-> your app
You have to move the data to be written securely all the way along (its
normal/default path is more lesiurely).

There are also global Fsys options, in particular the '-d' delay, that can
be used to reduce the window of dirty cache data, as well as options on
a per-mount basis.  

Chkfsys...

Chkfsys can only recover data back to a known state, i.e. the last time the 
on-disk inode (the structure which maintains all the information about a 
file) was updated.  Under normal circumstances, the on-disk inode is only 
updated periodically; such as when a file is first created (it will show a
size of 0), and when the file is closed.

There are other events (some dictated by POSIX and others by the design of
Fsys) which will cause the inode to be updated.  Examples are: when an extent
grows or a new extent is created, when stat/fstat is called for that file,
when fsync/fdatasync is called for that file, sometime after sync is called
(there is no guaranteed time for this) or before a write operation completes
if O_SYNC or O_DSYNCH is in effect.

If you want guaranteed recoverable write operations, you will have to open
the file with O_SYNC or O_DYSNCH.  With one of those flags set, a call to
write won't return until the data has been written to disk (actually, passed
to the controller by the driver -- if the controller buffers the writes then
all bets are off, but this is unusual) and the inode has been similarly
updated if required.  When this is done, then chkfsys will be able to recover
up to the last successful write operation.  Of course, this slows down file
writing considerably.

A faster alternative is to write the data normally, and periodically, at
suitable synchronization points, call fsync or fdatasync on that file.  These
calls tell Fsys to flush any buffered writes for the specified file and, if
required, update the on-disk inode.  This could be faster because you could
issue a number of (buffered) writes followed by a single synchronization
call.  Similar to the previous suggestion, chkfsys would then be able to
recover to the last time you called one of these functions.


Appendix B
----------------------------
Program to try to write back data to a disk in order to kick blocks marked
as unusable by the Sandisk circuitry back to a usable state.
Please note that this code is provided as an example. It is assumed that the
user will test to their level of satisfaction that their system performs
correctly.


/*******************************************************************************
 *  dblock.c
 *  Program to read every block in a mounted partition and write back bad
 *  blocks.
 *
 ******************************************************************************/
#ifdef __USAGE
%C
use:
%C  [-d mountpoint] [-b numblocks_per_read_or_write] [-v] 
	mountpoint = filesystem to check. Must be raw device.
                 Default is /dev/hd0t77
    numblocks  = the max number of blocks to read or write at once.
                 Default is the max number supported by the driver.
    v          = verbose

e.g.
%C -d /dev/hd0t77 -v
    Check first QNX partition.
#endif

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <errno.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/osinfo.h>
#include <sys/psinfo.h>
#include <sys/fd.h>
#include <sys/name.h>
#include <sys/sidinfo.h>
#include <sys/irqinfo.h>
#include <sys/timers.h>
#include <i86.h>
#include <time.h>
#include <sys/proc_msg.h>
#include <sys/kernel.h>
#include <sys/qnxterm.h>
#include <sys/stat.h>
#include <signal.h>
#include <setjmp.h>
#include <process.h>
#include <sys/wait.h>
#include <fcntl.h>
#include <limits.h>
#include <sys/disk.h>
#include <sys/fsys.h>

char *diskd        = "/dev/hd0t77";
int blocks_to_read = _MAX_BLOCK_OPS - 1;
int verbose        = 0;

char null_block[_BLOCK_SIZE];     /* block to writeback as nulls */
int fd;                           /* descriptor for disk device */

main(argc,argv)
int argc;
char *argv[];
{
   int     c;

   while ( ( c = getopt( argc, argv, "d:b:v" ) ) != -1 ) {
         switch(c) {
            case 'd': diskd = optarg;
                      break;
            case 'b': blocks_to_read = atoi(optarg);
                      break;
            case 'v': verbose++;
                      break;
            }
         }

   do_block_test();
}

do_block_test()
{
int num_blocks;
char *buf;
long bcount = 1;
unsigned total_blocks = 0;
int keepgo = 1;
int read_errno;

if ( verbose ) {
   printf("Opening drive %s\n", diskd);
   }
fd = open(diskd, O_RDWR | O_DSYNC);
if ( fd == -1 ) {
   fprintf(stderr,"Error opening drive. (%s)\n",strerror(errno));
   exit(-1);
   }

printf("operation: block size = %d, max blocks per op = %d\n", 
       _BLOCK_SIZE, blocks_to_read);

buf = (char *) malloc( _BLOCK_SIZE * blocks_to_read );
if ( buf == NULL ) {
   fprintf(stderr,"Error allocating buffer. (%s)\n",strerror(errno));
   exit(-1);
   }

printf("hit enter to start\n");
getchar();

while ( keepgo ) {
  num_blocks = block_read( fd, bcount, blocks_to_read, buf );
  if ( verbose )
     printf("read %d blocks starting from %d: wanted %d blocks.\n",
             num_blocks, bcount, blocks_to_read);

  if ( num_blocks == -1 ) {
	 read_errno = errno;
     switch ( read_errno ) {
     	case EBADF:
        	/* device does not support block I/O */
        	fprintf(stderr,
                "Device %s does not support block_read. (%s)\n",
                strerror(read_errno) );
        	keepgo = 0;
            break;
     	case EINVAL:
        	/* must be beyond end of disk... finished */
        	keepgo = 0;
            break;
     	case EIO:
        	/* a bad block */
            fprintf(stderr,
                "Error reading at block address %d for %d blocks. (%s)\n",
                 bcount, num_blocks, strerror(read_errno));
            try_write_back( bcount, num_blocks );
            break;
        }
     }
  else {
     total_blocks += num_blocks;
     bcount += num_blocks;
     }
  }

printf("total blocks read: %u\n", total_blocks);
}

try_write_back( int bcount, int num_blocks )
{
int i, ret;
int write_errno;

  memset( null_block, 0xff, _BLOCK_SIZE );

  /* try to write the null blocks */
  for ( i=0; i < num_blocks; i++ ) {
	  ret = block_write( fd, bcount+i, 1, null_block );
	  write_errno = errno;
      if ( verbose )
         printf("wrote null block at %d\n", bcount );

      if ( ret == -1 ) {
         switch ( write_errno ) {
     	    case EBADF:
        	    /* device does not support block I/O */
        	    fprintf(stderr,
                    "Device does not support block_write. (%s)\n",
                     strerror(write_errno) );
                break;
     	    case EINVAL:
        	    /* must be beyond end of disk... finished */
                break;
     	    case EIO:
        	    /* a bad block */
                fprintf(stderr,
                    "Error writing at block address %d. (%s)\n",
                     bcount+i, strerror(write_errno));
                fprintf(stderr, "Critical error... aborting.\n");
                exit(-1);
                break;
            }
         }
      }
}