com.bigdata.journal
Class BufferedDiskStrategy

java.lang.Object
  extended by com.bigdata.rawstore.AbstractRawStore
      extended by com.bigdata.rawstore.AbstractRawWormStore
          extended by com.bigdata.journal.AbstractBufferStrategy
              extended by com.bigdata.journal.BufferedDiskStrategy
All Implemented Interfaces:
IBufferStrategy, IDiskBasedStrategy, IAddressManager, IMRMW, IMROW, IRawStore, IStoreSerializer, IUpdateStore, IWORM

public class BufferedDiskStrategy
extends AbstractBufferStrategy
implements IDiskBasedStrategy, IUpdateStore

A disk-based strategy where a large buffer is used to minimize the chance that a read will read through to the disk (under normal circumstances the on-disk file will be fully buffered). This is especially important during asynchronous overflow processing as the data written onto BTrees has been appended onto the store and more or less random reads are required to traverse the BTree tuples in index order.

This strategy is designed for use with StoreManager. The expectation is that the store will be fully buffered MOST of the time. Typically the Options#INITIAL_EXTENT will be set equal to the Options#MAXIMUM_EXTENT using a value on the order of 200M. Normally, overflow will be triggered before the user extent is saturated and the disk file will remain fully buffered. In these cases there will be NO reads through to the disk. Note that neither the DirectBufferStrategy nor the MappedBufferStrategy are suitable for asynchronous overflow precisely because the JVM does not handle extending a mapped file or correct release of direct ByteBuffers.

There are a variety of reasons why overflow processing might not be initiated before the user extent overflows (asynchronous overflow may still be running on the old journal, the last set of tasks executing may have written more data that remains in the user extent, etc.). Regardless, in any of these situations the backing file on the disk will be extended BUT NOT the buffer. The buffer itself IS NOT extended for reasons that mostly have to do with memory leaks in the JVM for direct ByteBuffers (in fact, the caller must provide the buffer via the ctor, in a manner very similar to how the write cache is managed for the DiskOnlyStrategy).

The buffer provides both a write cache and a read cache, but only until it is full. On commit, all bytes from the last byte flushed to the backing file will be transferred from the buffer to the backing file. On restart, as much data in the user extent as will fit is read from the backing file into the buffer.

Once the buffer is full, a small WriteCache is allocated using the DirectBufferPool.INSTANCE and reads beyond the extent covered by the buffer go straight through to the disk. Writes are buffered in a write cache. The cache is flushed when it would overflow. As a result only large sequential writes are performed on the store. Reads read through the write cache for consistency.

One other advantage of this strategy is that interrupts of NIO operations are much less likely to cause the backing FileChannel to be closed asynchronously since most reads will be serviced by the buffer rather than touching the disk. Also, for reads which are serviced by the buffer, we can offer higher concurrency (reads through to the disk are serialized).

Version:
$Id: BufferedDiskStrategy.java 2265 2009-10-26 12:51:06Z thompsonbry $
Author:
Bryan Thompson
See Also:
BufferMode.BufferedDisk, TestBufferedDiskJournal
TODO:
modify to accept the DirectBufferPool instance to be used as a ctor parameter. that keeps the (de-)allocation local and allows us to configure the size of the backing buffer when we setup the StoreManager and to use small buffers for the unit tests (there can be a distinct DirectBufferPool for TestBufferedDiskJournal). The AbstractJournal will need another parameter for the DirectBufferPool. That argument should be optional and default to null for all of the other buffer modes. We would still pass in the writeCache., add a test suite for this variant., test out on a cluster. examine behavior of a series of overflow operations and see how much this does to improve throughput., modify asynchronous overflow to always "finish up" (handle each index before it ends). explicitly track the indices that have to be processed and those not yet finished.

What to do about indices that will not overflow? This is a pretty critical issue as we otherwise could wind up keeping a number of historical journals on hand. Normally I would expect such issues mainly with new applications that are being tested, in which case (a) test on a test federation and (b) the indices can be dropped., test correct force(boolean) (must transfer bytes written since the last force)., verify that reads return an immutable view of the buffer and that high concurrency for reads is allowed when the record to be read lies within the buffered region., test correct transition from the buffered extent onto the unbuffered extent., verify that records which extend across the buffer are NOT stored in the buffer (no split reads for sanity's sake)., report whether or not the on-disk write cache is enabled for each platform in AbstractStatisticsCollector. offer guidence on how to disable that write cache., test verifying that the write cache comes online atomically., test verifying that writeCache is restored iff necessary on restart., test verifying writeCacheOffset is restored correctly on restart (ie., you can continue to append to the store after restart and the result is valid)., test verifying that the buffer position and limit are updated correctly by write(ByteBuffer) regardless of the code path., If possible, refactor to share a common base class with the DiskOnlyStrategy. The main points of departure are the lack of an option for a read cache in this class and the differences in how the buffer is layered in., due to the high memory burden, this variant might not be the default for the unit tests of the services. however, it should be the default for deployed distributed federations. FIXME Examine behavior when write caching is enabled/disabled for the OS. This has a profound impact. Asynchronous writes of multiple buffers, and the use of smaller buffers, may be absolutely when the write cache is disabled. It may be that swapping sets in because the Windows write cache is being overworked, in which case doing incremental and async IO would help. Compare with behavior on server platforms. See http://support.microsoft.com/kb/259716, http://www.accucadd.com/TechNotes/Cache/WriteBehindCache.htm, http://msdn2.microsoft.com/en-us/library/aa365165.aspx, http://www.jasonbrome.com/blog/archives/2004/04/03/writecache_enabled.html, http://support.microsoft.com/kb/811392, http://mail-archives.apache.org/mod_mbox/db-derby-dev/200609.mbox/%3C44F820A8.6000000@sun.com%3E

                /sbin/hdparm -W 0 /dev/hda 0 Disable write caching
                /sbin/hdparm -W 1 /dev/hda 1 Enable write caching
 

Nested Class Summary
static interface BufferedDiskStrategy.Options
          Options for the BufferedDiskStrategy.
 
Field Summary
 DiskOnlyStrategy.StoreCounters storeCounters
          Counters on IRawStore and disk access.
 
Fields inherited from class com.bigdata.journal.AbstractBufferStrategy
bufferMode, ERR_ADDRESS_IS_NULL, ERR_ADDRESS_NOT_WRITTEN, ERR_BAD_RECORD_SIZE, ERR_BUFFER_EMPTY, ERR_BUFFER_NULL, ERR_INT32, ERR_NOT_OPEN, ERR_READ_ONLY, ERR_RECORD_LENGTH_ZERO, ERR_TRUNCATE, initialExtent, log, maximumExtent, nextOffset, WARN
 
Fields inherited from class com.bigdata.rawstore.AbstractRawWormStore
am
 
Fields inherited from class com.bigdata.rawstore.AbstractRawStore
serializer
 
Fields inherited from interface com.bigdata.rawstore.IAddressManager
NULL
 
Method Summary
 long allocate(int nbytes)
          Allocate a record without writing it on the store
 void close()
          Closes the file immediately (without flushing any pending writes).
 void closeForWrites()
          Extended to discard the write cache.
 void deleteResources()
          Deletes the backing file(s) (if any) and clears any records for the store from the IGlobalLRU.
 void force(boolean metadata)
          flushes the optional writeCache before syncing the disk.
 FileChannel getChannel()
          The channel used to read and write on the file.
 CounterSet getCounters()
          Return interesting information about the write cache and file operations.
 long getExtent()
          The current size of the journal in bytes.
 File getFile()
          The backing file.
 int getHeaderSize()
          The size of the file header in bytes.
 RandomAccessFile getRandomAccessFile()
          The object used to read and write on that file.
 long getUserExtent()
          The size of the user data extent in bytes.
 boolean isFullyBuffered()
          True iff the store is fully buffered (all reads are against memory).
 boolean isStable()
          True iff backed by stable storage.
 ByteBuffer read(long addr)
          Note: ClosedChannelException and AsynchronousCloseException can get thrown out of this method (wrapped as RuntimeExceptions) if a reader task is interrupted.
 ByteBuffer readRootBlock(boolean rootBlock0)
          Read the specified root block from the backing file.
 long transferTo(RandomAccessFile out)
          A block operation that transfers the serialized records (aka the written on portion of the user extent) en mass from the buffer onto an output file.
 void truncate(long newExtent)
          Either truncates or extends the journal.
 void update(long addr, int off, ByteBuffer data)
          Updates a region of a record.
 long write(ByteBuffer data)
          Write the data (unisolated).
 void writeRootBlock(IRootBlockView rootBlock, ForceEnum forceOnCommit)
          Write the root block onto stable storage (ie, flush it through to disk).
 
Methods inherited from class com.bigdata.journal.AbstractBufferStrategy
assertOpen, destroy, getBufferMode, getInitialExtent, getMaximumExtent, getNextOffset, getResourceMetadata, getUUID, isOpen, isReadOnly, overflow, size, transferFromDiskTo
 
Methods inherited from class com.bigdata.rawstore.AbstractRawWormStore
getAddressManager, getByteCount, getOffset, getOffsetBits, packAddr, toAddr, toString, unpackAddr
 
Methods inherited from class com.bigdata.rawstore.AbstractRawStore
deserialize, deserialize, deserialize, serialize
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface com.bigdata.journal.IBufferStrategy
getBufferMode, getInitialExtent, getMaximumExtent, getNextOffset
 
Methods inherited from interface com.bigdata.rawstore.IRawStore
destroy, getResourceMetadata, getUUID, isOpen, isReadOnly, size
 
Methods inherited from interface com.bigdata.rawstore.IAddressManager
getByteCount, getOffset, packAddr, toAddr, toString, unpackAddr
 
Methods inherited from interface com.bigdata.rawstore.IStoreSerializer
deserialize, deserialize, deserialize, serialize
 

Field Detail

storeCounters

public final DiskOnlyStrategy.StoreCounters storeCounters
Counters on IRawStore and disk access.

Method Detail

getHeaderSize

public final int getHeaderSize()
Description copied from interface: IDiskBasedStrategy
The size of the file header in bytes.

Specified by:
getHeaderSize in interface IBufferStrategy
Specified by:
getHeaderSize in interface IDiskBasedStrategy

getFile

public final File getFile()
Description copied from interface: IDiskBasedStrategy
The backing file.

Specified by:
getFile in interface IDiskBasedStrategy
Specified by:
getFile in interface IRawStore

getRandomAccessFile

public final RandomAccessFile getRandomAccessFile()
Description copied from interface: IDiskBasedStrategy
The object used to read and write on that file.

Specified by:
getRandomAccessFile in interface IDiskBasedStrategy

getChannel

public final FileChannel getChannel()
Description copied from interface: IDiskBasedStrategy
The channel used to read and write on the file.

Specified by:
getChannel in interface IDiskBasedStrategy

getCounters

public CounterSet getCounters()
Return interesting information about the write cache and file operations.

Specified by:
getCounters in interface IBufferStrategy
Specified by:
getCounters in interface IRawStore

isStable

public final boolean isStable()
Description copied from interface: IRawStore
True iff backed by stable storage.

Specified by:
isStable in interface IRawStore

isFullyBuffered

public boolean isFullyBuffered()
Description copied from interface: IRawStore
True iff the store is fully buffered (all reads are against memory). Implementations MAY change the value returned by this method over the life cycle of the store, e.g., to conserve memory a store may drop or decrease its buffer if it is backed by disk.

Note: This does not guarantee that the OS will not swap the buffer onto disk.

Specified by:
isFullyBuffered in interface IRawStore

force

public void force(boolean metadata)
flushes the optional writeCache before syncing the disk.

Specified by:
force in interface IRawStore
Parameters:
metadata - If true, then force both the file contents and the file metadata to disk.

close

public void close()
Closes the file immediately (without flushing any pending writes).

Specified by:
close in interface IRawStore
Overrides:
close in class AbstractBufferStrategy

deleteResources

public void deleteResources()
Description copied from interface: IRawStore
Deletes the backing file(s) (if any) and clears any records for the store from the IGlobalLRU.

Specified by:
deleteResources in interface IRawStore

getExtent

public final long getExtent()
Description copied from interface: IBufferStrategy
The current size of the journal in bytes. When the journal is backed by a disk file this is the actual size on disk of that file. The initial value for this property is set by Options.INITIAL_EXTENT.

Specified by:
getExtent in interface IBufferStrategy

getUserExtent

public final long getUserExtent()
Description copied from interface: IBufferStrategy
The size of the user data extent in bytes.

Note: The size of the user extent is always generally smaller than the value reported by IBufferStrategy.getExtent() since the latter also reports the space allocated to the journal header and root blocks.

Specified by:
getUserExtent in interface IBufferStrategy

read

public ByteBuffer read(long addr)
Note: ClosedChannelException and AsynchronousCloseException can get thrown out of this method (wrapped as RuntimeExceptions) if a reader task is interrupted.

Specified by:
read in interface IRawStore
Parameters:
addr - A long integer that encodes both the offset from which the data will be read and the #of bytes to be read. See IAddressManager.toAddr(int, long).
Returns:
The data read. The buffer will be flipped to prepare for reading (the position will be zero and the limit will be the #of bytes read).

allocate

public long allocate(int nbytes)
Description copied from interface: IUpdateStore
Allocate a record without writing it on the store

Note: The contents of the record having that address are undefined unless until data is written onto the record using IUpdateStore.update(long, int, ByteBuffer) and only those bytes actually written will be defined.

Specified by:
allocate in interface IUpdateStore
Parameters:
nbytes - The #of bytes in the record.
Returns:
The address of the record.

update

public void update(long addr,
                   int off,
                   ByteBuffer data)
Description copied from interface: IUpdateStore
Updates a region of a record. The record may have been written or simply allocated. The bytes in data from the Buffer.position() to the Buffer.limit() will be written starting at off bytes into the record identified by the addr. The state of other bytes in the record are unchanged. If their state was undefined (e.g., the record was IUpdateStore.allocate(int)'d but not written) then their state will remain undefined.

Specified by:
update in interface IUpdateStore
Parameters:
addr - The address of an existing record.
off - The offset into that record at which the data will be written.
data - The data to be written.

write

public long write(ByteBuffer data)
Description copied from interface: IRawStore
Write the data (unisolated).

Specified by:
write in interface IRawStore
Parameters:
data - The data. The bytes from the current Buffer.position() to the Buffer.limit() will be written and the Buffer.position() will be advanced to the Buffer.limit() . The caller may subsequently modify the contents of the buffer without changing the state of the store (i.e., the data are copied into the store).
Returns:
A long integer formed that encodes both the offset from which the data may be read and the #of bytes to be read. See IAddressManager.

readRootBlock

public ByteBuffer readRootBlock(boolean rootBlock0)
Description copied from interface: IBufferStrategy
Read the specified root block from the backing file.

Specified by:
readRootBlock in interface IBufferStrategy

writeRootBlock

public void writeRootBlock(IRootBlockView rootBlock,
                           ForceEnum forceOnCommit)
Description copied from interface: IBufferStrategy
Write the root block onto stable storage (ie, flush it through to disk).

Specified by:
writeRootBlock in interface IBufferStrategy
Parameters:
rootBlock - The root block. Which root block is indicated by IRootBlockView.isRootBlock0().

truncate

public void truncate(long newExtent)
Description copied from interface: IBufferStrategy
Either truncates or extends the journal.

Note: Implementations of this method MUST be synchronized so that the operation is atomic with respect to concurrent writers.

Specified by:
truncate in interface IBufferStrategy
Parameters:
newExtent - The new extent of the journal. This value represent the total extent of the journal, including any root blocks together with the user extent.

transferTo

public long transferTo(RandomAccessFile out)
                throws IOException
Description copied from interface: IBufferStrategy
A block operation that transfers the serialized records (aka the written on portion of the user extent) en mass from the buffer onto an output file. The buffered records are written "in order" starting at the current position on the output file. The file is grown if necessary. The file position is advanced to the last byte written on the file.

Note: Implementations of this method MUST be synchronized so that the operation is atomic with respect to concurrent writers.

Specified by:
transferTo in interface IBufferStrategy
Parameters:
out - The file to which the buffer contents will be transferred.
Returns:
The #of bytes written.
Throws:
IOException

closeForWrites

public void closeForWrites()
Extended to discard the write cache.

Note: The file is NOT closed and re-opened in a read-only mode in order to avoid causing difficulties for concurrent readers.

Specified by:
closeForWrites in interface IBufferStrategy
Overrides:
closeForWrites in class AbstractBufferStrategy


Copyright © 2006-2009 SYSTAP, LLC. All Rights Reserved.