Package org.neo4j.io.pagecache

The Neo4j PageCache API

See: Description

Package org.neo4j.io.pagecache Description

The Neo4j PageCache API

This package contains the API for the page caching mechanism used in Neo4j. How to acquire a concrete implementation of the API depends on the implementation in question. The Kernel implements its own mechanism to seek out and instantiate implementations of this API, based on the database configuration.

Page Caching Concepts

The purpose of a page cache is to cache data from files on a storage device, and keep the most often used data in memory where access is fast. This duplicates the most popular data from the file, into memory. Assuming that not all data can fit in memory (even though it sometimes can), the least used data will then be pushed out of memory, when we need data that is not already in the cache. This is called eviction, and choosing what to evict is the responsibility of the eviction algorithm that runs inside the page cache implementation.

A file must first have to be "mapped" into the page cache, before the page cache can cache the contents of the files. When you no longer have an immediate use for the contents of the file, it can be "unmapped." Mapping a file using the map method gives you a PagedFile object, through which the contents of the file can be accessed. Once a file has been mapped with the page cache, it should no longer be accessed directly through the file system, because the page cache will keep changes in memory, thinking it is managing the only authoritative copy.

If a file is mapped more than once, the same PagedFile is returned, and its reference counter is incremented. Unmapping decrements the reference counter, discarding the PagedFile from the cache if the counter reaches zero. If the last reference was unmapped, then all dirty pages for that file will be flushed before the file is discarded from the cache.

A "page" is a space that can fit a quantity of data, and is part of a larger whole. This larger whole can either be a file, or the memory allocated for the page cache. We refer to these two types of pages as "file pages" and "cache pages" respectively. Pages are the unit of what data is popular or not, and the unit of moving data into memory, and out to storage. When a cache page is holding the contents of a file page, the two are said to be "bound" to one another.

Each PagedFile object has a translation table, that logically translate file page ids for the given file, into cache page ids. The concrete implementations are typically more like Maps where the keys are the file page ids, and the values are concrete page object that currently holds that particular file page.

File pages are typically sized as a multiple of the size of the records they contain, so that you are guaranteed to be able to read or write a record in full, whenever you pin a page. File pages should be as large as they can possibly be, while still being no larger than the cache page size. Then the filePageId can be computed based on the recordId as the integer division recordId / recordsPerPage while the offset into the page is the modulo of that same division.

If a file page is not in memory, but someone needs it, a page fault occurs. Page faulting is finding a free page, and swapping the contents of the given file page into it. This has to be done in a thread-safe way, because multiple threads may race to discover that a page they want is not in memory, and this may be the same page. Page faulting also has to update the translation table, which again is something that needs to be done in a thread-safe manner. Page faulting also needs to take races with eviction into consideration, as the pages are now transitioning from free to bound, and eviction is a process that transition a page from bound to free.

If there are no, or not enough, free pages, then eviction occurs. Each page has a usage stamp, that is incremented on access and decremented by the dedicated eviction thread. If the counter reaches zero, the page is evicted. If the page was dirty because it had received writes since it was faulted, it will then be flushed before it is evicted and added back to the list of free pages.

Knowledge of how to move file pages in and out of cache pages is contained in a so called PageSwapper. The Page implementations themselves know how to do IO that moves data in and out of their respective memory area, but it is the swapper that tells them what file to use for IO, where in that file the data is located, and how much data needs to be moved. Every PagedFile have their own dedicated PageSwapper, that is instantiated for the given file by the PageSwapperFactory.

Once a file has been mapped, and a PagedFile object made available, the io method can be used to interact with the contents of the file. It takes in an initial file page id and a bitmap of intentions, such as what locking behaviour to use, and returns a PageCursor object. The PageCursor is the window into the data managed by the page cache.

Initially, the PageCursor is not bound to any page. Calling the PageCursor.next() method on the cursor will advance it to its next page. The first page that the cursor binds to, is the page with the file page id given to the io method. From then on, the cursor will scan linearly through the file.

The next method returns true if it successfully bound to the next page in its sequence. This is usually the case, but when PagedFile.PF_SHARED_LOCK or PagedFile.PF_NO_GROW is specified, the next method will return false if the cursor would otherwise move beyond the end of the file.

The next will grab the desired lock on the page (as specified by the pf_flags argument to the io method call) on the page, and then we can do the IO we intended. Following the IO, the PageCursor.shouldRetry() method must be consulted, and the IO must be redone on the page if it returns true. This is best done in a do-while loop. This retrying allows some optimistic optimisations in the page cache, that improves performance on average.

Here's a logical overview of a page cache:


     +---------------[ PageCache ]-----------------------------------+
     |                                                               |
     |  * PageSwapperFactory{ FileSystemAbstraction }                |
     |  * evictionThread                                             |
     |  * a large collection of Page objects:                        |
     |                                                               |
     |  +---------------[ Page ]----------------------------------+  |
     |  |                                                         |  |
     |  |  * usageCounter                                         |  |
     |  |  * some kind of read/write lock                         |  |
     |  |  * a cache page sized buffer                            |  |
     |  |  * binding metadata{ filePageId, PageSwapper }          |  |
     |  |                                                         |  |
     |  +---------------------------------------------------------+  |
     |                                                               |
     |  * linked list of mapped PagedFile instances:                 |
     |                                                               |
     |  +--------------[ PagedFile ]------------------------------+  |
     |  |                                                         |  |
     |  |  * referenceCounter                                     |  |
     |  |  * PageSwapper{ StoreChannel, filePageSize }            |  |
     |  |  * PageCursor freelists                                 |  |
     |  |  * translation table:                                   |  |
     |  |                                                         |  |
     |  |  +--------------[ translation table ]----------------+  |  |
     |  |  |                                                   |  |  |
     |  |  |  A translation table is basically a map from      |  |  |
     |  |  |  file page ids to Page objects. It is updated     |  |  |
     |  |  |  concurrently by page faulters and the eviction   |  |  |
     |  |  |  thread.                                          |  |  |
     |  |  |                                                   |  |  |
     |  |  +---------------------------------------------------+  |  |
     |  +---------------------------------------------------------+  |
     +---------------------------------------------------------------+

     +--------------[ PageCursor ]-----------------------------------+
     |                                                               |
     |  * currentPage: Page                                          |
     |  * page lock metadata                                         |
     |                                                               |
     +---------------------------------------------------------------+
 

Copyright © 2002–2017 The Neo4j Graph Database Project. All rights reserved.