>

### Background Introduction

Percolator is the websearch indexing system with a new architecture that was introduced by Google in 2010, it incrementally
processing updates to a large data set. The major reason that Google develop Percolator comes from the limitations of MapReduce, for example, if we use MapReduce to update the search index after recrawling some small portion of the web, it must process the entire repository, not just the new documents, also, the reason why BigTable is also insuffient is because it does not provide multirow transactions(it only supports atomic read-modify-write operations on individual rows but not multiple rows), which is required for incrementally updating Google’s index. Percolator fixes these limitations, it builds on top of Google File System (GFS) and updates the index incrementally thus avoid redoing work that has been done. Also, it provides transaction guaranteen.

Trade offs: Percolator trades effient use of resources for scalability because it uses twice as many resources as the previous system to processing the same crawl rate and it consumes roughly 30 times more CPU per transactions than a standard DBMS.

Notable Feature of Percolator

ACID transactions

  • Cross-row and cross-table

Snapshot isolation

  • Reads are from one consistent snap shot and writes are performed at another
  • I note that while snapshot isolation is not perfect (write skew), it can achieve great performance

Architecture Design

  • Three Binaries - Run on every machine
    • [Bigtable][] tablet server
    • [GFS][] chunkserver
    • Percolator worker
    • The computations, named observers, are linked into the worker
    • Worker scans for updates to the columns
  • Timestamp Oracle
    • Strictly increasing timestamps
  • Lightweight Lock Service
    • Just an optimization, not required

Degisn Notes:

(1) A table is indexed by row and column, since they use MVCC for snapshot isolation, each cell is actually a series of timestamped values (native support in BigTable)
(2) No central transaction manager
Thus no global deadlock detector, however, this has a drawback that it will increase latency in the case of conflicting transactions.

(3) Transacitons are implemented in C++
(4) Although blocking calls is used, a threadpoll is used for paralleism
The reason is that They found this to be simpler and allow for some clever batching
(5) Transactions allow engineers to create consistent secondary indexes

Transaction Isolation

Each column A is represented as a data column A:data, it also has a columnA:lockand a write columnA:write`.
A worker coordinates a 2PC protocol

  • Prewrite: lock the cell being written to

    • If there is another lock at any timestamp the transaction aborts
    • If there is a write record after the current transactions start timestamp, abort the transaction (write-write conflict)
    • Write the lock and data to each cell at the start timestamp
  • Commit

    • Get a commit timestamp from the timestamp oracle

    • Release each lock and replace it with a write record

    • A write record, stored in A:write, really points to the timestamp of the prewrite from the first phase.

    • Bigtable supports atomic row operations

    • If the transaction fails to complete at this stage, it will be rolled forward by other workers lazily

  • Get

    • If there is a lock in [0, start_timestamp], wait until the lock is released
    • Otherwise return the data
    • Notice how no transaction (T1) could commit before T2’s start_timesamp without having a prewrite lock on the column. commit_timestamp is after the prewrite phase.

Lazy Lock Cleanup

  • Each transaction has a primary lock, all other locks in the prewrite phase point to this primary lock
  • If T1 encounters a lock left by T2, it checks to see if the T2 has likely failed
    • Workers have an ephemeral path in Chubby
    • There is also a timeout
    • The wall time in the lock can be updated periodically for long running transactions
      -I’d like to see the implementation of this. My guess is the worker has a thread that periodically checks and refreshes all the locks held by the worker across all transactions
  • If T1 feels T2 has failed, it attempts to cleanup the primary lock first (this is atomic), rolling back the transaction
  • If the crash was in the commit phase, the primary will have been converted to a write record. T1 is responsible for rolling the transaction forward

Timestamp Oracle

  • Creates a batch of timestamps and logs the highest timestamp to stable storage. Now the oracle can hand out timestamps from memory. If the orace crashes, it can just start skip to beyond the highest logged timestamp
  • It is not sharded, single machine (2 million timestamps/sec)
  • Workers batch calls to the timestamp server for several transactions

Notifications

  • Like a trigger, but does not run in the same transaction as the write
    • This is not for data integrity but incremental processing
  • At-most-one one observer’s transaction will commit for a given change
  • Multiple changes may be batched into one observer transaction, however
    • The authors say batching allows them to deal with cells that change quite a lot
  • Each foo:data has an additional foo:observer0 column for each observer
    • Stores the timestamp of the latest start for the observer
    • If two transactions start for the same change, only one of them will commit because of a conflict on the ack column
  • Tables may be trillions of cells but only have millions of updates
    • During a write to a watched column, an additional column is written to for each dirty cell
    • These columns are in a separate locality group so that workers only need to scan the notify columns, not the actual data
  • Each worker scans with many threads, picking a random tablet and then random start key
    • Workers acquire locks in the the lightweight lock service prior to running observers on a row
    • The authors make it sound like a lock is acquired for each row scanned but I imagine it’s more a range of keys? Seeing as the authors are fans of batching calls across transactions, they probably batch requests for row level locks together.
    • It’s OK if the lock service goes down, it’s just an optimization
  • The authors noted that threads tended to pile up at a slow section of the keyspace. Many threads reaching the same point compounded the problem
    • If a thread notices it is scanning the same row as another thread it jumps to a randomly selected key
  • Weak Notification
    • Write only to notify column

Batching
A worker waits until it has collected several precommit operations at a time, e.g. they wait for several seconds to create a batch
Reads are also batched. For example, they offset this with prefetching columns in the same row
Made based on past read patterns

Reference
http://blog.octo.com/en/my-reading-of-percolator-architecture-a-google-search-engine-component/