Skip to main content

Advanced topics

When are events published?

Execution events are recorded roughly "as they are happening" inside the execution daemon: you see a BLOCK_START event at roughly the same moment that the execution daemon beings processing a new block, followed by the start of the first transaction (a TXN_HEADER_START event) about 1 millisecond later. Most transaction-related events are recorded less than one microsecond after the transaction they describe has completed.

Execution of a typical transaction will emit a few dozen events, but large transaction can be emit hundreds of events. The TXN_EVM_OUTPUT event -- which is recorded as soon as the transaction is finished -- provides a summary accounting of how many more events related to that transaction will follow (how many logs, how many call frames, etc.), so that any memory can be preallocated. Such an event is referred as a "header event" in the documentation: an event whose content describes the number of subsequent, related events that will be recorded.

All these events are recorded as soon as the transaction is "committed" to the currently-executing block. This happens before the block has finished executing, and should not be confused with the unrelated notion of "commitment" in the consensus algorithm. Although there are complex speculative execution optimizations inside the execution daemon, the recording of a transaction takes place when all work on a particular transaction has finished. This is referred to as "transaction commit" time.

This is a different than the block-at-a-time style update you would see in, for example, the Geth real-time events WebSocket protocol (which our RPC server also supports). Certain properties of the block (its hash, its state root, etc.) are not known at the time you see the transactions. If you would like block-at-a-time updates, the Rust SDK contains some utilities which will aggregate the events back into complete, block-oriented updates.

One thing to be careful of: although transactions are always committed to a block in index order, they might be recorded out of order. That is, you must assume that the set of execution events that make up transactions 2 and 3 could be "mixed together" in any order. This is because of optimizations in the event recording code path.

However, for a particular transaction (e.g., transaction 3) events pertaining to that transaction are always recorded in the same order: first all of the logs, then all the call frames, then all the state access records. Each of these is recorded in index order, i.e., log 2 is always recorded before log 3.

Sequence numbers and the lifetime detection algorithm

All event descriptors are tagged with an incrementing sequence number starting at 1. Sequence numbers are 64-bit unsigned integers which do not repeat unless the execution daemon is restarted. Zero is not valid sequence number.

Also note that the sequence number modulo the descriptor array size equals the array index where the next event descriptor will be located. This is shown below with a concrete example where the descriptor array size is 64. Note that the last valid index in the array is 63, then access wraps around to the beginning of the array at index 0.

╔═...═════════════════════════Event descriptor array═══╬═══════════════════...═╗
║ │ ║
║ ┌─Event────────┐┌─Event────────┐┌─Event────────┐ │ ┌─Event─────────┐ ║
║ │ ││ ││ │ │ │ │ ║
║ │ seqnum = 318 ││ seqnum = 319 ││ seqnum = 320 │ │ │ seqnum = 256 │ ║
║ │ ││ ││ │ │ │ │ ║
║ └──────────────┘└──▲───────────┘└──────────────┘ │ └───────────────┘ ║
║ 61 │ 62 63 │ 0 ║
╚═...════════════════════╬═════════════════════════════╬═══════════════════...═╝
│ │
■ ◇
Next event Ring buffer
wrap-around to
┌──────────────────────────────┐ zero is here
│last read sequence number │
│(last_seqno) is initially 318 │
└──────────────────────────────┘

In this example:

  • We keep track of the "last seen sequence number" (last_seqno) which has value 318 to start; being the "last" sequence number means we have already finished reading the event with this sequence number, which lives at array index 61

  • 318 % 64 is 62, so we will find the potential next event at that index if it has been produced

  • Observe that the sequence number of the item at index 62 is 319, which is the last seen sequence number plus 1 (319 == 318 + 1). This means that event 319 has been produced, and its data can be safely read from that slot

  • When we're ready to advance to the next event, the last seen sequence number will be incremented to 319. As before, we can find the next event (if it has been produced) at 319 % 64 == 63. The event at this index bears the sequence number 320, which is again the last seen sequence number + 1, therefore this event is also valid

  • When advancing a second time, we increment the last seen sequence number to 320. This time, the event at index 320 % 64 == 0 is not 321, but is a smaller number, 256. This means the next event has not been written yet, and we are seeing an older event in the same slot. We've seen all of the currently available events, and will need to check again later once a new event is written

  • Alternatively we might have seen a much larger sequence number, like 384 (320 + 64). This would mean that we consumed events too slowly, so slowly that the 63 events in the range [321, 384) were produced in the meantime. These were subsequently overwritten, and are now lost. They can be replayed using services external to event ring API, but within the event ring API itself there is no way to recover them

Lifetime of an event payload, zero copy vs. memcpy APIs

Because of the descriptor overwrite behavior, an event descriptor might be overwritten by the execution daemon while a reader is still examining its data. To deal with this, the reader API makes a copy of the event descriptor. If it detects that the event descriptor changed during the copy operation, it reports a gap. Copying an event descriptor is fast, because it is only a single cache line in size.

This is not the case for event payloads, which could potentially be very large. This means a memcpy(3) of an event payload could be expensive, and it would be advantageous to read the payload bytes directly from the payload buffer's shared memory segment: a "zero-copy" API. This exposes the user to the possibility that the event payload could be overwritten while still using it, so two solutions are provided:

  1. A simple detection mechanism allows payload overwrite to be detected at any time: the writer keeps track of the minimum payload offset value (before modular arithmetic is applied) that is still valid. If the offset value in the event descriptor is smaller than this, it is no longer safe to read the event payload

  2. A payload memcpy-style API is also provided. This uses the detection mechanism above in the following way: first, the payload is copied to a user-provided buffer. Before returning, it checks if the lifetime remained valid after the copy finished. If so, then an overwrite did not occur during the copy, so the copy must be valid. Otherwise, the copy is invalid

The reason to prefer the zero-copy APIs is that they do less work. The reason to prefer memcpy APIs is that it is not always easy (or possible) to "undo" the work you did if you find out later that the event payload was corrupted by an overwrite while you were working with it. The most logical thing to do in that case is start by copying the data to stable location, and if the copy isn't valid, to never start the operation.

An example user of the zero-copy API is the eventwatch example C program, which can turn events into printed strings that are sent to stdout. The expensive work of formatting a hexdump of the event payload is performed using the original payload memory. If an overwrite happened during the string formatting, the hexdump output buffer will be wrong, but that is OK: it will not be sent to stdout until the end. Once formatting is complete, eventwatch checks if the payload expired and if so, writes an error to stderr instead of writing the formatted buffer to stdout.

Whether you should copy or not depends on the characteristics of the reader, namely how easily it can deal with "aborting" processing.