Skip to content

Latest commit

 

History

History
365 lines (320 loc) · 14.5 KB

design-doc.md

File metadata and controls

365 lines (320 loc) · 14.5 KB

Design Document for the Test Execution Engine

The test execution engine (TEE) will read the XML format generated by the CPN Tools test generator. This format should be as general as possible in order to support as many use cases as possible, but our main focus here is on distributed systems.

In the first incarnation of the tool, we will limit ourselves to testing from a client's perspective. That is, we consider that the TEE simulates one or more clients invoking RPC calls on a server or a group of servers in the case of Gorums. For now, we don't use multiple client processes, but simulate multiple clients by making RPC calls from different goroutines. As such we provide the <Concurrent> tag to allow the TEE runtime to start a goroutine for each <Call> tag specified within a <Concurrent> tag. Further, if one or more <Call> tags are specified outside a <Concurrent> tag, the TEE runtime invokes these calls sequentially. In addition to this, we can also specify a <Sequential> tag to accomplish the same. This is mainly useful for constructing a sequence of calls concurrent with another sequence of calls.

Here is an example XML (an alternative example will be given below):

<Test Name="CrashFailurePaxos" Type="systemtest">
    <TypeAssignments>
        <SystemParameterTypes>
            <Field Name="SystemSize" Type="int"></Field>
            <Field Name="QuorumSize" Type="int"></Field>
            <Field Name="ServerIDs" Type="*[]int"></Field>
        </SystemParameterTypes>
        <InputType Name="Server" Type="">
            <Field Name="IDs" Type="*[]string">
        </InputType>
        <OracleType Name="LegalResponses" Type="*[]string"></OracleType>
        <OracleType Name="ExpectLeaders" Type="*[]string"></OracleType>
    </TypeAssignments>
    <SystemParameters>
    </SystemParameters>
    <TestCase ID="1">
        <Concurrent>
            <Call Name="Failure">
                <InputValues>
                    <Server IDs="8080,8081"/>
                </InputValues>
            </Call>
            <Call Name="Prepare">
                <InputValues>
                    <PrepareMsg Rnd="42"/>
                </InputValues>
            </Call>
            <Call Name="Prepare" Fail="0.5">
                <InputValues>
                    <PrepareMsg Rnd="43"/>
                </InputValues>
            </Call>
        </Concurrent>
        <Call Name="Accept">
            <InputValues>
                <AcceptMsg Rnd="42" Val="paxos"/>
            </InputValues>
        </Call>
        <Call Name="Commit">
            <InputValues>
                <LearnMsg Rnd="42" Val="paxos"/>
            </InputValues>
        </Call>
        <Oracles>
            <ExpectedLeaders>8082</ExpectedLeaders>
            <LegalResponses>paxos</LegalResponses>
        </Oracles>
    </TestCase>
</Test>

Types used in the <InputValues> and <Oracles> must either be defined in the <TypeAssignment> part, or there must already exist a corresponding Go type similarly named, so that the tool can find and match against those types. For this to work, the Go type struct must be embroded with the xml:"..." tag, as shown in this example for the <AcceptMsg> type:

type AcceptMsg struct{
    Rnd uint32 `xml:"Rnd"`
    Val string `xml:"Rnd"`
}

The <InputValues> of the different <Call> tags should match up with the expected input arguments to those calls. To help ensure that these fields match up we require that every method name mentioned in a <Call> is defined in a user-defined TestAdapter interface, of which an example is shown below.

// TestAdapter specified the methods that can be invoked by the
// Test Execution Engine to drive the execution of a test.
type TestAdapter interface {
  Prepare(PrepareMsg) PromiseMsg
  Accept(AcceptMsg) LearnMsg
  Commit(LearnMsg)
  Failure(string)
}

The user must define both this interface, and provide an implementation of each of its methods. Often these implementations are simple forwarding methods, but can also provide custom functionality, e.g. one can imagine different types of failure injection methods that provide different types of behaviors that cannot easily be simulated by other means.

Moreover, each of the <InputValues> used in the XML format must also correspond to the different types used in those methods. Typically, these input types are defined as message types in the proto file and generated by the protoc compiler, and so we do not need to specify these details in the <TypeAssignment> part. In addition to the proto message types, input values may assume any of the datatypes supported by default in the Go language.

However, the exact data type representation of the proto messages must also be know to the CPN Tools test generator. To assist with this, we could implement a translation function that converts proto message types and Go types to a format that can be used by CPN Tools.

Note that the <Call> tag can take a Fail attribute, which allows to specify that a given call can fail with some probabilty given by the Fail attribute.

We also specify several <Oracles>, such as the <LegalResponses> and the <ExpectedLeaders> which will be tested against after the execution. To allow the TEE to obtain the results of an execution, the user must also define the following interface to be called after an execution.

// TestOracles specified the methods that can be invoked by the
// Test Execution Engine to obtain the expected results of a test.
type TestOracles interface {
  ExpectedLeaders() string
  LegalResponses() string
}

An Alternative Example

The above example is perhaps too fine grained for some test cases. This example simulates the existence of two proposers that think they are both leaders, and propose different values. However, one of the proposers fails in the second phase, and so the legal response is lamport.

<Test Name="CrashFailurePaxos" Type="systemtest">
    <TypeAssignments>
        <SystemParameterTypes>
            <Field Name="SystemSize" Type="int"></Field>
            <Field Name="QuorumSize" Type="int"></Field>
            <Field Name="ServerIDs" Type="*[]int"></Field>
        </SystemParameterTypes>
        <InputType Name="Server" Type="">
            <Field Name="IDs" Type="*[]string">
        </InputType>
        <OracleType Name="LegalResponses" Type="*[]string"></OracleType>
        <OracleType Name="ExpectLeaders" Type="*[]string"></OracleType>
    </TypeAssignments>
    <SystemParameters>
    </SystemParameters>
    <TestCase ID="1">
        <Concurrent>
            <Call Name="Failure">
                <InputValues>
                    <Server IDs="8080,8081"/>
                </InputValues>
            </Call>
            <Call Name="RunPaxosPhases">
                <InputValues>
                    <Msg string="paxos"/>
                    <FailurePhase string="2"/>
                </InputValues>
            </Call>
            <Call Name="RunPaxosPhases">
                <InputValues>
                    <Msg string="lamport"/>
                </InputValues>
            </Call>
        </Concurrent>
        <Oracles>
            <ExpectedLeaders>8082</ExpectedLeaders>
            <LegalResponses>lamport</LegalResponses>
        </Oracles>
    </TestCase>
</Test>

We should note that the implmentation of the RunPaxosPhases method will be a custom implementation of a Paxos proposer, similar to the one currently implemented in proposer.go#runPaxosPhases where we can trigger the failure of individual Paxos phases. In this example we implement the RunPaxosPhases method in such a way that we can trigger the failure of individual phases based on the input provided in the test case. Below is an excerpt of the code needed. This code is more or less replicating what is needed by the proposer's code, except that we augment the code with checks if one of the Paxos phases should fail, as per the input provided by the test case.

func (ta *PaxosTestAdapter) RunPaxosPhases(msg string, failurePhase string) error {
        p := ta.GetProposer()
        // access proposer state in mutual exclusion for use below; avoid holding lock during quorum calls
        p.m.RLock()
        crnd, cval := p.crnd, p.cval
        p.m.RUnlock()

        // ******************************************************
        //  PHASE ONE: send Prepare to obtain quorum of Promises
        preMsg := &PrepareMsg{Rnd: crnd}
        p.logf("Sending  Phase 1a msg: %v\n", preMsg)
        prmMsg, err := p.config.Prepare(ctx, preMsg)
        err = checkFailurePhase(err, "1", failurePhase)
        if err != nil {
            return err
        }
        p.logf("Received Phase 1b msg: %v\n", prmMsg)

        // ******************************************************
        //  PHASE TWO: send Accept to obtain quorum of Learns
        if prmMsg.GetVrnd() != Ignore {
                // promise msg has a locked-in value
                cval = prmMsg.GetVval()
                // update proposer state in mutual exclusion
                p.m.Lock()
                p.cval = cval
                p.m.Unlock()
        }
        // use local proposer's cval or locked-in value from promise msg, if any.
        accMsg := &AcceptMsg{Rnd: crnd, Val: cval}
        p.logf("Sending  Phase 2a msg: %v\n", accMsg)
        lrnMsg, err := p.config.Accept(ctx, accMsg)
        err = checkFailurePhase(err, "2", failurePhase)
        if err != nil {
            return err
        }
        p.logf("Received Phase 2b msg: %v\n", lrnMsg)

        // ******************************************************
        //  PHASE THREE: send Commit to obtain a quorum of Acks
        p.logf("Sending  Phase 3a msg: %v\n", lrnMsg)
        ackMsg, err := p.config.Commit(ctx, lrnMsg)
        err = checkFailurePhase(err, "3", failurePhase)
        if err != nil {
            return err
        }
        p.logf("Received Phase 3b msg: %v\n", ackMsg)
        return nil
}

func checkFailurePhase(err error, phase, failurePhase string) error {
    if err != nil {
        // if there was an actual error; always return that first
        return err
    }
    if phase == failurePhase {
        return fmt.Errorf("Paxos test adapter failed phase %d", phase)
    }
    return nil
}

Running the Test Execution Engine

When running the TEE tool we first parse the XML into instances of the relevant datatypes that can then be used to drive the test execution. In the main test function, will be similar to the TestSystem function in system_tests.go, but will be much simpler; it should do the following:

  for _, id := range serverIDs {
    go ServerStart(id, addrs, quorumSize)
  }

Side note: This initializing of the system should ideally be specified and derived from the XML file, but suggest to postpone that for now.

We should also create an instance of the TestAdapter. The test adapter is implemented by the user and will typically take care of establishing gRPC connections necessary for the different <Call> methods used. An example of the setup needed is in the newPaxosConfig function in paxos_gorums_helper.go, but it will store state to be accessible to the test adapter.

The TestAdapter implementation will thus serve as a state object for the TEE, and so when invoking the different methods on it, those method implementations will have access to the necessary server references and quorum call references to be able to invoke the gRPC and Quorum Calls on those servers and configurations.

When the main test function has access to the test adapter, it is trivial (maybe not) to process the different parts of the test execution. Basically, for each test case, invoke each specified call with the specified parameters obtain by parsing the XML into the relevant datatypes (see the Q&A below).

Here is pseudocode for a sequential execution:

  for each testcase t {
      for each call c in t {
          m = get method name of c
          inputValues = get input values of c
          result = invoke m using reflection with inputValues
          // not sure if we should check each individual invocation??
          if result != c.expected {
              fail
          }
      }
      // here we check if the test passed/failed.
  }

Open Questions (and Some Answers)

  1. Q: Can we use XML tags for proto defined message types, such as PrepareMsg and AcceptMsg and be able to parse the XML file into instances of those datatypes directly to avoid doubly defined datatypes?

    A: Yes. One approach is to ensure that we have access to the datatypes produced by the protoc compiler in the same folder from which the TEE tool is running, i.e. those in the .pb.go file. However, to allow XML parsing into those datatypes, we need to add the necessary Go XML tags to the relevant Go file. However, this Go file is generated, and so we shouldn't change it manually. Instead we can use an extension, which is currently not supported by the golang/protobuf package, but is supported by gogo/protobuf and its moretags extension. Since Gorums is still using gogo/protobuf we can easily leverage moretags to support XML-based message datatypes. In the future, hopefully they will implement go_tag in golang/protobuf. See Proposal for adding go_tag.

    Another option to consider is to replace the XML format with JSON, which is already supported by golang/protobuf.

    If we stick with XML, the proto file needs to be modified as follows:

    import "github.com/gogo/protobuf/gogoproto/gogo.proto";
    
    message PrepareMsg {
      uint32 rnd = 1; [(gogoproto.moretags) = "xml:\"Rnd\""];
    }
  2. Q: How to determine if a test passed/failed? Where do we do the check? See the pseudocode above.

  3. Q: If test pass/fail is done after an the execution of a sequence of RPC calls, how does the oracle learn the result of the execution?

    A: We could add a Query() method to the TestAdapter that must be implemented by the user, which should compute the final result of an execution based on the state of the system.

    I think I found a reasonable solution by requiring that a TestOracles interface be implemented. See above.

Future Updates and Ideas

In a future version of the tool could start separate processes on the same machine or on different machines, using e.g. ssh. But this is not a priority now. We also limit ourselves to gRPC-based frameworks for now, although we aim to make the tool flexible enough to support other distributed computing technologies.