Shared timeout implements under multi-thread context model

This blog shows timeout impelents under multi-thread Java application. Discuss its advantage and disadvantage, also tell difficulties we meet in practice.

Why shared timeout

For a complex micro service system, message is a basic way used for communication between all modules. So timeout is come into used, for every API or message, to avoid unfinished tasks.

Let’s think if you what to do something without timeout, it may never finished with response. So timeout does the matter to help user and system get rid of those situations.

But when we talk about shared timeout, what is it and what it does.

For example, one API may sending several message to execute on different services and finally finished. During the process, if a timeout happen, API should return as timeouted as we expected. So the design of timeout mechanism should share a same period of timeout.

Assume API timeout is T and three messages tend to be used for API’s execution, and message1 use time T1, message2 use time T2 and message3 use time T3 , in shared timeout situation, message1‘s timeout is T , message2‘s timeout is T - T1, message3‘s timeout is T - T1 - T2. The remaining timeout should be used for next message to confirm all sub messages could be timeouted as expected.

Application level implements

For message level timeout, the easiest way to implement timeout management is on message bus. Every timeout the message received, check its timeout header and calculate its remaining timeout seems to make sense.

But if we record its timeout by decrease message used time. The machanism became more complex, because every messages time usage need to be recorded but for most product the message profiling is disable to avoid any performance overhead.

In order to solve this problem, we use message deadline as metadata for message lifecyle. Every new coming message should be set a deadline due to its configuration and every sub message calculate its remaining timeout by using the deadline time substract current time. So with a general service to get current time this timeout mechanism get more efficient.

More challenges

In multi-thread application. Lifecyle maintanence of message’s timeout is really matter.

Espacially, ZStack use in-process micro services architecture, messages pass through memory or http, for API message, the timeout always works well, but for internal messages more problems came out.

For example, a async invoker util functios may send several messages but they shares the same thread before handling, so the thread’s context will be used as message’s initial context where the timeout stored.

When thread context changed we can clear the context by thread pool’s thread lifecyle hook.

But some user case the timeout do not work as expected.

  • GC task (in memory task triggered by a fixed time rate or any system event)
  • Thread level task (async and sync task queue)

GC task

GC task is used to handling some unexpected async operation or retry to delete some resource and so on.

If a thread submit the task executes the task itself, the context will be used directly, and actually the context mess up the GC task’s execution. So always use a new thread to start GC job is a good choice.

For multi-thread application, message dilevery and handling involves different threads, especially some task driver might be used to construct work flow and task execution. So the timeout context need be passed from one thread to another.

Assume Thread1 do task1 and finished with submit a new task2, maybe sometimes after Thread2 start to handle task2 but at this time task2 is required to contain timeout. Or the timeout cannot be passed.

Fixed thread task

Same thread handling all tasks, so task should store its context when submitted to task queue.

Execution need to recover task context before execution.

Benefits of the concepts

For a in-process arch, use a global level timeout during api or inner task lifecycle and the whole timeout can be managed.

Easy implements

In java program, use aop to maintain the timeout get/set seems a good choice.

A typical ZStack task workflow, usually use api at the first step. A new coming api message, ZStack will set timeout to it, but in order to know parent messages timeout, we need to manage the timeout information to the message.

So a design named TaskContext is created to contains the global variables during whole task lifecycle. Use aop, all async tasks will use its parent TaskContext and clear it before start. With TaskContext, the timeout can be managed.

But some user scenarios still need to be discussed, list it before details:

  • Inner message is used as the start of a task, it should support timeout.
  • API use a inner message configured with timeout which should be supported.
  • How did new coming mechanism aware of the timeout from task context
  • How to avoid task context be messed

Inner message

Inner message level timeout configuration need to be supported as some tasks is executed by GC task which we mentioned, may send inner message directly, so timeout maybe requested for those tasks

Duplicate configuration

For api messages, it may use a workflow contains several inner messsages when those messages all have timeout configuration, we need to use the origin timeout but do not use new configured one.

Aware of timeout

It not practical, becamse task context is a in-memory variable and marked by thread-id, so everytime the thread switched the task context need to be copied to the new thread. If any mechanism do not support task context copy, it will result in timeout loss, if any inner message used, a new timeout will be set from timeout manager. And it seems no good solution to make the new mechanism aware of this. That’s the shortage of using aop.

Do not touch task context

Only timeout manager should use task context for timeout handling and other task context usage including manually clear it or set value should be avoid.

But for some reason the access of TaskContext supposed to be available to core module for timeout or other context usage (for example, task id), so only keep it cleared after thread context switch and only assign value to framework known fields to avoid any mess up operations from other developer.

Conclusion

Actually task context is more likely a global variable for every thread to use. Keep it from abuse and oom is the first task and aop in involved to resolve this problem. So how to trace task context seems the next valuable target of this version of code.

Check the code

https://github.com/zstackio/zstack check the code if you are interested in this feature.