Appearance
Error Reference
task manager is shutting down
- Type:
Go error
(specific string match)
- Symptoms:
Calling QueueTask
, RunTask
, or other submission methods returns an error with the message "task manager is shutting down" or "task manager is shutting down (context done)".
- Properties:
None (standard error
interface, typically errors.New
or wrapped by fmt.Errorf
)
Scenarios
Attempting to submit a new task after manager.Stop()
or manager.Kill()
has been called.
Example:
go
manager.Stop() // Initiates shutdown
_, err := manager.QueueTask(...) // This will likely return 'task manager is shutting down'
Reason:
The TaskManager has transitioned to a non-running state (stopping or killed) and no longer accepts new tasks.
Attempting to submit a new task after the context provided to NewTaskManager
(config.Ctx
) has been cancelled externally.
Example:
go
ctx, cancel := context.WithCancel(context.Background())
config := tasker.Config{Ctx: ctx, ...}
manager := tasker.NewTaskManager(config)
cancel() // Cancels the main context
_, err := manager.QueueTask(...) // This will likely return 'task manager is shutting down (context done)'
Reason:
The primary control context for the manager has been cancelled, signaling it to begin shutdown.
- Diagnosis:
Check application lifecycle management. Verify if Stop()
or Kill()
are being called prematurely. Ensure task submissions are coordinated with the manager's active state.
- Resolution:
Only submit tasks when the TaskManager
is in a running state. Add checks around submission calls (e.g., if manager.isRunning() { ... }
). Ensure graceful shutdown routines are triggered only when no more new tasks are expected.
- Prevention:
Implement robust application lifecycle hooks. Use defer manager.Stop()
in main
or main goroutine. Integrate manager.Stop()
or manager.Kill()
with OS signals (e.g., SIGINT
, SIGTERM
).
- Handling Patterns:
`if errors.Is(err, errors.New("task manager is shutting down"))` (or contains substring) then discard/log task rather than retrying.
- Propagation Behavior:
This error is returned directly to the caller of submission methods (QueueTask
, RunTask
, etc.) and propagated through result/error channels or callbacks for async methods.
processor_crash
- Type:
Go error
(example custom error string)
- Symptoms:
A task function explicitly returns an error indicating an underlying resource or worker failure (e.g., a connection drop, service outage). Logs may show WARN: Detected unhealthy error: processor_crash. Worker will be replaced.
- Properties:
None (typically a simple errors.New
or custom error type from the application's domain)
Scenarios
A task encounters a critical, unrecoverable error during its execution that indicates the worker's associated resource is faulty.
Example:
go
func myProblematicTask(ctx context.Context, proc *ImageProcessor) (string, error) {
if rand.Intn(2) == 0 { // 50% chance to simulate a crash
return "", errors.New("processor_crash") // This triggers CheckHealth to return false
}
return "processed", nil
}
// In Config: CheckHealth: func(err error) bool { return err.Error() != "processor_crash" }
Reason:
The CheckHealth
function (provided in tasker.Config
) detected this specific error string and returned false
, signaling tasker
that the worker or its resource is unhealthy.
- Diagnosis:
Review the CheckHealth
implementation in tasker.Config
. Analyze task logs for the specific error returned by the task function. Check the health of external services that the resource (R
) interacts with.
- Resolution:
If the error genuinely means the resource is unusable, tasker
's retry mechanism (if MaxRetries > 0
) will attempt to re-process the task on a new worker. Ensure OnCreate
for the resource is robust to create new, healthy instances. If the external service is down, that requires external intervention.
- Prevention:
Implement robust error handling within your task functions. Use circuit breakers or exponential backoff for external service calls. Monitor resource health externally to proactively replace faulty instances.
- Handling Patterns:
`tasker` automatically handles this by replacing the worker and potentially retrying the task. For the task caller, it's treated as a normal task failure, but with the possibility of being retried internally.
- Propagation Behavior:
This error is returned by the task function. It is then passed to the CheckHealth
function. If CheckHealth
returns false
, tasker
handles the worker replacement and task retry internally. Eventually, if retries are exhausted, the error is returned to the original task submitter.
max retries exceeded
- Type:
Go error
(specific string match, often wrapped)
- Symptoms:
A task that previously triggered an unhealthy worker condition (i.e., CheckHealth
returned false
) is reported as failed with a message indicating retry exhaustion.
- Properties:
None (standard error
interface)
Scenarios
A task repeatedly fails with an error that CheckHealth
identifies as unhealthy, and the number of retry attempts reaches MaxRetries
.
Example:
go
// Config with MaxRetries: 1
// task returns errors.New("processor_crash") (unhealthy)
// first attempt fails, worker replaced, task re-queued (retry 1/1)
// second attempt fails with "processor_crash"
// -> Result: Task Failed: max retries exceeded: computation_error_task_X
Reason:
The task manager exhausted all allowed retries for a task that consistently caused unhealthy worker conditions or was placed on an unhealthy worker, and could not complete successfully.
- Diagnosis:
This error indicates a persistent problem with the task or the resources it's trying to use. Examine the underlying error that caused CheckHealth
to return false
repeatedly.
- Resolution:
Investigate the root cause of the persistent unhealthy condition. This might involve debugging the task logic, checking external service health, or reviewing resource creation/destruction (OnCreate
/OnDestroy
). Consider increasing MaxRetries
if the errors are truly transient but require more attempts.
- Prevention:
Improve task robustness to handle transient errors internally without relying solely on tasker
's retry. Ensure external dependencies are stable. Set MaxRetries
appropriately for the expected transience of errors.
- Handling Patterns:
Catch this error to log it as a critical task failure. Potentially escalate to an alert if many tasks hit this state. These tasks cannot be recovered by `tasker`'s internal mechanism.
- Propagation Behavior:
This error is returned directly to the caller of the submission methods (QueueTask
, QueueTaskWithPriority
, etc.) or delivered via callbacks/channels after all retries are exhausted.
failed to create temporary resource
- Type:
Go error
(specific string match, wrapped with %w
)
- Symptoms:
A call to RunTask
returns an error with this message, often wrapping another underlying error from OnCreate
.
- Properties:
None (standard error
interface, contains underlying error via wrapping)
Scenarios
When RunTask
is called and the resourcePool
is empty (or full, causing it to bypass the pool), it attempts to create a new, temporary resource via OnCreate
, but OnCreate
returns an error.
Example:
go
// In Config: OnCreate: func() (*BadResource, error) { return nil, errors.New("connection_refused") }
_, err := manager.RunTask(...) // Will return 'failed to create temporary resource: connection_refused'
Reason:
The OnCreate
function, responsible for providing a resource to the task, failed during a RunTask
call, preventing the task from executing.
- Diagnosis:
Inspect the wrapped error within the returned failed to create temporary resource
error. This wrapped error (errors.Unwrap(err)
) will contain the specific reason OnCreate
failed.
- Resolution:
Debug the OnCreate
function. Common causes include incorrect connection strings, unavailable external services, insufficient permissions, or resource exhaustion. Ensure OnCreate
is robust and handles its own potential errors gracefully.
- Prevention:
Pre-flight checks for external dependencies. Robust error handling within OnCreate
. Consider implementing retries within OnCreate
itself for transient resource creation issues.
- Handling Patterns:
Log the error, especially unwrapping the root cause. This typically indicates an environment or configuration issue that needs attention.
- Propagation Behavior:
This error is returned directly to the caller of RunTask
.
priority queue full, task requeue failed
- Type:
Go error
(specific string match, wrapped with %w
)
- Symptoms:
A task that was intended to be re-queued (due to CheckHealth
returning false
and retries remaining) fails with this error, indicating it could not be re-added to the priority queue.
- Properties:
None (standard error
interface, contains underlying error via wrapping)
Scenarios
A task fails, CheckHealth
indicates an unhealthy worker, retries are allowed, and tasker
attempts to re-queue it to the priority queue, but the priority queue's buffer is full.
Example:
go
// Config: MaxRetries: 1, PriorityQueue buffer is small
// Task fails with unhealthy error, manager tries to re-queue
// Priority queue is full at that exact moment
// -> Result: Task Failed: priority queue full, task requeue failed: [original_error]
Reason:
The internal priority queue did not have capacity to accept a task that needed to be re-queued after an unhealthy worker condition. This usually happens under extreme load or if the priority queue's buffer size is too small relative to MaxRetries
and task failure rate.
- Diagnosis:
Check the priorityQueue
buffer size (not directly configurable by user, but tied to WorkerCount
in default implementation). Review overall system load and task failure rates. This is an internal tasker
queue, so it usually points to overloaded conditions.
- Resolution:
This indicates that even the priority retry mechanism is overloaded. Consider increasing WorkerCount
or MaxWorkerCount
to increase processing capacity, or reduce the rate of unhealthy errors. The task is permanently failed from tasker
's perspective.
- Prevention:
Ensure adequate worker capacity (WorkerCount
, MaxWorkerCount
). Minimize unhealthy worker conditions through robust resources. For very high load, consider custom MetricsCollector
to monitor queue depths and pre-scale.
- Handling Patterns:
Log this error. This task is definitively lost from `tasker`'s management. Manual intervention or external retry logic may be needed.
- Propagation Behavior:
This error is returned to the original caller of the submission methods or delivered via callbacks/channels.