running long deep-research jobs on vercel's 13-minute functions
the short version: a deep-research job can run for hours, but a vercel function dies at ~13 minutes. the fix isn't a bigger function — it's making the job resumable and letting a cron poll it to completion.
deep research arena fires one research question at several providers and compares the outputs. the problem is that deep research is slow. some of these jobs run for hours, while vercel serverless functions cap at 800 seconds (about 13 minutes) on pro before they get killed mid-execution. the work outlives the function running it.
the first version streamed in whatever finished inside that window and marked everything else failed. that held up fine until i added the premium tiers. providers like parallel ultra8x and gemini-max can run for over an hour, and even gemini standard at around 20 minutes blows past 800 seconds — so they would just die at the 13-minute mark.
the fix was to stop running the job to completion inside one request. when an async provider starts, i persist its handle (the provider key, task id, and start time) to a jsonb column. at around 780 seconds the function returns early instead of marking the slot failed, so the slot stays pending with its task id sitting in the db. a cron then runs every minute, finds the runs still in progress, polls the stored handles, and saves each result whenever it lands.
with that in place the run finishes whenever the providers finish, decoupled from any single function's lifetime. i kept streaming for the fast providers since that path was already working, and the cron just acts as the durable backstop for the slow ones. the request path and the cron share one module, and every save is an atomic jsonb merge, so they can poll the same handle concurrently without colliding.
i tested it on a parallel ultra8x run. it was still in progress at 14 minutes, which would have been dead under the old version, and the cron saved a complete 26,686-character report at 22 minutes.
the takeaway is that maxDuration is a hard ceiling rather than a config you can raise past it. if the job can exceed it, the move is to make the job resumable and let a scheduler drive it to completion. you persist a handle, not a process — and because the saves are atomic merges you mostly get to ignore the concurrency.