In this case, Transformers v5 has triggered the problem becoming apparent, but it might not be the cause itself.
Bottom line
I would treat this as a DataParallel + generate() integration failure, not as a Helsinki-NLP-specific regression and not as strong evidence that Python 3.12 or Ubuntu 24 broke Marian. The most likely immediate cause is:
generate() needs to know the model’s device.
- In current Transformers,
ModuleUtilsMixin.device is implemented as return next(param.device for param in self.parameters()).
- PyTorch has public issue history showing that inside
nn.DataParallel replica forward calls, self.parameters() can be empty.
- If those two facts meet,
StopIteration is the natural outcome. (GitHub)
Why your theory makes sense technically
The key detail is the current Transformers implementation of .device. It does not do anything fancy there; it simply asks the module for its parameters and takes the device of the first one. If no parameter is yielded, that line fails immediately with StopIteration. (GitHub)
That lines up unusually well with PyTorch’s DataParallel behavior. PyTorch issue #49828 shows a minimal example where, inside forward, list(self.parameters()) becomes [] under nn.DataParallel. That is not a theoretical concern; it is a public repro on PyTorch’s tracker. So your description — “it blows up when something downstream interrogates self.model.device inside generate()” — is mechanically very plausible. (GitHub)
There is also older Transformers issue history showing the same general failure shape: code paths using next(self.parameters()) under DataParallel can raise StopIteration. An old XLNet issue shows exactly that pattern in forward execution under DataParallel. (GitHub)
Why this is probably not really “a Marian problem”
Everything about the failure points to the wrapper architecture rather than the translation model family.
Public Hugging Face issue history shows several recurring classes of wrapper-related problems around generation:
generate() on a DataParallel-wrapped model not being workable in practice. (GitHub)
- seq2seq examples breaking under
DataParallel, with the issue disappearing when restricted to one visible GPU. (GitHub)
- pipeline and other high-level APIs also not reliably supporting a
DataParallel-wrapped object. (GitHub)
That pattern is much broader than Marian. It says: generation-oriented Hugging Face code and DataParallel have been a rough combination for years. Your translation models are just another place where that rough edge is surfacing. (GitHub)
Why it may have worked for years and then failed after your upgrade
That is believable, and it still does not point to Python 3.12 as the root cause.
DataParallel is one of those APIs where small changes in PyTorch internals, wrapper behavior, generation logic, or device-resolution timing can change whether a latent bug becomes visible. PyTorch still documents DataParallel as a replicated-per-forward, single-process, multi-thread abstraction, and still recommends DistributedDataParallel instead. Hugging Face has also continued evolving generation internals in v5 and around that period more generally, so a formerly lucky code path can become unlucky without the model itself changing. (PyTorch Documentation)
So my reading is:
- your upgrade likely exposed the problem,
- but the design weakness was already there.
Why DataParallel is the wrong fit for your exact workload
Your workload is unusually clear:
- one machine,
- two GPUs,
- inference only,
- models that fit comfortably on a single GPU,
- goal is throughput on batched translation.
That is almost the textbook case for one process per GPU with one ordinary model replica per process.
PyTorch’s own comparison explains why:
DataParallel is single-process, multi-threaded,
DistributedDataParallel is multi-process,
DataParallel pays thread/GIL overhead,
- it also pays per-iteration replication overhead,
- and scattering/gathering adds more overhead,
- so DDP is usually faster even on a single machine. (PyTorch Documentation)
That point matters even more for generate(). Translation generation is not just “run one forward once.” It is an iterative control loop that keeps checking model state, generation configuration, and device placement. That is exactly the kind of path where wrapper abstractions tend to leak. (GitHub)
Should you switch to DDP?
Directionally, yes. Practically, the cleaner answer is:
Use the DDP architecture idea, but not necessarily the DDP wrapper.
For pure inference, the important part is not gradient synchronization. The important part is:
- one process owns GPU 0,
- one process owns GPU 1,
- each process loads a normal unwrapped model,
- each process translates its shard of the batch,
- you merge outputs back in original order.
That gets you the benefit PyTorch wants you to have — multi-process, one replica per GPU — without forcing you into training-style distributed ceremony you do not need. PyTorch’s DDP tutorial and Hugging Face Accelerate’s distributed-inference guide both point in that direction. Accelerate specifically documents split_between_processes() for exactly this sort of multi-GPU inference sharding. (PyTorch Documentation)
Why I would not “hand-jam threads”
I would hand-jam processes, not threads.
Threads are the wrong escape hatch here because DataParallel is already the thread-based solution. PyTorch’s own comparison explicitly frames DataParallel as single-process, multi-threaded, with GIL contention and replication overhead, while the preferred alternative is multi-process. If you replace DataParallel with your own thread orchestration, you are staying on the same side of the architectural boundary that PyTorch is already telling you to leave. (PyTorch Documentation)
So if you want a robust “I control it myself” implementation, the right homemade version is:
- worker process on GPU 0,
- worker process on GPU 1,
- persistent model instance per worker,
- input queue,
- output queue,
- batch sharding by index.
That is much closer to the officially recommended direction than a thread pool.
A subtle but important caveat about DDP wrappers
There is one thing Dr. Google often leaves out: wrapping the model in DistributedDataParallel and then calling model.generate(...) on the wrapper is not always smooth either. Hugging Face has a public issue where a DDP-wrapped model raised AttributeError: 'DistributedDataParallel' object has no attribute 'generate'. (GitHub)
That is why my real recommendation is not “blindly wrap in DDP.” It is:
- adopt the process-per-GPU design,
- keep each process’s model as a plain model object for inference,
- call normal
model.generate(...) inside that process,
- use a launcher or helper library only for process orchestration.
For inference, that is usually simpler and more reliable than leaning on wrapper methods.
What I would do in your exact setup
For Helsinki-NLP/opus-mt-ru-en and Helsinki-NLP/opus-mt-zh-en, on a 2-GPU single machine, I would do this:
- Drop
nn.DataParallel entirely for inference.
- Keep one ordinary model replica per GPU.
- Split each incoming batch into two shards.
- Run
generate() independently in each process.
- Reassemble outputs in input order.
- Keep models loaded persistently so you do not repay load cost on every request. (PyTorch Documentation)
If you want the least boilerplate, use Accelerate distributed inference. If you want maximum explicit control and easy debugging, use plain Python multiprocessing or torchrun with one process per GPU. Either route is more appropriate than DataParallel for this workload. (Hugging Face)
What I would not do
I would not:
- spend much more time trying to patch around
self.model.device under DataParallel,
- assume downgrading Python will fix it,
- assume downgrading Transformers will fix it,
- rely on
model.module.generate(...) as the long-term answer,
- replace
DataParallel with a custom thread-based design. (PyTorch Documentation)
The reason is simple: all of those keep you too close to the wrapper behavior that is already failing.
My actual diagnosis, stated plainly
If I had to summarize your case in one sentence:
generate() is touching model device state; Transformers currently resolves .device by iterating parameters; DataParallel replicas can expose no parameters inside forward; so StopIteration under DataParallel is a believable architectural failure, not an especially convincing sign that Marian or Python 3.12 is broken. (GitHub)
Recommendation
For your case, I would make this call:
- keep the Marian/SentencePiece fix separate,
- stop using
nn.DataParallel for translation inference,
- move to one process per GPU,
- use plain unwrapped models in those processes,
- shard batches explicitly,
- use Accelerate or plain multiprocessing depending on how much control you want. (Hugging Face)
This also fits the broader current state of the Transformers stack and the separate Marian/backend diagnosis already established here.