Systems engineering runs on text. Requirements, interface specs, verification procedures, trade studies, review comments. Most of the artifacts that define a program are written in natural language before they are anything else. That one fact explains why large language models landed in this field faster than almost anyone expected, and why INCOSE went to the trouble of publishing a dedicated Survey of LLM Applications for Systems Engineering to make sense of a space that shifts month to month.
But "LLMs fit systems engineering" and "LLMs are safe to trust on a flight-critical program" are two very different claims. This piece looks at where AI copilots in model-based systems engineering (MBSE) are genuinely pulling their weight, where they quietly fall over, and how careful teams keep the productivity without inheriting the risk.
Why systems engineering suits LLMs better than most fields
Most enterprise AI pilots stall because the underlying data is messy, private, and badly structured. Systems engineering has the opposite problem, in a useful way. Its work products are text-heavy, follow recognizable conventions, and are written against published standards like ISO/IEC/IEEE 15288 and the requirements guidance in IEEE 29148. To a language model, a lot of SE work looks like a well-defined translation or rewriting task.
The INCOSE survey makes the point that generative AI in the SE community already covers a wide spectrum, from drafting individual requirements, to producing full text-based documents, to generating models themselves. It also flags a pattern worth absorbing: teams rarely stick with a raw, general-purpose model for long. They tune it for SE work through fine-tuning and retrieval-augmented generation (RAG), grounding the output in their own authoritative data instead of the open internet.
Where AI copilots already earn their keep
The most believable wins right now sit in the unglamorous middle of the workflow, where an LLM speeds up a human rather than replacing their judgment.
- Requirements authoring and cleanup. Drafting candidate "shall" statements, flagging vague or unverifiable language, and rewriting requirements into a consistent house style.
- Requirements analysis. Scanning a requirement set for ambiguity, contradictions, duplication, and missing conditions. Humans do this well but cannot do it tirelessly across thousands of statements.
- Derivation and decomposition. Proposing subsystem requirements derived from a system-level need, which an engineer then accepts, edits, or throws out.
- Architecture first drafts. Sketching candidate functions, modes, and component breakdowns. Research on generative design of model-based spacecraft architectures found that models could produce architectures whose functions, modes, and components were generally traceable back to requirements. That is a real result, with an important caveat we will get to.
- Cross-model alignment. Recent work on LLM-assisted semantic alignment in collaborative MBSE using SysML v2 looks at using models to reconcile terminology and structure when several teams contribute to one system model, which is a constant source of pain on distributed programs.
The common thread is augmentation. The copilot shortens the trip from blank page to reviewable draft, and the engineer stays the one who signs off.
The traceability and hallucination problem
Here is where the enthusiasm needs a hard edge. The same spacecraft-architecture study that found LLM output "generally traceable" also found that the generated designs did not match the traceability quality of existing, human-built ones. In systems engineering that gap is the whole game, because the value of a model lives in defensible links between need, requirement, design, and verification.
LLMs are probabilistic text generators, not reasoning engines with a ground truth. That produces a familiar set of failure modes the SE literature keeps pointing at: invented references to requirements or components that do not exist, confident but wrong restatements of an interface, silent disagreement between two outputs generated minutes apart, and a general opacity that makes the results hard to check. A recent framework for risk assessment of LLMs in systems engineering treats these as lifecycle risks (reliability, alignment, bias, limited interpretability), not cosmetic bugs.
The compliance fallout is direct. As the INCOSE survey puts it, the lack of clear quality-assurance pathways compounds hallucination, ambiguity, and inconsistency, "thereby undermining compliance with standards such as INCOSE, ISO/IEC/IEEE 15288, and IEEE 29148." A requirement you cannot trust to trace is not a faster requirement. It is a liability that happens to look finished.
How serious teams contain the risk
The teams getting real value are not the ones with the cleverest prompts. They are the ones who treat the LLM as a constrained part inside a governed process. A few patterns show up again and again.
- Ground the model in the model. RAG against the authoritative system model and requirements baseline, rather than the open web, sharply cuts hallucination and keeps suggestions tied to real, identifiable elements.
- Keep a human in the verification loop. Copilots draft, engineers approve, and every acceptance is logged so each AI-assisted change has an owner and an audit trail.
- Make traceability a first-class output. If the tool proposes a requirement or component, it should propose the trace links too, so a reviewer can check them on the spot instead of reconstructing them later.
- Watch the cost-of-expertise tradeoff. INCOSE's 2025 International Symposium work on performance tradeoffs in LLMs for systems engineering is a useful reminder that bigger or more specialized is not automatically better. The right model depends on the job.
Tooling architecture matters here too. A copilot bolted onto a document store inherits all of that store's ambiguity. A copilot that works on a structured, SysML v2-based model can reason over typed elements and existing trace links instead of loose prose, which keeps its suggestions anchored to the authoritative model and the review state regulated programs expect. Where the copilot lives varies by vendor: some bolt assistants onto established platforms like Dassault's Cameo, others sit alongside open-source tools like Eclipse Capella, and newer entrants such as Dalus build the copilot directly into a SysML v2-native core. For defense and aerospace teams, deployment posture counts as much as features, so SOC 2 compliance, government-cloud options, and on-premises hosting belong in any serious evaluation rather than as an afterthought.
What AI copilots will not solve
Set expectations before you set up a pilot. LLMs will not supply the domain knowledge your program lacks. They recombine patterns, and a confidently wrong subsystem decomposition can cost more to unwind than it saved to draft. They will not replace verification and validation, since a model that "looks complete" is the exact failure mode V&V exists to catch. They will not settle organizational disagreement about what the system should do. If stakeholders have not aligned, the copilot just writes fluent versions of the confusion. And on their own they will not hand you a defensible digital thread. Traceability is an engineering discipline that tooling can support but cannot manufacture.
There is a quieter risk too: automation complacency. The faster and more fluent the drafts, the easier it is for reviewers to wave them through. The teams that win deliberately put friction back in at the steps where a human signature carries legal and safety weight.
Bottom line
LLMs in systems engineering have moved past the first peak of the hype cycle into something more useful and more sober. Used as copilots that draft requirements, surface inconsistencies, sketch architectures, and align models, they can meaningfully compress the slow early phases of MBSE. Used as oracles, they put back exactly the ambiguity and untraceability that MBSE was built to remove. The thing that separates teams in 2026 is not whether they use AI. It is whether the AI is grounded in their authoritative model, kept inside a human-governed verification loop, and held to the same traceability bar as everything else on the program. Get that right and the copilot is a real force multiplier. Get it wrong and you have just automated the production of plausible-looking technical debt.