1.2 Nomenclature: Naming Conventions, One-Letter and Three-Letter Codes

Foundational 7 min read APS Editorial Article 1.2

Peptide nomenclature is not bookkeeping. In a field where a single transposed residue can determine biological activity, the conventions governing how sequences are named and written carry real scientific weight. This article covers the one-letter and three-letter amino acid codes, sequence directionality, systematic naming, and the most consequential points of confusion in the literature.

Key Terms

N-terminus: The end of a peptide chain bearing a free or modified α-amino group; written at the left by convention.
C-terminus: The end of a peptide chain bearing a free or modified α-carboxyl group; written at the right by convention.
One-letter code: Single-character identifiers for the twenty canonical amino acids, standardized for sequence databases and computational use.
Three-letter code: Three-character identifiers for the twenty canonical amino acids, preferred in structural and chemical contexts.
International Nonproprietary Name, INN: WHO-assigned generic names for pharmaceutical substances, including approved peptide therapeutics.

Why Nomenclature Matters

Nomenclature is not bookkeeping. In a field where a single transposed residue can mean the difference between a bioactive sequence and an inert one, the conventions governing how peptides are named, written, and communicated carry real scientific weight. A researcher who misreads a sequence direction, misassigns a modification, or conflates two naming systems is not making a trivial error. Understanding the logic behind peptide nomenclature, not just the rules but why the rules exist, is the foundation for reading the literature accurately and communicating your own work without ambiguity.

The Amino Acid Codes

The twenty canonical amino acids are identified by two shorthand systems that coexist in the literature and serve different purposes.^[5]

The three-letter code was the earlier convention, adopted when the structures of the amino acids were first being characterized. It is generally a recognizable abbreviation of the amino acid name: Gly for glycine, Ala for alanine, Phe for phenylalanine. Most three-letter codes are intuitive enough to be learned quickly. The exceptions, Asn for asparagine, Gln for glutamine, and Trp for tryptophan, require deliberate memorization. The three-letter code is preferred in detailed structural discussions, in tables, and whenever ambiguity about modification state needs to be resolved explicitly.

The one-letter code was introduced as sequence databases grew and single-character representation became computationally and typographically necessary. It is less intuitive: G for glycine and A for alanine are straightforward, but B for asparagine or aspartate, X for any amino acid, and Z for glutamine or glutamate reflect the compromises made when fitting twenty-plus entries into the Roman alphabet. The one-letter code is standard in sequence alignments, database entries, and any context where long sequences must be presented compactly.

Both systems are given in the table below. Every peptide scientist should know both without reference.

The Standard Amino Acid Codes

Amino Acid	Three-Letter Code	One-Letter Code
Alanine	Ala	A
Arginine	Arg	R
Asparagine	Asn	N
Aspartate	Asp	D
Cysteine	Cys	C
Glutamate	Glu	E
Glutamine	Gln	Q
Glycine	Gly	G
Histidine	His	H
Isoleucine	Ile	I
Leucine	Leu	L
Lysine	Lys	K
Methionine	Met	M
Phenylalanine	Phe	F
Proline	Pro	P
Serine	Ser	S
Threonine	Thr	T
Tryptophan	Trp	W
Tyrosine	Tyr	Y
Valine	Val	V

Ambiguity codes: B = Asp or Asn; Z = Glu or Gln; J = Leu or Ile; X = unknown or non-standard.

Sequence Direction: N to C by Convention

Peptide sequences are written from left to right, amino terminus to carboxyl terminus, by universal convention. This is not arbitrary. The ribosome synthesizes peptides in the N-to-C direction, and the biological logic of peptide function is largely organized around this directionality. When a sequence is written H-GRGDS-OH, the H denotes a free amino terminus and the OH a free carboxyl terminus. When those termini are modified, whether acetylated, amidated, or cyclized, the modification is stated explicitly.

Reversing this convention, even accidentally, produces a different molecule with different properties. The sequence GRGDS and its retro sequence SDGRG are not the same peptide and do not behave identically in biological assays, even though they share the same amino acid composition. Directionality is not a formality.

Naming Peptides Systematically

For short peptides, systematic nomenclature uses the suffix -yl for all residues except the C-terminal one, which retains its full name. A tripeptide of glycine, alanine, and valine in that order is glycyl-alanyl-valine, abbreviated Gly-Ala-Val or GAV. Hyphens between three-letter codes are standard in the older literature; modern usage often omits them for sequences longer than four or five residues.

This systematic approach becomes cumbersome for longer sequences and is rarely used beyond the tetrapeptide level. In practice, sequences longer than four residues are written as strings of one-letter or three-letter codes, with modifications and unusual residues noted explicitly.

Naming Modified and Non-Canonical Residues

The two-code system covers only the twenty canonical amino acids. The far larger universe of non-canonical residues, post-translational modifications, and synthetic building blocks requires additional conventions.

For post-translational modifications, the modification is indicated in parentheses following the affected residue: Ser(p) or pSer for phosphoserine, Lys(Ac) for acetyllysine. For non-canonical amino acids without an established abbreviation, three-letter codes are assigned by convention within a paper and defined at first use. The IUPAC recommendations provide a framework, though complete standardization remains aspirational rather than achieved.

D-amino acids are distinguished from their L-counterparts by lowercase one-letter codes in some notations, or by the prefix D- in three-letter notation: D-Phe, D-Ala. This distinction is critical in contexts where stereochemistry affects biological activity or proteolytic stability, which is to say, most contexts of practical importance.

Peptide Names in the Literature

Beyond systematic nomenclature, many peptides carry common names that have accumulated historical weight: oxytocin, vasopressin, substance P, alamethicin, melittin. These names carry no structural information but are so embedded in the literature that they function as stable identifiers. When encountering a named peptide for the first time, the first task is to find its sequence; the name alone tells you nothing about its chemistry.

Synthetic peptides derived from known proteins are often named by their position of origin: the fibronectin-derived peptide GRGDS, the laminin-derived IKVAV. This positional naming encodes useful biological context but requires knowledge of the parent protein to interpret fully.

Therapeutic peptides acquire International Nonproprietary Names, INNs, through the WHO naming system, typically carrying the stem -tide for peptides: exenatide, liraglutide, ziconotide.^[6] These names are used in clinical and regulatory contexts and are distinct from the chemical names used in the research literature.

Common Points of Confusion

Three errors recur with enough frequency to deserve explicit attention.

First, asparagine versus aspartate and glutamine versus glutamate are regularly confused, both in reading and in writing. The pairs share three-letter codes that differ by one letter, Asn/Asp and Gln/Glu, and one-letter codes that share no obvious relationship to the names. The charged residues carry the D and E codes; the amide residues carry N and Q. These pairs must be committed to memory because confusing them in a sequence is a meaningful chemical error.

Second, leucine and isoleucine are isobaric: they share the same molecular formula and are indistinguishable by standard mass spectrometry. The ambiguity code J exists precisely because MS-based sequencing cannot resolve them without additional data. Awareness of this limitation is essential when interpreting mass spectrometric sequence data.

Third, the B, Z, and X ambiguity codes appear in database entries when sequencing data is incomplete or ambiguous. A sequence containing X is not a fully defined peptide. When these codes appear in the literature, they signal a gap in characterization that should be noted.

References

[5] IUPAC-IUB Joint Commission on Biochemical Nomenclature (1984). Nomenclature and symbolism for amino acids and peptides. European Journal of Biochemistry, 138(1), 9–37.
[6] WHO International Nonproprietary Names (INN) Programme. World Health Organization. https://www.who.int/teams/health-product-and-policy-standards/inn

Comments (0)

No comments yet.

Article Info

nomenclature amino acid codes one-letter code three-letter code sequence direction N-terminus C-terminus INN

Nomenclature: Naming Conventions, One-Letter and Three-Letter Codes