Filtering

In this example, we’re going to demonstrate some features of GlyLES to extract information from glycans on both, the monomeric level and the atomic level.

Monomer-level

These functions deal with finding monomers and their attached functional groups.

Parse only the Tree

In order to work with glycans beyond simple conversion, one can directly work on the atomic structure of glycans. This is possible without parsing the whole glycan into SMILES but just read in the monosaccharide tree by setting tree_only=True when instantiating a glycan.

[1]:
from glyles import Glycan


glycan = Glycan("Man(a1-2)Gal")  # read full glycan
print(glycan.glycan_smiles)  # better use glycan.get_smiles(), but good for demonstration

glycan = Glycan("Man(a1-2)Gal", tree_only=True)
print(glycan.glycan_smiles)
O1C(O)[C@H](O[C@H]2O[C@H](CO)[C@@H](O)[C@H](O)[C@@H]2O)[C@@H](O)[C@@H](O)[C@H]1CO
None

Count substructures

Using GlyLES, the user can count how often a certain substructure occurs in a glycan. For this check, only the type of monosaccharide has to match, neither the enantiomeric form nor the configuration (whether it’s alpha, beta, or undefined), nor bonds between the monosaccharides, nor any attached functional groups. This is the default configuration of the method. Further, you have to specify that you want to match_nodes. Later, we will see, that there are other possibilities.

Instead of passing a glycan object to the count function, you can also send the IUPAC-condensed string.

[2]:
glycan = Glycan("Fuc(a1-2)[GalNAc(a1-3)]Gal(b1-4)GlcNAc(b1-3)[Fuc(a1-2)[GalNAc(a1-3)]Gal(b1-4)GlcNAc(b1-6)]Gal(b1-3)[GlcNAc(a1-4)Gal(b1-4)GlcNAc6S(b1-6)]GalNAc")
sub = Glycan("Gal")

print(glycan.count(sub, match_nodes=True))
print(glycan.count("Gal", match_nodes=True))
7
7

This is possible for the root-monomer …

[3]:
print(glycan.count("Gal", match_root=True))
1

… and the leaves as well.

[4]:
print(glycan.count("Fuc", match_leaves=True))
2

This can be extended to bigger substructures

Finding polymers in root monomers or the leaves of a glycan is not possible.

[5]:
print(glycan.count("Gal(a1-2)GlcNAc", match_nodes=True))
3

Exact substructure matches

This can also be done as exact matching, i.e. the matching galactoses have to have the exact same modifications as the query (match_all_fg). This includes the enantiomeric form, the configuration (whether it’s alpha, beta, or undefined), bonds between the monosaccharides, and any attached functional groups. Similarly, matching of bonds can be requested (match_edges). Finally, both matching filters can be combined.

[6]:
# Matching monosaccharides exactly
print(glycan.count("Gal(a1-2)Glc", match_nodes=True, match_all_fg=True))
print(glycan.count("Gal(a1-2)GlcNAc", match_nodes=True, match_all_fg=True))

# Matching bonds between monosaccharides exactly
print(glycan.count("Gal(a1-2)Glc", match_nodes=True, match_edges=True))
print(glycan.count("Gal(b1-4)Glc", match_nodes=True, match_edges=True))

# Matching both monosaccharides and their bonds exactly
print(glycan.count("Gal(a1-2)GlcNAc", match_nodes=True, match_all_fg=True, match_edges=True))
print(glycan.count("Gal(b1-4)Glc", match_nodes=True, match_all_fg=True, match_edges=True))
print(glycan.count("Gal(b1-4)GlcNAc", match_nodes=True, match_all_fg=True, match_edges=True))
0
2
0
3
0
0
2

Partially matching substructures

It is also possible to match only some functional groups (match_some_fg). This is an intermediate stage between matching no functional groups and all. Therefore, the query glycan has to contain all functional groups that have to match. If a matching monomer in the glycan has more functional groups, but at least those of the query, then they are considered as a match. This can be combined with matching edges as well.

[7]:
print(glycan.count("Gal(b1-4)GlcNAc", match_some_fg=True, match_nodes=True))
3

Filtering and Ordering Glycans based on Structural Properties

Using this, you can filter and sort lists of glycans based on structural properties.

[8]:
glycans = [(line.split("\t")[0], Glycan(line.split("\t")[0])) for line in open("files/pubchem_poly.tsv", "r").readlines()]
query = Glycan("Tal")
len(list(filter(lambda x: x[1].count(query, match_nodes=True) > 0, glycans)))
[8]:
0
[9]:
query = Glycan("Gal")
print("\n".join([f"{x[1].count(query, match_nodes=True)}: {x[0]}" for x in sorted(glycans, key=lambda x: x[1].count(query, match_nodes=True), reverse=True)][:10]))
4: Gal(b1-4)GlcNAc(b1-4)[GalNAc(b1-4)GlcNAc(b1-2)]Man(a1-3)[GalNAc(b1-4)GlcNAc(b1-2)[GalNAc(b1-4)GlcNAc(b1-6)]Man(a1-6)]Man(b1-4)GlcNAc(b1-4)[Fuc(a1-6)]GlcNAc
3: GalNAc(b1-4)GlcNAc(b1-2)[GalNAc(b1-4)GlcNAc(b1-6)]Man(a1-6)[GalNAc(b1-4)GlcNAc(b1-2)Man(a1-3)]Man(b1-4)GlcNAc(b1-4)[Fuc(a1-6)]GlcNAc
3: Gal(b1-4)GlcNAc(b1-2)[GalNAc(b1-4)GlcNAc(b1-6)]Man(a1-6)[GalNAc(b1-4)GlcNAc(b1-2)Man(a1-3)]Man(b1-4)GlcNAc(b1-4)[Fuc(a1-6)]GlcNAc
3: Gal(b1-4)GlcNAc(b1-4)[GalNAc(b1-4)GlcNAc(b1-2)]Man(a1-3)[GalNAc(b1-4)GlcNAc(b1-2)Man(a1-6)]Man(b1-4)GlcNAc(b1-4)[Fuc(a1-6)]GlcNAc
3: Gal(b1-4)GlcNAc(b1-6)[GalNAc(b1-4)GlcNAc(b1-2)]Man(a1-6)[GalNAc(b1-4)GlcNAc(b1-2)Man(a1-3)]Man(b1-4)GlcNAc(b1-4)[Fuc(a1-6)]GlcNAc
3: Gal(a1-3)Gal(b1-4)GlcNAc(b1-2)Man(a1-3)[GalNAc(b1-4)GlcNAc(b1-2)Man(a1-6)]Man(b1-4)GlcNAc(b1-4)[Fuc(a1-6)]GlcNAc
3: Fuc(a1-2)[GalNAc(a1-3)]Gal(b1-4)GlcNAc(b1-2)Man(a1-3)[Gal(b1-4)GlcNAc(b1-2)Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc
3: Fuc(a1-2)[GalNAc(a1-3)]Gal(b1-4)GlcNAc(b1-2)Man(a1-6)[Gal(b1-4)GlcNAc(b1-2)Man(a1-3)]Man(b1-4)GlcNAc(b1-4)GlcNAc
3: Gal(a1-3)Gal(b1-4)GlcNAc(b1-2)Man(a1-6)[NeuAc(a2-6)GalNAc(b1-4)GlcNAc(b1-2)Man(a1-3)]Man(b1-4)GlcNAc(b1-4)[Fuc(a1-6)]GlcNAc
3: Fuc(a1-2)[GalNAc(a1-3)]Gal(b1-4)GlcNAc(b1-2)Man(a1-6)[NeuAc(a2-3)Gal(b1-4)GlcNAc(b1-2)Man(a1-3)]Man(b1-4)GlcNAc(b1-4)GlcNAc

Atomic Level

Counting functional groups

Most functionality shown previously is already possible by using tools like glycowork, glypy, GlycoQL, or others. Now, we will demonstrate functionality working directly on the atomic structure of glycans which is a new feature, not reproducible with glycowork and others.

Using GlyLES, one can generate a glycan based on the IUPAC-condensed string and then search for functional groups in it. This can be done either providing the IUPAC abbreviation for a functional group or a SMILES or SMARTS string of the query that is matched to the structure. Here, one has to be careful as the SMILES query CO matches to R1-C-O-C-R2 even though, one might want to match to C-O-H groups. Therefore, it might be easier to hand in the SMARTS string [#6]-[#8]-[#1] to match more precise.

Another note when counting motives: The to match the query CO to the structure R1-C-O-C-R2 There are two options, one with the left carbon and one with the right carbon. In the count-method, we ensure that every atom is present in exactly one match. So matching CO to R1-C-O-C-R2 will have one hit, matching to R1-C-O-O-C-R2 will have two.

[10]:
print(glycan.count_functional_groups("CO"))  # count carboxy groups
print(glycan.count_functional_groups("[#6]-[#8]-[#1]"))  # or insert it as SMARTS to get the matches you meant
print(glycan.count_functional_groups("S"))  # find the sulfate group
print(glycan.count_functional_groups("NC(=O)C"))  # count NAc groups
56
31
1
7

Counting possible sites for deprotonation

With GlyLES, you can also count possible locations of deprotonation. This is possible by either counting the absolute number of possible oxygen atoms that can be deprotonated or the number of functional groups (such as acid-, sulfate-, or phosphate-groups) that can be deprotonized.

[11]:
print(glycan.count_protonation(groups=False))  # count all possible oxygen atoms
print(glycan.count_protonation(groups=True))  # count all functional groups that can be deprotonized, no matter how often
2
1