back to dataset directory

Solubilitiy data set, SMILES format
Please acknowledge Dr Jarmo Huuskonen if there will be a publication

train.smi, 1033 training compounds
test1.smi = 258 compounds in test set #1
test2.smi = 21 compounds in test set #2

Attention! Some of the compounds employ a definition of explicit hydrogens that causes problems with some programs. E.g. line 79 states the SMILES string:
79 n(H)(c(c(c1cccc2)ccc3)c3)c12

If changed to
79 [nH](c(c(c1cccc2)ccc3)c3)c12

the problem will be resolved. CACTVS for example can also read the first format shown. I didn't want to meddle around with the files I received so I decided to provide the original data sets here. This is also true for some missing spacers; I will provide a corrected data set here as soon as I find the time.