Are the MEDLINE abstracts meant to be used as input data?
Yes, in fact it is probably necessary to use them to get competitive accuracies. • Why do the abstracts often contain references to gene names followed by a “p”. For example, abstract 10022848 references “sec4p” and “sec15p”, but the file gene-abstracts.txt associates this abstract with the genes “sec4” and “sec15”. The “p” suffix is often used to refer to the protein encoded by a given gene. For example, “sec4p” denotes the protein encoded by the gene “sec4”. Since the protein is the “product” of the gene, you can think of references to “sec4p” as saying something about “sec4”. • Are some of the abstracts more relevant than others? Yes, certainly. The range of relevance for the abstracts probably varies widely. • Is the relation represented by the protein-protein interaction table symmetric? Why are some pairs listed both ways, while most are not? Yes, the relation is symmetric. Therefore the order of the pair in each row does not matter. Some pairs are listed in both orders simply be