T. Jarrett, IPAC
(970818)
** updated ** (970903)
We generate "oblique" decision trees (ODT) using an algorithm developed by Murthy, Kasif&Salzberg (1994). See for example, Oblique decision trees and The OC1 decision tree software system .
The ODT is generated from an input list of sources detected and classified using GALWORKS. This set is referred to as a "training set", consisting of real galaxies, stars, double stars and triple stars. Artifacts, including meteor streaks and horizontal stripes associated with bright stars, are not currently used in this classification scheme.
Once the ODT is built, it can then be used to classify candidate extended sources. But for testing purposes, we use the training set to classify previously classified sets (generally, these are also training sets) so that we may evaluate the accuracy of the classification. The ultimate goals is to produce a reliable list of extended sources with minimal contamination from non-extended sources.
We produce the training sets as follows. All extended source candidates
from GALWORKS are visually examined using the JHK images plus the corresponding
optical DSS image to classify. Classification consists of
NOTE: each band is processed separately
The integrated flux and various scoring parameters used to generate the decision trees are:
The decision tree consists of both a "pruned" and an "unpruned" version. To "prune" a decision tree is to eliminate "branches" deemed redundant or unnecessary to the decision tree. There is always a danger of "overfitting" the data, especially if the input training sets are small. On the other hand, as we shall see below, "unpruned" decision trees carry the full power of the hyper-plane decision tree machine. Results from both "pruned" and "unpruned" decision trees are presented below.
Four training sets were generated from a small subset of the 2MASS scans observed to date. The fields of study consisted of
The training sets generated from the latter three fields, COMA, Hercules and the mid-galactic latitude field (Hercules supercluster) can be applied to any relatively low source density field (i.e., glat > 20 deg or so). Appropriately, the latter three training sets are combined to form one large training set, consisting of about 2100 objects, most of which are galaxies (of all shapes, sizes, flux, morphology, etc). The first training set corresponding to the low galactic latitude field (glat = 8 to 10 deg) should only be applied to fields with relatively high src density.
The following tables show the results using the ODT generated from the super training set (COMA+HERCULES+SUPERCLUSTER) to the individual training sets of COMA, Hercules and the Herc supercluster. Caution: The results from this exercise should not be overinterpreted since the ODT is generated from the same training sets that we attempt a blind classification (that is, the ODT is incestually-related to the fields with which we hope to classify). As more training sets are added with future study of different (and diverse) fields, this bias problem is minimized.
**Note: see latest results below for "Mid47" field **
Table Structure:
For completeness, we want "ngot" to be equal to "ngal", or as close as possible. For reliability, we want "nfa" to be zero or as close as possible to zero.
COMA
Notice that the "unpruned" ODT correctly classifies the training set much better than that of the "pruned" version. The C&R is excellent for the "unpruned" ODT version.
Hercules
Similar to Coma, the Herc results show that the "unpruned" ODT correctly classifies the training set much better than that of the "pruned" version. The C&R is excellent for the "unpruned" ODT version.
Hercules Supercluster
Similar to Coma and Hercules, the mid-galactic latitude field results show that the "unpruned" ODT correctly classifies the training set much better than that of the "pruned" version. The C&R is excellent for the "unpruned" ODT version.
What is Inside the Decision Tree Blackbox?
For the brave hearted, take a look at the decision tree equations here:
J band ODT for the COMA+Herc+Supercluster training set: unpruned
J band ODT for the COMA+Herc+Supercluster training set: pruned
Eight contiguous scans toward the general direction of [glong=116 and glat=-47] were obtained on the night of 970731. See GALWORKS Performance on a High Glat Field. We can use this field to independently test the Coma+Herc+Mid30 training set.
The Mid47 source list was evaluted by the ODT and the Coma+Herc+Mid30 training set and classification was performed on each band. This information was then "merged" to form a 3-channel classification. In general, we use a "two-band" rule to classify; that is, the object must be similarly classified in at least two bands. The results for this exercise are given below. We also show some results using "one-band" and "three-band" rules, given at the end.
Pruned ODT
Unpruned ODT
** Two-Band Rule applied here **
** Two-Band Rule applied here **
** Two-Band Rule applied here **
The "unpruned" ODT performs better than the pruned version,
at least with regard to false sources. There is still an
unaccceptably high rate of contamination to the galaxy
lists, from 1 to 5% false sources. The internal completeness
is very good, only 1 or 3% of the galaxies are not recovered
with ODT classification.
** Two-Band Rule applied here **
** Two-Band Rule applied here **
** Two-Band Rule applied here **
It is constructive to observe what happens with the classification when using "one-band" and "three-band" rules. The former should help galaxy completeness (i.e., only one band need be classified as a galaxy to make the cut) but sacrificing reliability. The latter "3-band" rule works inversely, it sacrifices completeness to maximize reliability. As seen below, the loss of galaxies is perhaps unacceptably high (~10%), with only slight gain in reliability (1 or 2%, although every % point is a hard fought battle).
Unpruned ODT w/ One-band and Three-band Rules
** One-Band Rule applied here **
** Three-Band Rule applied here **