Application of Oblique Decision Trees
to the GALWORKS Extended Source Candidates

T. Jarrett, IPAC
(970818)

** updated ** (970903)

We generate "oblique" decision trees (ODT) using an algorithm developed by Murthy, Kasif&Salzberg (1994). See for example, Oblique decision trees and The OC1 decision tree software system .

The ODT is generated from an input list of sources detected and classified using GALWORKS. This set is referred to as a "training set", consisting of real galaxies, stars, double stars and triple stars. Artifacts, including meteor streaks and horizontal stripes associated with bright stars, are not currently used in this classification scheme.

Once the ODT is built, it can then be used to classify candidate extended sources. But for testing purposes, we use the training set to classify previously classified sets (generally, these are also training sets) so that we may evaluate the accuracy of the classification. The ultimate goals is to produce a reliable list of extended sources with minimal contamination from non-extended sources.


We produce the training sets as follows. All extended source candidates from GALWORKS are visually examined using the JHK images plus the corresponding optical DSS image to classify. Classification consists of

NOTE: each band is processed separately

The integrated flux and various scoring parameters used to generate the decision trees are:

The decision tree consists of both a "pruned" and an "unpruned" version. To "prune" a decision tree is to eliminate "branches" deemed redundant or unnecessary to the decision tree. There is always a danger of "overfitting" the data, especially if the input training sets are small. On the other hand, as we shall see below, "unpruned" decision trees carry the full power of the hyper-plane decision tree machine. Results from both "pruned" and "unpruned" decision trees are presented below.


Four training sets were generated from a small subset of the 2MASS scans observed to date. The fields of study consisted of

The training sets generated from the latter three fields, COMA, Hercules and the mid-galactic latitude field (Hercules supercluster) can be applied to any relatively low source density field (i.e., glat > 20 deg or so). Appropriately, the latter three training sets are combined to form one large training set, consisting of about 2100 objects, most of which are galaxies (of all shapes, sizes, flux, morphology, etc). The first training set corresponding to the low galactic latitude field (glat = 8 to 10 deg) should only be applied to fields with relatively high src density.

The following tables show the results using the ODT generated from the super training set (COMA+HERCULES+SUPERCLUSTER) to the individual training sets of COMA, Hercules and the Herc supercluster. Caution: The results from this exercise should not be overinterpreted since the ODT is generated from the same training sets that we attempt a blind classification (that is, the ODT is incestually-related to the fields with which we hope to classify). As more training sets are added with future study of different (and diverse) fields, this bias problem is minimized.

**Note: see latest results below for "Mid47" field **


Table Structure:

For completeness, we want "ngot" to be equal to "ngal", or as close as possible. For reliability, we want "nfa" to be zero or as close as possible to zero.

COMA

Notice that the "unpruned" ODT correctly classifies the training set much better than that of the "pruned" version. The C&R is excellent for the "unpruned" ODT version.

Hercules

Similar to Coma, the Herc results show that the "unpruned" ODT correctly classifies the training set much better than that of the "pruned" version. The C&R is excellent for the "unpruned" ODT version.

Hercules Supercluster

Similar to Coma and Hercules, the mid-galactic latitude field results show that the "unpruned" ODT correctly classifies the training set much better than that of the "pruned" version. The C&R is excellent for the "unpruned" ODT version.


What is Inside the Decision Tree Blackbox?

For the brave hearted, take a look at the decision tree equations here:

J band ODT for the COMA+Herc+Supercluster training set: unpruned

J band ODT for the COMA+Herc+Supercluster training set: pruned


High Galactic Latitude Field: Glat = -47 degrees

Eight contiguous scans toward the general direction of [glong=116 and glat=-47] were obtained on the night of 970731. See GALWORKS Performance on a High Glat Field. We can use this field to independently test the Coma+Herc+Mid30 training set.

The Mid47 source list was evaluted by the ODT and the Coma+Herc+Mid30 training set and classification was performed on each band. This information was then "merged" to form a 3-channel classification. In general, we use a "two-band" rule to classify; that is, the object must be similarly classified in at least two bands. The results for this exercise are given below. We also show some results using "one-band" and "three-band" rules, given at the end.

Pruned ODT

Unpruned ODT
The "unpruned" ODT performs better than the pruned version, at least with regard to false sources. There is still an unaccceptably high rate of contamination to the galaxy lists, from 1 to 5% false sources. The internal completeness is very good, only 1 or 3% of the galaxies are not recovered with ODT classification.

It is constructive to observe what happens with the classification when using "one-band" and "three-band" rules. The former should help galaxy completeness (i.e., only one band need be classified as a galaxy to make the cut) but sacrificing reliability. The latter "3-band" rule works inversely, it sacrifices completeness to maximize reliability. As seen below, the loss of galaxies is perhaps unacceptably high (~10%), with only slight gain in reliability (1 or 2%, although every % point is a hard fought battle).

Unpruned ODT w/ One-band and Three-band Rules