Algorithm Descriptions
 Archive Intro
 Table Descriptions
 Schema Browser
 Introduction to SQL
 Form Query User Guide
 Query Limits
 How To
 SkyServer Sites
 SkyServer Traffic Page
 Web Browsers
 Contact Help Desk

Match and MatchHead Tables

Jim Gray, Alex Szalay, Robert Lupton, Jeff Munn

May 2003, revised January, May, June, July, December 2004

The SDSS data can be used for temporal studies of objects that are re-observed at different times.  The SDSS survey observes about 10% of the Northern survey area 2 or more times, and observes the Southern stripe more than a dozen times. 

The match table is intended to make temporal queries easy by providing a precomputed list of all objects that were observed multiple times.  More formally,

Match = { (ObjID1,ObjID2) | Objid1 and ObjID2 are both from different runs (i.e. different observations).
  And they are within 1 arcsecond of one another
  And are both good (star or galaxy or unknown)
  And are both fully deblended (no children)
  And they are primary or secondary (not family or outside) }

But, as always, there are complications.

Green, Yellow, Red: What if ObjID2 in Run2 is missing?
It could be missing because it was not seen or because it was masked. What about the "edge cases" at the edge of run1 or of run2. Perhaps it is an edge object. We color-code these edge cases as "yellow" objects and the masked objects as "red" objects. And of course, we flag the missing but matched objects as "green’.

Surrogate: When an object is missing in Run2, what do we put in the match table?
We could fabricate ObjID2, or we could find the closest ObjID2 in Run2 and just flag it and record the distance to it (which will be more than 1 arcsecond).

Computing the Match table

The Match table is computed by using the Neighbors table and has a very similar schema (the Neighbors table only stores mode (1,2) (aka primary/secondary) and type (3,5,6) (aka galaxy, unknown, star) objects.

       objID1 bigint not null, objID2 bigint not null, -- object pair
       run1 smallint not null, run2 smallint not null, -- their run numbers
       type1 tinyint not null, type2 tinyint not null, -- star, galaxy,...
       mode1 tinyint not null, mode2 tinyint not null, -- primary, secondary,...
       distance float not null,			       -- in arcminutes
       miss char not null,			       -- " " no miss, RGY:red,green,yellow
       matchHead bigint not null,		       -- see below.
       primary key (objID1, ObjID2)
) ON [Neighbors]
-- now populate the table
SELECT objID as objID1, neighborObjID as objID2,
       (objID & 0x0000FFFF00000000)/power(cast(2 as bigint),32) as run1,
       (NeighborObjID & 0x0000FFFF00000000)/power(cast(2 as bigint),32) run2,
       type as type1, neighborType as type2, 
       mode as mode1, neighborMode as mode2,
       distance,  ' ' as miss, 0 as matchHead
FROM Neighbors
WHERE distance < 1.0/60.0	-- within 1 arcsecond of one another

One arcsecond is a large error in Sloan Positioning - the vast majority is within a ˝ arcsecond (95%). But a particular cluster may not form a complete graph (all members connected to all others). To make the graph fully transitive, we repeatedly execute the query to add the "curved" arcs in Figure 1. Notice that that figure shows two objects observed in four runs, and that the two objects are observed only once in the middle two runs. The whole collection is closed to make a "bundle" that will have a matchHead object (the smallest objID of the bundle).
Figure 1: Six objects from 4 runs forming a bundle.

declare @Trip table (
	objid1 bigint,ObjID2 bigint,
	run1 smallint, run2 smallint,
	type1 tinyint, type2 tinyint, 
	mode1 tinyInt, mode2 tinyInt,
	primary key (objID1,ObjID2)

WHILE (1=1) BEGIN delete @Trip -- empty the triples table insert @Trip -- compute new pairs based on triples select distinct a.objID1, b.ObjID2, a.run1, b.run2, a.type1, b.type2, a.mode1, b.mode2 from Match a join Match b on a.ObjID2 = b.objID1 where a.objID1 != b.ObjID2 -- different objects and not exists ( select objID1 -- and pair not already in Match from match c where c.objID1 = a.objID1 and c.objID2 = b.objID2)
IF @@rowcount = 0 BREAK -- stop if none found. select "adding " + cast(count(*) as varchar(20)) + " triples." from @Trip -- now add these into Match and repeat till no more rows. insert Match -- else add pair to Match. select T.*,N.distance, 0,0 from @Trip T join Neighbors N on t.objID1 = N.objID and t.ObjID2 = N.NeighborObjID END

Computing the MatchHead table

Now each cluster of objects in the Match table is fully connected. We can name the clusters in the Match table by the minimum (non zero) objID in the cluster and can compute the MatchHead table that describes the global properties of the cluster: its name, its average RA and DEC and the variance in RA, DEC.

-- build a table of cluster IDs (minimum object ID of each cluster).

CREATE TABLE MatchHead ( objID bigint not null primary key, -- id of match head object averageRa float not null default 0, -- average right ascension averageDec float not null default 0, -- average declination varRa float not null default 0, -- variance in RA varDec float not null default 0, -- variance in DEC matchCount tinyInt not null default 0, -- real number in cluster missCounttinyInt not null default 0 -- fake objects in cluster ) ON [Neighbors]

Figure 2: Green area of A must match B, yellow area may match B. Red areas of B (masks) have a good reason for the A object to have a missing object. In these missing cases, we pick the "nearest" object in B to be a surrogate for A in the match table.

-- compute the minimum object IDs.

declare @MinID table ( objID bigint primary key )
insert @MinID select distinct objID1 from Match MinId where 0 = ( select count(*) from Match m where MinId.objID1 = m.objID1 and MinId.objID1 > m.ObjID2 )
-- compute all pairs of objIDs in a -- cluster (including x,x for headID)
declare @pairs table ( objID1 bigint not null, objID2 bigint not null, primary key(objID1,objID2) )
insert @pairs select minID.objID, m.ObjID2 from @MinID minID join Match m on minID.objID = m.objID1
insert @pairs select objID, objID from @MinID
-- now populate the MatchHead table with minObjID and statistics
insert MatchHead select MinID.objID, -- for each MatchHead ID avg(ra), avg(dec), -- get avg ra,dec of group coalesce(stdev(ra),0), coalesce(stdev(dec),0), -- get stddev of group count(distinct -- count runs (p.objid2 & 0x0000FFFF00000000)), 0 -- count misses later From @MinID as MinID join @pairs as p on MinID.objID = p.objID1 join PhotoObjAll as o on p.objID2 = o.objID group by MinID.objID
-- now set MatchHead on each object
UPDATE Match SET Matchhead = P.objID1 FROM Match M join @Pairs P on M.objID1 = P.ObjID2
Figure 3: A green surrogate for the right 4 objects and a red (masked) surrogate for the leftmost one.
The number missing from the cluster is computed in the next section.

Matching the Missing Objects

There may be an object in camcol A that should have matching objects in an overlapping camcol B (see figure 2). In particular, any object in the green part of A should have a matching object in B (in Figure 2). Objects in A that are near the edge of B (10 pixels ~4 arc seconds = the yellow part of B) may have matching objects in B.In some cases the B area is masked (red) and that explains why there is not a match.

If a "green" A object does not match a B object then either the object is moving or variable or masked. We can check the masks to see if the (A.ra, A.dec) is masked in B. If not, we assume that A is just "missing."

Similarly, if a "yellow" A object does not match a B object, then either the object is moving or variable or masked or the edge effects caused the object to be missing.In these edge cases we check to see if (A.ra, A.dec) is masked in B, if not we call the object missing-edge.

So, missing objects come in 3 varieties:
Hit   0
Missing Green1
Missing EdgeYellow2
Masked Red 3

In each of these cases we create a match object as the closest object in B to A and Match.flag is set to Green, Red or Yellow.These "fake" objects do not contribute to the cluster average or variance or centroid.

We add this object to A’s cluster (along with all the edges), and we increment the cluster miss count by the number of records we add to the cluster.
Figure 4: Graph showing distance vs frequency of missies of various colors. This is data from a small sample of the BestDR1 database.

The logic for computing missing objects is as follows.

   For each RunA in the Regions table.
      For each RunB overlapping RunA other than RunA
         Let R_ABYellow be the Region RunAÇ((runB+ ε))
         Find (x,y) where x and y in R_ABYellow and run.x = RunA and run.y = RunB
            And x not in match and y the closest object in RunB.
         If the position of x is in runB - ε we mark the pair as green.
         If the position of x is masked in RunB then we mark the pair as red.
         We now add the (x,y) pairs to the match table and
            If x is not in match, we add x to matchhead with a miss count of 1
            If x is in match, we propagate x.matchHead to this match entry
               and increment the matchHead miss count

The actual code is a little more complex (about 700 lines of SQL). In the personal SkyServerDR1 there are about 20,000 matches and 10,000 object misses, so it seems that the misses will make an interesting study.
Figure 5: Statistics from full DR1 dataset showing distribution of miss distances. Red objects are not being computed. On BestDR1, the statistics are:
Match 12,431,518
Green 5,446,930
Yellow 912,571
Red ??

The results of this are that a bundle can have dangling pointers to these surrogate objects.Figure 3 shows the diagram of Figure 1 where a fifth overlapping run has been added. The leftmost object is masked in this new run and so we find a surrogate "red" object for it. The other objects are also have no match in this run but are not masked and are closest to the green (right) object in the figure.

It takes 4 minutes to compute on the personal SkyServer DR1, It will take a bit longer on the thousand times larger Dr2, but ...

As per Robert’s request, surrogate match objects are found rather than invented. Sometimes we have to look far away for them (500 arcseconds in some cases).

Misses are painted Yellow (near the edge), Red (masked), and Green (well inside the overlap). Most misses are Green.
color count
Y 3,516
R 58
G 15,264

The graphs of distances are shown in Figure 4.