Classifying serendipitous X-ray sources with Machine Learning

16 October 2015

The instruments of modern day observational astronomy have been steadily moving towards bigger telescopes and deeper surveys. A number of facilities have recently been (or will soon be) commissioned to survey the sky in unprecedented detail: at radio wavelengths, the upcoming Square Kilometre Array (SKA) telescope and its operational Australian precursors, the Murchison Widefield Array (MWA) and Australia Square Kilometre Array Pathfinder (ASKAP); in the visible bands, the Large Synoptic Survey Telescope (LSST) and SkyMapper; at higher energies, the soon-to-be-launched Spectrum Roentgen Gamma (SRG) space telescope. These facilities represent a dramatic increase in the amount of data collected that will be immensely challenging to process and utilise in real time.

Novel methods to quickly and accurately identify astrophysical sources and to flag objects of particular rarity are needed to meet this challenge. In the 2014 publication by former CAASTRO PhD student Kitty Lo and colleagues (see press release), the Random Forest supervised ensemble machine learning algorithm was applied to classify the variable X-ray sources in the second XMM-Newton Serendipitous Source catalogue (2XMM). Building on this work, CAASTRO Affiliate Dr Sean Farrell (University of Sydney) led the team that applied the same method to the 3XMM catalogue, the largest X-ray source catalogue ever produced (representing a 40% increase over 2XMM with 372,728 unique sources of which 3,696 are flagged as variable). The variable X-ray sources were classified into six distinct categories of object: Active Galactic Nuclei (AGN), Cataclysmic Variables (CVs), Gamma Ray Bursts (GRBs), stars, Ultraluminous X-ray Sources (ULXs) and X-ray Binaries (XRBs), with a classification accuracy of ~92%. The Random Forest algorithm was also applied for the first time to data quality control and was used to identify spurious detections with an accuracy of ~95%. Quality control is one of the areas in astronomy surveys that is most demanding of human inspection, making this result particularly significant.

In addition to classifying the entire variable source component of 3XMM, a number of exotic outlier sources were discovered that may be representative of entirely new classes of objects. Three particularly interesting objects were identified including a new candidate supergiant fast X-ray transient (SFXT), a 400 second period X-ray pulsar and an eclipsing binary system with a 5-hour orbital period coincident with a known Cepheid variable star. All these objects are very rare and could provide unique insight into the most extreme physical processes known, highlighting the effectiveness of the Random Forest technique. In the era of large surveys, machine learning appears to be rapidly becoming an invaluable tool for the modern day astronomer.


Publication details:

S. Farrell, T. Murphy and K. Lo (The Astrophysical Journal 2015): "Autoclassification of the Variable 3XMM Sources Using the Random Forest Machine Learning Algorithm"