Dataset for machine learning MK stellar classification

Dataset for machine learning MK stellar classification

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I would like to create program for automatic Morgan-Keenan stellar classification using machine learning. For that, I need dataset of stars with known absolute magnitude, temperature and luminosity type (0, Ia, Ib,… , VII). I found some datasets, (e. g., however they contain only few hundreds stars and not all star types are represented.

Is there any large enough dataset (at least 1000 stars) in which all types of stars are represented, from hypergiants to dwarfs, and with all the information mentioned?

That I know there is the XHIP catalog via VizieR and you can enter a range in UMag (or B- and V-band) for example of-20… 20. Check the box SpType and Tc to get also the spectral type and temperature (double check the literature for the quantities you need)

On the column on the left titled Preferences, set the quantity max to unlimited (it's the number of data) and the box just below indicates the format to download the data, if you want *.csv select CDS Portal and click on Submit.

You will be redirected on the CDS portal and simply click on Save and then on MyData, it will show you a list of the dataset you saved and you can select the format of your file (csv, fits, etc.) and then download.

With the simple range of the V absolute magnitude between -20 and 20 you get more then 100k stars

Edited: I noticed that downloading the data through the CDS portal does not give you the quantities that you check, but some fixed one. Maybe I can suggest to select ascii text/plain and then download the page.

Applied Statistical Methods in Astronomy: Gaussian Processes and Machine Learning

Modern telescopes provide challenging data, not only in quantity but also quality, demanding new methods and techniques for scientific inference. New algorithms specific to astronomical problems are being developed and brought to the community by a new generation of scientists. This special session will focus on statistical methods that were published within the last year. This session originates from the AAS Working Group on Astroinformatics and Astrostatistics and the CHASC International Center for Astrostatistics. The goal of this Special Session is to review advances in the newly popular methods of gaussian processes and machine learning, to present applications to data, and to discuss current issues and future perspectives. These methods have applications across the entire spectrum of astronomical research and are being rapidly developed. They have to be presented at large forums to make the community aware of the rapid progress being made in this field. The specific invited talks include discussion of application of gaussian processes to time-series spectra in exoplanet research, machine learning techniques to quantify variability states of a micro-quasar, machine learning application in cosmology, and source detection. We also intend to have an associated poster session allowing contributions from the entire community. The session schedule will allow for a discussion and input from the audience.


Chair: Aneta Siemiginowska (CfA/ CHASC)

2:02 pm: Dan Foreman-Mackey (Center for Computational Astrophysics, Flatiron Institute)

An astronomer's introduction to Gaussian Processes

Abstract: Gaussian Processes (GPs) are a class of stochastic models that have become widely used in astronomy. A general introduction to GP modeling can be mystifying so, in this talk, I will introduce GP modeling with a focus on applications from the astronomy literature. I will summarize the basic theory and motivate the broad applicability of these methods. Finally, I will discuss some of the computational limitations of simple implementations of GP models and describe some recent developments that make these models more broadly tractable.

Presentation slides [.pdf] Notes:- Dan Foreman-Mackey is an Associate Research Scientist at the Flatiron Institute's Center for Computational Astrophysics.

2:22 pm: Ian Czekala (Stanford)

Using Gaussian Processes to Construct Flexible Models of Stellar Spectra

Abstract:The use of spectra is fundamental to astrophysical fields ranging from exoplanets to stars to galaxies. In spite of this ubiquity, or perhaps because of it, there are a plethora of use cases that do not yet have physics-based forward models that can fit high signal-to-noise data to within the observational noise. These inadequacies result in subtle but systematic residuals not captured by any model, which complicates and biases parameter inference. Fortunately, the now-prevalent collection and archiving of large spectral datasets also provides an opening for empirical, data-driven approaches. We introduce one example of a time-series dataset of high-resolution stellar spectra, as is commonly delivered by planet-search radial velocity instruments like TRES, HIRES, and HARPS. Measurements of radial velocity variations of stars and their companions are essential for stellar and exoplanetary study these measurements provide access to the fundamental physical properties that dictate all phases of stellar evolution and facilitate the quantitative study of planetary systems. In observations of a (spatially unresolved) spectroscopic binary star, one only ever records the composite sum of the spectra from the primary and secondary stars, complicating photospheric analysis of each individual star. Our technique "disentangles" the composite spectra by treating each underlying stellar spectrum as a Gaussian process, whose posterior predictive distribution is inferred simultaneously with the orbital parameters. To demonstrate the potential of this technique, we deploy it on red-optical time-series spectra of the mid-M-dwarf eclipsing binary LP661-13, which was recently discovered by the MEarth project. We successfully reconstruct the primary and secondary stellar spectra and report orbital parameters with improved precision compared to traditional radial velocity analysis techniques.

Presentation slides [.pdf] Notes:- Ian Czekala is a KIPAC Postdoctoral Fellow at Stanford University.

2:38 pm: Daniela Huppenkothen (University of Washington)

Classifying Black Hole States with Machine Learning

Abstract:Galactic black hole binaries are known to go through different states with apparent signatures in both X-ray light curves and spectra, leading to important implications for accretion physics as well as our knowledge of General Relativity. Existing frameworks of classification are usually based on human interpretation of low-dimensional representations of the data, and generally only apply to fairly small data sets. Machine learning, in contrast, allows for rapid classification of large, high-dimensional data sets. In this talk, I will report on advances made in classification of states observed in Black Hole X-ray Binaries, focusing on the two sources GRS 1915+105 and Cygnus X-1, and show both the successes and limitations of using machine learning to derive physical constraints on these systems.

Presentation slides [.pdf] Notes:- Daniela Huppenkothen is an Associate Director for the DIRAC Institute at the University of Washington, Seattle.

2:54 pm: Michelle Ntampaka (Harvard University)

Dynamical Mass Measurements of Contaminated Galaxy Clusters Using Support Distribution Machines

Abstract: We study dynamical mass measurements of galaxy clusters contaminated by interlopers and show that a modern machine learning (ML) algorithm can predict masses by better than a factor of two compared to a standard scaling relation approach. We create two mock catalogs from Multidark

s publicly available N- body MDPL1 simulation, one with perfect galaxy cluster membership infor- mation and the other where a simple cylindrical cut around the cluster center allows interlopers to contaminate the clusters. In the standard approach, we use a power-law scaling relation to infer cluster mass from galaxy line-of-sight (LOS) velocity dispersion. Assuming perfect membership knowledge, this unrealistic case produces a wide fractional mass error distribution, with a width E=0.87. Interlopers introduce additional scatter, significantly widening the error distribution further (E=2.13). We employ the support distribution machine (SDM) class of algorithms to learn from distributions of data to predict single values. Applied to distributions of galaxy observables such as LOS velocity and projected distance from the cluster center, SDM yields better than a factor-of-two improvement (E=0.67) for the contaminated case. Remarkably, SDM applied to contaminated clusters is better able to recover masses than even the scaling relation approach applied to uncon- taminated clusters. We show that the SDM method more accurately reproduces the cluster mass function, making it a valuable tool for employing cluster observations to evaluate cosmological models.

Presentation slides [.pdf] Notes:- Michelle Ntampaka is a Data Science Initiative Fellow at Harvard University.

3:10 pm: Tamas Budavari (Johns Hopkins University)

Sub-Band Image Reconstruction using DCR

Abstract:Refraction by the atmosphere causes the astrometric positions of sources to depend on the airmass through which an observation was taken. This shift is dependent on the underlying spectral energy of the source and the filter or bandpass through which it is observed. Wavelength- dependent refraction within a single passband is often referred to as differential chromatic refraction (DCR). With a new generation of astronomical surveys undertaking repeated observations of the same part of the sky over a range of different airmasses and parallactic angles DCR should be a detectable and measurable astrometric signal. Here we introduce a novel procedure that uses this subtle signal to infer the underlying spectral energy distribution of a source we solve for multiple latent images at specific wavelengths via a generalized deconvolution procedure built on robust statistics.

Presentation slides [.pdf] Notes:- Tamas Budavari is an Assistant Professor in Applied Math at the Johns Hopkins University.

Dataset for machine learning MK stellar classification - Astronomy

Infra-Red Astronomy Satellite Project Database

John Stutz
It's possible that one of John's colleagues actually provided this to UCI, perhaps Mike Marshall (MARSHALL%PLU '@'

The Infra-Red Astronomy Satellite (IRAS) was the first attempt to map the full sky at infra-red wavelengths. This could not be done from ground observatories because large portions of the infra-red spectrum is absorbed by the atmosphere. The primary observing program was the full high resolution sky mapping performed by scanning at 4 frequencies. The Low Resolution Observation (IRAS-LRS) program observed high intensity sources over two continuous spectral bands. This database derives from a subset of the higher quality LRS observations taken between 12h and 24h right ascension.

This database contains 531 high quality spectra derived from the IRAS-LRS database. The original data contained 100 spectral measurements in each of two overlapping bands. Of these, 44 blue band and 49 red band channels contain usable flux measurements. Only these are included here. The original spectral intensities values are compressed to 4-digits, and each spectrum includes 5 rescaling parameters. We have used the LRS specified algorithm to rescale these to units of spectral intensity (Janskys). Total intensity differences have been eliminated by normalizing each spectrum to a mean value of 5000.

This database was originally obtained for use in development and testing of our AutoClass system for Bayesian classification. We have not retained any results from this development, having concentrated our efforts of a 5425 element version of the same data. Our classifications were based upon simultaneous modeling of all 93 spectral intensities. With the larger database we were able to find classes that correspond well with known spectral types associated with particular stellar types. We also found classes that match with the spectra expected of certain stellar processes under investigation by Ames astronomers. These classes have considerably enlarged the set of stars being investigated by those researchers.

The original fortran data file is given in The file spectra-2.head contains information about the .data file contents and how to rescale the compressed spectral intensities.

Attribute Information:

1. LRS-name: (Suspected format: 5 digits, "+" or "-", 4 digits)
2. LRS-class: integer - The LRS-class values range from 0 - 99 with the 10's digit giving the basic class and the 1's digit giving the subclass. These classes are based on features (peaks, valleys, and trends) of the spectral curves.
3. ID-type: integer
4. Right-Ascension: float - Astronomical longitude. 1h = 15deg
5. Declination: float - Astronomical lattitude. -90 1 :

Dataset for machine learning MK stellar classification - Astronomy


In recent years we have seen much discussion of Machine Learning & Artificial Intelligence. Everyone is talking about it, but a few know exactly what to do. A Machine is a semi or fully automated device that magnifies human physical and/or mental capabilities in performing one or more operations. The term Learning describes the ability to improve behavior based on experience. Machine learning is the technology that allows systems to learn directly from examples, data, and experience. Machine learning sits at the intersection of artificial intelligence, data science, and statistics, and has applications in robotics. It uses elements of each of these fields to process data in a way that can detect and learn from patterns, predict future activity, or make decisions.

“Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed” — Arthur L. Samuel, AI pioneer.

Branches of machine learning

There are three key branches of machine learning.

There are many other specialized ML also exist, such as- Semi-supervised Learning, Transfer Learning, Multitask Learning, etc.

Machine Learning in Astronomical Data Analysis

By processing the large amounts of data now being generated in fields such as life sciences, particle physics, astronomy, the social sciences, and more, machine learning could be a key enabler for a range of scientific fields, pushing forward the boundaries of science. It could become a key tool for researchers to analyze these large data sets, detecting previously unforeseen patterns or extracting unexpected insights.

Research in astronomy generates large amounts of data. For example, once up and running, the Large Synoptic Survey Telescope (LSST) is expected to create over 15 terabytes of astronomical data each night from its images of the night sky. [LSST (Large Synoptic Survey Telescope). See (accessed 22 March 2017)] In analyzing this data, a key challenge for astronomy is to detect interesting features or signals from the noise and to assign these to the correct category or phenomenon.

According to an article by Eric M Howard, Macquarie University, A solid machine learning algorithm must provide a consolidated interface between the astronomical data processing problem and a statistical method of analyzing the information. The computational method provided should be able to automate a large number of processes that are responsible with the data flow, find patterns in digital data and translate them into useful information. Machine learning covers several techniques to interpret and analyze existing data using models for data behavior and data-based statistical inferences, such as regression, supervised classification, maximum likelihood estimators or Bayesian method.

Some Real-World Problems That ML can be used for:

· Finding new pulsars from existing data sets

· Identifying the properties of stars

· Classifying supernova remnant spectra

· Clustering starspot evolution tracks

· Modeling a complex stellar flare

· Modeling photometric metallicities

· Correctly classifying galaxies

Beyond astronomy, ML has many applications in science and a wide range of other fields. The skills developed by astronomers as they investigate and implement ML techniques will also serve them in cross-disciplinary endeavors and will be an excellent way for future students to enhance their skillsets for non-astronomy career paths.


School of Information Engineering, Nanjing Audit University, Nanjing, Jiangshu, China

Peng Yang, Guowei Yang & Fanlong Zhang

School of Information Engineering, Nanchang Hangkong University, Nanchang, Jiangxi, China

School of Astronomy and Space Science, Nanjing University, Nanjing, Jiangshu, China

National Astronomical Observatories, Chinese Academy of Sciences (NAOC), Beijing, China

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

Corresponding author

NGC 4725 is an intermediate barred spiral galaxy with a prominent ring structure about 40 million light-years away in…

NGC 4707 has a morphological type of Sm or Im, meaning that it is mostly irregular or has very weak spiral arms. The…

Doing a scattered matrix plot can give quick relationship between above parameters like physical size and stellar mass of galaxies.

Using unsupervised learning like PCA and t-SNE can further help in evaluating this data.

Here is the PCA plot of this data with 6 parameters.

The results are bimodal. PCA clusters the galaxies based on the type Elliptical (Red) and Spiral (Blue)

Another approach would be to plot t-SNE unsupervised learning algorithm with various perplexity values. Different values tried here are 5,10,15,30,40 & 50.

These plottings coraborate with the PCA analysis. We can zoom in further to select a pocket within this data. Selecting a part of the data for further analysis.

Plotting mstar & phys_size of this selective data of 30 galaxies, against their morphological type code “t” (ref: shows :

Conclusion shows that galaxies that we identified in the pocket of 30 are highly concentrated and of low stellar mass.

On April 25th this year, GAIA published its DR2 archive. I was going through this archive and stumbled upon this video.

Some quick plotting based on the above learning gave below visualizations

Visualizing Cepheid variables based on the limited arc data. Cepheid variables are candlesticks to gauge distances in space. Since luminosity of each type of Cepheid is constant it is easier to extrapolate their distances from earth.Right Ascension and Declination are placement co-ordinates. Right Ascension is the angular distance of a particular point measured eastward along the celestial equator from the Sun March equinox.Declination is the angular distance of a point north or south of the celestial equator.

Using Bokeh plot in python to plot GAIA exoplanet data with radius and mass compared to earth.

There are many other parameters like luminosity, temperature etc which can be visualized from this data. In my next article, I am planning to pay tribute to KEPLER by creating some of the visualizations and inferences from that data and to welcome TESS in 2019.

A more consistent and deeper initiative can create a boom in collaboration between astronomers, statisticians, data scientists and information & computer professionals and there by helping to accelerate our understanding of the SPACE around us.

#datascience #astronomy #GAIA

Dhaval Mandalia enjoys data science, project management, training executives and write about management strategies. He’s also a contributing member in the management association community in Gujarat. Follow him on Twitter and Facebook.

May 1, 2020 1625 words 8 minutes

This is a follow-on from a previous post on tangent spaces. It is part of a series of posts that I write on the basics of differential or Riemannian geometry, providing the necessary background for reading some of the more advanced textbooks on general relativity. I will introduce the derivative of a smooth map between manifolds and work a little bit with the definition(s) to become familiar with the thicket of mathematical notation. Then I narrow in on the special case where (f) is a real-valued function, introduce the notions of dual space and covectors, and finally show the connection to the differential known from basic calculus.


Stellar spectral classification is one of the most fundamental tasks in survey astronomy. Many automated classification methods have been applied to spectral data. However, their main limitation is that the model parameters must be tuned repeatedly to deal with different data sets. In this paper, we utilize the Bayesian support vector machines (BSVM) to classify the spectral subclass data. Based on Gibbs sampling, BSVM can infer all model parameters adaptively according to different data sets, which allows us to circumvent the time-consuming cross validation for penalty parameter. We explored different normalization methods for stellar spectral data, and the best one has been suggested in this study. Finally, experimental results on several stellar spectral subclass classification problems show that the BSVM model not only possesses good adaptability but also provides better prediction performance than traditional methods.

2. Exploring Machine Learning Classification to predict galaxy classes

There is a wide range of galaxy types observed by the Sloan Digital Sky Survey in the Galaxy Zoo. In this activity, we will limit our dataset to three types of galaxy: spirals, ellipticals and mergers, as shown below.

The galaxy catalog we are using is a sample of galaxies where at least 20 human classifiers (such as yourself) have come to a consensus on the galaxy type. Examples of spiral and elliptical galaxies were selected where there was a unanimous classification. Due to low sample numbers, we included merger examples where at least 80% of human classifiers selected the merger class. We need this high quality data to train our classifier.

The features that we will be using to do our galaxy classification are color index, adaptive moments, eccentricities and concentrations. These features are provided as part of the SDSS catalogue.

Color indices are the same colors (u-g, g-r, r-i, and i-z) we used for regression. Studies of galaxy evolution tell us that spiral galaxies have younger star populations and therefore are ‘bluer’ (brighter at lower wavelengths). Elliptical galaxies have an older star population and are brighter at higher wavelengths (‘redder’).

Eccentricity approximates the shape of the galaxy by fitting an ellipse to its profile. Eccentricity is the ratio of the two axis (semi-major and semi-minor). The De Vaucouleurs model was used to attain these two axis. To simplify our experiments, we will use the median eccentricity across the 5 filters.

Adaptive moments also describe the shape of a galaxy. They are used in image analysis to detect similar objects at different sizes and orientations. We use the fourth moment here for each band.

Concentration is similar to the luminosity profile of the galaxy, which measures what proportion of a galaxy’s total light is emitted within what radius. A simplified way to represent this is to take the ratio of the radii containing 50% and 90% of the Petrosian flux.

The Petrosian method allows us to compare the radial profiles of galaxies at different distances. If you are interested, you can read more here on the need for Petrosian approach. We will use the concentration from the u, r and z bands. For these experiments, we will define concentration as:

We have extracted the SDSS and Galaxy Zoo data for 780 galaxies, the first few rows fo the datatset are shown below:

As described earlier, the data has the following fields:

  • colors: u-g, g-r, r-i, and i-z
  • eccentricity: ecc
  • 4th adaptive moments: m4_u, m4_g, m4_r, m4_i, and m4_z
  • 50% Petrosian: petroR50_u, petroR50_r, petroR50_z
  • 90% Petrosian: petroR90_u, petroR90_r, petroR90_z.

Now, let’s split the data and generate the features, and then train a decision tree classifier, perform a held-out validation by predicting the actual classes for later comparison.

The decision tree learnt with grid search cross validation is shown below:

The accuracy of classification problems is a lot simpler to calculate than for regression problems. The simplest measure is the fraction of objects that are correctly classified, as shown below. The accuracy measure is often called the model score. While the way of calculating the score can vary depending on the model, the accuracy is the most common for classification problems.

In addition to an overall accuracy score, we’d also like to know where our model is going wrong. For example, were the incorrectly classified mergers miss-classified as spirals or ellipticals? To answer this type of question we use a confusion matrix. The confusion matrix computed for our problem is shown below:

Random Forest

So far we have used a single decision tree model. However, we can improve the accuracy of our classification by using a collection (or ensemble) of trees as known as a random forest.

A random forest is a collection of decision trees that have each been independently trained using different subsets of the training data and/or different combinations of features in those subsets.

When making a prediction, every tree in the forest gives its own prediction and the most common classification is taken as the overall forest prediction (in regression the mean prediction is used).

The following figure shows the confusion matrix computed with random forest classifier.

Did the random forest improve the accuracy of the model? The answer is yes – we see a substantial increase in accuracy. When we look at the 10-fold cross validation results, we see that the random forest systematically out performs a single decision tree: The random forest is around

Title: An expert computer program for classifying stars on the MK spectral classification system

This paper describes an expert computer program (MKCLASS) designed to classify stellar spectra on the MK Spectral Classification system in a way similar to humans—by direct comparison with the MK classification standards. Like an expert human classifier, the program first comes up with a rough spectral type, and then refines that spectral type by direct comparison with MK standards drawn from a standards library. A number of spectral peculiarities, including barium stars, Ap and Am stars, λ Bootis stars, carbon-rich giants, etc., can be detected and classified by the program. The program also evaluates the quality of the delivered spectral type. The program currently is capable of classifying spectra in the violet-green region in either the rectified or flux-calibrated format, although the accuracy of the flux calibration is not important. We report on tests of MKCLASS on spectra classified by human classifiers those tests suggest that over the entire HR diagram, MKCLASS will classify in the temperature dimension with a precision of 0.6 spectral subclass, and in the luminosity dimension with a precision of about one half of a luminosity class. These results compare well with human classifiers.

Watch the video: Was ist Classification und Regression? Künstliche Intelligenz (May 2022).


  1. Pancho

    This - is absurd.

  2. Umi

    I think they are wrong. Write to me in PM, discuss it.

  3. Parkinson

    So it is infinitely possible to discuss..

  4. Dain

    This message, is matchless))), very much it is pleasant to me :)

  5. Seiji

    In my opinion you went the wrong way.

  6. Corley

    Completely I share your opinion. I like your idea. I suggest to take out for the general discussion.

  7. Milintica

    Dear blog administrator, where are you from?

  8. Voodoojinn

    It is remarkable, it is rather valuable answer

Write a message